All of lore.kernel.org
 help / color / mirror / Atom feed
* sata_nv ADMA controller lockup investigation
@ 2007-02-15  4:30 Robert Hancock
  2007-02-15 10:46 ` Jeff Garzik
  0 siblings, 1 reply; 5+ messages in thread
From: Robert Hancock @ 2007-02-15  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-ide, Jeff Garzik, Allen Martin; +Cc: nas, B.Steinbrink

While testing out some libata FUA changes I was working on, I was 
inadvertently able to reproduce the kind of NCQ command timeouts in 
sata_nv that a few people have reported. I since verified that the FUA 
stuff had nothing to do with it as it still happens even with FUA 
disabled. However I'm somewhat at a loss as to how to further debug 
this, so I'm posting my findings in the hope that somebody has some more 
ideas (or anyone at NVIDIA decides to come forth with a tip or two).

The conditions in which I can reproduce this are with:

ext3 filesystem mounted with -o barrier=1
Two instances of a program which truncates a file, then writes single bytes
to it, fsyncing after each one.
Simultaneously, repeatedly writing 100MB from /dev/zero to a file
using dd.

A command timeout usually happens within a few minutes. With my working copy
loaded up with a ton of extra debugging, the exception report for one of
these looks like this. My comments are indented.

ata4: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 cpb count 0x0 next cpb idx 0x0
	This is just dumping all of the ADMA registers when the timeout
	happened.
ata4: last intr at 1171511467:501179, status 0x1540
	This shows the time of the last interrupt in seconds:microseconds and the
	ADMA status register contents at that time.
ata4: cmd 61/08:00:40:36:75/00:00:0c:00:00/40 tag 0 at 1171511467:360525 done 1171511467:393064, stat before 0x400 after 0x400
ata4: cmd 61/40:00:80:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:393928 done 1171511467:394345, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:394400 done 1171511467:425548, stat before 0x500 after 0x400
ata4: cmd 61/08:00:c0:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:425556 done 1171511467:425694, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:425699 done 1171511467:433896, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:433958 done 1171511467:433971, stat before 0x500 after 0x400
ata4: cmd 61/08:00:c8:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:433978 done 1171511467:434152, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:434160 done 1171511467:442326, stat before 0x500 after 0x400
ata4: cmd 61/08:00:d0:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:442389 done 1171511467:442843, stat before 0x500 after 0x400
ata4: cmd 61/08:08:88:7e:75/00:00:0c:00:00/40 tag 1 at 1171511467:442395 done 1171511467:442846, stat before 0x400 after 0x400
ata4: cmd 61/e8:10:08:58:77/01:00:0c:00:00/40 tag 2 at 1171511467:442419 done 1171511467:445010, stat before 0x400 after 0x400
ata4: cmd 61/e8:18:f0:59:77/01:00:0c:00:00/40 tag 3 at 1171511467:442437 done 1171511467:447182, stat before 0x0 after 0x0
ata4: cmd 61/e8:20:d8:5b:77/01:00:0c:00:00/40 tag 4 at 1171511467:442455 done 1171511467:449343, stat before 0x0 after 0x0
ata4: cmd 61/e8:28:c0:5d:77/01:00:0c:00:00/40 tag 5 at 1171511467:442475 done 1171511467:451543, stat before 0x0 after 0x0
ata4: cmd 61/30:30:a8:5f:77/00:00:0c:00:00/40 tag 6 at 1171511467:442481 done 1171511467:451833, stat before 0x0 after 0x0
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:451858 done 1171511467:492486, stat before 0x500 after 0x400
ata4: cmd 61/08:00:d8:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:492498 done 1171511467:492666, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:492671 done 1171511467:500909, stat before 0x500 after 0x400
ata4: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 at 1171511467:501167 done 1171511467:501181, stat before 0x500 after 0x400
ata4: cmd 61/08:00:e0:a1:64/00:00:0a:00:00/40 tag 0 at 1171511467:501187 done 0:0, stat before 0x500 after 0x400
	These lines show the last 20 commands issued, the contents of the
	taskfile, the tag, the time in sec:usec they were issued,
	the time in sec:usec they completed (0:0 for still incomplete),
	the ADMA status register contents before issuing the command,
	and the register contents after issuing the command.
ata4: CPB 0: ctl_flags 0x1f, resp_flags 0x0
	Contents of the outstanding CPB's flags, showing that the controller
	seems not to have touched it, released and done flags are clear.
ata4: timeout waiting for ADMA IDLE, stat=0x400
ata4: timeout waiting for ADMA LEGACY, stat=0x400
	As part of error handling we try to switch the controller back to
	legacy mode. We time out waiting for the controller to show
	IDLE, and then clear the GO bit, and then time out waiting for it
	to show the LEGACY state. Right after this we beat it over the head
	with NV_ADMA_CTL_CHANNEL_RESET which finally seems to restore its
	senses, until one of these happens again.
ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen
ata4.00: cmd 61/08:00:e0:a1:64/00:00:0a:00:00/40 tag 0 cdb 0x0 data 4096 out
         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4: soft resetting port
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

The pattern of the hang always seems to be that it's an NCQ command that
immediately follows a cache flush. When that NCQ command is issued, the
IDLE bit clears as we would expect, but then it stays like that
with just the STOPPED bit set. No interrupt is raised and the CPB response
flags are not updated to indicate the command was released (i.e. received
by the drive) or completed. It's like the controller accepted the command
and then stalled before actually doing anything with it.

My guess it's either a bug in the controller that needs to be somehow
worked around, or we're somehow not operating the controller "properly".
The latter is very difficult for me to determine since the only
documentation I have other than the original NVIDIA-provided driver for
this controller is the ADMA standard, which this controller only vaguely
follows (for example, not all of the status bits on this controller
are in the ADMA standard and some of the meanings for those that are
there seem to be different).

My only real clue at this point is maybe there's something wrong with
the way we are giving the commands to the controller, using the
APPEND register. That's another thing that's not in the ADMA standard.
It's curious that only the post-cache-flush command is having issues,
and normal NCQ operation seems fine. Maybe it's related to that tag 0
being reused repeatedly?

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sata_nv ADMA controller lockup investigation
  2007-02-15  4:30 sata_nv ADMA controller lockup investigation Robert Hancock
@ 2007-02-15 10:46 ` Jeff Garzik
  2007-02-16  4:52   ` Robert Hancock
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Garzik @ 2007-02-15 10:46 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel, linux-ide, Allen Martin, nas, B.Steinbrink

Robert Hancock wrote:
> It's curious that only the post-cache-flush command is having issues,
> and normal NCQ operation seems fine. Maybe it's related to that tag 0
> being reused repeatedly?


If you take cache flush out of the equation, what happens when NCQ is 
enabled with a queue depth of 1 (to reproduce tag-0-used-repeatedly 
condition)?

	Jeff



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sata_nv ADMA controller lockup investigation
  2007-02-15 10:46 ` Jeff Garzik
@ 2007-02-16  4:52   ` Robert Hancock
  2007-03-20 22:01     ` Neil Schemenauer
  0 siblings, 1 reply; 5+ messages in thread
From: Robert Hancock @ 2007-02-16  4:52 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, linux-ide, Allen Martin, nas, B.Steinbrink

Jeff Garzik wrote:
> Robert Hancock wrote:
>> It's curious that only the post-cache-flush command is having issues,
>> and normal NCQ operation seems fine. Maybe it's related to that tag 0
>> being reused repeatedly?
> 
> 
> If you take cache flush out of the equation, what happens when NCQ is 
> enabled with a queue depth of 1 (to reproduce tag-0-used-repeatedly 
> condition)?
> 
>     Jeff

I was able to reproduce it in my same test case with NCQ depth set to 1.
Of course, there were still some cache flushes going on there, so I'm not
certain that test really told us anything new. I'm rather doutbful that it's
related to reusing the same tag now, though, based on the tests I've been
doing. It may be something to do with switching between NCQ and non-NCQ
commands, maybe the controller isn't able to handle doing that too rapidly.
This patch seems to fix the problem - or at least it hasn't failed the tests that
I've run so far. It's not really ideal though, so I'd like to do some more
investigation/testing before proclaiming it as a fix. Experimentally, it
appears that 10 microseconds is not enough delay, but 20 seems to work better.

Hints from the peanut gallery remain welcome.

--- linux-2.6.20-git6edit/drivers/ata/sata_nv.c.before_hacking	2007-02-15 18:19:13.000000000 -0600
+++ linux-2.6.20-git6edit/drivers/ata/sata_nv.c	2007-02-15 22:36:02.000000000 -0600
@@ -219,6 +219,7 @@
 	void __iomem *		gen_block;
 	void __iomem *		notifier_clear_block;
 	u8			flags;
+	int			last_issue_ncq;
 };
 
 struct nv_host_priv {
@@ -1260,6 +1261,7 @@
 {
 	struct nv_adma_port_priv *pp = qc->ap->private_data;
 	void __iomem *mmio = pp->ctl_block;
+	int curr_ncq = (qc->tf.protocol == ATA_PROT_NCQ);
 
 	VPRINTK("ENTER\n");
 
@@ -1274,6 +1276,14 @@
 	/* write append register, command tag in lower 8 bits
 	   and (number of cpbs to append -1) in top 8 bits */
 	wmb();
+
+	if(curr_ncq != pp->last_issue_ncq) {
+	   	/* Seems to need some delay before switching between NCQ and non-NCQ
+		   commands, else we get command timeouts and such. */
+		udelay(20);
+		pp->last_issue_ncq = curr_ncq;
+	}
+
 	writew(qc->tag, mmio + NV_ADMA_APPEND);
 
 	DPRINTK("Issued tag %u\n",qc->tag);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sata_nv ADMA controller lockup investigation
  2007-02-16  4:52   ` Robert Hancock
@ 2007-03-20 22:01     ` Neil Schemenauer
  2007-03-20 23:24       ` Robert Hancock
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Schemenauer @ 2007-03-20 22:01 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Jeff Garzik, linux-kernel, linux-ide, Allen Martin, B.Steinbrink

Not sure if this helps.  I'm getting this reset with 2.6.21-rc4.
After the reset the controller seems to work again.

sata_nv 0000:00:07.0: version 3.3
ACPI: PCI Interrupt Link [APSI] enabled at IRQ 22
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [APSI] -> GSI 22 (level, low) -> IRQ 22
sata_nv 0000:00:07.0: Using ADMA mode
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0xffffc2000001e480 ctl 0xffffc2000001e4a0 bmdma 0x000000000001cc00 irq 22
ata2: SATA max UDMA/133 cmd 0xffffc2000001e580 ctl 0xffffc2000001e5a0 bmdma 0x000000000001cc08 irq 22
scsi0 : sata_nv
ata1: SATA link down (SStatus 0 SControl 300)
scsi1 : sata_nv
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7: Maxtor 6V300F0, VA111630, max UDMA/133
ata2.00: 586114704 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
scsi 1:0:0:0: Direct-Access     ATA      Maxtor 6V300F0   VA11 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sda: 586114704 512-byte hdwr sectors (300091 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI device sda: 586114704 512-byte hdwr sectors (300091 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 >
sd 1:0:0:0: Attached scsi disk sda
sd 1:0:0:0: Attached scsi generic sg0 type 0
ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 21
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [APSJ] -> GSI 21 (level, low) -> IRQ 21
sata_nv 0000:00:08.0: Using ADMA mode
PCI: Setting latency timer of device 0000:00:08.0 to 64
ata3: SATA max UDMA/133 cmd 0xffffc20000020480 ctl 0xffffc200000204a0 bmdma 0x000000000001b800 irq 21
ata4: SATA max UDMA/133 cmd 0xffffc20000020580 ctl 0xffffc200000205a0 bmdma 0x000000000001b808 irq 21
scsi2 : sata_nv
ata3: SATA link down (SStatus 0 SControl 300)
scsi3 : sata_nv
ata4: SATA link down (SStatus 0 SControl 300)
pata_amd 0000:00:06.0: version 0.2.8
PCI: Setting latency timer of device 0000:00:06.0 to 64
ata5: PATA max UDMA/133 cmd 0x00000000000101f0 ctl 0x00000000000103f6 bmdma 0x000000000001e000 irq 14
ata6: PATA max UDMA/133 cmd 0x0000000000010170 ctl 0x0000000000010376 bmdma 0x000000000001e008 irq 15
scsi4 : pata_amd
ata5.00: ATAPI, max UDMA/33
ata5.01: ATAPI, max MWDMA2
ata5.00: configured for UDMA/33
ata5.01: configured for MWDMA2
scsi5 : pata_amd
ATA: abnormal status 0x8 on port 0x0000000000010177
scsi 4:0:0:0: CD-ROM            PLEXTOR  DVDR   PX-504A   1.02 PQ: 0 ANSI: 5
sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 4:0:0:0: Attached scsi CD-ROM sr0
sr 4:0:0:0: Attached scsi generic sg1 type 5
scsi 4:0:1:0: CD-ROM            COMPAQ   SC-140S          SE04 PQ: 0 ANSI: 5
sr1: scsi3-mmc drive: 1x/40x cd/rw xa/form2 cdda tray
sr 4:0:1:0: Attached scsi CD-ROM sr1
sr 4:0:1:0: Attached scsi generic sg2 type 5

[...]

ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
ata2: CPB 1: ctl_flags 0x1f, resp_flags 0x2
ata2: timeout waiting for ADMA IDLE, stat=0x400
ata2: timeout waiting for ADMA LEGACY, stat=0x400
ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x2 frozen
ata2.00: cmd 61/00:08:72:44:22/02:00:21:00:00/40 tag 1 cdb 0x0 data 262144 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata2: soft resetting port
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
SCSI device sda: 586114704 512-byte hdwr sectors (300091 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA


$ lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:0c.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80)
01:0d.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15)
05:00.0 VGA compatible controller: ATI Technologies Inc R430 [Radeon X800 XL] (PCIe)
05:00.1 Display controller: ATI Technologies Inc R430 [Radeon X800 XL] (PCIe) Secondary


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: sata_nv ADMA controller lockup investigation
  2007-03-20 22:01     ` Neil Schemenauer
@ 2007-03-20 23:24       ` Robert Hancock
  0 siblings, 0 replies; 5+ messages in thread
From: Robert Hancock @ 2007-03-20 23:24 UTC (permalink / raw)
  To: Neil Schemenauer
  Cc: Jeff Garzik, linux-kernel, linux-ide, Allen Martin, B.Steinbrink

Neil Schemenauer wrote:
> Not sure if this helps.  I'm getting this reset with 2.6.21-rc4.
> After the reset the controller seems to work again.
> 
...
> ata2.00: ATA-7: Maxtor 6V300F0, VA111630, max UDMA/133
> ata2.00: 586114704 sectors, multi 16: LBA48 NCQ (depth 31/32)

...

> ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
> ata2: CPB 1: ctl_flags 0x1f, resp_flags 0x2
> ata2: timeout waiting for ADMA IDLE, stat=0x400
> ata2: timeout waiting for ADMA LEGACY, stat=0x400
> ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x2 frozen
> ata2.00: cmd 61/00:08:72:44:22/02:00:21:00:00/40 tag 1 cdb 0x0 data 262144 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata2: soft resetting port
> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata2.00: configured for UDMA/133
> ata2: EH complete
> SCSI device sda: 586114704 512-byte hdwr sectors (300091 MB)
> sda: Write Protect is off
> sda: Mode Sense: 00 3a 00 00
> SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

That one looks like a drive-side issue. CPB resp_flags 2 indicates the 
drive accepted the command and the controller is still waiting for a 
response.

Could be that this is another drive that needs to be added to the NCQ 
blacklist, some similar Maxtor models seem to have issues..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-03-20 23:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-15  4:30 sata_nv ADMA controller lockup investigation Robert Hancock
2007-02-15 10:46 ` Jeff Garzik
2007-02-16  4:52   ` Robert Hancock
2007-03-20 22:01     ` Neil Schemenauer
2007-03-20 23:24       ` Robert Hancock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.