All of lore.kernel.org
 help / color / mirror / Atom feed
* libata fails to recover from HSM violation involving DRQ status
@ 2007-04-28 20:15 Mark Lord
  2007-04-28 20:18 ` Mark Lord
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Mark Lord @ 2007-04-28 20:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, IDE/ATA development list

Tejun,

While working on the new hdparm (version 7.0, released today),
I ran into trouble when a buggy SG_IO/ATA_16 packet caused
the libata EH to get confused.

I triggered this by accident, issuing an IDENTIFY command
which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
but libata never recovered from the "stuck DRQ bit" that resulted.

In the IDE driver, we had code to try and cope with stuck DRQ,
by just looping and reading from the data port a few times.
That could have been done better, but it worked a lot of the time,
back in those simpler days.

I don't know what you try in libata-eh, but perhaps it can be tweaked?
Below is the 'dmesg' from that system before I hit the big red button.

I can also supply a program that will generate this lockup on demand,
for testing purposes.

Cheers

Mark


sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
ata1: soft resetting port
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: port is slow to respond, please be patient (Status 0xd8)
ata1: port failed to respond (30 secs, Status 0xd8)
ata1: soft resetting port
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: limiting speed to UDMA/100:PIO3
ata1: failed to recover some devices, retrying in 5 secs
ata1: port is slow to respond, please be patient (Status 0xd8)
ata1: port failed to respond (30 secs, Status 0xd8)
ata1: soft resetting port
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ATA: abnormal status 0xD8 on port 0x000101f7
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 120322555
Buffer I/O error on device sda1, logical block 15018230
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 127883035
Buffer I/O error on device sda1, logical block 15963290
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 31060987
Buffer I/O error on device sda1, logical block 3860534
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 31060995
Buffer I/O error on device sda1, logical block 3860535
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 3860536
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 3860537
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 31780267
Buffer I/O error on device sda1, logical block 3950444
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 3950445
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43692707
Buffer I/O error on device sda1, logical block 5439499
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43692739
Buffer I/O error on device sda1, logical block 5439503
lost page write due to I/O error on sda1
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693195
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693243
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693259
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693299
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693315
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693331
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693355
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693419
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693451
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693467
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693523
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693563
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693603
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693619
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693635
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693651
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 43693683
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 54442019
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 54442051
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 92386395
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 94024779
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 94024803
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 96384107
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 96384211
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 96384227
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 96384275
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 117617635
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 117617707
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 118141547
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 118288219
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 126005883
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 126300947
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155417787
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155519227
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155519251
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155519347
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155519539
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155521219
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155589019
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155589523
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155595347
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 155596443
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 275427915
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 275427939
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 176787
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27701915
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27702155
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27702243
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27702427
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27704179
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27891339
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 27964171
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 28253219
Aborting journal on device sda1.
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 28253315
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30847571
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30882475
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30941587
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30956683
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30956795
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958155
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958435
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958499
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958563
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958659
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958915
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30959171
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30959427
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30959683
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30959939
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30960195
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30960451
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30960707
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30960963
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30961219
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30961475
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30961731
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30961987
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30962243
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30964363
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30964619
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30964875
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30965131
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30965387
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30965643
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30965899
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30966155
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30966411
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30966667
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30966923
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30967179
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30967435
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30967691
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30967947
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30968283
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30968539
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30968771
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30969027
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30969283
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30969539
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30997643
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30997707
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 126005891
EXT3-fs error (device sda1): ext3_get_inode_loc: unable to read inode block - inode=7864507, block=15728647
EXT3-fs error (device sda1) in ext3_reserve_inode_write: IO failure
EXT3-fs error (device sda1) in ext3_dirty_inode: IO failure
journal commit I/O error
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958443
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958475
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30958499
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30961371
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 30967803
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 31780267
__journal_remove_journal_head: freeing b_committed_data

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:15 libata fails to recover from HSM violation involving DRQ status Mark Lord
@ 2007-04-28 20:18 ` Mark Lord
  2007-04-28 20:30 ` Alan Cox
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Mark Lord @ 2007-04-28 20:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun,
> 
> While working on the new hdparm (version 7.0, released today),
> I ran into trouble when a buggy SG_IO/ATA_16 packet caused
> the libata EH to get confused.
> 
> I triggered this by accident, issuing an IDENTIFY command
> which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
> but libata never recovered from the "stuck DRQ bit" that resulted.
...

This was on 2.6.21, with ata_piix.

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:15 libata fails to recover from HSM violation involving DRQ status Mark Lord
  2007-04-28 20:18 ` Mark Lord
@ 2007-04-28 20:30 ` Alan Cox
  2007-04-28 20:37 ` Jeff Garzik
  2007-04-28 22:09 ` Mark Lord
  3 siblings, 0 replies; 37+ messages in thread
From: Alan Cox @ 2007-04-28 20:30 UTC (permalink / raw)
  To: Mark Lord; +Cc: Tejun Heo, Jeff Garzik, Alan Cox, IDE/ATA development list

> In the IDE driver, we had code to try and cope with stuck DRQ,
> by just looping and reading from the data port a few times.
> That could have been done better, but it worked a lot of the time,
> back in those simpler days.

It works very well. The current "old" IDE has some changes in the area
but those are basically to handle one or two controllers whose internal
state machine flushes the data queue so we don't hang the box solid
trying to flush it ourselves.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:15 libata fails to recover from HSM violation involving DRQ status Mark Lord
  2007-04-28 20:18 ` Mark Lord
  2007-04-28 20:30 ` Alan Cox
@ 2007-04-28 20:37 ` Jeff Garzik
  2007-04-28 20:44   ` Mark Lord
  2007-04-28 21:25   ` Alan Cox
  2007-04-28 22:09 ` Mark Lord
  3 siblings, 2 replies; 37+ messages in thread
From: Jeff Garzik @ 2007-04-28 20:37 UTC (permalink / raw)
  To: Mark Lord; +Cc: Tejun Heo, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun,
> 
> While working on the new hdparm (version 7.0, released today),
> I ran into trouble when a buggy SG_IO/ATA_16 packet caused
> the libata EH to get confused.
> 
> I triggered this by accident, issuing an IDENTIFY command
> which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
> but libata never recovered from the "stuck DRQ bit" that resulted.
> 
> In the IDE driver, we had code to try and cope with stuck DRQ,
> by just looping and reading from the data port a few times.
> That could have been done better, but it worked a lot of the time,
> back in those simpler days.
> 
> I don't know what you try in libata-eh, but perhaps it can be tweaked?
> Below is the 'dmesg' from that system before I hit the big red button.

I am reluctant to do anything about this.

All manner of things can go wrong, if the taskfile protocol specified 
disagrees with the taskfile contents.

At that point you are in undefined territory, since libata will happily 
ARM a DMA controller or otherwise program controller registers in 
preparation for the requested taskfile protocol.  Data corruption, hard 
locks, anything could happen at that point.

Maybe we do need to recover from a stuck DRQ bit, but I'll wait until 
that symptom shows up with a different catalyst.

	Jeff




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:37 ` Jeff Garzik
@ 2007-04-28 20:44   ` Mark Lord
  2007-04-28 20:50     ` Jeff Garzik
  2007-04-28 21:25   ` Alan Cox
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-28 20:44 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Tejun Heo, Alan Cox, IDE/ATA development list

Jeff Garzik wrote:
> Mark Lord wrote:
>..
>> I triggered this by accident, issuing an IDENTIFY command
>> which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
>> but libata never recovered from the "stuck DRQ bit" that resulted.
..
> Maybe we do need to recover from a stuck DRQ bit, but I'll wait until 
> that symptom shows up with a different catalyst.

It's a failure mode that occurs very often (as far as failures go)
with the IDE driver.  *Lots* of occurance.

So as more things migrate to libata, we'll eventually have to deal
with it here, too.  I'm just trying to give us a chance to fix it
before somebody loses data over it.

Actually, I'm not so sure that this problem hasn't *already* been
posted to this very mailing list.

http://lkml.org/lkml/2006/10/1/264
http://www.mail-archive.com/linux-ide@vger.kernel.org/msg05078.html
...

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:44   ` Mark Lord
@ 2007-04-28 20:50     ` Jeff Garzik
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Garzik @ 2007-04-28 20:50 UTC (permalink / raw)
  To: Mark Lord; +Cc: Tejun Heo, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Actually, I'm not so sure that this problem hasn't *already* been
> posted to this very mailing list.
> 
> http://lkml.org/lkml/2006/10/1/264
> http://www.mail-archive.com/linux-ide@vger.kernel.org/msg05078.html
> ...

What Tejun said at the end of that thread :)

That one is a phy-level problem, when it starts complaining about 
10b-to-8b decode and non-recoverable communication errors.

I'm keeping an open mind, but with the drivers being different from each 
other, I want to see how libata encounters that failure mode in the field.

	Jeff



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:37 ` Jeff Garzik
  2007-04-28 20:44   ` Mark Lord
@ 2007-04-28 21:25   ` Alan Cox
  2007-04-28 21:35     ` Mark Lord
  2007-04-28 21:38     ` Jeff Garzik
  1 sibling, 2 replies; 37+ messages in thread
From: Alan Cox @ 2007-04-28 21:25 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, Tejun Heo, Alan Cox, IDE/ATA development list

> I am reluctant to do anything about this.

This one does need dealing with. It happens in the real world and the old
IDE paths for this do get triggered and used now and then (we know this
because bugs in them were found). All it takes is a device and a
controller disagreeing about the length of a data transfer to get in a
mess. In theory resetting the bus should get you out of this, I'm
suprised we didn't get out that way.

> All manner of things can go wrong, if the taskfile protocol specified 
> disagrees with the taskfile contents.

True but at the point you are trying to do error recovery and DRQ is
wedged on its a good idea to pull remaining data out of the fifo. 

> At that point you are in undefined territory, since libata will happily 
> ARM a DMA controller or otherwise program controller registers in 
> preparation for the requested taskfile protocol.  Data corruption, hard 
> locks, anything could happen at that point.

SG_IO and other userspace interfaces can mean we issue a command that
ends up causing variants of this kind of confusion.

Alan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 21:25   ` Alan Cox
@ 2007-04-28 21:35     ` Mark Lord
  2007-04-28 21:38     ` Jeff Garzik
  1 sibling, 0 replies; 37+ messages in thread
From: Mark Lord @ 2007-04-28 21:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Tejun Heo, Alan Cox, IDE/ATA development list

Alan Cox wrote:
>> I am reluctant to do anything about this.
> 
> This one does need dealing with. It happens in the real world and the old
> IDE paths for this do get triggered and used now and then (we know this
> because bugs in them were found). All it takes is a device and a
> controller disagreeing about the length of a data transfer to get in a
> mess. In theory resetting the bus should get you out of this, I'm
> suprised we didn't get out that way.
..
> SG_IO and other userspace interfaces can mean we issue a command that
> ends up causing variants of this kind of confusion.

That last one doesn't really worry me -- it has to be deliberately
done by the sysadmin.

But the history of real-world cases are definitely of concern,
especially since it's quite likely a rather simple fix.

I think failed WRITE_DMA requests (IDNF or ECC faults) were one source.

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 21:25   ` Alan Cox
  2007-04-28 21:35     ` Mark Lord
@ 2007-04-28 21:38     ` Jeff Garzik
  2007-04-28 21:41       ` Mark Lord
  2007-04-28 23:56       ` Alan Cox
  1 sibling, 2 replies; 37+ messages in thread
From: Jeff Garzik @ 2007-04-28 21:38 UTC (permalink / raw)
  To: Alan Cox; +Cc: Mark Lord, Tejun Heo, Alan Cox, IDE/ATA development list

Alan Cox wrote:
>> I am reluctant to do anything about this.
> 
> This one does need dealing with. It happens in the real world and the old
> IDE paths for this do get triggered and used now and then (we know this
> because bugs in them were found). All it takes is a device and a
> controller disagreeing about the length of a data transfer to get in a

How would they disagree (excluding human error)?


> mess. In theory resetting the bus should get you out of this, I'm
> suprised we didn't get out that way.

Indeed.


>> All manner of things can go wrong, if the taskfile protocol specified 
>> disagrees with the taskfile contents.
> 
> True but at the point you are trying to do error recovery and DRQ is
> wedged on its a good idea to pull remaining data out of the fifo. 

It's not really a good idea for SATA.  The "FIFO" often co-emulated by 
the SATA controller and SATA phy.  You just want to kick SATA really 
hard (i.e. bus reset and friends).

	Jeff




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 21:38     ` Jeff Garzik
@ 2007-04-28 21:41       ` Mark Lord
  2007-04-29  3:17         ` Tejun Heo
  2007-04-28 23:56       ` Alan Cox
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-28 21:41 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Tejun Heo, Alan Cox, IDE/ATA development list

Jeff Garzik wrote:
> 
> It's not really a good idea for SATA.  The "FIFO" often co-emulated by 
> the SATA controller and SATA phy.  You just want to kick SATA really 
> hard (i.e. bus reset and friends).

Sure.  So why don't we do that now?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 20:15 libata fails to recover from HSM violation involving DRQ status Mark Lord
                   ` (2 preceding siblings ...)
  2007-04-28 20:37 ` Jeff Garzik
@ 2007-04-28 22:09 ` Mark Lord
  2007-04-29  3:04   ` Tejun Heo
  3 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-28 22:09 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Tejun Heo, Alan Cox, IDE/ATA development list

Mark Lord wrote:
>..
> I triggered this by accident, issuing an IDENTIFY command
> which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
> but libata never recovered from the "stuck DRQ bit" that resulted.
...
> sda: Mode Sense: 00 3a 00 00
> SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> support DPO or FUA
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
>         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata1: soft resetting port
> ata1.00: configured for UDMA/100
> ata1: EH complete
> SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
> sda: Write Protect is off
> sda: Mode Sense: 00 3a 00 00
> SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> support DPO or FUA
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
>         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata1: soft resetting port
> ata1.00: configured for UDMA/100
> ata1: EH complete
...
(over and over)

Say.. is this problem as simple as excessive retries for an SG_IO command?
There shouldn't really be *any* retries here, and it should eventually
just fail the command rather than shut down the port.

Or am I just reading the logs wrong?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 21:38     ` Jeff Garzik
  2007-04-28 21:41       ` Mark Lord
@ 2007-04-28 23:56       ` Alan Cox
  1 sibling, 0 replies; 37+ messages in thread
From: Alan Cox @ 2007-04-28 23:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, Tejun Heo, Alan Cox, IDE/ATA development list

> > This one does need dealing with. It happens in the real world and the old
> > IDE paths for this do get triggered and used now and then (we know this
> > because bugs in them were found). All it takes is a device and a
> > controller disagreeing about the length of a data transfer to get in a
> 
> How would they disagree (excluding human error)?

Human error with SG_IO is quite sufficient, or controller bugs. It
happens: we see it happen and stuff gets stuck that way sometimes when you
get a timeout. For some controllers a failure gets stuck because of the
FIFO magic they do.

> It's not really a good idea for SATA.  The "FIFO" often co-emulated by 
> the SATA controller and SATA phy.  You just want to kick SATA really 
> hard (i.e. bus reset and friends).

Possibly we need a per controller ->drain_fifo method if available. It's
precisely because the FIFO is "magic" that some of the PATA controllers
get stuck in a mess (eg they hold off IRQ until the FIFO is drained)

Alan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 22:09 ` Mark Lord
@ 2007-04-29  3:04   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-29  3:04 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Mark Lord wrote:
>> ..
>> I triggered this by accident, issuing an IDENTIFY command
>> which incorrectly specified ATA_PROT_NODATA.  My error, for sure,
>> but libata never recovered from the "stuck DRQ bit" that resulted.
> ...
>> sda: Mode Sense: 00 3a 00 00
>> SCSI device sda: write cache: enabled, read cache: enabled, doesn't
>> support DPO or FUA
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
>> ata1: soft resetting port
>> ata1.00: configured for UDMA/100
>> ata1: EH complete
>> SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
>> sda: Write Protect is off
>> sda: Mode Sense: 00 3a 00 00
>> SCSI device sda: write cache: enabled, read cache: enabled, doesn't
>> support DPO or FUA
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>         res 58/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
>> ata1: soft resetting port
>> ata1.00: configured for UDMA/100
>> ata1: EH complete
> ...
> (over and over)
> 
> Say.. is this problem as simple as excessive retries for an SG_IO command?
> There shouldn't really be *any* retries here, and it should eventually
> just fail the command rather than shut down the port.
> 
> Or am I just reading the logs wrong?

libata EH isn't trying to retry the command.  It's trying to revalidate
the device after resetting it to make sure that the device is still
there and listening to commands.  As the device fails to respond to
reset and the following IDENTIFY, libata EH assumes that the device is
dead one way or the other and gives up on the device after a few
reset/revalidate retries.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-28 21:41       ` Mark Lord
@ 2007-04-29  3:17         ` Tejun Heo
  2007-04-29  3:46           ` Jeff Garzik
                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-29  3:17 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Jeff Garzik wrote:
>>
>> It's not really a good idea for SATA.  The "FIFO" often co-emulated by
>> the SATA controller and SATA phy.  You just want to kick SATA really
>> hard (i.e. bus reset and friends).
> 
> Sure.  So why don't we do that now?

We do that.  It's just that ata_piix is lacking SControl access so all
we can do is SRST not PHY hardreset.  I don't think draining
FIFO/whatever on most SATA controllers would be unnecessary as PHY
hardreset would make most drives forget what they were doing.  I thought
SRST would have similar effect.  It's supposed to reset the device's HSM
and thus clear DRQ, right?  Stuck DRQ after SRST seems odd to me.

One more thing to note is that there might be no way to drain data
safely on non-SFF (ahci/sil24...) interfaces and some controllers lock
the machine up hard when TF registers are accessed in certain unexpected
way (unsurprisingly, sata_nv), so if we do this, it needs to be
configurable per-driver.  Another question is how would SATA controllers
emulating TF interface react when data port is polled after a DMA
command.  I'm pretty sure many of them would behave erratically.

Anyways, can you try to hack it into ata_bmdma_error_handler() and see
whether it actually works?  You can check for AC_ERR_HSM there and drain
data port if DRQ is set.  After HSM, ATA_NIEN is set and the port should
be quiescent at that point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29  3:17         ` Tejun Heo
@ 2007-04-29  3:46           ` Jeff Garzik
  2007-04-29  7:45             ` Tejun Heo
  2007-04-29  3:51           ` Tejun Heo
  2007-04-29 12:07           ` Mark Lord
  2 siblings, 1 reply; 37+ messages in thread
From: Jeff Garzik @ 2007-04-29  3:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mark Lord, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> and thus clear DRQ, right?  Stuck DRQ after SRST seems odd to me.

Unfortunately not odd on ata_piix, which can get stuck DRQ-on somewhere 
deep inside its IDE emulation engine.  And neither draining the FIFO nor 
SRST nor a couple other tricks ever helped.  The only thing that seemed 
to make any difference was an enable/disable reset via our beloved PCS 
register.

	Jeff



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29  3:17         ` Tejun Heo
  2007-04-29  3:46           ` Jeff Garzik
@ 2007-04-29  3:51           ` Tejun Heo
  2007-04-29 11:56             ` Mark Lord
  2007-04-29 12:07           ` Mark Lord
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2007-04-29  3:51 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> Mark Lord wrote:
>> Jeff Garzik wrote:
>>> It's not really a good idea for SATA.  The "FIFO" often co-emulated by
>>> the SATA controller and SATA phy.  You just want to kick SATA really
>>> hard (i.e. bus reset and friends).
>> Sure.  So why don't we do that now?
> 
> We do that.  It's just that ata_piix is lacking SControl access so all
> we can do is SRST not PHY hardreset.  I don't think draining
> FIFO/whatever on most SATA controllers would be unnecessary as PHY
> hardreset would make most drives forget what they were doing.  I thought
> SRST would have similar effect.  It's supposed to reset the device's HSM
> and thus clear DRQ, right?  Stuck DRQ after SRST seems odd to me.
> 
> One more thing to note is that there might be no way to drain data
> safely on non-SFF (ahci/sil24...) interfaces and some controllers lock
> the machine up hard when TF registers are accessed in certain unexpected
> way (unsurprisingly, sata_nv), so if we do this, it needs to be
> configurable per-driver.  Another question is how would SATA controllers
> emulating TF interface react when data port is polled after a DMA
> command.  I'm pretty sure many of them would behave erratically.
> 
> Anyways, can you try to hack it into ata_bmdma_error_handler() and see
> whether it actually works?  You can check for AC_ERR_HSM there and drain
> data port if DRQ is set.  After HSM, ATA_NIEN is set and the port should
> be quiescent at that point.

Oh, and one more thing, was the drive SATA or PATA?  I think it could be
the SATA SFF emulation which doesn't reset itself on SRST.  Testing
whether SRST clears DRQ on actual PATA devices would be worthwhile.  If
it's really the controller's emulation HSM that's not getting reset, the
fix definitely should be applied per-driver.

Ah.. one more thing, is this draining also needed after DMA commands or
only after PIO commands?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29  3:46           ` Jeff Garzik
@ 2007-04-29  7:45             ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-29  7:45 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, Alan Cox, Alan Cox, IDE/ATA development list

Jeff Garzik wrote:
> Tejun Heo wrote:
>> and thus clear DRQ, right?  Stuck DRQ after SRST seems odd to me.
> 
> Unfortunately not odd on ata_piix, which can get stuck DRQ-on somewhere
> deep inside its IDE emulation engine.  And neither draining the FIFO nor
> SRST nor a couple other tricks ever helped.  The only thing that seemed
> to make any difference was an enable/disable reset via our beloved PCS
> register.

OIC, it's the SFF emulation which doesn't reset itself on SRST.  So,
flipping PCS helps.  Maybe we should do that in prereset?

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29  3:51           ` Tejun Heo
@ 2007-04-29 11:56             ` Mark Lord
  2007-04-29 12:59               ` Mark Lord
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 11:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> Tejun Heo wrote:
>..
>> Anyways, can you try to hack it into ata_bmdma_error_handler() and see
>> whether it actually works?  You can check for AC_ERR_HSM there and drain
>> data port if DRQ is set.  After HSM, ATA_NIEN is set and the port should
>> be quiescent at that point.

Sure, I'll do that here shortly.

> Oh, and one more thing, was the drive SATA or PATA?

The controller, and libata, think the drive is SATA.
(but it really is PATA with a bridge between it and the controller).

> Ah.. one more thing, is this draining also needed after DMA commands or
> only after PIO commands?

Dunno about DMA -- this instance was just an IDENTIFY command (PIO).
I think the drive also does IDENTIFY_DMA, though, so I can try that too.

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29  3:17         ` Tejun Heo
  2007-04-29  3:46           ` Jeff Garzik
  2007-04-29  3:51           ` Tejun Heo
@ 2007-04-29 12:07           ` Mark Lord
  2007-04-29 16:36             ` Tejun Heo
  2 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 12:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> 
> Anyways, can you try to hack it into ata_bmdma_error_handler()

>From greping the code, I don't see how that function would ever
be called from ata_piix.  ??

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 11:56             ` Mark Lord
@ 2007-04-29 12:59               ` Mark Lord
  2007-04-29 13:13                 ` Mark Lord
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 12:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>> Tejun Heo wrote:
>> ..
>>> Anyways, can you try to hack it into ata_bmdma_error_handler() and see
>>> whether it actually works?  You can check for AC_ERR_HSM there and drain
>>> data port if DRQ is set.  After HSM, ATA_NIEN is set and the port should
>>> be quiescent at that point.
> 
> Sure, I'll do that here shortly.

Okay, it recovers nicely now with the patch below,
which I'm including for illustrative purposes only.
Ideally, we would look into the qc to see how large
the request was, and determine the drain "limit" based
on that.  But I got tired of rebooting and just hardcoded
it for the time being.

For my failed IDENTIFY, it claims 255 iterations.
Which makes sense, as tf_read probably already read one word
of the 256 words in the pipeline.

Draining is a nice workaround for most problems,
but we cannot drain for a WRITE --> wrong data direction,
and I don't want to feed bad data *into* the output FIFO.
Mmm.. I guess I'll have to try a failed WRITE under the
same circumstances and see what that does.  Probably it just
recovers without any fuss, as the FIFO will be empty anyway.

>> Ah.. one more thing, is this draining also needed after DMA commands or
>> only after PIO commands?

My drive doesn't do IDENTIFY_DMA, so I fed it a READ_DMA instead
with "no data", and libata recovered without draining.

Here's the hack I used:

--- linux/drivers/ata/libata-sff.c.orig	2007-04-26 12:02:46.000000000 -0400
+++ linux/drivers/ata/libata-sff.c	2007-04-29 08:29:27.000000000 -0400
@@ -413,6 +413,24 @@
 	ap->ops->irq_on(ap);
 }
 
+static void ata_drain_fifo (struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+	u8 stat = ata_chk_status(ap);
+	/*
+	 * Try to clear stuck DRQ if necessary.
+	 */
+	if ((stat & ATA_DRQ) && (!qc || qc->dma_dir != DMA_TO_DEVICE)) {
+		unsigned int i, limit = 512;
+		printk("Draining up to %u words from data FIFO.\n", limit);
+		for (i = 0; i < limit ; ++i) {
+			ioread16(ap->ioaddr.data_addr);
+			if (!(ata_chk_status(ap) & ATA_DRQ))
+				break;
+		}
+		printk("Drained %u/%u words.\n", i, limit);
+	}
+}
+
 /**
  *	ata_bmdma_drive_eh - Perform EH with given methods for BMDMA controller
  *	@ap: port to handle error for
@@ -469,7 +487,7 @@
 	}
 
 	ata_altstatus(ap);
-	ata_chk_status(ap);
+	ata_drain_fifo(ap, qc);
 	ap->ops->irq_clear(ap);
 
 	spin_unlock_irqrestore(ap->lock, flags);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 12:59               ` Mark Lord
@ 2007-04-29 13:13                 ` Mark Lord
  2007-04-29 16:42                   ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 13:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

> 
>>> Ah.. one more thing, is this draining also needed after DMA commands or
>>> only after PIO commands?
> 
> My drive doesn't do IDENTIFY_DMA, so I fed it a READ_DMA instead
> with "no data", and libata recovered without draining. 

More specifically, here's what happens for READ_DMA(1 sector)
with "NON_DATA" specified (same circustances as the failed IDENTIFY):

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd c8/00:01:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 0
         res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: port failed to respond (30 secs, Status 0xd0)
ata1: soft resetting port
ATA: abnormal status 0xD0 on port 0x000101f7
ATA: abnormal status 0xD0 on port 0x000101f7
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So no draining, and all is well again.
Odds look pretty good that this is just a PIO thing.

-ml

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 12:07           ` Mark Lord
@ 2007-04-29 16:36             ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-29 16:36 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>>
>> Anyways, can you try to hack it into ata_bmdma_error_handler()
> 
> From greping the code, I don't see how that function would ever
> be called from ata_piix.  ??

Yeah, I meant ata_bmdma_drive_eh().  You apparently have figured that
out already.  Sorry about the confusion.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 13:13                 ` Mark Lord
@ 2007-04-29 16:42                   ` Tejun Heo
  2007-04-29 16:47                     ` Mark Lord
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2007-04-29 16:42 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
>>>> Ah.. one more thing, is this draining also needed after DMA commands or
>>>> only after PIO commands?
>>
>> My drive doesn't do IDENTIFY_DMA, so I fed it a READ_DMA instead
>> with "no data", and libata recovered without draining. 
> 
> More specifically, here's what happens for READ_DMA(1 sector)
> with "NON_DATA" specified (same circustances as the failed IDENTIFY):
> 
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd c8/00:01:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 0
>         res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> ata1: port is slow to respond, please be patient (Status 0xd0)
> ata1: port failed to respond (30 secs, Status 0xd0)
> ata1: soft resetting port
> ATA: abnormal status 0xD0 on port 0x000101f7
> ATA: abnormal status 0xD0 on port 0x000101f7
> ata1.00: configured for UDMA/100
> ata1: EH complete
> SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
> sda: Write Protect is off
> sda: Mode Sense: 00 3a 00 00
> SCSI device sda: write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> 
> So no draining, and all is well again.
> Odds look pretty good that this is just a PIO thing.

So, this is specific to SATA (the host side at least) piix && PIO READ,
right?  I think we can fit this code nicely into
piix_sata_error_handler() if we make sure that it triggers under the
right condition - after a PIO READ command fails due to HSM violation
caused by stuck DRQ.

Can you please perform similar test on a native PATA device connected to
native PATA controller?  I'm curious whether SRST makes real silicons
forget about the on-going command.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 16:42                   ` Tejun Heo
@ 2007-04-29 16:47                     ` Mark Lord
  2007-04-29 18:49                       ` Mark Lord
  2007-05-01 13:00                       ` Mark Lord
  0 siblings, 2 replies; 37+ messages in thread
From: Mark Lord @ 2007-04-29 16:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> So, this is specific to SATA (the host side at least) piix && PIO READ,
> right?  I think we can fit this code nicely into
> piix_sata_error_handler() if we make sure that it triggers under the
> right condition - after a PIO READ command fails due to HSM violation
> caused by stuck DRQ.

Yeah, so far it's just PIO FROM DEVICE on a "SATA" device on ata_piix.
It *may* be more widespread than that, but we'll have to test some others.

> Can you please perform similar test on a native PATA device connected to
> native PATA controller?  I'm curious whether SRST makes real silicons
> forget about the on-going command.

I'll dig through some other hardware here and see what I have.
This'll take a few hours.

Cheers


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 16:47                     ` Mark Lord
@ 2007-04-29 18:49                       ` Mark Lord
  2007-04-29 19:05                         ` Mark Lord
  2007-04-29 19:07                         ` Mark Lord
  2007-05-01 13:00                       ` Mark Lord
  1 sibling, 2 replies; 37+ messages in thread
From: Mark Lord @ 2007-04-29 18:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>> Can you please perform similar test on a native PATA device connected to
>> native PATA controller?  I'm curious whether SRST makes real silicons
>> forget about the on-going command.
> 
> I'll dig through some other hardware here and see what I have.

Here's the first results, with sata_via and pata_via:  no issue.
This was tested without the fifo-flush hack.  Logs are below.

If anyone else is feeling brave, I've just put hdparm-7.2 up on sourceforge.net,
with a new VERY DANGEROUS command-line flag:

    hdparm --drq-hsm-error  /dev/whatever

This will issue an IDENTIFY (or PACKET_IDENTIFY) as a "non data" command
to the device of your choice, and you can then just sit back and watch
the fireworks.  The flag is clearly marked as "VERY DANGEROUS",
but seems to be quite safe here now that I've patched my kernel with the hack.

Note that it does a "sync(); sleep(1);" before issuing the fated command.

Cheers

----------------------


sata_via 0000:00:0f.0: version 2.1
ACPI: PCI Interrupt 0000:00:0f.0[B] -> GSI 20 (level, low) -> IRQ 16
sata_via 0000:00:0f.0: routed to hard irq line 10
ata1: SATA max UDMA/133 cmd 0x0001d800 ctl 0x0001d402 bmdma 0x0001c400 irq 16
ata2: SATA max UDMA/133 cmd 0x0001d000 ctl 0x0001c802 bmdma 0x0001c408 irq 16
scsi0 : sata_via
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ATA: abnormal status 0x7F on port 0x0001d807
ATA: abnormal status 0x7F on port 0x0001d807
ata1.00: ATA-6: HDS722512VLSA80, V33OA63A, max UDMA/100
ata1.00: 241254720 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
scsi1 : sata_via
Switched to high resolution mode on CPU 0
ata2: SATA link down 1.5 Gbps (SStatus 0 SControl 300)
ATA: abnormal status 0x7F on port 0x0001d007
scsi 0:0:0:0: Direct-Access     ATA      HDS722512VLSA80  V33O PQ: 0 ANSI: 5
SCSI device sda: 241254720 512-byte hdwr sectors (123522 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI device sda: 241254720 512-byte hdwr sectors (123522 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: Attached scsi disk sda
sd 0:0:0:0: Attached scsi generic sg0 type 0
pata_via 0000:00:0f.1: version 0.2.1
ACPI: PCI Interrupt 0000:00:0f.1[A] -> GSI 20 (level, low) -> IRQ 16
ata3: PATA max UDMA/133 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001fc00 irq 14
ata4: PATA max UDMA/133 cmd 0x00010170 ctl 0x00010376 bmdma 0x0001fc08 irq 15
scsi2 : pata_via
ATA: abnormal status 0x8 on port 0x000101f7
scsi3 : pata_via
ata4.00: ATAPI, max UDMA/66
ata4.00: configured for UDMA/66
scsi 3:0:0:0: CD-ROM            PIONEER  DVD-RW  DVR-111D 1.23 PQ: 0 ANSI: 5
scsi 3:0:0:0: Attached scsi generic sg1 type 5
ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 17 (level, low) -> IRQ 17
....

###### Test stuck DRQ on VIA-sata (disk):

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
ata1: soft resetting port
ATA: abnormal status 0x7F on port 0x0001d807
ATA: abnormal status 0x7F on port 0x0001d807
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 241254720 512-byte hdwr sectors (123522 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA


###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
###### Notice how the first "ata4.00: cmd ..." line is *missing*:

         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM violation)
ata4: soft resetting port
ata4.00: configured for UDMA/66
ata4: EH complete


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 18:49                       ` Mark Lord
@ 2007-04-29 19:05                         ` Mark Lord
  2007-04-30  0:59                           ` Tejun Heo
  2007-04-29 19:07                         ` Mark Lord
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 19:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

.. And here is another test of un-hacked 2.6.21,
this time for ata_piix with a pure PATA configuration.
Again, it passes with flying colours.


ata1: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001ffa0 irq 14
ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x0001ffa8 irq 15
scsi0 : ata_piix
ata1.00: ATA-7: Maxtor 6Y160L0, YAR41BW0, max UDMA/133
ata1.00: 320173056 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
scsi1 : ata_piix
ata2.00: ATAPI, max UDMA/66
ata2.01: ATAPI, max MWDMA2
ata2.00: configured for UDMA/66
ata2.01: configured for MWDMA2
scsi 0:0:0:0: Direct-Access     ATA      Maxtor 6Y160L0   YAR4 PQ: 0 ANSI: 5
SCSI device sda: 320173056 512-byte hdwr sectors (163929 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI device sda: 320173056 512-byte hdwr sectors (163929 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
scsi 1:0:0:0: CD-ROM            PIONEER  DVD-RW  DVR-111D 1.23 PQ: 0 ANSI: 5
scsi 1:0:1:0: CD-ROM            PLEXTOR  CD-R   PX-W1210A 1.08 PQ: 0 ANSI: 5

....

####### hdparm --drq-hsm-error /dev/sda:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 0 
         res 58/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x2 (HSM violation)
ata1: soft resetting port
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 320173056 512-byte hdwr sectors (163929 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

####### hdparm --drq-hsm-error /dev/cdrom:

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: cmd a1/00:00:00:00:00/00:00:00:00:00/40 tag 0 cdb 0x1e data 0 
         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM violation)
ata2: soft resetting port
ata2.00: configured for UDMA/66
ata2.01: configured for MWDMA2
ata2: EH complete


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 18:49                       ` Mark Lord
  2007-04-29 19:05                         ` Mark Lord
@ 2007-04-29 19:07                         ` Mark Lord
  2007-04-30  0:54                           ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-29 19:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
>
> ###### Test stuck DRQ on VIA-sata (disk):
> 
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 
>         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)

Why do we not always put a '\n' in front of that last line above ??
Sometimes it seems to have it, and lots of times it does not have a '\n'.
Weird.

> ###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
> ###### Notice how the first "ata4.00: cmd ..." line is *missing*:
> 
>         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM violation)
> ata4: soft resetting port
> ata4.00: configured for UDMA/66
> ata4: EH complete

And in this case, the first line of diagnostics (the "cmd" line)
is always missing.  Why?

-ml

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 19:07                         ` Mark Lord
@ 2007-04-30  0:54                           ` Tejun Heo
  2007-04-30  3:42                             ` Mark Lord
  2007-04-30 17:47                             ` Mark Lord
  0 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-30  0:54 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Mark Lord wrote:
>>
>> ###### Test stuck DRQ on VIA-sata (disk):
>>
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
> 
> Why do we not always put a '\n' in front of that last line above ??
> Sometimes it seems to have it, and lots of times it does not have a '\n'.
> Weird.
> 
>> ###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
>> ###### Notice how the first "ata4.00: cmd ..." line is *missing*:
>>
>>         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM violation)
>> ata4: soft resetting port
>> ata4.00: configured for UDMA/66
>> ata4: EH complete
> 
> And in this case, the first line of diagnostics (the "cmd" line)
> is always missing.  Why?

Hmmm... that's very weird.  I've never seen such problems.  The report
messages are printed in ata_eh_report() and both the cmd and res lines
are printed by single invocation to printk().  Is the log captured using
serial console?  I think it could be transmission error or buffer
overflow on serial link.

-- 
tejun


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 19:05                         ` Mark Lord
@ 2007-04-30  0:59                           ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-30  0:59 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> .. And here is another test of un-hacked 2.6.21,
> this time for ata_piix with a pure PATA configuration.
> Again, it passes with flying colours.

Thanks a lot.  I'd also like to try but I'm on the road and not bored
enough (yet) to do that on my only working machine.  It's good to know
that SRST is a strong enough kick in the pants for actual ATA devices.
So, we only have to fix that stubborn ata_piix.  :-)

-- 
tejun


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-30  0:54                           ` Tejun Heo
@ 2007-04-30  3:42                             ` Mark Lord
  2007-04-30  3:58                               ` Tejun Heo
  2007-04-30 17:47                             ` Mark Lord
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-30  3:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
>
> Hmmm... that's very weird.  I've never seen such problems.  The report
> messages are printed in ata_eh_report() and both the cmd and res lines
> are printed by single invocation to printk().  Is the log captured using
> serial console?  I think it could be transmission error or buffer
> overflow on serial link.

Naw, just "dmesg > file", so it should be a pretty reliable capture.
Just odd.  Something weird.

Cheers


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-30  3:42                             ` Mark Lord
@ 2007-04-30  3:58                               ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-04-30  3:58 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>>
>> Hmmm... that's very weird.  I've never seen such problems.  The report
>> messages are printed in ata_eh_report() and both the cmd and res lines
>> are printed by single invocation to printk().  Is the log captured using
>> serial console?  I think it could be transmission error or buffer
>> overflow on serial link.
> 
> Naw, just "dmesg > file", so it should be a pretty reliable capture.
> Just odd.  Something weird.

The format string is pretty long with a lot of parameters.  Maybe we're
hitting some obscure bug in printk and friends?

-- 
tejun


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-30  0:54                           ` Tejun Heo
  2007-04-30  3:42                             ` Mark Lord
@ 2007-04-30 17:47                             ` Mark Lord
  2007-05-01  0:23                               ` Mark Lord
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-04-30 17:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Tejun Heo wrote:
> Mark Lord wrote:
>> Mark Lord wrote:
>>> ###### Test stuck DRQ on VIA-sata (disk):
>>>
>>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>>         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
>> Why do we not always put a '\n' in front of that last line above ??
>> Sometimes it seems to have it, and lots of times it does not have a '\n'.
>> Weird.
>>
>>> ###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
>>> ###### Notice how the first "ata4.00: cmd ..." line is *missing*:
>>>
>>>         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM violation)
>>> ata4: soft resetting port
>>> ata4.00: configured for UDMA/66
>>> ata4: EH complete
>> And in this case, the first line of diagnostics (the "cmd" line)
>> is always missing.  Why?
> 
> Hmmm... that's very weird.  I've never seen such problems. 

Well, from looking at the code, we see that the last thing
before the "res" line is a "%s" for dma_str[qc->dma_dir].
If qc->dma_dir is corrupted (or just not set), then we'll get
semi-random garbage, which must be what's happening here.

The easy fix is to do this:

-                        dma_str[qc->dma_dir],
+                        dma_str[qc->dma_dir & 3],

We should do that regardless, as it's just safe programming.

Tejun:  I don't have an up-to-date GIT tree here at the moment,
so perhaps you could generate a patch to put this fix into your tree for Jeff ?

I'll try and test it here first, and post again after I've done so.

Secondly, I might later have a look and see why qc-dma_dir
doesn't have a proper value..

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-30 17:47                             ` Mark Lord
@ 2007-05-01  0:23                               ` Mark Lord
  2007-05-01  2:47                                 ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-05-01  0:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>> Mark Lord wrote:
>>> Mark Lord wrote:
>>>> ###### Test stuck DRQ on VIA-sata (disk):
>>>>
>>>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>>>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>>>         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM 
>>>> violation)
>>> Why do we not always put a '\n' in front of that last line above ??
>>> Sometimes it seems to have it, and lots of times it does not have a 
>>> '\n'.
>>> Weird.
>>>
>>>> ###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
>>>> ###### Notice how the first "ata4.00: cmd ..." line is *missing*:
>>>>
>>>>         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM 
>>>> violation)
>>>> ata4: soft resetting port
>>>> ata4.00: configured for UDMA/66
>>>> ata4: EH complete
>>> And in this case, the first line of diagnostics (the "cmd" line)
>>> is always missing.  Why?
..
> Well, from looking at the code, we see that the last thing
> before the "res" line is a "%s" for dma_str[qc->dma_dir].
> If qc->dma_dir is corrupted (or just not set), then we'll get
> semi-random garbage, which must be what's happening here.

WRONG.  The qc->dma_dir turns out to be just fine (3) in this case.

And.. the messages look fine with "dmesg", but syslogd records only
the "res.." line in /var/log/messages.

Too kooky for me.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-05-01  0:23                               ` Mark Lord
@ 2007-05-01  2:47                                 ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2007-05-01  2:47 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Mark Lord wrote:
>> Tejun Heo wrote:
>>> Mark Lord wrote:
>>>> Mark Lord wrote:
>>>>> ###### Test stuck DRQ on VIA-sata (disk):
>>>>>
>>>>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>>>>> ata1.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
>>>>>         res 58/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM
>>>>> violation)
>>>> Why do we not always put a '\n' in front of that last line above ??
>>>> Sometimes it seems to have it, and lots of times it does not have a
>>>> '\n'.
>>>> Weird.
>>>>
>>>>> ###### Test stuck DRQ on VIA-pata (ATAPI DVD/RW):
>>>>> ###### Notice how the first "ata4.00: cmd ..." line is *missing*:
>>>>>
>>>>>         res 58/00:02:00:00:02/00:00:00:00:00/40 Emask 0x2 (HSM
>>>>> violation)
>>>>> ata4: soft resetting port
>>>>> ata4.00: configured for UDMA/66
>>>>> ata4: EH complete
>>>> And in this case, the first line of diagnostics (the "cmd" line)
>>>> is always missing.  Why?
> ..
>> Well, from looking at the code, we see that the last thing
>> before the "res" line is a "%s" for dma_str[qc->dma_dir].
>> If qc->dma_dir is corrupted (or just not set), then we'll get
>> semi-random garbage, which must be what's happening here.
> 
> WRONG.  The qc->dma_dir turns out to be just fine (3) in this case.
> 
> And.. the messages look fine with "dmesg", but syslogd records only
> the "res.." line in /var/log/messages.

Oh well, it's probably throttling or just eating messages at its whim.  :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-04-29 16:47                     ` Mark Lord
  2007-04-29 18:49                       ` Mark Lord
@ 2007-05-01 13:00                       ` Mark Lord
  2007-05-11  3:33                         ` Mark Lord
  1 sibling, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-05-01 13:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Tejun Heo wrote:
>> So, this is specific to SATA (the host side at least) piix && PIO READ,
>> right?  I think we can fit this code nicely into
>> piix_sata_error_handler() if we make sure that it triggers under the
>> right condition - after a PIO READ command fails due to HSM violation
>> caused by stuck DRQ.
> 
> Yeah, so far it's just PIO FROM DEVICE on a "SATA" device on ata_piix.
> It *may* be more widespread than that, but we'll have to test some others.

I retested this again today on my new pure-SATA notebook with ata_piix.
In this case, the DRQ drain is not necessary, but also doesn't harm anything.
Tested it both ways.  This is with a Hitachi HTS541612J9SA00 SATA drive.

The original fault was on ata_piix SATA, with some kind of external
bridge (on the motherboard) to a Seagate PATA drive.  Sometime in the
next few days I'll have the exact same drive, but with a SATA interface,
and we'll try that in the pure-SATA situation.

This will tell us whether it's the bridge, or the drive, that was the issue.

The fix remains the same: drain the data fifo when DRQ is left high.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-05-01 13:00                       ` Mark Lord
@ 2007-05-11  3:33                         ` Mark Lord
  2007-05-11  3:35                           ` Mark Lord
  0 siblings, 1 reply; 37+ messages in thread
From: Mark Lord @ 2007-05-11  3:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Mark Lord wrote:
>> Tejun Heo wrote:
>>> So, this is specific to SATA (the host side at least) piix && PIO READ,
>>> right?  I think we can fit this code nicely into
>>> piix_sata_error_handler() if we make sure that it triggers under the
>>> right condition - after a PIO READ command fails due to HSM violation
>>> caused by stuck DRQ.
>>
>> Yeah, so far it's just PIO FROM DEVICE on a "SATA" device on ata_piix.
>> It *may* be more widespread than that, but we'll have to test some 
>> others.
> 
> I retested this again today on my new pure-SATA notebook with ata_piix.
> In this case, the DRQ drain is not necessary, but also doesn't harm 
> anything.
> Tested it both ways.  This is with a Hitachi HTS541612J9SA00 SATA drive.
> 
> The original fault was on ata_piix SATA, with some kind of external
> bridge (on the motherboard) to a Seagate PATA drive.  Sometime in the
> next few days I'll have the exact same drive, but with a SATA interface,
> and we'll try that in the pure-SATA situation.
> 
> This will tell us whether it's the bridge, or the drive, that was the 
> issue.
> 
> The fix remains the same: drain the data fifo when DRQ is left high.

Okay, I finally got round to testing this with the new pure-SATA
notebook I have here.  Same problem:  without draining the DRQ fifo,
the system *never* recovers.

But with the patch to drain DRQ, all is well.  That patch is now a keeper
for my own kernels.  Tejun, did you want to cook up a better-placed variant
of it for mainline?  I'm away for a few days now..

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: libata fails to recover from HSM violation involving DRQ status
  2007-05-11  3:33                         ` Mark Lord
@ 2007-05-11  3:35                           ` Mark Lord
  0 siblings, 0 replies; 37+ messages in thread
From: Mark Lord @ 2007-05-11  3:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Garzik, Alan Cox, Alan Cox, IDE/ATA development list

Mark Lord wrote:
> Mark Lord wrote:
>
>> I retested this again today on my new pure-SATA notebook with ata_piix.
>> In this case, the DRQ drain is not necessary, but also doesn't harm 
>> anything.
>> Tested it both ways.  This is with a Hitachi HTS541612J9SA00 SATA drive.
>>
>> The original fault was on ata_piix SATA, with some kind of external
>> bridge (on the motherboard) to a Seagate PATA drive.  Sometime in the
>> next few days I'll have the exact same drive, but with a SATA interface,
>> and we'll try that in the pure-SATA situation.
>>
>> This will tell us whether it's the bridge, or the drive, that was the 
>> issue.
>>
>> The fix remains the same: drain the data fifo when DRQ is left high.
> 
> Okay, I finally got round to testing this with the new pure-SATA
> notebook I have here.  Same problem:  without draining the DRQ fifo,
> the system *never* recovers.
> 
> But with the patch to drain DRQ, all is well.  That patch is now a keeper
> for my own kernels.  Tejun, did you want to cook up a better-placed variant
> of it for mainline?  I'm away for a few days now..

A note for anyone confused by my two postings above:
The DRQ drain *is* needed for the Seagate notebook drives (PATA/SATA),
but not for the Hitachi notebook SATA drive I also have here.

Cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-05-11  3:35 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-28 20:15 libata fails to recover from HSM violation involving DRQ status Mark Lord
2007-04-28 20:18 ` Mark Lord
2007-04-28 20:30 ` Alan Cox
2007-04-28 20:37 ` Jeff Garzik
2007-04-28 20:44   ` Mark Lord
2007-04-28 20:50     ` Jeff Garzik
2007-04-28 21:25   ` Alan Cox
2007-04-28 21:35     ` Mark Lord
2007-04-28 21:38     ` Jeff Garzik
2007-04-28 21:41       ` Mark Lord
2007-04-29  3:17         ` Tejun Heo
2007-04-29  3:46           ` Jeff Garzik
2007-04-29  7:45             ` Tejun Heo
2007-04-29  3:51           ` Tejun Heo
2007-04-29 11:56             ` Mark Lord
2007-04-29 12:59               ` Mark Lord
2007-04-29 13:13                 ` Mark Lord
2007-04-29 16:42                   ` Tejun Heo
2007-04-29 16:47                     ` Mark Lord
2007-04-29 18:49                       ` Mark Lord
2007-04-29 19:05                         ` Mark Lord
2007-04-30  0:59                           ` Tejun Heo
2007-04-29 19:07                         ` Mark Lord
2007-04-30  0:54                           ` Tejun Heo
2007-04-30  3:42                             ` Mark Lord
2007-04-30  3:58                               ` Tejun Heo
2007-04-30 17:47                             ` Mark Lord
2007-05-01  0:23                               ` Mark Lord
2007-05-01  2:47                                 ` Tejun Heo
2007-05-01 13:00                       ` Mark Lord
2007-05-11  3:33                         ` Mark Lord
2007-05-11  3:35                           ` Mark Lord
2007-04-29 12:07           ` Mark Lord
2007-04-29 16:36             ` Tejun Heo
2007-04-28 23:56       ` Alan Cox
2007-04-28 22:09 ` Mark Lord
2007-04-29  3:04   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.