linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is it safe to ignore UDMA BadCRC errors?
@ 2003-12-29 16:07 Jonathan Kamens
  2003-12-29 16:12 ` Jonathan Kamens
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jonathan Kamens @ 2003-12-29 16:07 UTC (permalink / raw)
  To: linux-kernel

The topic of CRC errrors from IDE drives has been discussed numerous
times on this list, and I've reviewed those discussions, but I'm still
not 100% certain of the answer to this question: Is it safe for me to
ignore occasional CRC errors from my drive?

Here are the details....

The errors look like this:

  hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

They don't seem to happen often enough to convince the kernel to back
down to a slower UDMA mode.

PCI IDE controller:

  PDC20262: IDE controller at PCI slot 00:0f.0
  PDC20262: chipset revision 1
  PDC20262: not 100% native mode: will probe irqs later
  PDC20262: ROM enabled at 0xfebe0000
  PDC20262: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
      ide2: BM-DMA at 0xef00-0xef07, BIOS settings: hde:DMA, hdf:DMA
      ide3: BM-DMA at 0xef08-0xef0f, BIOS settings: hdg:DMA, hdh:DMA

Seagate 160GB hard drive getting the occasional errors:

  /dev/hde:
   multcount    = 16 (on)
   IO_support   =  0 (default 16-bit)
   unmaskirq    =  1 (on)
   using_dma    =  1 (on)
   keepsettings =  0 (off)
   readonly     =  0 (off)
   readahead    =  8 (on)
   geometry     = 19457/255/63, sectors = 312581808, start = 0

  /dev/hde:

   Model=ST3160021A, FwRev=3.04, SerialNo=3JS0X3MB
   Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
   RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
   BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16
   CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
   IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
   PIO modes:  pio0 pio1 pio2 pio3 pio4 
   DMA modes:  mdma0 mdma1 mdma2 
   UDMA modes: udma0 udma1 udma2 udma3 *udma4 udma5 
   AdvancedPM=no WriteCache=enabled
   Drive conforms to: ATA/ATAPI-6 T13 1410D revision 2: 

   * signifies the current active mode

Flat, keyed, two-position ribbon cable which looks to be in good
condition.  There is no hdf on the same channel as hde (i.e., the
slave position on the cable is empty).

Replacing the cable with a shielded, round, single-position, keyed,
ATA100/133 cable didn't get rid of the errors (and in fact seemed to
make the drive behave worse, which may mean it wasn't a very good
shielded round cable; speaking of which, can anyone recommend a good,
reliable brand of IDE cables).

Reducing to UDMA3 by adding "-X udma3" to the hdparm invocation on
reboot didn't get rid of the errors.

Rerouting the various IDE cables to prevent them from running parallel
to each other didn't get rid of the errors.

Should I be continuing to pursue this, or should I just ignore it if
the performance of the drive is good and the errors only happen
occasionally?

Thanks for any advice you can provide.

  Jonathan Kamens

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 16:07 Is it safe to ignore UDMA BadCRC errors? Jonathan Kamens
@ 2003-12-29 16:12 ` Jonathan Kamens
  2003-12-29 19:52 ` Eric D. Mudama
  2004-01-15  2:21 ` Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?) Jonathan Kamens
  2 siblings, 0 replies; 23+ messages in thread
From: Jonathan Kamens @ 2003-12-29 16:12 UTC (permalink / raw)
  To: linux-kernel

I suppose if I were less of an idiot I would have mentioned my kernel
version in my message :-).  I'm using 2.4.22-ac4.  Here are the IDE
and PDC settings that are enabled in my .config:

  CONFIG_IDE=y
  CONFIG_BLK_DEV_IDE=y
  CONFIG_BLK_DEV_IDEDISK=y
  CONFIG_IDEDISK_MULTI_MODE=y
  CONFIG_BLK_DEV_IDECD=y
  CONFIG_BLK_DEV_IDETAPE=m
  CONFIG_BLK_DEV_IDESCSI=m
  CONFIG_BLK_DEV_IDEPCI=y
  CONFIG_IDEPCI_SHARE_IRQ=y
  CONFIG_BLK_DEV_IDEDMA_PCI=y
  CONFIG_IDEDMA_PCI_AUTO=y
  CONFIG_BLK_DEV_IDEDMA=y
  CONFIG_BLK_DEV_PDC202XX_OLD=y
  CONFIG_IDEDMA_AUTO=y
  CONFIG_BLK_DEV_PDC202XX=y
  CONFIG_BLK_DEV_IDE_MODES=y

Thanks again,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 16:07 Is it safe to ignore UDMA BadCRC errors? Jonathan Kamens
  2003-12-29 16:12 ` Jonathan Kamens
@ 2003-12-29 19:52 ` Eric D. Mudama
  2003-12-29 20:24   ` Florian Schuele
  2003-12-30 11:38   ` Jonathan Kamens
  2004-01-15  2:21 ` Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?) Jonathan Kamens
  2 siblings, 2 replies; 23+ messages in thread
From: Eric D. Mudama @ 2003-12-29 19:52 UTC (permalink / raw)
  To: linux-kernel

On Mon, Dec 29 at 11:07, Jonathan Kamens wrote:
>The topic of CRC errrors from IDE drives has been discussed numerous
>times on this list, and I've reviewed those discussions, but I'm still
>not 100% certain of the answer to this question: Is it safe for me to
>ignore occasional CRC errors from my drive?
>
>Here are the details....
>
>The errors look like this:
>
>  hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>  hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
>
>They don't seem to happen often enough to convince the kernel to back
>down to a slower UDMA mode.

0x5184 is the error code for when the drive sends you data that was
corrupted during transmission over the cable.  In general, nothing is
wrong with your drive, and a re-read from the drive will almost always
produce the proper data.

Odds are your cable is bad, regardless of how "good" it looks, you
really can't tell if you have marginal conductivity on a pin or
something else wierd.  In my home system I replace the IDE cables
every few years, on my test box at work I replace them every month
since I'm doing lots of re-plugging of drives. Note that a bad cable
is *dangerous* to your filesystem, since a PIO transfer to the drive
has *no* integrity checking on the cable!

Also, those "round" cables violate the ATA spec, I can't really
recommend using them unless airflow is your #1 concern, however in
that case you're probably better off buying a SATA drive.

Generic IDE ribbon cables (between 6" and 18") seem to work fine for
most people, just go buy another $2 cable from CompUSA and see if the
problem goes away.

FYI, UDMA4 isn't that fast, only 66MB/sec... "good" (functional, not
brand name) flat cables should be able to do 100MB sec trivially.

--eric

-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 19:52 ` Eric D. Mudama
@ 2003-12-29 20:24   ` Florian Schuele
  2003-12-29 20:34     ` Eric D. Mudama
  2003-12-30 11:38   ` Jonathan Kamens
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Schuele @ 2003-12-29 20:24 UTC (permalink / raw)
  To: linux-kernel

On 29.12.03 12:52 -0700, Eric D. Mudama wrote:

> On Mon, Dec 29 at 11:07, Jonathan Kamens wrote:
> >The topic of CRC errrors from IDE drives has been discussed numerous
> >times on this list, and I've reviewed those discussions, but I'm still
> >not 100% certain of the answer to this question: Is it safe for me to
> >ignore occasional CRC errors from my drive?
> >
> >Here are the details....
> >
> >The errors look like this:
> >
> > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
> >
> >They don't seem to happen often enough to convince the kernel to back
> >down to a slower UDMA mode.
> 
> 0x5184 is the error code for when the drive sends you data that was
> corrupted during transmission over the cable.  In general, nothing is
> wrong with your drive, and a re-read from the drive will almost always
> produce the proper data.
> 
> Odds are your cable is bad, regardless of how "good" it looks, you
> really can't tell if you have marginal conductivity on a pin or
> something else wierd.  In my home system I replace the IDE cables
> every few years, on my test box at work I replace them every month
> since I'm doing lots of re-plugging of drives. Note that a bad cable
> is *dangerous* to your filesystem, since a PIO transfer to the drive
> has *no* integrity checking on the cable!
> 
> Also, those "round" cables violate the ATA spec, I can't really
> recommend using them unless airflow is your #1 concern, however in
> that case you're probably better off buying a SATA drive.
> 
> Generic IDE ribbon cables (between 6" and 18") seem to work fine for
> most people, just go buy another $2 cable from CompUSA and see if the
> problem goes away.
> 
> FYI, UDMA4 isn't that fast, only 66MB/sec... "good" (functional, not
> brand name) flat cables should be able to do 100MB sec trivially.

i wrote a mail to this list a few days ago.
i have the same error messages as the above.
but _only_ with kernel 2.6.0, _not_ with 2.4.20 ...
thats strange. isnt it?
after i little traffic on the hd`s the system freezes.

-- 

florian schuele

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 20:24   ` Florian Schuele
@ 2003-12-29 20:34     ` Eric D. Mudama
  2003-12-29 20:52       ` Florian Schuele
  0 siblings, 1 reply; 23+ messages in thread
From: Eric D. Mudama @ 2003-12-29 20:34 UTC (permalink / raw)
  To: linux-kernel

On Mon, Dec 29 at 21:24, Florian Schuele wrote:
>i wrote a mail to this list a few days ago.
>i have the same error messages as the above.
>but _only_ with kernel 2.6.0, _not_ with 2.4.20 ...
>thats strange. isnt it?
>after i little traffic on the hd`s the system freezes.

Your error message wasn't the same.

You reported status 0x51 with error code 0x04.  That is the generic
catch-all error message, which usually refers to aborted commands.
Unfortunately, it's tough to know what command the system was
attempting to issue when your drive aborted the command.  There may be
some IDE debugging code you can enable to attempt to find out, but I
don't know exactly how to do that, a linux IDE person will need to
speak up... i'm just a generic IDE person.

The previous poster reported 0x51 with error code 0x80, that is
specifically UDMA CRC errors, and I diagnosed it as such.

--
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 20:34     ` Eric D. Mudama
@ 2003-12-29 20:52       ` Florian Schuele
  0 siblings, 0 replies; 23+ messages in thread
From: Florian Schuele @ 2003-12-29 20:52 UTC (permalink / raw)
  To: linux-kernel

On 29.12.03 13:34 -0700, Eric D. Mudama wrote:

> On Mon, Dec 29 at 21:24, Florian Schuele wrote:
> >i wrote a mail to this list a few days ago.
> >i have the same error messages as the above.
> >but _only_ with kernel 2.6.0, _not_ with 2.4.20 ...
> >thats strange. isnt it?
> >after i little traffic on the hd`s the system freezes.
> 
> Your error message wasn't the same.

ups, sorry... i was too fast with my post...

> 
> You reported status 0x51 with error code 0x04.  That is the generic
> catch-all error message, which usually refers to aborted commands.
> Unfortunately, it's tough to know what command the system was
> attempting to issue when your drive aborted the command.  There may be
> some IDE debugging code you can enable to attempt to find out, but I
> don't know exactly how to do that, a linux IDE person will need to
> speak up... i'm just a generic IDE person.

ah, ok perhaps i`ll find a way to debug or someone else who can help.

> 
> The previous poster reported 0x51 with error code 0x80, that is
> specifically UDMA CRC errors, and I diagnosed it as such.
> 

-- 

florian schuele

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-29 19:52 ` Eric D. Mudama
  2003-12-29 20:24   ` Florian Schuele
@ 2003-12-30 11:38   ` Jonathan Kamens
  2003-12-30 20:06     ` Eric D. Mudama
  1 sibling, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2003-12-30 11:38 UTC (permalink / raw)
  To: linux-kernel

Eric D. Mudama writes:
 > Odds are your cable is bad, regardless of how "good" it looks, you
 > really can't tell if you have marginal conductivity on a pin or
 > something else wierd.

I replaced the cable with a brand new one and I'm still getting the
CRC errors.  What now?

Thanks,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 11:38   ` Jonathan Kamens
@ 2003-12-30 20:06     ` Eric D. Mudama
  2003-12-30 20:11       ` Jonathan Kamens
  2003-12-30 20:14       ` Ed Sweetman
  0 siblings, 2 replies; 23+ messages in thread
From: Eric D. Mudama @ 2003-12-30 20:06 UTC (permalink / raw)
  To: linux-kernel

On Tue, Dec 30 at  6:38, Jonathan Kamens wrote:
>I replaced the cable with a brand new one and I'm still getting the
>CRC errors.  What now?

Do you know if you're doing reads or writes when the errors occur?

What motherboard/chipset are you using with this drive?

Is the environment especially warm?

-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 20:06     ` Eric D. Mudama
@ 2003-12-30 20:11       ` Jonathan Kamens
  2003-12-30 20:25         ` Eric D. Mudama
  2003-12-30 20:14       ` Ed Sweetman
  1 sibling, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2003-12-30 20:11 UTC (permalink / raw)
  To: Eric D. Mudama; +Cc: linux-kernel

Eric D. Mudama writes:
 > Date: 	Tue, 30 Dec 2003 13:06:43 -0700
 > 
 > On Tue, Dec 30 at  6:38, Jonathan Kamens wrote:
 > >I replaced the cable with a brand new one and I'm still getting the
 > >CRC errors.  What now?
 > 
 > Do you know if you're doing reads or writes when the errors occur?

I've been led to believe, from a response in this thread, that the
errors must be generated by reads, because someone asserted that no
CRC checking is done on writes.

 > What motherboard/chipset are you using with this drive?

The motherboard is a SuperMicro S2DGU, but the IDE controller having
the problem is not on the motherboard, it's a Promise Ultra66 card.

 > Is the environment especially warm?

No, I don't think so, and I just installed a brand new Antec 550W
power supply, so I would imagine that I've got more than enough power
and the power supply fans plus the additional case and CPU fans should
be enough to cool the machine adequately.

Thanks,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 20:06     ` Eric D. Mudama
  2003-12-30 20:11       ` Jonathan Kamens
@ 2003-12-30 20:14       ` Ed Sweetman
  1 sibling, 0 replies; 23+ messages in thread
From: Ed Sweetman @ 2003-12-30 20:14 UTC (permalink / raw)
  To: Eric D. Mudama; +Cc: linux-kernel

Eric D. Mudama wrote:
> On Tue, Dec 30 at  6:38, Jonathan Kamens wrote:
> 
>> I replaced the cable with a brand new one and I'm still getting the
>> CRC errors.  What now?
> 
> 
> Do you know if you're doing reads or writes when the errors occur?
> 
> What motherboard/chipset are you using with this drive?
> 
> Is the environment especially warm?
> 

CRC errors can easily be caused by the cable being close to a magnet or 
high electricity source.   Watch out for the case speaker and any other 
speakers around the cable.  Also, make sure you're using S.M.A.R.T to 
check the drive, itself, for problems.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 20:11       ` Jonathan Kamens
@ 2003-12-30 20:25         ` Eric D. Mudama
  2003-12-30 20:30           ` Jonathan Kamens
  0 siblings, 1 reply; 23+ messages in thread
From: Eric D. Mudama @ 2003-12-30 20:25 UTC (permalink / raw)
  To: linux-kernel

On Tue, Dec 30 at 15:11, Jonathan Kamens wrote:
>Eric D. Mudama writes:
> > Do you know if you're doing reads or writes when the errors occur?
>
>I've been led to believe, from a response in this thread, that the
>errors must be generated by reads, because someone asserted that no
>CRC checking is done on writes.

Actually, PIO modes (both read and write) have no error checking at
all built into them.

UDMA modes include a checksum on every transfer, for both reads and
writes.

Actually, if the drive is reporting the checksum error, it must be a
write (duh), and therefore the drive doesn't agree with how the card
is clocking the data, or something to that effect.

IDE controllers are pretty cheap, you might want to see if you can go
get a current generation card from promise, which will do the
Seagate's udma5 mode quite easily.  Shouldn't cost more than $35 at a
retail place like CompUSA, or $20 online.

> The motherboard is a SuperMicro S2DGU, but the IDE controller having
> the problem is not on the motherboard, it's a Promise Ultra66 card.

(see above)

--eric

-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 20:25         ` Eric D. Mudama
@ 2003-12-30 20:30           ` Jonathan Kamens
  2003-12-30 20:48             ` Eric D. Mudama
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2003-12-30 20:30 UTC (permalink / raw)
  To: Eric D. Mudama; +Cc: linux-kernel

Eric D. Mudama writes:
 > UDMA modes include a checksum on every transfer, for both reads and
 > writes.

This contradicts what I was told previously by another subscriber to
this list.

If it is true, then it would appear that the answer to my question "Is
it save to ignore UDMA BadCRC errors?" is "Yes."  If all transfers are
checksummed, and if all transfers with bad checksums will be retried
by the kernel, then occasional checksum errors are harmless because
they will be retried.  Do you agree?

Thanks,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is it safe to ignore UDMA BadCRC errors?
  2003-12-30 20:30           ` Jonathan Kamens
@ 2003-12-30 20:48             ` Eric D. Mudama
  0 siblings, 0 replies; 23+ messages in thread
From: Eric D. Mudama @ 2003-12-30 20:48 UTC (permalink / raw)
  To: linux-kernel

On Tue, Dec 30 at 15:30, Jonathan Kamens wrote:
>Eric D. Mudama writes:
> > UDMA modes include a checksum on every transfer, for both reads and
> > writes.
>
>This contradicts what I was told previously by another subscriber to
>this list.
>
>If it is true, then it would appear that the answer to my question "Is
>it save to ignore UDMA BadCRC errors?" is "Yes."  If all transfers are
>checksummed, and if all transfers with bad checksums will be retried
>by the kernel, then occasional checksum errors are harmless because
>they will be retried.  Do you agree?

No, I don't.

Your system may always detect the errors and function happily (as
designed) but to me the $30 for the security of knowing you have 100%
functioning hardware is well worth it.

--eric

-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2003-12-29 16:07 Is it safe to ignore UDMA BadCRC errors? Jonathan Kamens
  2003-12-29 16:12 ` Jonathan Kamens
  2003-12-29 19:52 ` Eric D. Mudama
@ 2004-01-15  2:21 ` Jonathan Kamens
  2004-01-16  3:47   ` Jonathan Kamens
  2 siblings, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2004-01-15  2:21 UTC (permalink / raw)
  To: linux-kernel

Hello everyone,

I'd like to provide an update on my efforts to understand what causes
"DriveStatusError BadCRC" errors when using UDMA drives, how to debug
these errors in general, the specific progress I've made at resolving
these errors on my system, and subsequent problems I've encountered
when doing so.

Recall that I was getting these errors from 2.4.22-ac4 on my
dual-processor (550MHz Pentium III Katmai) system:

  hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

from a Seagate 160GB drive (ST3160021A) plugged into its own channel
on a Promise Ultra66 (PDC20262) controller.

The suggestion most frequently given to me and others for resolving
BadCRC errors is to replace the IDE cable with one that conforms to
the Ultra ATA spec (80-conductor, flat, two drive connectors, single
drive connected to the end connector).  I tried several such cables,
none of which made the BadCRC errors go away.

Other suggestions given to me included:

* Make sure the IDE cable is not running parallel to another cable.
* Make sure the cable is not passing near magnets inside the case,
  e.g., speaker magnets.
* Update the IDE controller's firmware.
* Check to make sure the PCI bus speed is valid (33MHz, normally).
* Make sure the PCI latency timer is set in the BIOS to at least 64.

I tried all of these suggestions, and none of them worked.

I tried swapping the drives on the controller's two channels, and the
BadCRC errors traveled with the drive.  Then I swapped the cables on
the two channels, and the errors still remained on the same drive.

Next, I bought a SIIG Ultra ATA 133 controller, compiled SIIMAGE
support into my kernel, plugged in the new SIIG controller, and moved
the drive getting the BadCRC errors over to it.  They stopped -- I
haven't seen a single BadCRC error since I moved the drive to the SIIG
controller a couple of weeks ago.

Alas, another problem has presented itself.  Twice after I installed
the SIIG controller and moved the Seagate drive to it, my system hung
(all activity seemed to stop, syslogd stopped logging, X server
stopped responding, couldn't switch VTs).  Both times, Alt-SysRq-s and
Alt-SysRq-u appeared to have no effect, but Alt-SysRq-b successfully
rebooted the system.  I couldn't get any more information because I
don't have a serial console and my monitor was in X when the hang
happened; since I couldn't switch VT's I couldn't get to one where the
magic SysRq sequences would display information.

After the second hang, I tried two more things -- moving the other
drive to the SIIG controller, such that the Promise controller no
longer has any drives on it (but it's still plugged in, and also, my
motherboard's PIIX4 controller still has a hard drive, CD-ROM and
OnStream DI-30 drive plugged into it as hda, hdc and hdd
respectively), and turning off unmask IRQ for the drives on the SIIG
controller, as suggested in other messages here.  Unfortunately, even
with these two additional steps, I'm still seeing kernel hangs, albeit
seemingly less frequently -- I just had another one about an hour ago.

I've just enabled the NMI watchdog, compiled software watchdog support
into my kernel and installed and enabled the watchdog daemon.  If
anyone can suggest anything else I can do to debug these hangs, I'm
all ears.

Thanks for reading this far. :-)

  Jonathan Kamens

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-15  2:21 ` Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?) Jonathan Kamens
@ 2004-01-16  3:47   ` Jonathan Kamens
  2004-01-16  7:47     ` John Bradford
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2004-01-16  3:47 UTC (permalink / raw)
  To: linux-kernel

The drive which stopped reporting BadCRC errors for weeks when I
transferred it from the Promise PDC20262 Ultra66 controller to the
SIIG SIi680 Ultra133 controller just reported this:

Jan 15 22:03:13 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Jan 15 22:03:13 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }
Jan 15 22:03:20 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Jan 15 22:03:20 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }

I don't know whether it's relevant that these errors are tagged
"drive_cmd" and the BadCRC errors were tagged "dma_intr".

Are the errors above close enough to the BadCRC errors I was getting,
i.e.:

  hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
  hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

for me to now start suspecting a problem with the drive, given that
two different controllers with two different chipsets are reporting
problems with it?

Thanks,

  Jonathan Kamens

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16  3:47   ` Jonathan Kamens
@ 2004-01-16  7:47     ` John Bradford
  2004-01-16 15:27       ` Jonathan Kamens
  0 siblings, 1 reply; 23+ messages in thread
From: John Bradford @ 2004-01-16  7:47 UTC (permalink / raw)
  To: Jonathan Kamens, linux-kernel

Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
> The drive which stopped reporting BadCRC errors for weeks when I
> transferred it from the Promise PDC20262 Ultra66 controller to the
> SIIG SIi680 Ultra133 controller just reported this:
> 
> Jan 15 22:03:13 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> Jan 15 22:03:13 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }
> Jan 15 22:03:20 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> Jan 15 22:03:20 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }

The drive doesn't seem to understand the command it was sent.

> I don't know whether it's relevant that these errors are tagged
> "drive_cmd" and the BadCRC errors were tagged "dma_intr".
> 
> Are the errors above close enough to the BadCRC errors I was getting,
> i.e.:
> 
>   hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>   hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
> 
> for me to now start suspecting a problem with the drive, given that
> two different controllers with two different chipsets are reporting
> problems with it?

No.

John.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16  7:47     ` John Bradford
@ 2004-01-16 15:27       ` Jonathan Kamens
  2004-01-16 15:46         ` John Bradford
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2004-01-16 15:27 UTC (permalink / raw)
  To: linux-kernel

John Bradford writes:
 > Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
 > > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
 > > ... hde: drive_cmd: error=0x04 { DriveStatusError }
 > 
 > The drive doesn't seem to understand the command it was sent.

I'm not sure what this means, but assuming that it's going to happen
again at some point, what do I need to do to my kernel/configuration
now to be able to capture additional debugging information the next
time it happens?

Thanks,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16 15:27       ` Jonathan Kamens
@ 2004-01-16 15:46         ` John Bradford
  2004-01-16 15:48           ` Jonathan Kamens
  2004-01-16 16:12           ` Updated on UDMA BadCRC errors + subsequent problems Ed Sweetman
  0 siblings, 2 replies; 23+ messages in thread
From: John Bradford @ 2004-01-16 15:46 UTC (permalink / raw)
  To: Jonathan Kamens, linux-kernel

Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
> John Bradford writes:
>  > Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
>  > > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
>  > > ... hde: drive_cmd: error=0x04 { DriveStatusError }
>  > 
>  > The drive doesn't seem to understand the command it was sent.
> 
> I'm not sure what this means, but assuming that it's going to happen
> again at some point,

Maybe not - the most common cause I've seen for that message in the logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is disabled.

I.E. the error should be reproducable with:

# smartctl -d /dev/hda
# smartctl -a /dev/hda

Are you sure you weren't trying to get S.M.A.R.T. info from the drive at the time the error was logged?

John.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16 15:46         ` John Bradford
@ 2004-01-16 15:48           ` Jonathan Kamens
  2004-01-16 16:48             ` John Bradford
  2004-01-16 16:12           ` Updated on UDMA BadCRC errors + subsequent problems Ed Sweetman
  1 sibling, 1 reply; 23+ messages in thread
From: Jonathan Kamens @ 2004-01-16 15:48 UTC (permalink / raw)
  To: John Bradford; +Cc: linux-kernel

John Bradford writes:

 > Maybe not - the most common cause I've seen for that message in the
 > logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is
 > disabled.
 > 
 > I.E. the error should be reproducable with:
 > 
 > # smartctl -d /dev/hda
 > # smartctl -a /dev/hda
 > 
 > Are you sure you weren't trying to get S.M.A.R.T. info from the
 > drive at the time the error was logged?

My smartctl wants "-s off" rather than "-d", but other than that,
you're correct, that sequence of commands does ause the same error to
appear in the logs.  But why/how would SMART be disabled on the drive?
I've been running smartd on the drive for weeks with no errors of this
sort, and I fail to see how SMART would suddenly be disabled on the
drive with no action on my part, so it seems more likely that some
other condition caused the error.

Thanks,

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems
  2004-01-16 15:46         ` John Bradford
  2004-01-16 15:48           ` Jonathan Kamens
@ 2004-01-16 16:12           ` Ed Sweetman
  1 sibling, 0 replies; 23+ messages in thread
From: Ed Sweetman @ 2004-01-16 16:12 UTC (permalink / raw)
  To: John Bradford; +Cc: Jonathan Kamens, linux-kernel

John Bradford wrote:
> Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
> 
>>John Bradford writes:
>> > Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
>> > > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
>> > > ... hde: drive_cmd: error=0x04 { DriveStatusError }
>> > 
>> > The drive doesn't seem to understand the command it was sent.
>>
>>I'm not sure what this means, but assuming that it's going to happen
>>again at some point,
> 
> 
> Maybe not - the most common cause I've seen for that message in the logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is disabled.
> 
> I.E. the error should be reproducable with:
> 
> # smartctl -d /dev/hda
> # smartctl -a /dev/hda
> 
> Are you sure you weren't trying to get S.M.A.R.T. info from the drive at the time the error was logged?
> 
> John.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }


Some drives i guess report the exact error like mine here.  These occur 
when i'm transferring from my tulip based nic to my hdd at 8Mbytes a 
second (avg). The fs is ext3.  When i'm transferring about 1.5GB files 
over the drive seems to freak out.  Timing sources (tsc) can also lose 
so many ticks that the other time source has to be used.

What i dont understand is why the ata drivers dont handle crc errors 
correctly. Instead of resetting the ide bus and turning dma off why dont 
they start throttling down the dma modes one by one,  When the rate of 
crc errors reaches a certain reasonable number, drop an udma level. If 
that crc error rate is reached again, drop a level.  You keep doing that 
until you hit pio mode.  Usually the problem is solved by simply using a 
lower dma mode.  That way my system doesn't have to reach loads of 20 
and io suck all my cpu while i'm trying to re-enable dma so i can 
actually figure out what's going on. CRC errors are caused by timing 
problems as well as physical problems around the cabling in the 
computer.  Normal hdd to hdd transfers (which avg about 30MByte/sec) do 
not cause these errors for me.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16 15:48           ` Jonathan Kamens
@ 2004-01-16 16:48             ` John Bradford
  2004-01-16 18:04               ` Jonathan Kamens
  2004-01-16 20:52               ` Alan Cox
  0 siblings, 2 replies; 23+ messages in thread
From: John Bradford @ 2004-01-16 16:48 UTC (permalink / raw)
  To: Jonathan Kamens; +Cc: linux-kernel, alan

Quote from Jonathan Kamens <jik@kamens.brookline.ma.us>:
> John Bradford writes:
> 
>  > Maybe not - the most common cause I've seen for that message in the
>  > logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is
>  > disabled.
>  > 
>  > I.E. the error should be reproducable with:
>  > 
>  > # smartctl -d /dev/hda
>  > # smartctl -a /dev/hda
>  > 
>  > Are you sure you weren't trying to get S.M.A.R.T. info from the
>  > drive at the time the error was logged?
> 
> My smartctl wants "-s off" rather than "-d", but other than that,
> you're correct, that sequence of commands does ause the same error to
> appear in the logs.  But why/how would SMART be disabled on the drive?
> I've been running smartd on the drive for weeks with no errors of this
> sort, and I fail to see how SMART would suddenly be disabled on the
> drive with no action on my part,

Some motherboard BIOSes disable S.M.A.R.T. on drives connected to
their on-board controllers on each boot.  Quite possibly some PCI IDE
cards do as well.  It's possible, (but probably not likely), that by
trying the drive on different controllers a BIOS somewhere has
disabled S.M.A.R.T.

> so it seems more likely that some
> other condition caused the error.

Quite possibly, but I can only really guess as to what that might be
at this point.

I _think_ that UDMA CRC checking is only done on data transfers, not
commands.  I've CC'ed Alan in the hope of getting some confirmation on
this.  Maybe a command being corrupted on the wire could theoretically
cause that error.

John.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16 16:48             ` John Bradford
@ 2004-01-16 18:04               ` Jonathan Kamens
  2004-01-16 20:52               ` Alan Cox
  1 sibling, 0 replies; 23+ messages in thread
From: Jonathan Kamens @ 2004-01-16 18:04 UTC (permalink / raw)
  To: linux-kernel, alan

John Bradford writes:
 > Some motherboard BIOSes disable S.M.A.R.T. on drives connected to
 > their on-board controllers on each boot.  Quite possibly some PCI IDE
 > cards do as well.  It's possible, (but probably not likely), that by
 > trying the drive on different controllers a BIOS somewhere has
 > disabled S.M.A.R.T.

The error occurred at least a week after the drive was moved to its
current controller.  In that week both the drive and smartd ran just
fine with no errors.

  jik

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)
  2004-01-16 16:48             ` John Bradford
  2004-01-16 18:04               ` Jonathan Kamens
@ 2004-01-16 20:52               ` Alan Cox
  1 sibling, 0 replies; 23+ messages in thread
From: Alan Cox @ 2004-01-16 20:52 UTC (permalink / raw)
  To: John Bradford; +Cc: Jonathan Kamens, linux-kernel, alan

On Fri, Jan 16, 2004 at 04:48:34PM +0000, John Bradford wrote:
> I _think_ that UDMA CRC checking is only done on data transfers, not
> commands.  I've CC'ed Alan in the hope of getting some confirmation on
> this.  Maybe a command being corrupted on the wire could theoretically
> cause that error.

You are correct for PATA but the situation there is very very unlikely,
let alone for it to be repeatable


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2004-01-16 20:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-29 16:07 Is it safe to ignore UDMA BadCRC errors? Jonathan Kamens
2003-12-29 16:12 ` Jonathan Kamens
2003-12-29 19:52 ` Eric D. Mudama
2003-12-29 20:24   ` Florian Schuele
2003-12-29 20:34     ` Eric D. Mudama
2003-12-29 20:52       ` Florian Schuele
2003-12-30 11:38   ` Jonathan Kamens
2003-12-30 20:06     ` Eric D. Mudama
2003-12-30 20:11       ` Jonathan Kamens
2003-12-30 20:25         ` Eric D. Mudama
2003-12-30 20:30           ` Jonathan Kamens
2003-12-30 20:48             ` Eric D. Mudama
2003-12-30 20:14       ` Ed Sweetman
2004-01-15  2:21 ` Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?) Jonathan Kamens
2004-01-16  3:47   ` Jonathan Kamens
2004-01-16  7:47     ` John Bradford
2004-01-16 15:27       ` Jonathan Kamens
2004-01-16 15:46         ` John Bradford
2004-01-16 15:48           ` Jonathan Kamens
2004-01-16 16:48             ` John Bradford
2004-01-16 18:04               ` Jonathan Kamens
2004-01-16 20:52               ` Alan Cox
2004-01-16 16:12           ` Updated on UDMA BadCRC errors + subsequent problems Ed Sweetman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).