* Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
@ 2007-12-01 11:26 Justin Piszcz
2007-12-01 12:13 ` Jan Engelhardt
2007-12-06 22:00 ` Andrew Morton
0 siblings, 2 replies; 13+ messages in thread
From: Justin Piszcz @ 2007-12-01 11:26 UTC (permalink / raw)
To: linux-kernel, linux-raid, linux-ide; +Cc: apiszcz
I am putting a new machine together and I have dual raptor raid 1 for the
root, which works just fine under all stress tests.
Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
sale now adays):
I ran the following:
dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde
(as it is always a very good idea to do this with any new disk)
And sometime along the way(?) (i had gone to sleep and let it run), this
occurred:
[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
action 0x2 frozen
[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
0x0 data 512 in
[42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
(ATA bus error)
[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status
0x80)
[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
(750156 MB)
[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
Usually when I see this sort of thing with another box I have full of
raptors, it was due to a bad raptor and I never saw it again after I
replaced the disk that it happened on, but that was using the Intel P965
chipset.
For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
I am going to do some further testing but does this indicate a bad drive?
Bad cable? Bad connector?
As you can see above, /dev/sdc stopped responding for a little bit and
then the kernel reset the port.
Why is this though? What is the likely root cause? Should I replace the
drive? Obviously this is not normal and cannot be good at all, the idea
is to put these drives in a RAID5 and if one is going to timeout that is
going to cause the array to go degraded and thus be worthless in a raid5
configuration.
Can anyone offer any insight here?
Thank you,
Justin.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
@ 2007-12-01 12:13 ` Jan Engelhardt
2007-12-01 12:23 ` Justin Piszcz
2007-12-01 18:44 ` Bill Davidsen
2007-12-06 22:00 ` Andrew Morton
1 sibling, 2 replies; 13+ messages in thread
From: Jan Engelhardt @ 2007-12-01 12:13 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
On Dec 1 2007 06:26, Justin Piszcz wrote:
> I ran the following:
>
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
>
> (as it is always a very good idea to do this with any new disk)
Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.
(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-01 12:13 ` Jan Engelhardt
@ 2007-12-01 12:23 ` Justin Piszcz
[not found] ` <20071201174733.646a5c35@absurd>
2007-12-01 18:44 ` Bill Davidsen
1 sibling, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2007-12-01 12:23 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
On Sat, 1 Dec 2007, Jan Engelhardt wrote:
>
> On Dec 1 2007 06:26, Justin Piszcz wrote:
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
>
> Why would you care about what's on the disk? fdisk, mkfs and
> the day-to-day operation will overwrite it _anyway_.
>
> (If you think the disk is not empty, you should look at it
> and copy off all usable warez beforehand :-)
>
The purpose is with any new disk its good to write to all the blocks and
let the drive to all of the re-mapping before you put 'real' data on it.
Let it crap out or fail before I put my data on it.
Justin.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-01 12:13 ` Jan Engelhardt
2007-12-01 12:23 ` Justin Piszcz
@ 2007-12-01 18:44 ` Bill Davidsen
2007-12-10 8:14 ` Tejun Heo
1 sibling, 1 reply; 13+ messages in thread
From: Bill Davidsen @ 2007-12-01 18:44 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Justin Piszcz, linux-kernel, linux-raid, linux-ide, apiszcz
Jan Engelhardt wrote:
> On Dec 1 2007 06:26, Justin Piszcz wrote:
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
>
> Why would you care about what's on the disk? fdisk, mkfs and
> the day-to-day operation will overwrite it _anyway_.
>
> (If you think the disk is not empty, you should look at it
> and copy off all usable warez beforehand :-)
>
Do you not test your drive for minimum functionality before using them?
Also, if you have the tools to check for relocated sectors before and
after doing this, that's a good idea as well. S.M.A.R.T is your friend.
And when writing /dev/zero to a drive, if it craps out you have less
emotional attachment to the data.
--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-01 18:44 ` Bill Davidsen
@ 2007-12-10 8:14 ` Tejun Heo
2007-12-13 22:27 ` Bill Davidsen
0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2007-12-10 8:14 UTC (permalink / raw)
To: Bill Davidsen
Cc: Jan Engelhardt, Justin Piszcz, linux-kernel, linux-raid,
linux-ide, apiszcz
Bill Davidsen wrote:
> Jan Engelhardt wrote:
>> On Dec 1 2007 06:26, Justin Piszcz wrote:
>>> I ran the following:
>>>
>>> dd if=/dev/zero of=/dev/sdc
>>> dd if=/dev/zero of=/dev/sdd
>>> dd if=/dev/zero of=/dev/sde
>>>
>>> (as it is always a very good idea to do this with any new disk)
>>
>> Why would you care about what's on the disk? fdisk, mkfs and
>> the day-to-day operation will overwrite it _anyway_.
>>
>> (If you think the disk is not empty, you should look at it
>> and copy off all usable warez beforehand :-)
>>
> Do you not test your drive for minimum functionality before using them?
I personally don't.
> Also, if you have the tools to check for relocated sectors before and
> after doing this, that's a good idea as well. S.M.A.R.T is your friend.
> And when writing /dev/zero to a drive, if it craps out you have less
> emotional attachment to the data.
Writing all zero isn't too useful tho. Drive failing reallocation on
write is catastrophic failure. It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive. The drive will have
to go to the trash can. This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early. I personally haven't seen any such failure yet. Maybe I'm lucky.
Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate. But then again I think it's an overkill.
Writing zeros to sectors is more useful as cure rather than prevention.
If your drive fails to read a sector, write whatever value to the
sector. The drive will forget about the data on the damaged sector and
reallocate and write new data to it. Of course, you lose data which was
originally on the sector.
I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk. Of course, if
you wanna be extra cautious, good for you. :-)
--
tejun
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-10 8:14 ` Tejun Heo
@ 2007-12-13 22:27 ` Bill Davidsen
0 siblings, 0 replies; 13+ messages in thread
From: Bill Davidsen @ 2007-12-13 22:27 UTC (permalink / raw)
To: Tejun Heo
Cc: Jan Engelhardt, Justin Piszcz, linux-kernel, linux-raid,
linux-ide, apiszcz
Tejun Heo wrote:
> Bill Davidsen wrote:
>
>> Jan Engelhardt wrote:
>>
>>> On Dec 1 2007 06:26, Justin Piszcz wrote:
>>>
>>>> I ran the following:
>>>>
>>>> dd if=/dev/zero of=/dev/sdc
>>>> dd if=/dev/zero of=/dev/sdd
>>>> dd if=/dev/zero of=/dev/sde
>>>>
>>>> (as it is always a very good idea to do this with any new disk)
>>>>
>>> Why would you care about what's on the disk? fdisk, mkfs and
>>> the day-to-day operation will overwrite it _anyway_.
>>>
>>> (If you think the disk is not empty, you should look at it
>>> and copy off all usable warez beforehand :-)
>>>
>>>
>> Do you not test your drive for minimum functionality before using them?
>>
>
> I personally don't.
>
>
>> Also, if you have the tools to check for relocated sectors before and
>> after doing this, that's a good idea as well. S.M.A.R.T is your friend.
>> And when writing /dev/zero to a drive, if it craps out you have less
>> emotional attachment to the data.
>>
>
> Writing all zero isn't too useful tho. Drive failing reallocation on
> write is catastrophic failure. It means that the drive wanna relocate
> but can't because it used up all its extra space which usually indicates
> something else is seriously wrong with the drive. The drive will have
> to go to the trash can. This is all serious and bad but the catch is
> that in such cases the problem usually stands like a sore thumb so
> either vendor doesn't ship such drive or you'll find the failure very
> early. I personally haven't seen any such failure yet. Maybe I'm lucky.
>
The problem is usually not with what the vendor ships, but what the
carrier delivers. Bad handling does happen, "drop ship" can have several
meanings, and I have received shipments with the "G sensor" in the case
triggered. Zero is a handy source of data, but the important thing is to
look at the relocated sector count before and after the write. If there
are a lot of bad sectors initially, the drive is probably a poor choice
for anything critical.
> Most data loss occurs when the drive fails to read what it thought it
> wrote successfully and the opposite - reading and dumping the whole disk
> to /dev/null periodically is probably much better than writing zeros as
> it allows the drive to find out deteriorating sector early while it's
> still readable and relocate. But then again I think it's an overkill.
>
> Writing zeros to sectors is more useful as cure rather than prevention.
> If your drive fails to read a sector, write whatever value to the
> sector. The drive will forget about the data on the damaged sector and
> reallocate and write new data to it. Of course, you lose data which was
> originally on the sector.
>
> I personally think it's enough to just throw in an extra disk and make
> it RAID0 or 5 and rebuild the array if read fails on one of the disks.
> If write fails or read fail continues, replace the disk. Of course, if
> you wanna be extra cautious, good for you. :-)
>
>
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
2007-12-01 12:13 ` Jan Engelhardt
@ 2007-12-06 22:00 ` Andrew Morton
2007-12-06 22:38 ` Justin Piszcz
1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2007-12-06 22:00 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I am putting a new machine together and I have dual raptor raid 1 for the
> root, which works just fine under all stress tests.
>
> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
> sale now adays):
>
> I ran the following:
>
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
>
> (as it is always a very good idea to do this with any new disk)
>
> And sometime along the way(?) (i had gone to sleep and let it run), this
> occurred:
>
> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
> action 0x2 frozen
Gee we're seeing a lot of these lately.
> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
> 0x0 data 512 in
> [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
> (ATA bus error)
> [42881.841899] ata3: soft resetting port
> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> [42915.919149] ata3.00: revalidation failed (errno=-5)
> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> [42920.912458] ata3: hard resetting port
> [42926.411363] ata3: port is slow to respond, please be patient (Status
> 0x80)
> [42930.943080] ata3: COMRESET failed (errno=-16)
> [42930.943130] ata3: hard resetting port
> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42931.413523] ata3.00: configured for UDMA/133
> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> [42931.413655] ata3: EH complete
> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
> (750156 MB)
> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
> Usually when I see this sort of thing with another box I have full of
> raptors, it was due to a bad raptor and I never saw it again after I
> replaced the disk that it happened on, but that was using the Intel P965
> chipset.
>
> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
>
> I am going to do some further testing but does this indicate a bad drive?
> Bad cable? Bad connector?
>
> As you can see above, /dev/sdc stopped responding for a little bit and
> then the kernel reset the port.
>
> Why is this though? What is the likely root cause? Should I replace the
> drive? Obviously this is not normal and cannot be good at all, the idea
> is to put these drives in a RAID5 and if one is going to timeout that is
> going to cause the array to go degraded and thus be worthless in a raid5
> configuration.
>
> Can anyone offer any insight here?
It would be interesting to try 2.6.21 or 2.6.22.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-06 22:00 ` Andrew Morton
@ 2007-12-06 22:38 ` Justin Piszcz
2007-12-06 23:05 ` Andrew Morton
0 siblings, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2007-12-06 22:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
On Thu, 6 Dec 2007, Andrew Morton wrote:
> On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>> I am putting a new machine together and I have dual raptor raid 1 for the
>> root, which works just fine under all stress tests.
>>
>> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
>> sale now adays):
>>
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
>>
>> And sometime along the way(?) (i had gone to sleep and let it run), this
>> occurred:
>>
>> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
>> action 0x2 frozen
>
> Gee we're seeing a lot of these lately.
>
>> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
>> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
>> 0x0 data 512 in
>> [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
>> (ATA bus error)
>> [42881.841899] ata3: soft resetting port
>> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [42915.919042] ata3.00: qc timeout (cmd 0xec)
>> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
>> [42915.919149] ata3.00: revalidation failed (errno=-5)
>> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
>> [42920.912458] ata3: hard resetting port
>> [42926.411363] ata3: port is slow to respond, please be patient (Status
>> 0x80)
>> [42930.943080] ata3: COMRESET failed (errno=-16)
>> [42930.943130] ata3: hard resetting port
>> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [42931.413523] ata3.00: configured for UDMA/133
>> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
>> [42931.413655] ata3: EH complete
>> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
>> (750156 MB)
>> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
>> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
>> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>>
>> Usually when I see this sort of thing with another box I have full of
>> raptors, it was due to a bad raptor and I never saw it again after I
>> replaced the disk that it happened on, but that was using the Intel P965
>> chipset.
>>
>> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
>> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
>>
>> I am going to do some further testing but does this indicate a bad drive?
>> Bad cable? Bad connector?
>>
>> As you can see above, /dev/sdc stopped responding for a little bit and
>> then the kernel reset the port.
>>
>> Why is this though? What is the likely root cause? Should I replace the
>> drive? Obviously this is not normal and cannot be good at all, the idea
>> is to put these drives in a RAID5 and if one is going to timeout that is
>> going to cause the array to go degraded and thus be worthless in a raid5
>> configuration.
>>
>> Can anyone offer any insight here?
>
> It would be interesting to try 2.6.21 or 2.6.22.
>
This was due to NCQ issues (disabling it fixed the problem).
Justin.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
2007-12-06 22:38 ` Justin Piszcz
@ 2007-12-06 23:05 ` Andrew Morton
0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2007-12-06 23:05 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
On Thu, 6 Dec 2007 17:38:08 -0500 (EST)
Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Thu, 6 Dec 2007, Andrew Morton wrote:
>
> > On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
> > Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >
> >> I am putting a new machine together and I have dual raptor raid 1 for the
> >> root, which works just fine under all stress tests.
> >>
> >> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
> >> sale now adays):
> >>
> >> I ran the following:
> >>
> >> dd if=/dev/zero of=/dev/sdc
> >> dd if=/dev/zero of=/dev/sdd
> >> dd if=/dev/zero of=/dev/sde
> >>
> >> (as it is always a very good idea to do this with any new disk)
> >>
> >> And sometime along the way(?) (i had gone to sleep and let it run), this
> >> occurred:
> >>
> >> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
> >> action 0x2 frozen
> >
> > Gee we're seeing a lot of these lately.
> >
> >> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> >> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
> >> 0x0 data 512 in
> >> [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
> >> (ATA bus error)
> >> [42881.841899] ata3: soft resetting port
> >> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> >> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> >> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> >> [42915.919149] ata3.00: revalidation failed (errno=-5)
> >> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> >> [42920.912458] ata3: hard resetting port
> >> [42926.411363] ata3: port is slow to respond, please be patient (Status
> >> 0x80)
> >> [42930.943080] ata3: COMRESET failed (errno=-16)
> >> [42930.943130] ata3: hard resetting port
> >> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> >> [42931.413523] ata3.00: configured for UDMA/133
> >> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> >> [42931.413655] ata3: EH complete
> >> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
> >> (750156 MB)
> >> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> >> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> >> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
> >> enabled, doesn't support DPO or FUA
> >>
> >> Usually when I see this sort of thing with another box I have full of
> >> raptors, it was due to a bad raptor and I never saw it again after I
> >> replaced the disk that it happened on, but that was using the Intel P965
> >> chipset.
> >>
> >> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
> >> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
> >>
> >> I am going to do some further testing but does this indicate a bad drive?
> >> Bad cable? Bad connector?
> >>
> >> As you can see above, /dev/sdc stopped responding for a little bit and
> >> then the kernel reset the port.
> >>
> >> Why is this though? What is the likely root cause? Should I replace the
> >> drive? Obviously this is not normal and cannot be good at all, the idea
> >> is to put these drives in a RAID5 and if one is going to timeout that is
> >> going to cause the array to go degraded and thus be worthless in a raid5
> >> configuration.
> >>
> >> Can anyone offer any insight here?
> >
> > It would be interesting to try 2.6.21 or 2.6.22.
> >
>
> This was due to NCQ issues (disabling it fixed the problem).
>
I cannot locate any further email discussion on this topic.
Disabling NCQ at either compile time or runtime is not a "fix" and further
work should be done here to maek the kernel run acceptably on that
hardware.
^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>]
* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
[not found] <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>
@ 2007-12-01 20:08 ` Robert Hancock
[not found] ` <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>
1 sibling, 0 replies; 13+ messages in thread
From: Robert Hancock @ 2007-12-01 20:08 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz
Justin Piszcz wrote:
> I am putting a new machine together and I have dual raptor raid 1 for
> the root, which works just fine under all stress tests.
>
> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
> sale now adays):
>
> I ran the following:
>
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
>
> (as it is always a very good idea to do this with any new disk)
>
> And sometime along the way(?) (i had gone to sleep and let it run), this
> occurred:
>
> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
> action 0x2 frozen
> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0
> cdb 0x0 data 512 in
> [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask
> 0x10 (ATA bus error)
> [42881.841899] ata3: soft resetting port
> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> [42915.919149] ata3.00: revalidation failed (errno=-5)
> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> [42920.912458] ata3: hard resetting port
> [42926.411363] ata3: port is slow to respond, please be patient (Status
> 0x80)
> [42930.943080] ata3: COMRESET failed (errno=-16)
> [42930.943130] ata3: hard resetting port
> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42931.413523] ata3.00: configured for UDMA/133
> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> [42931.413655] ata3: EH complete
> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
> (750156 MB)
> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
> Usually when I see this sort of thing with another box I have full of
> raptors, it was due to a bad raptor and I never saw it again after I
> replaced the disk that it happened on, but that was using the Intel P965
> chipset.
>
> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
>
> I am going to do some further testing but does this indicate a bad
> drive? Bad cable? Bad connector?
Could be any of the above.
>
> As you can see above, /dev/sdc stopped responding for a little bit and
> then the kernel reset the port.
It looks like the first thing that happened is that the controller
reported it lost the SATA link, and then the drive didn't respond until
it was bashed with a few hard resets..
>
> Why is this though? What is the likely root cause? Should I replace
> the drive? Obviously this is not normal and cannot be good at all, the
> idea is to put these drives in a RAID5 and if one is going to timeout
> that is going to cause the array to go degraded and thus be worthless in
> a raid5 configuration.
>
> Can anyone offer any insight here?
>
> Thank you,
>
> Justin.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/
^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>]
end of thread, other threads:[~2007-12-13 22:10 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
2007-12-01 12:13 ` Jan Engelhardt
2007-12-01 12:23 ` Justin Piszcz
[not found] ` <20071201174733.646a5c35@absurd>
[not found] ` <Pine.LNX.4.64.0712011155110.6257@p34.internal.lan>
2007-12-02 9:11 ` Justin Piszcz
2007-12-10 8:23 ` Tejun Heo
2007-12-01 18:44 ` Bill Davidsen
2007-12-10 8:14 ` Tejun Heo
2007-12-13 22:27 ` Bill Davidsen
2007-12-06 22:00 ` Andrew Morton
2007-12-06 22:38 ` Justin Piszcz
2007-12-06 23:05 ` Andrew Morton
[not found] <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>
2007-12-01 20:08 ` Robert Hancock
[not found] ` <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>
[not found] ` <fa.ib4H8TQ3raADIWdsEBy+eSL/1RU@ifi.uio.no>
[not found] ` <fa.S4u1AwoYnqrSuegcUaP78D3SFXQ@ifi.uio.no>
[not found] ` <fa.H1nTe/xQV/oyEMTHAkOjqgqu7jY@ifi.uio.no>
[not found] ` <fa.YpQ6xCPOijQOCKsLJr1SDINFURI@ifi.uio.no>
2007-12-05 1:26 ` Robert Hancock
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).