Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
@ 2007-12-01 11:26 Justin Piszcz
  2007-12-01 12:13 ` Jan Engelhardt
  2007-12-06 22:00 ` Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Justin Piszcz @ 2007-12-01 11:26 UTC (permalink / raw)
  To: linux-kernel, linux-raid, linux-ide; +Cc: apiszcz

I am putting a new machine together and I have dual raptor raid 1 for the 
root, which works just fine under all stress tests.

Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
sale now adays):

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this 
occurred:

[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 
action 0x2 frozen
[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
0x0 data 512 in
[42880.680292]          res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
(ATA bus error)
[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status 
0x80)
[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
(750156 MB)
[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA

Usually when I see this sort of thing with another box I have full of 
raptors, it was due to a bad raptor and I never saw it again after I 
replaced the disk that it happened on, but that was using the Intel P965 
chipset.

For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).

I am going to do some further testing but does this indicate a bad drive? 
Bad cable?  Bad connector?

As you can see above, /dev/sdc stopped responding for a little bit and 
then the kernel reset the port.

Why is this though?  What is the likely root cause?  Should I replace the 
drive?  Obviously this is not normal and cannot be good at all, the idea 
is to put these drives in a RAID5 and if one is going to timeout that is 
going to cause the array to go degraded and thus be worthless in a raid5 
configuration.

Can anyone offer any insight here?

Thank you,

Justin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
@ 2007-12-01 12:13 ` Jan Engelhardt
  2007-12-01 12:23   ` Justin Piszcz
  2007-12-01 18:44   ` Bill Davidsen
  2007-12-06 22:00 ` Andrew Morton
  1 sibling, 2 replies; 13+ messages in thread
From: Jan Engelhardt @ 2007-12-01 12:13 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz


On Dec 1 2007 06:26, Justin Piszcz wrote:
> I ran the following:
>
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
>
> (as it is always a very good idea to do this with any new disk)

Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-01 12:13 ` Jan Engelhardt
@ 2007-12-01 12:23   ` Justin Piszcz
       [not found]     ` <20071201174733.646a5c35@absurd>
  2007-12-01 18:44   ` Bill Davidsen
  1 sibling, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2007-12-01 12:23 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:

>
> On Dec 1 2007 06:26, Justin Piszcz wrote:
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
>
> Why would you care about what's on the disk? fdisk, mkfs and
> the day-to-day operation will overwrite it _anyway_.
>
> (If you think the disk is not empty, you should look at it
> and copy off all usable warez beforehand :-)
>

The purpose is with any new disk its good to write to all the blocks and 
let the drive to all of the re-mapping before you put 'real' data on it. 
Let it crap out or fail before I put my data on it.

Justin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <20071201174733.646a5c35@absurd>]

[parent not found: <Pine.LNX.4.64.0712011155110.6257@p34.internal.lan>]

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
       [not found]       ` <Pine.LNX.4.64.0712011155110.6257@p34.internal.lan>
@ 2007-12-02  9:11         ` Justin Piszcz
  2007-12-10  8:23           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2007-12-02  9:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-raid, linux-ide, apiszcz



On Sat, 1 Dec 2007, Justin Piszcz wrote:

>
>
> On Sat, 1 Dec 2007, Janek Kozicki wrote:
>
>> Justin Piszcz said:     (by the date of Sat, 1 Dec 2007 07:23:41 -0500 
>> (EST))
>> 
>>>>> dd if=/dev/zero of=/dev/sdc
>>> 
>>> The purpose is with any new disk its good to write to all the blocks and
>>> let the drive to all of the re-mapping before you put 'real' data on it.
>>> Let it crap out or fail before I put my data on it.
>> 
>> better use badblocks. It writes data, then reads it afterwards:
>> In this example the data is semi random (quicker than /dev/urandom ;)
>> 
>> badblocks -c 10240 -s -w -t random -v /dev/sdc
>> 
>> -- 
>> Janek Kozicki                                                         |
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>
> Will give this a shot and see if I can reproduce the error, thanks.
>

The badblocks did not do anything; however, when I built a software raid 5 
and the performed a dd:

/usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

I saw this somewhere along the way:

[30189.967531] RAID5 conf printout:
[30189.967576]  --- rd:3 wd:3
[30189.967617]  disk 0, o:1, dev:sdc1
[30189.967660]  disk 1, o:1, dev:sdd1
[30189.967716]  disk 2, o:1, dev:sde1
[42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 
0x2 frozen
[42332.936706] ata5.00: spurious completions during NCQ issue=0x0 
SAct=0x7000 FIS=004040a1:00000800
[42332.936804] ata5.00: cmd 61/08:60:6f:4d:2a/00:00:27:00:00/40 tag 12 cdb 
0x0 data 4096 out
[42332.936805]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.936977] ata5.00: cmd 61/08:68:77:4d:2a/00:00:27:00:00/40 tag 13 cdb 
0x0 data 4096 out
[42332.936981]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.937162] ata5.00: cmd 61/00:70:0f:49:2a/04:00:27:00:00/40 tag 14 cdb 
0x0 data 524288 out
[42332.937163]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42333.240054] ata5: soft resetting port
[42333.494462] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42333.506592] ata5.00: configured for UDMA/133
[42333.506652] ata5: EH complete
[42333.506741] sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors 
(750156 MB)
[42333.506834] sd 4:0:0:0: [sde] Write Protect is off
[42333.506887] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
[42333.506905] sd 4:0:0:0: [sde] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA

Next test, I will turn off NCQ and try to make the problem re-occur.
If anyone else has any thoughts here..?
I ran long smart tests on all 3 disks, they all ran successfully.

Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

Justin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-02  9:11         ` Justin Piszcz
@ 2007-12-10  8:23           ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2007-12-10  8:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz

Justin Piszcz wrote:
> The badblocks did not do anything; however, when I built a software raid
> 5 and the performed a dd:
> 
> /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M
> 
> [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action
> 0x2 frozen
> [42332.936706] ata5.00: spurious completions during NCQ issue=0x0
> SAct=0x7000 FIS=004040a1:00000800
> 
> Next test, I will turn off NCQ and try to make the problem re-occur.
> If anyone else has any thoughts here..?
> I ran long smart tests on all 3 disks, they all ran successfully.
> 
> Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

That was me being stupid.  Patches for both upstream and -stable
branches are posted.  These will go away.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-01 12:13 ` Jan Engelhardt
  2007-12-01 12:23   ` Justin Piszcz
@ 2007-12-01 18:44   ` Bill Davidsen
  2007-12-10  8:14     ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Bill Davidsen @ 2007-12-01 18:44 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Justin Piszcz, linux-kernel, linux-raid, linux-ide, apiszcz

Jan Engelhardt wrote:
> On Dec 1 2007 06:26, Justin Piszcz wrote:
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
> 
> Why would you care about what's on the disk? fdisk, mkfs and
> the day-to-day operation will overwrite it _anyway_.
> 
> (If you think the disk is not empty, you should look at it
> and copy off all usable warez beforehand :-)
> 
Do you not test your drive for minimum functionality before using them? 
Also, if you have the tools to check for relocated sectors before and 
after doing this, that's a good idea as well. S.M.A.R.T is your friend. 
And when writing /dev/zero to a drive, if it craps out you have less 
emotional attachment to the data.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-01 18:44   ` Bill Davidsen
@ 2007-12-10  8:14     ` Tejun Heo
  2007-12-13 22:27       ` Bill Davidsen
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2007-12-10  8:14 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Jan Engelhardt, Justin Piszcz, linux-kernel, linux-raid,
	linux-ide, apiszcz

Bill Davidsen wrote:
> Jan Engelhardt wrote:
>> On Dec 1 2007 06:26, Justin Piszcz wrote:
>>> I ran the following:
>>>
>>> dd if=/dev/zero of=/dev/sdc
>>> dd if=/dev/zero of=/dev/sdd
>>> dd if=/dev/zero of=/dev/sde
>>>
>>> (as it is always a very good idea to do this with any new disk)
>>
>> Why would you care about what's on the disk? fdisk, mkfs and
>> the day-to-day operation will overwrite it _anyway_.
>>
>> (If you think the disk is not empty, you should look at it
>> and copy off all usable warez beforehand :-)
>>
> Do you not test your drive for minimum functionality before using them?

I personally don't.

> Also, if you have the tools to check for relocated sectors before and
> after doing this, that's a good idea as well. S.M.A.R.T is your friend.
> And when writing /dev/zero to a drive, if it craps out you have less
> emotional attachment to the data.

Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-10  8:14     ` Tejun Heo
@ 2007-12-13 22:27       ` Bill Davidsen
  0 siblings, 0 replies; 13+ messages in thread
From: Bill Davidsen @ 2007-12-13 22:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Engelhardt, Justin Piszcz, linux-kernel, linux-raid,
	linux-ide, apiszcz

Tejun Heo wrote:
> Bill Davidsen wrote:
>   
>> Jan Engelhardt wrote:
>>     
>>> On Dec 1 2007 06:26, Justin Piszcz wrote:
>>>       
>>>> I ran the following:
>>>>
>>>> dd if=/dev/zero of=/dev/sdc
>>>> dd if=/dev/zero of=/dev/sdd
>>>> dd if=/dev/zero of=/dev/sde
>>>>
>>>> (as it is always a very good idea to do this with any new disk)
>>>>         
>>> Why would you care about what's on the disk? fdisk, mkfs and
>>> the day-to-day operation will overwrite it _anyway_.
>>>
>>> (If you think the disk is not empty, you should look at it
>>> and copy off all usable warez beforehand :-)
>>>
>>>       
>> Do you not test your drive for minimum functionality before using them?
>>     
>
> I personally don't.
>
>   
>> Also, if you have the tools to check for relocated sectors before and
>> after doing this, that's a good idea as well. S.M.A.R.T is your friend.
>> And when writing /dev/zero to a drive, if it craps out you have less
>> emotional attachment to the data.
>>     
>
> Writing all zero isn't too useful tho.  Drive failing reallocation on
> write is catastrophic failure.  It means that the drive wanna relocate
> but can't because it used up all its extra space which usually indicates
> something else is seriously wrong with the drive.  The drive will have
> to go to the trash can.  This is all serious and bad but the catch is
> that in such cases the problem usually stands like a sore thumb so
> either vendor doesn't ship such drive or you'll find the failure very
> early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.
>   

The problem is usually not with what the vendor ships, but what the 
carrier delivers. Bad handling does happen, "drop ship" can have several 
meanings, and I have received shipments with the "G sensor" in the case 
triggered. Zero is a handy source of data, but the important thing is to 
look at the relocated sector count before and after the write. If there 
are a lot of bad sectors initially, the drive is probably a poor choice 
for anything critical.
> Most data loss occurs when the drive fails to read what it thought it
> wrote successfully and the opposite - reading and dumping the whole disk
> to /dev/null periodically is probably much better than writing zeros as
> it allows the drive to find out deteriorating sector early while it's
> still readable and relocate.  But then again I think it's an overkill.
>
> Writing zeros to sectors is more useful as cure rather than prevention.
>  If your drive fails to read a sector, write whatever value to the
> sector.  The drive will forget about the data on the damaged sector and
> reallocate and write new data to it.  Of course, you lose data which was
> originally on the sector.
>
> I personally think it's enough to just throw in an extra disk and make
> it RAID0 or 5 and rebuild the array if read fails on one of the disks.
> If write fails or read fail continues, replace the disk.  Of course, if
> you wanna be extra cautious, good for you.  :-)
>
>   
-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
  2007-12-01 12:13 ` Jan Engelhardt
@ 2007-12-06 22:00 ` Andrew Morton
  2007-12-06 22:38   ` Justin Piszcz
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2007-12-06 22:00 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz

On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz <jpiszcz@lucidpixels.com> wrote:

> I am putting a new machine together and I have dual raptor raid 1 for the 
> root, which works just fine under all stress tests.
> 
> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
> sale now adays):
> 
> I ran the following:
> 
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
> 
> (as it is always a very good idea to do this with any new disk)
> 
> And sometime along the way(?) (i had gone to sleep and let it run), this 
> occurred:
> 
> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 
> action 0x2 frozen

Gee we're seeing a lot of these lately.

> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
> 0x0 data 512 in
> [42880.680292]          res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
> (ATA bus error)
> [42881.841899] ata3: soft resetting port
> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> [42915.919149] ata3.00: revalidation failed (errno=-5)
> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> [42920.912458] ata3: hard resetting port
> [42926.411363] ata3: port is slow to respond, please be patient (Status 
> 0x80)
> [42930.943080] ata3: COMRESET failed (errno=-16)
> [42930.943130] ata3: hard resetting port
> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42931.413523] ata3.00: configured for UDMA/133
> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> [42931.413655] ata3: EH complete
> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
> (750156 MB)
> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
> enabled, doesn't support DPO or FUA
> 
> Usually when I see this sort of thing with another box I have full of 
> raptors, it was due to a bad raptor and I never saw it again after I 
> replaced the disk that it happened on, but that was using the Intel P965 
> chipset.
> 
> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
> 
> I am going to do some further testing but does this indicate a bad drive? 
> Bad cable?  Bad connector?
> 
> As you can see above, /dev/sdc stopped responding for a little bit and 
> then the kernel reset the port.
> 
> Why is this though?  What is the likely root cause?  Should I replace the 
> drive?  Obviously this is not normal and cannot be good at all, the idea 
> is to put these drives in a RAID5 and if one is going to timeout that is 
> going to cause the array to go degraded and thus be worthless in a raid5 
> configuration.
> 
> Can anyone offer any insight here?

It would be interesting to try 2.6.21 or 2.6.22.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-06 22:00 ` Andrew Morton
@ 2007-12-06 22:38   ` Justin Piszcz
  2007-12-06 23:05     ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Justin Piszcz @ 2007-12-06 22:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz



On Thu, 6 Dec 2007, Andrew Morton wrote:

> On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
> Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>> I am putting a new machine together and I have dual raptor raid 1 for the
>> root, which works just fine under all stress tests.
>>
>> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
>> sale now adays):
>>
>> I ran the following:
>>
>> dd if=/dev/zero of=/dev/sdc
>> dd if=/dev/zero of=/dev/sdd
>> dd if=/dev/zero of=/dev/sde
>>
>> (as it is always a very good idea to do this with any new disk)
>>
>> And sometime along the way(?) (i had gone to sleep and let it run), this
>> occurred:
>>
>> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
>> action 0x2 frozen
>
> Gee we're seeing a lot of these lately.
>
>> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
>> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
>> 0x0 data 512 in
>> [42880.680292]          res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
>> (ATA bus error)
>> [42881.841899] ata3: soft resetting port
>> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [42915.919042] ata3.00: qc timeout (cmd 0xec)
>> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
>> [42915.919149] ata3.00: revalidation failed (errno=-5)
>> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
>> [42920.912458] ata3: hard resetting port
>> [42926.411363] ata3: port is slow to respond, please be patient (Status
>> 0x80)
>> [42930.943080] ata3: COMRESET failed (errno=-16)
>> [42930.943130] ata3: hard resetting port
>> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [42931.413523] ata3.00: configured for UDMA/133
>> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
>> [42931.413655] ata3: EH complete
>> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
>> (750156 MB)
>> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
>> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
>> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>>
>> Usually when I see this sort of thing with another box I have full of
>> raptors, it was due to a bad raptor and I never saw it again after I
>> replaced the disk that it happened on, but that was using the Intel P965
>> chipset.
>>
>> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
>> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
>>
>> I am going to do some further testing but does this indicate a bad drive?
>> Bad cable?  Bad connector?
>>
>> As you can see above, /dev/sdc stopped responding for a little bit and
>> then the kernel reset the port.
>>
>> Why is this though?  What is the likely root cause?  Should I replace the
>> drive?  Obviously this is not normal and cannot be good at all, the idea
>> is to put these drives in a RAID5 and if one is going to timeout that is
>> going to cause the array to go degraded and thus be worthless in a raid5
>> configuration.
>>
>> Can anyone offer any insight here?
>
> It would be interesting to try 2.6.21 or 2.6.22.
>

This was due to NCQ issues (disabling it fixed the problem).

Justin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
  2007-12-06 22:38   ` Justin Piszcz
@ 2007-12-06 23:05     ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2007-12-06 23:05 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz

On Thu, 6 Dec 2007 17:38:08 -0500 (EST)
Justin Piszcz <jpiszcz@lucidpixels.com> wrote:

> 
> 
> On Thu, 6 Dec 2007, Andrew Morton wrote:
> 
> > On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
> > Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >
> >> I am putting a new machine together and I have dual raptor raid 1 for the
> >> root, which works just fine under all stress tests.
> >>
> >> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
> >> sale now adays):
> >>
> >> I ran the following:
> >>
> >> dd if=/dev/zero of=/dev/sdc
> >> dd if=/dev/zero of=/dev/sdd
> >> dd if=/dev/zero of=/dev/sde
> >>
> >> (as it is always a very good idea to do this with any new disk)
> >>
> >> And sometime along the way(?) (i had gone to sleep and let it run), this
> >> occurred:
> >>
> >> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000
> >> action 0x2 frozen
> >
> > Gee we're seeing a lot of these lately.
> >
> >> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> >> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
> >> 0x0 data 512 in
> >> [42880.680292]          res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
> >> (ATA bus error)
> >> [42881.841899] ata3: soft resetting port
> >> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> >> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> >> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> >> [42915.919149] ata3.00: revalidation failed (errno=-5)
> >> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> >> [42920.912458] ata3: hard resetting port
> >> [42926.411363] ata3: port is slow to respond, please be patient (Status
> >> 0x80)
> >> [42930.943080] ata3: COMRESET failed (errno=-16)
> >> [42930.943130] ata3: hard resetting port
> >> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> >> [42931.413523] ata3.00: configured for UDMA/133
> >> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> >> [42931.413655] ata3: EH complete
> >> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
> >> (750156 MB)
> >> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> >> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> >> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
> >> enabled, doesn't support DPO or FUA
> >>
> >> Usually when I see this sort of thing with another box I have full of
> >> raptors, it was due to a bad raptor and I never saw it again after I
> >> replaced the disk that it happened on, but that was using the Intel P965
> >> chipset.
> >>
> >> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
> >> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
> >>
> >> I am going to do some further testing but does this indicate a bad drive?
> >> Bad cable?  Bad connector?
> >>
> >> As you can see above, /dev/sdc stopped responding for a little bit and
> >> then the kernel reset the port.
> >>
> >> Why is this though?  What is the likely root cause?  Should I replace the
> >> drive?  Obviously this is not normal and cannot be good at all, the idea
> >> is to put these drives in a RAID5 and if one is going to timeout that is
> >> going to cause the array to go degraded and thus be worthless in a raid5
> >> configuration.
> >>
> >> Can anyone offer any insight here?
> >
> > It would be interesting to try 2.6.21 or 2.6.22.
> >
> 
> This was due to NCQ issues (disabling it fixed the problem).
> 

I cannot locate any further email discussion on this topic.

Disabling NCQ at either compile time or runtime is not a "fix" and further
work should be done here to maek the kernel run acceptably on that
hardware.

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>]

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
       [not found] <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>
@ 2007-12-01 20:08 ` Robert Hancock
       [not found] ` <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>
  1 sibling, 0 replies; 13+ messages in thread
From: Robert Hancock @ 2007-12-01 20:08 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz

Justin Piszcz wrote:
> I am putting a new machine together and I have dual raptor raid 1 for 
> the root, which works just fine under all stress tests.
> 
> Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
> sale now adays):
> 
> I ran the following:
> 
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
> 
> (as it is always a very good idea to do this with any new disk)
> 
> And sometime along the way(?) (i had gone to sleep and let it run), this 
> occurred:
> 
> [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 
> action 0x2 frozen
> [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
> [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 
> cdb 0x0 data 512 in
> [42880.680292]          res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 
> 0x10 (ATA bus error)
> [42881.841899] ata3: soft resetting port
> [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42915.919042] ata3.00: qc timeout (cmd 0xec)
> [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> [42915.919149] ata3.00: revalidation failed (errno=-5)
> [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
> [42920.912458] ata3: hard resetting port
> [42926.411363] ata3: port is slow to respond, please be patient (Status 
> 0x80)
> [42930.943080] ata3: COMRESET failed (errno=-16)
> [42930.943130] ata3: hard resetting port
> [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42931.413523] ata3.00: configured for UDMA/133
> [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
> [42931.413655] ata3: EH complete
> [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
> (750156 MB)
> [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
> [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
> enabled, doesn't support DPO or FUA
> 
> Usually when I see this sort of thing with another box I have full of 
> raptors, it was due to a bad raptor and I never saw it again after I 
> replaced the disk that it happened on, but that was using the Intel P965 
> chipset.
> 
> For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
> the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
> 
> I am going to do some further testing but does this indicate a bad 
> drive? Bad cable?  Bad connector?

Could be any of the above.

> 
> As you can see above, /dev/sdc stopped responding for a little bit and 
> then the kernel reset the port.

It looks like the first thing that happened is that the controller 
reported it lost the SATA link, and then the drive didn't respond until 
it was bashed with a few hard resets..

> 
> Why is this though?  What is the likely root cause?  Should I replace 
> the drive?  Obviously this is not normal and cannot be good at all, the 
> idea is to put these drives in a RAID5 and if one is going to timeout 
> that is going to cause the array to go degraded and thus be worthless in 
> a raid5 configuration.
> 
> Can anyone offer any insight here?
> 
> Thank you,
> 
> Justin.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>]

[parent not found: <fa.ib4H8TQ3raADIWdsEBy+eSL/1RU@ifi.uio.no>]

[parent not found: <fa.S4u1AwoYnqrSuegcUaP78D3SFXQ@ifi.uio.no>]

[parent not found: <fa.H1nTe/xQV/oyEMTHAkOjqgqu7jY@ifi.uio.no>]

[parent not found: <fa.YpQ6xCPOijQOCKsLJr1SDINFURI@ifi.uio.no>]

* Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
       [not found]         ` <fa.YpQ6xCPOijQOCKsLJr1SDINFURI@ifi.uio.no>
@ 2007-12-05  1:26           ` Robert Hancock
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Hancock @ 2007-12-05  1:26 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, linux-ide, apiszcz

Justin Piszcz wrote:
> The badblocks did not do anything; however, when I built a software raid 
> 5 and the performed a dd:
> 
> /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M
> 
> I saw this somewhere along the way:
> 
> [30189.967531] RAID5 conf printout:
> [30189.967576]  --- rd:3 wd:3
> [30189.967617]  disk 0, o:1, dev:sdc1
> [30189.967660]  disk 1, o:1, dev:sdd1
> [30189.967716]  disk 2, o:1, dev:sde1
> [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 
> 0x2 frozen
> [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 
> SAct=0x7000 FIS=004040a1:00000800
> [42332.936804] ata5.00: cmd 61/08:60:6f:4d:2a/00:00:27:00:00/40 tag 12 
> cdb 0x0 data 4096 out
> [42332.936805]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 
> 0x2 (HSM violation)
> [42332.936977] ata5.00: cmd 61/08:68:77:4d:2a/00:00:27:00:00/40 tag 13 
> cdb 0x0 data 4096 out
> [42332.936981]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 
> 0x2 (HSM violation)
> [42332.937162] ata5.00: cmd 61/00:70:0f:49:2a/04:00:27:00:00/40 tag 14 
> cdb 0x0 data 524288 out
> [42332.937163]          res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 
> 0x2 (HSM violation)
> [42333.240054] ata5: soft resetting port
> [42333.494462] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [42333.506592] ata5.00: configured for UDMA/133
> [42333.506652] ata5: EH complete
> [42333.506741] sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors 
> (750156 MB)
> [42333.506834] sd 4:0:0:0: [sde] Write Protect is off
> [42333.506887] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
> [42333.506905] sd 4:0:0:0: [sde] Write cache: enabled, read cache: 
> enabled, doesn't support DPO or FUA
> 
> Next test, I will turn off NCQ and try to make the problem re-occur.
> If anyone else has any thoughts here..?
> I ran long smart tests on all 3 disks, they all ran successfully.
> 
> Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

The problem won't recur with NCQ off, because spurious completions are 
impossible in that case.

It was originally thought that these AHCI spurious NCQ completions were 
busted NCQ implementations on the drives, but I think there theory is 
that it's some other timing problem or some such, given the number of 
drives across all makers which are reported to do this. I believe Tejun 
is investigating?

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-12-13 22:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-01 11:26 Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port) Justin Piszcz
2007-12-01 12:13 ` Jan Engelhardt
2007-12-01 12:23   ` Justin Piszcz
     [not found]     ` <20071201174733.646a5c35@absurd>
     [not found]       ` <Pine.LNX.4.64.0712011155110.6257@p34.internal.lan>
2007-12-02  9:11         ` Justin Piszcz
2007-12-10  8:23           ` Tejun Heo
2007-12-01 18:44   ` Bill Davidsen
2007-12-10  8:14     ` Tejun Heo
2007-12-13 22:27       ` Bill Davidsen
2007-12-06 22:00 ` Andrew Morton
2007-12-06 22:38   ` Justin Piszcz
2007-12-06 23:05     ` Andrew Morton
     [not found] <fa.hhS4g1h0uppt8Xx/ZZfNNQfAv1Q@ifi.uio.no>
2007-12-01 20:08 ` Robert Hancock
     [not found] ` <fa.YIWyRfjQw18aIH2fKaze37Gwuzo@ifi.uio.no>
     [not found]   ` <fa.ib4H8TQ3raADIWdsEBy+eSL/1RU@ifi.uio.no>
     [not found]     ` <fa.S4u1AwoYnqrSuegcUaP78D3SFXQ@ifi.uio.no>
     [not found]       ` <fa.H1nTe/xQV/oyEMTHAkOjqgqu7jY@ifi.uio.no>
     [not found]         ` <fa.YpQ6xCPOijQOCKsLJr1SDINFURI@ifi.uio.no>
2007-12-05  1:26           ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).