All of lore.kernel.org
 help / color / mirror / Atom feed
* Mdadm server eating drives
@ 2013-06-12 13:47 Barrett Lewis
  2013-06-12 13:57 ` David Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Barrett Lewis @ 2013-06-12 13:47 UTC (permalink / raw)
  To: linux-raid

I started about 1 year ago with a 5x2tb raid 5.  At the beginning of
feburary, I came home from work and my drives were all making these
crazy beeping noises.  At that point I was on kernel version .34

I shutdown and rebooted the server and the raid array didn't come back
online.  I noticed one drive was going up and down and determined that
the drive had actual physical damage to the power connecter and was
losing and regaining power through vibration.  No problem.  I bought
another hard drive and mdadm started recovering to the new drive.
Got it back to a Raid 5,  backed up my data, then started growing to a
raid6, and my computer hung hard where even REISUB was ignored.  I
restarted and resumed the grow.  Then I started getting errors like
these, they repeat for a minute or two and then the device gets failed
out of the array:

[  193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0
[  193.801554] ata4.00: irq_stat 0x40000008
[  193.801581] ata4.00: failed command: READ FPDMA QUEUED
[  193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30
ncq 4096 in
[  193.801618]          res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask
0x409 (media error) <F>
[  193.801703] ata4.00: status: { DRDY ERR }
[  193.801728] ata4.00: error: { UNC }
[  193.804479] ata4.00: configured for UDMA/133
[  193.804499] ata4: EH complete

First one one drive, then on another, then on another, as the slow
grow to raid 6 was happening these messages kept coming up and taking
drives down.  Eventually (over the course of the week long grow time)
the failures were happening faster than I could recover them and I had
to revert to ddrescueing raid components to keep it from going under
the minimum components.  I ended up having to ddrescue 3 failed drives
and force the array assembly to get back to 5 drives and by that time
the arrays ext4 file system could no longer mount (said something
about group descriptors being corrupted).  By this time, every one of
the original drives has been replaced and this has been ongoing for 5
months.  I didn't even want to do an fsck to *attempt* to fix the file
system until I got a solid raid6.

I upgraded my kernel to .40, bought another hard drive and put it in
there and started the grow.  Within an hour the system froze. I
rebooted and restarted the array (and the grow), 2 hours later the
system froze again, rebooted restarted the array (and the grow) again,
and got those same errors again, this time on a drive that I had
bought last month.  Frustrated (feeling like this will never end) I
let it keep going, hoping to atleast get back to raid 5.  A few hours
later I got these errors AGAIN on ANOTHER drive I got last month (of a
differen't brand and model).  So now I'm back with a non functional
array.  A pile of 6 dead drives (not counting the ones still in the
computer, components of a now incomplete array).

What is going on here?  If brand new drives from a month ago from two
different manufacturers are failling, something else is going on.  Is
it my motherboard?  I've run memtest for 15 hours so far with no
errors, and ill let it go for 48 before I stop it, lets assume its not
the RAM for now.

Not included in this history are SEVERAL times the machine locked up
harder than a REISUB, almost always during the heavy IO of component
recovery.  It seems to stay up for weeks when the array is inactive
(and I'm too busy with other things to deal with it) and then as soon
as I put a new drive in and the recovery starts, it hangs within an
hour, and does so every few hours, and eventually I get the "failed
command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors
and another drive falls off the array.

I don't mind buying a new motherboard if thats what it is (i've
already spent almost a grand on hard drives), I just want to get this
fixed/stable and the nightmare behind me.

Here is the dmesg output for my last boot where two drives failed at
193 and 12196: http://paste.ubuntu.com/5753575/

Thanks for any thoughts on the matter

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-12 13:47 Mdadm server eating drives Barrett Lewis
@ 2013-06-12 13:57 ` David Brown
  2013-06-12 14:44 ` Phil Turmel
  2013-06-12 15:41 ` Adam Goryachev
  2 siblings, 0 replies; 34+ messages in thread
From: David Brown @ 2013-06-12 13:57 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

Hi,

Since you mentioned problems with power, are you sure your power supply
is enough for all these drives?

mvh.,

David


On 12/06/13 15:47, Barrett Lewis wrote:
> I started about 1 year ago with a 5x2tb raid 5.  At the beginning of
> feburary, I came home from work and my drives were all making these
> crazy beeping noises.  At that point I was on kernel version .34
> 
> I shutdown and rebooted the server and the raid array didn't come back
> online.  I noticed one drive was going up and down and determined that
> the drive had actual physical damage to the power connecter and was
> losing and regaining power through vibration.  No problem.  I bought
> another hard drive and mdadm started recovering to the new drive.
> Got it back to a Raid 5,  backed up my data, then started growing to a
> raid6, and my computer hung hard where even REISUB was ignored.  I
> restarted and resumed the grow.  Then I started getting errors like
> these, they repeat for a minute or two and then the device gets failed
> out of the array:
> 
> [  193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0
> [  193.801554] ata4.00: irq_stat 0x40000008
> [  193.801581] ata4.00: failed command: READ FPDMA QUEUED
> [  193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30
> ncq 4096 in
> [  193.801618]          res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask
> 0x409 (media error) <F>
> [  193.801703] ata4.00: status: { DRDY ERR }
> [  193.801728] ata4.00: error: { UNC }
> [  193.804479] ata4.00: configured for UDMA/133
> [  193.804499] ata4: EH complete
> 
> First one one drive, then on another, then on another, as the slow
> grow to raid 6 was happening these messages kept coming up and taking
> drives down.  Eventually (over the course of the week long grow time)
> the failures were happening faster than I could recover them and I had
> to revert to ddrescueing raid components to keep it from going under
> the minimum components.  I ended up having to ddrescue 3 failed drives
> and force the array assembly to get back to 5 drives and by that time
> the arrays ext4 file system could no longer mount (said something
> about group descriptors being corrupted).  By this time, every one of
> the original drives has been replaced and this has been ongoing for 5
> months.  I didn't even want to do an fsck to *attempt* to fix the file
> system until I got a solid raid6.
> 
> I upgraded my kernel to .40, bought another hard drive and put it in
> there and started the grow.  Within an hour the system froze. I
> rebooted and restarted the array (and the grow), 2 hours later the
> system froze again, rebooted restarted the array (and the grow) again,
> and got those same errors again, this time on a drive that I had
> bought last month.  Frustrated (feeling like this will never end) I
> let it keep going, hoping to atleast get back to raid 5.  A few hours
> later I got these errors AGAIN on ANOTHER drive I got last month (of a
> differen't brand and model).  So now I'm back with a non functional
> array.  A pile of 6 dead drives (not counting the ones still in the
> computer, components of a now incomplete array).
> 
> What is going on here?  If brand new drives from a month ago from two
> different manufacturers are failling, something else is going on.  Is
> it my motherboard?  I've run memtest for 15 hours so far with no
> errors, and ill let it go for 48 before I stop it, lets assume its not
> the RAM for now.
> 
> Not included in this history are SEVERAL times the machine locked up
> harder than a REISUB, almost always during the heavy IO of component
> recovery.  It seems to stay up for weeks when the array is inactive
> (and I'm too busy with other things to deal with it) and then as soon
> as I put a new drive in and the recovery starts, it hangs within an
> hour, and does so every few hours, and eventually I get the "failed
> command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors
> and another drive falls off the array.
> 
> I don't mind buying a new motherboard if thats what it is (i've
> already spent almost a grand on hard drives), I just want to get this
> fixed/stable and the nightmare behind me.
> 
> Here is the dmesg output for my last boot where two drives failed at
> 193 and 12196: http://paste.ubuntu.com/5753575/
> 
> Thanks for any thoughts on the matter


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-12 13:47 Mdadm server eating drives Barrett Lewis
  2013-06-12 13:57 ` David Brown
@ 2013-06-12 14:44 ` Phil Turmel
  2013-06-12 15:41 ` Adam Goryachev
  2 siblings, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2013-06-12 14:44 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 06/12/2013 09:47 AM, Barrett Lewis wrote:
> I started about 1 year ago with a 5x2tb raid 5.  At the beginning of
> feburary, I came home from work and my drives were all making these
> crazy beeping noises.  At that point I was on kernel version .34

[trim /]

What you are experiencing is typical of a hobby-level user who bought
non-raid-rated drives and is now experiencing timeout mismatch array
failures due to a lack of error recovery control.

I suggest you search the archives for various combinations of "scterc",
"URE", "timeout", and "error recovery".  In the end, you almost
certainly will need to either use "smartctl -l scterc,70,70" to turn on
ERC in your drives, or use "echo 180 >/sys/block/sdX/device/timeout" to
lengthen linux's standard driver command timeout.

Anyways, when you check in again, please report the output of the following:

1) "mdadm -E /dev/sdX" for each member device or partition
2) "mdadm -D /dev/mdX" for your array
3) "smartctl -x /dev/sdX" for each member device
4) "cat /proc/mdstat"
5) "for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done"
6) "dmesg" (trimmed to relevant md and sd* messages)
7) "cat /etc/mdadm.conf"

Phil


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-12 13:47 Mdadm server eating drives Barrett Lewis
  2013-06-12 13:57 ` David Brown
  2013-06-12 14:44 ` Phil Turmel
@ 2013-06-12 15:41 ` Adam Goryachev
       [not found]   ` <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com>
  2 siblings, 1 reply; 34+ messages in thread
From: Adam Goryachev @ 2013-06-12 15:41 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 12/06/13 23:47, Barrett Lewis wrote:
> I started about 1 year ago with a 5x2tb raid 5.  At the beginning of
> feburary, I came home from work and my drives were all making these
> crazy beeping noises.  At that point I was on kernel version .34
>
> I shutdown and rebooted the server and the raid array didn't come back
> online.  I noticed one drive was going up and down and determined that
> the drive had actual physical damage to the power connecter and was
> losing and regaining power through vibration.  No problem.  I bought
> another hard drive and mdadm started recovering to the new drive.
> Got it back to a Raid 5,  backed up my data, then started growing to a
> raid6, and my computer hung hard where even REISUB was ignored.  I
> restarted and resumed the grow.  Then I started getting errors like
> these, they repeat for a minute or two and then the device gets failed
> out of the array:
>
> [  193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0
> [  193.801554] ata4.00: irq_stat 0x40000008
> [  193.801581] ata4.00: failed command: READ FPDMA QUEUED
> [  193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30
> ncq 4096 in
> [  193.801618]          res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask
> 0x409 (media error) <F>
> [  193.801703] ata4.00: status: { DRDY ERR }
> [  193.801728] ata4.00: error: { UNC }
> [  193.804479] ata4.00: configured for UDMA/133
> [  193.804499] ata4: EH complete
>
> First one one drive, then on another, then on another, as the slow
> grow to raid 6 was happening these messages kept coming up and taking
> drives down.  Eventually (over the course of the week long grow time)
> the failures were happening faster than I could recover them and I had
> to revert to ddrescueing raid components to keep it from going under
> the minimum components.  I ended up having to ddrescue 3 failed drives
> and force the array assembly to get back to 5 drives and by that time
> the arrays ext4 file system could no longer mount (said something
> about group descriptors being corrupted).  By this time, every one of
> the original drives has been replaced and this has been ongoing for 5
> months.  I didn't even want to do an fsck to *attempt* to fix the file
> system until I got a solid raid6.
>
> I upgraded my kernel to .40, bought another hard drive and put it in
> there and started the grow.  Within an hour the system froze. I
> rebooted and restarted the array (and the grow), 2 hours later the
> system froze again, rebooted restarted the array (and the grow) again,
> and got those same errors again, this time on a drive that I had
> bought last month.  Frustrated (feeling like this will never end) I
> let it keep going, hoping to atleast get back to raid 5.  A few hours
> later I got these errors AGAIN on ANOTHER drive I got last month (of a
> differen't brand and model).  So now I'm back with a non functional
> array.  A pile of 6 dead drives (not counting the ones still in the
> computer, components of a now incomplete array).
>
> What is going on here?  If brand new drives from a month ago from two
> different manufacturers are failling, something else is going on.  Is
> it my motherboard?  I've run memtest for 15 hours so far with no
> errors, and ill let it go for 48 before I stop it, lets assume its not
> the RAM for now.
>
> Not included in this history are SEVERAL times the machine locked up
> harder than a REISUB, almost always during the heavy IO of component
> recovery.  It seems to stay up for weeks when the array is inactive
> (and I'm too busy with other things to deal with it) and then as soon
> as I put a new drive in and the recovery starts, it hangs within an
> hour, and does so every few hours, and eventually I get the "failed
> command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors
> and another drive falls off the array.
>
> I don't mind buying a new motherboard if thats what it is (i've
> already spent almost a grand on hard drives), I just want to get this
> fixed/stable and the nightmare behind me.
>
> Here is the dmesg output for my last boot where two drives failed at
> 193 and 12196: http://paste.ubuntu.com/5753575/
>
> Thanks for any thoughts on the matter

Apart from the previous thought regarding lack of power for the number
of drives, have you considered getting a SATA controller card? This
would totally rule out the motherboard as being an issue without forcing
you to replace the motherboard. I'd probably check out the power supply
issue first (quick, cheap, easy) and then follow up with using a well
supported SATA controller card.... (ie, not a cheap crappy sata card
with poor drivers/etc).

Hope this helps

Regards,
Adam

-- 
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
       [not found]     ` <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com>
@ 2013-06-14  0:19       ` Barrett Lewis
  2013-06-14  2:08         ` Phil Turmel
  0 siblings, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-06-14  0:19 UTC (permalink / raw)
  To: linux-raid

Sorry for the delay, I wanted to let the memtest run for 48 hours.
It's at 49 hours now with zero errors, so memory is pretty much ruled
out.

As far as power, I would *think* I have enough power.  The power
supply is a 500w Thermaltake TR2.  It's powering an Asrock z77 mobo
with an i5-3570k, and the only card on it is a dinky little 2 port
sata card my OS drive is on (the RAID components are plugged into the
mobo).  Eight 7200 drives and an SSD.  Tell me if this sounds
insufficient.

Phil, when you say "what you are experiencing", what do you mean
specifically?  The dmesg errors and drives falling off?  Or did you
mean the beeping noises (since thats the part you trimmed)?


Here is the data you requested

1) mdadm -E /dev/sd[a-f]       http://pastie.org/8040826

2) mdadm -D /dev/md0          http://pastie.org/8040828

3)
smartctl -x /dev/sda                   http://pastie.org/8040847
smartctl -x /dev/sdb                   http://pastie.org/8040848
smartctl -x /dev/sdc                   http://pastie.org/8040850
smartctl -x /dev/sdd                   http://pastie.org/8040851
smartctl -x /dev/sde                   http://pastie.org/8040852
smartctl -x /dev/sdf                   http://pastie.org/8040853

4) cat /proc/mdstat                   http://pastie.org/8040859

5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done
                 http://pastie.org/8040870

6) dmesg | grep -e sd -e md                   http://pastie.org/8040871
(note that I have rebooted since the last dmesg link I posted (where
two drives failed) because I was running memtest, if I should do dmesg
differently, let me know)

7) cat /etc/mdadm.conf                   http://pastie.org/8040876


Adam, I wouldn't be opposed to spending the money on a good sata card,
but I'd like to get opinions from a few people first.  Any suggestions
on a good one for mdadm specifically?

Thanks all!

On Thu, Jun 13, 2013 at 7:19 PM, Barrett Lewis
<barrett.lewis.mitsi@gmail.com> wrote:
> Sorry for the delay, I wanted to let the memtest run for 48 hours.  It's at
> 49 hours now with zero errors, so memory is pretty much ruled out.
>
> As far as power, I would *think* I have enough power.  The power supply is a
> 500w Thermaltake TR2.  It's powering an Asrock z77 mobo with an i5-3570k,
> and the only card on it is a dinky little 2 port sata card my OS drive is on
> (the RAID components are plugged into the mobo).  Eight 7200 drives and an
> SSD.  Tell me if this sounds insufficient.
>
> Phil, when you say "what you are experiencing", what do you mean
> specifically?  The dmesg errors and drives falling off?  Or did you mean the
> beeping noises (since thats the part you trimmed)?
>
>
> Here is the data you requested
>
> 1) mdadm -E /dev/sd[a-f]       http://pastie.org/8040826
>
> 2) mdadm -D /dev/md0          http://pastie.org/8040828
>
> 3)
> smartctl -x /dev/sda                   http://pastie.org/8040847
> smartctl -x /dev/sdb                   http://pastie.org/8040848
> smartctl -x /dev/sdc                   http://pastie.org/8040850
> smartctl -x /dev/sdd                   http://pastie.org/8040851
> smartctl -x /dev/sde                   http://pastie.org/8040852
> smartctl -x /dev/sdf                   http://pastie.org/8040853
>
> 4) cat /proc/mdstat                   http://pastie.org/8040859
>
> 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done
> http://pastie.org/8040870
>
> 6) dmesg | grep -e sd -e md                   http://pastie.org/8040871
> (note that I have rebooted since the last dmesg link I posted (where two
> drives failed) because I was running memtest, if I should do dmesg
> differently, let me know)
>
> 7) cat /etc/mdadm.conf                   http://pastie.org/8040876
>
>
> Adam, I wouldn't be opposed to spending the money on a good sata card, but
> I'd like to get opinions from a few people first.  Any suggestions on a good
> one for mdadm specifically?
>
> Thanks all!
>
>
> On Thu, Jun 13, 2013 at 7:17 PM, Barrett Lewis
> <barrett.lewis.mitsi@gmail.com> wrote:
>>
>> Sorry for the delay, I wanted to let the memtest run for 48 hours.  It's
>> at 49 now with zero errors, so memory is pretty much ruled out.
>>
>> As far as power, I would think I have enough power.  The power supply is a
>> 500w Thermaltake TR2.  It's powering an Asrock z77 mobo with an i5-3570k,
>> and the only card on it is a dinky little 2 port sata card my OS drive is on
>> (the RAID components are plugged into the mobo).  Eight 7200 drives and an
>> SSD.  Tell me if this sounds like insufficient power.
>>
>> Phil, when you say "what you are experiencing", what do you mean
>> specifically?  The dmesg errors and drives falling off?  Or did you mean the
>> beeping noises (since thats the part you trimmed)?
>>
>>
>> Here is the data you requested
>>
>> 1) mdadm -E /dev/sd[a-f]       http://pastie.org/8040826
>>
>> 2) mdadm -D /dev/md0          http://pastie.org/8040828
>>
>> 3)
>> smartctl -x /dev/sda                   http://pastie.org/8040847
>> smartctl -x /dev/sdb                   http://pastie.org/8040848
>> smartctl -x /dev/sdc                   http://pastie.org/8040850
>> smartctl -x /dev/sdd                   http://pastie.org/8040851
>> smartctl -x /dev/sde                   http://pastie.org/8040852
>> smartctl -x /dev/sdf                   http://pastie.org/8040853
>>
>> 4) cat /proc/mdstat                   http://pastie.org/8040859
>>
>> 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done
>> http://pastie.org/8040870
>>
>> 6) dmesg | grep -e sd -e md                   http://pastie.org/8040871
>> (note that I have rebooted since the last dmesg link I posted (where two
>> drives failed) because I was running memtest, if I should do dmesg
>> differently, let me know)
>>
>> 7) cat /etc/mdadm.conf                   http://pastie.org/8040876
>>
>>
>> Adam, I wouldn't be opposed to spending the money on a good sata card, but
>> I'd like to get opinions from a few people first.  Any suggestions on a good
>> one for mdadm specifically?
>>
>>
>>
>> On Wed, Jun 12, 2013 at 10:41 AM, Adam Goryachev
>> <adam@websitemanagers.com.au> wrote:
>>>
>>> On 12/06/13 23:47, Barrett Lewis wrote:
>>> > I started about 1 year ago with a 5x2tb raid 5.  At the beginning of
>>> > feburary, I came home from work and my drives were all making these
>>> > crazy beeping noises.  At that point I was on kernel version .34
>>> >
>>> > I shutdown and rebooted the server and the raid array didn't come back
>>> > online.  I noticed one drive was going up and down and determined that
>>> > the drive had actual physical damage to the power connecter and was
>>> > losing and regaining power through vibration.  No problem.  I bought
>>> > another hard drive and mdadm started recovering to the new drive.
>>> > Got it back to a Raid 5,  backed up my data, then started growing to a
>>> > raid6, and my computer hung hard where even REISUB was ignored.  I
>>> > restarted and resumed the grow.  Then I started getting errors like
>>> > these, they repeat for a minute or two and then the device gets failed
>>> > out of the array:
>>> >
>>> > [  193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0
>>> > action 0x0
>>> > [  193.801554] ata4.00: irq_stat 0x40000008
>>> > [  193.801581] ata4.00: failed command: READ FPDMA QUEUED
>>> > [  193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30
>>> > ncq 4096 in
>>> > [  193.801618]          res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask
>>> > 0x409 (media error) <F>
>>> > [  193.801703] ata4.00: status: { DRDY ERR }
>>> > [  193.801728] ata4.00: error: { UNC }
>>> > [  193.804479] ata4.00: configured for UDMA/133
>>> > [  193.804499] ata4: EH complete
>>> >
>>> > First one one drive, then on another, then on another, as the slow
>>> > grow to raid 6 was happening these messages kept coming up and taking
>>> > drives down.  Eventually (over the course of the week long grow time)
>>> > the failures were happening faster than I could recover them and I had
>>> > to revert to ddrescueing raid components to keep it from going under
>>> > the minimum components.  I ended up having to ddrescue 3 failed drives
>>> > and force the array assembly to get back to 5 drives and by that time
>>> > the arrays ext4 file system could no longer mount (said something
>>> > about group descriptors being corrupted).  By this time, every one of
>>> > the original drives has been replaced and this has been ongoing for 5
>>> > months.  I didn't even want to do an fsck to *attempt* to fix the file
>>> > system until I got a solid raid6.
>>> >
>>> > I upgraded my kernel to .40, bought another hard drive and put it in
>>> > there and started the grow.  Within an hour the system froze. I
>>> > rebooted and restarted the array (and the grow), 2 hours later the
>>> > system froze again, rebooted restarted the array (and the grow) again,
>>> > and got those same errors again, this time on a drive that I had
>>> > bought last month.  Frustrated (feeling like this will never end) I
>>> > let it keep going, hoping to atleast get back to raid 5.  A few hours
>>> > later I got these errors AGAIN on ANOTHER drive I got last month (of a
>>> > differen't brand and model).  So now I'm back with a non functional
>>> > array.  A pile of 6 dead drives (not counting the ones still in the
>>> > computer, components of a now incomplete array).
>>> >
>>> > What is going on here?  If brand new drives from a month ago from two
>>> > different manufacturers are failling, something else is going on.  Is
>>> > it my motherboard?  I've run memtest for 15 hours so far with no
>>> > errors, and ill let it go for 48 before I stop it, lets assume its not
>>> > the RAM for now.
>>> >
>>> > Not included in this history are SEVERAL times the machine locked up
>>> > harder than a REISUB, almost always during the heavy IO of component
>>> > recovery.  It seems to stay up for weeks when the array is inactive
>>> > (and I'm too busy with other things to deal with it) and then as soon
>>> > as I put a new drive in and the recovery starts, it hangs within an
>>> > hour, and does so every few hours, and eventually I get the "failed
>>> > command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors
>>> > and another drive falls off the array.
>>> >
>>> > I don't mind buying a new motherboard if thats what it is (i've
>>> > already spent almost a grand on hard drives), I just want to get this
>>> > fixed/stable and the nightmare behind me.
>>> >
>>> > Here is the dmesg output for my last boot where two drives failed at
>>> > 193 and 12196: http://paste.ubuntu.com/5753575/
>>> >
>>> > Thanks for any thoughts on the matter
>>>
>>> Apart from the previous thought regarding lack of power for the number
>>> of drives, have you considered getting a SATA controller card? This
>>> would totally rule out the motherboard as being an issue without forcing
>>> you to replace the motherboard. I'd probably check out the power supply
>>> issue first (quick, cheap, easy) and then follow up with using a well
>>> supported SATA controller card.... (ie, not a cheap crappy sata card
>>> with poor drivers/etc).
>>>
>>> Hope this helps
>>>
>>> Regards,
>>> Adam
>>>
>>> --
>>> Adam Goryachev
>>> Website Managers
>>> Ph: +61 2 8304 0000
>>> adam@websitemanagers.com.au
>>> Fax: +61 2 8304 0001
>>> www.websitemanagers.com.au
>>>
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14  0:19       ` Barrett Lewis
@ 2013-06-14  2:08         ` Phil Turmel
       [not found]           ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>
  2013-07-29 22:25           ` Roy Sigurd Karlsbakk
  0 siblings, 2 replies; 34+ messages in thread
From: Phil Turmel @ 2013-06-14  2:08 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

Hi Barrett,

Please interleave your replies, and trim unnecessary quotes.

On 06/13/2013 08:19 PM, Barrett Lewis wrote:
> Sorry for the delay, I wanted to let the memtest run for 48 hours.
> It's at 49 hours now with zero errors, so memory is pretty much ruled
> out.
> 
> As far as power, I would *think* I have enough power.  The power
> supply is a 500w Thermaltake TR2.  It's powering an Asrock z77 mobo
> with an i5-3570k, and the only card on it is a dinky little 2 port
> sata card my OS drive is on (the RAID components are plugged into the
> mobo).  Eight 7200 drives and an SSD.  Tell me if this sounds
> insufficient.
> 
> Phil, when you say "what you are experiencing", what do you mean
> specifically?  The dmesg errors and drives falling off?  Or did you
> mean the beeping noises (since thats the part you trimmed)?

Drives dropping out when they shouldn't, and smartctl says "PASSED".
This is *unavoidable* when you have mismatched device and driver timeouts.

> Here is the data you requested
> 
> 1) mdadm -E /dev/sd[a-f]       http://pastie.org/8040826

/dev/sdd and /dev/sde have old event counts ...

> 2) mdadm -D /dev/md0          http://pastie.org/8040828

... matching the array report ...

> 3)
> smartctl -x /dev/sda                   http://pastie.org/8040847

Ok, but no error recovery support (typical of green drives).

> smartctl -x /dev/sdb                   http://pastie.org/8040848

Ok, green again.  No ERC.

> smartctl -x /dev/sdc                   http://pastie.org/8040850

Ok, with ERC support, but disabled.  Not a green drive.

> smartctl -x /dev/sdd                   http://pastie.org/8040851

Not Ok.  A few relocations, a couple pending errors.  ERC support
present but disabled.

> smartctl -x /dev/sde                   http://pastie.org/8040852

Not Ok.  No relocations, but several pending errors.  No ERC.

> smartctl -x /dev/sdf                   http://pastie.org/8040853

Ok, but no ERC.

> 4) cat /proc/mdstat                   http://pastie.org/8040859
> 
> 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done
>                  http://pastie.org/8040870

All timeouts are still the default 30 seconds.  With enabled ERC
support, these values must be two to three minutes.  I recommend 180
seconds.  Your array *will not* complete a rebuild with dealing with
this problem.

> 6) dmesg | grep -e sd -e md                   http://pastie.org/8040871
> (note that I have rebooted since the last dmesg link I posted (where
> two drives failed) because I was running memtest, if I should do dmesg
> differently, let me know)
> 
> 7) cat /etc/mdadm.conf                   http://pastie.org/8040876

I generally simplify the ARRAY line to just the device and the UUID, but
it is ok as is.

> Adam, I wouldn't be opposed to spending the money on a good sata card,
> but I'd like to get opinions from a few people first.  Any suggestions
> on a good one for mdadm specifically?

No need.  Just fix your timeouts.  For the two devices that support ERC,
you need to turn it on:

> smartctl -l scterc,70,70 /dev/sdc
> smartctl -l scterc,70,70 /dev/sdd

For the others, you need long timeouts in the linux driver:

> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done

This must be done now, and at every power cycle or reboot.  rc.local or
similar distro config is the appropriate place.  (Enterprise drives
power up with ERC enabled.  As do raid-rated consumer drives like WD Red.)

Then stop and re-assemble your array.  Use --force to reintegrate your
problem drives.  Fortunately, this is a raid6--with compatible timeouts,
your rebuild will succeed.  A URE on /dev/sdd would have to fall in the
same place as a URE on /dev/sde to kill it.

Upon completion, the UREs will either be fixed or relocated.  If any
drive's relocations reach double digits, I'd replace it.

Finally, after your array is recovered, set up a cron job that'll
trigger a "check" scrub of your array on a regular basis.  I use a
weekly scrub.  The scrub keeps UREs that develop on idle parts of your
array from accumulating.  Note, the scrub itself will crash your array
if your timeouts are mismatched and any UREs are lurking.

I'll let you browse the archives for a more detailed explanation of
*why* this happens.

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
       [not found]           ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>
@ 2013-06-14 21:18             ` Barrett Lewis
  2013-06-14 21:20               ` Barrett Lewis
  2013-06-14 21:24               ` Phil Turmel
  0 siblings, 2 replies; 34+ messages in thread
From: Barrett Lewis @ 2013-06-14 21:18 UTC (permalink / raw)
  To: linux-raid

On Thu, Jun 13, 2013 at 9:08 PM, Phil Turmel <philip@turmel.org> wrote:
> Please interleave your replies, and trim unnecessary quotes.

No problem.

>> smartctl -l scterc,70,70 /dev/sdc
>> smartctl -l scterc,70,70 /dev/sdd
>> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done
>
> This must be done now, and at every power cycle or reboot.  rc.local or
> similar distro config is the appropriate place.  (Enterprise drives
> power up with ERC enabled.  As do raid-rated consumer drives like WD Red.)

Seems that the drives themselves retained the ERC settings after a
reboot.  But I went ahead and put scterc and the timeouts in rc.local.

>
> Then stop and re-assemble your array.  Use --force to reintegrate your
> problem drives.  Fortunately, this is a raid6--with compatible timeouts,
> your rebuild will succeed.  A URE on /dev/sdd would have to fall in the
> same place as a URE on /dev/sde to kill it.

It worked.  Yer a wizard!  Thank you!

> Finally, after your array is recovered, set up a cron job that'll
> trigger a "check" scrub of your array on a regular basis.  I use a
> weekly scrub.  The scrub keeps UREs that develop on idle parts of your
> array from accumulating.  Note, the scrub itself will crash your array
> if your timeouts are mismatched and any UREs are lurking.

I'll definatly do this.  When you talk about mismatched timeouts, do
you mean matched between each of the components (as in
/sys/block/sdX/device/timeout) or between that driver timeout and some
device timeout per component?  If you mean between components, are my
timeouts matched now, even though I did not raise the 30 seconds on
the two drives with ERC?

On Fri, Jun 14, 2013 at 4:16 PM, Barrett Lewis
<barrett.lewis.mitsi@gmail.com> wrote:
> On Thu, Jun 13, 2013 at 9:08 PM, Phil Turmel <philip@turmel.org> wrote:
>> Please interleave your replies, and trim unnecessary quotes.
>
> No problem.
>
>>> smartctl -l scterc,70,70 /dev/sdc
>>> smartctl -l scterc,70,70 /dev/sdd
>>> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done
>>
>> This must be done now, and at every power cycle or reboot.  rc.local or
>> similar distro config is the appropriate place.  (Enterprise drives
>> power up with ERC enabled.  As do raid-rated consumer drives like WD Red.)
>
> Seems that the drives themselves retained the ERC settings after a
> reboot.  But I went ahead and put scterc and the timeouts in rc.local.
>
>>
>> Then stop and re-assemble your array.  Use --force to reintegrate your
>> problem drives.  Fortunately, this is a raid6--with compatible timeouts,
>> your rebuild will succeed.  A URE on /dev/sdd would have to fall in the
>> same place as a URE on /dev/sde to kill it.
>
> It worked.  Yer a wizard!  Thank you!
>
>> Finally, after your array is recovered, set up a cron job that'll
>> trigger a "check" scrub of your array on a regular basis.  I use a
>> weekly scrub.  The scrub keeps UREs that develop on idle parts of your
>> array from accumulating.  Note, the scrub itself will crash your array
>> if your timeouts are mismatched and any UREs are lurking.
>
> I'll definatly do this.  When you talk about mismatched timeouts, do
> you mean matched between each of the components (as in
> /sys/block/sdX/device/timeout) or between that driver timeout and some
> device timeout per component?  If you mean between components, are my
> timeouts matched now, even though I did not raise the 30 seconds on
> the two drives with ERC?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14 21:18             ` Barrett Lewis
@ 2013-06-14 21:20               ` Barrett Lewis
  2013-06-14 21:25                 ` Phil Turmel
  2013-06-14 21:24               ` Phil Turmel
  1 sibling, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-06-14 21:20 UTC (permalink / raw)
  To: linux-raid

Oops, again, sorry for the email issues, I'm having trouble getting
gmail to play right.

So now that I have a synced raid6, I'm looking at this problem of the
filesystem having been partially or fully corrupted, which happened
after a few components were ddrescued onto other components and force
assembled.  Is this something you might expect to happen in that
scenario?

mount /dev/md0 /media/vault               http://pastie.org/8042532

I do have a backup, so if it comes down to it, I can just make a new
filesystem and restore that way.  However the backup is not 100%
complete, so if possible id like to get the file system back, even
with errors, to supplement the missing part of the backup.

Is there any reason not to run "e2fsck -y /dev/md0"?

Thanks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14 21:18             ` Barrett Lewis
  2013-06-14 21:20               ` Barrett Lewis
@ 2013-06-14 21:24               ` Phil Turmel
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2013-06-14 21:24 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 06/14/2013 05:18 PM, Barrett Lewis wrote:

> I'll definatly do this.  When you talk about mismatched timeouts, do
> you mean matched between each of the components (as in
> /sys/block/sdX/device/timeout) or between that driver timeout and some
> device timeout per component?  If you mean between components, are my
> timeouts matched now, even though I did not raise the 30 seconds on
> the two drives with ERC?

For each drive, the driver timeout (/sys/block/.../device/timeout) must
be longer than the drive's timeout (smartctl -l scterc).

Note that scterc is in deciseconds, while the driver uses seconds.

Enterprise drives typically power up with 7.0 second timeouts.  The few
SSDs I've been playing with power up with 4.0 second timeouts.  Without
ERC, the drives I've played with will perform error recovery for about
two full minutes, ignoring the world for the duration.

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14 21:20               ` Barrett Lewis
@ 2013-06-14 21:25                 ` Phil Turmel
  2013-06-14 21:30                   ` Phil Turmel
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Turmel @ 2013-06-14 21:25 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 06/14/2013 05:20 PM, Barrett Lewis wrote:
> Oops, again, sorry for the email issues, I'm having trouble getting
> gmail to play right.
> 
> So now that I have a synced raid6, I'm looking at this problem of the
> filesystem having been partially or fully corrupted, which happened
> after a few components were ddrescued onto other components and force
> assembled.  Is this something you might expect to happen in that
> scenario?
> 
> mount /dev/md0 /media/vault               http://pastie.org/8042532
> 
> I do have a backup, so if it comes down to it, I can just make a new
> filesystem and restore that way.  However the backup is not 100%
> complete, so if possible id like to get the file system back, even
> with errors, to supplement the missing part of the backup.
> 
> Is there any reason not to run "e2fsck -y /dev/md0"?

An fsck is often needed after one of these crises.  So, yes.

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14 21:25                 ` Phil Turmel
@ 2013-06-14 21:30                   ` Phil Turmel
  2013-06-17 21:37                     ` Barrett Lewis
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Turmel @ 2013-06-14 21:30 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 06/14/2013 05:25 PM, Phil Turmel wrote:
> On 06/14/2013 05:20 PM, Barrett Lewis wrote:

>> Is there any reason not to run "e2fsck -y /dev/md0"?
> 
> An fsck is often needed after one of these crises.  So, yes.

After wrapping my head around the grammar... *No*, no reason to not run
fsck.

:-)

Phil


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14 21:30                   ` Phil Turmel
@ 2013-06-17 21:37                     ` Barrett Lewis
  2013-06-18  4:13                       ` Mikael Abrahamsson
  0 siblings, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-06-17 21:37 UTC (permalink / raw)
  To: linux-raid

This is terrific.  fsck found tons and tons of errors and fixed them
all.  Then I ran rsync -avHcn [array] [backup] and found 5 or so files
out of 8tb which had some slight corruption, which can easily be
restored from the backup but I was curious to do a dry run first to
use vbindiff and see what the corruption looked like at the byte
level.  Interesting!

I did notice that before rsync found one of the differences (corrupt
files) it started spitting out those same "failed
command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors
as before but this time it did not fail the drive.  I take this to
mean there is still some physical problems with the drive, but with
the new timeout settings it is not unnecessarily failing the drive out
of the array.  So if I overwrite the corrupted files with the backups,
(or write any new data to the array really), will it avoid those
problem areas on the platter?

I just want to say a big thanks as this has been causing me
indescribable stress and monetary cost for months since the beginning
of february and it looks like I am back in business.  I think I will
write some perl scripts to help monitor some of these things.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-17 21:37                     ` Barrett Lewis
@ 2013-06-18  4:13                       ` Mikael Abrahamsson
  2013-06-27  0:23                         ` Barrett Lewis
  0 siblings, 1 reply; 34+ messages in thread
From: Mikael Abrahamsson @ 2013-06-18  4:13 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On Mon, 17 Jun 2013, Barrett Lewis wrote:

> I did notice that before rsync found one of the differences (corrupt 
> files) it started spitting out those same "failed command: READ FPDMA 
> QUEUED status: { DRDY ERR } error: { UNC }" errors as before but this 
> time it did not fail the drive.  I take this to mean there is still some 
> physical problems with the drive, but with the new timeout settings it 
> is not unnecessarily failing the drive out of the array.  So if I 
> overwrite the corrupted files with the backups, (or write any new data 
> to the array really), will it avoid those problem areas on the platter?

What should have happened here is that when md received the read error it 
should have read parity and recalculculated what should have been on those 
read error sectors and written to them, and the drive should have either 
succeeded in writing the new information, or written them to another place 
(reallocation).

If your system is now working well, it might make sense to issue a 
"repair" to the array and let it run through completely:

echo repair > /sys/block/md0/md/sync_action

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-18  4:13                       ` Mikael Abrahamsson
@ 2013-06-27  0:23                         ` Barrett Lewis
  2013-06-27 17:13                           ` Nicolas Jungers
  0 siblings, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-06-27  0:23 UTC (permalink / raw)
  To: linux-raid

Everything is going well, I am just trying to replace the parts that
are on the way out.
I ran a 'repair' and it came out with 5477 under
/sys/block/md0/md/mismatch_cnt.  Then a 'check' came out with 0.

Then I went out and bought a couple WD Reds (I'm done with greens now
that I know they lack ERC).  I replaced one of the two drives Phil
said was not ok, which had many reallocations (I can personally see
those) in the smart status.  I then ran another repair to be safe.  It
came up with 0 mismatches, but in the process /dev/sda started giving
me tons (and tons and tons, rolled over dmesg) of these "failed
command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }"
errors. sda hadn't been giving me problems before but I'll come back
to it.

The second disk Phil said was "not ok" was this one which showed
"several pending errors".
(original smart status) http://pastie.org/8040852
I was going to replace it with my second spare Red, but the errors
seem to have gone away.
(current smart status) http://pastie.org/8084278
Or maybe I am looking in the wrong place to find the pending errors
(looking at "197 Current_Pending_Sector").  Is the drive currently in
need of replacement?  I'm not sure what I'm looking for.

What about this one (sda), after it gave all of those errors during a
repair?  http://pastie.org/8084292
I get the "5 Reallocated_Sector_Ct", but where do you find pending errors?

What does it mean to get all these "failed command: READ FPDMA QUEUED
status: { DRDY ERR } error: { UNC }" errors and the smart status seems
to be fine even after a repair?

Thanks everyone, I'm learning a lot.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-27  0:23                         ` Barrett Lewis
@ 2013-06-27 17:13                           ` Nicolas Jungers
  2013-07-02  0:17                             ` Barrett Lewis
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolas Jungers @ 2013-06-27 17:13 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 06/27/2013 02:23 AM, Barrett Lewis wrote:
> Everything is going well, I am just trying to replace the parts that
> are on the way out.
> I ran a 'repair' and it came out with 5477 under
> /sys/block/md0/md/mismatch_cnt.  Then a 'check' came out with 0.
>
> Then I went out and bought a couple WD Reds (I'm done with greens now
> that I know they lack ERC).  I replaced one of the two drives Phil
> said was not ok, which had many reallocations (I can personally see
> those) in the smart status.  I then ran another repair to be safe.  It
> came up with 0 mismatches, but in the process /dev/sda started giving
> me tons (and tons and tons, rolled over dmesg) of these "failed
> command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }"
> errors. sda hadn't been giving me problems before but I'll come back
> to it.
>
> The second disk Phil said was "not ok" was this one which showed
> "several pending errors".
> (original smart status) http://pastie.org/8040852
> I was going to replace it with my second spare Red, but the errors
> seem to have gone away.
> (current smart status) http://pastie.org/8084278
> Or maybe I am looking in the wrong place to find the pending errors
> (looking at "197 Current_Pending_Sector").  Is the drive currently in
> need of replacement?  I'm not sure what I'm looking for.
>
> What about this one (sda), after it gave all of those errors during a
> repair?  http://pastie.org/8084292
> I get the "5 Reallocated_Sector_Ct", but where do you find pending errors?
>
> What does it mean to get all these "failed command: READ FPDMA QUEUED
> status: { DRDY ERR } error: { UNC }" errors and the smart status seems
> to be fine even after a repair?

Have you considered that your SATA may be faulty? I had consistent bad 
experiences with "cheap" SATA cables. I also use exclusively now cables 
with latches. I said "cheap" because the price is not an absolute 
criteria, quality of sourcing is more important in my experience.

Regards,
N.


>
> Thanks everyone, I'm learning a lot.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-27 17:13                           ` Nicolas Jungers
@ 2013-07-02  0:17                             ` Barrett Lewis
  2013-07-02  1:57                               ` Stan Hoeppner
  2013-07-02 21:49                               ` Phil Turmel
  0 siblings, 2 replies; 34+ messages in thread
From: Barrett Lewis @ 2013-07-02  0:17 UTC (permalink / raw)
  To: linux-raid

I am very sorry to keep bugging this list, but I am really lost.

After learning about erc and timeouts the severity of the problem was
reduced to the point that I could atleast get my system back to a
raid6.  I ran a repair and fixed 5477 mismatches, and then a check
showed it clean.  Yet drives continue to give me DRDY statuses.  I
replaced the two that were doing it with WD reds (which my intent is
to only buy from now on).  Then I tried to run a repair again, and my
system crashed, as if the timers were mismatched, but I had set the
driver timeouts on all drives to 180, even the ones with erc to be
safe.  This repair crashed several (3-4) times under these conditions
(usually within a few minutes of starting).  Finally instead of a
repair I ran a check which somehow completed fine and showed zero
mismatches.

I started rsync to verify my data against a backup.  And now 3 drives
are giving me DRDY statuses.  Two of them have REALLY failed out of
the array, giving DRDY DF ERR messages, and don't even show superblock
present from mdadm --examine, so now I'm back to the bare minimum of
my raid6.  One of the two drives that is so bad it lost it's
superblock is one of the WD reds I just bought and installed 5 days
ago.

Any thoughts on what is going on?  I have to ask again if it's
possibly my motherboard is frying the hardware in these drives?



cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]

md0 : active raid6 sdd[6](F) sdc[7] sda[9] sdf[8](F) sdb[0] sde[4]
      7813531648 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/4] [U__UUU]

unused devices: <none>

sudo mdadm -D /dev/md0 | nopaste
http://pastie.org/8101687

sudo mdadm --examine /dev/sd[a-f] 2>&1 | nopaste
http://pastie.org/8101681


sudo smartctl -x /dev/sda | nopaste
http://pastie.org/8101691

sudo smartctl -x /dev/sdb | nopaste
http://pastie.org/8101693

sudo smartctl -x /dev/sdc | nopaste
http://pastie.org/8101694

sudo smartctl -x /dev/sdd | nopaste
http://pastie.org/8101695

sudo smartctl -x /dev/sde | nopaste
http://pastie.org/8101696

sudo smartctl -x /dev/sdf | nopaste
http://pastie.org/8101697

for x in /sys/block/sd[a-f]/device/timeout ; do echo $x $(< $x); done
/sys/block/sda/device/timeout 180
/sys/block/sdb/device/timeout 180
/sys/block/sdc/device/timeout 180
/sys/block/sdd/device/timeout 180
/sys/block/sde/device/timeout 180
/sys/block/sdf/device/timeout 180






On Thu, Jun 27, 2013 at 12:13 PM, Nicolas Jungers <nicolas@jungers.net> wrote:
> On 06/27/2013 02:23 AM, Barrett Lewis wrote:
>>
>> Everything is going well, I am just trying to replace the parts that
>> are on the way out.
>> I ran a 'repair' and it came out with 5477 under
>> /sys/block/md0/md/mismatch_cnt.  Then a 'check' came out with 0.
>>
>> Then I went out and bought a couple WD Reds (I'm done with greens now
>> that I know they lack ERC).  I replaced one of the two drives Phil
>> said was not ok, which had many reallocations (I can personally see
>> those) in the smart status.  I then ran another repair to be safe.  It
>> came up with 0 mismatches, but in the process /dev/sda started giving
>> me tons (and tons and tons, rolled over dmesg) of these "failed
>> command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }"
>> errors. sda hadn't been giving me problems before but I'll come back
>> to it.
>>
>> The second disk Phil said was "not ok" was this one which showed
>> "several pending errors".
>> (original smart status) http://pastie.org/8040852
>> I was going to replace it with my second spare Red, but the errors
>> seem to have gone away.
>> (current smart status) http://pastie.org/8084278
>> Or maybe I am looking in the wrong place to find the pending errors
>> (looking at "197 Current_Pending_Sector").  Is the drive currently in
>> need of replacement?  I'm not sure what I'm looking for.
>>
>> What about this one (sda), after it gave all of those errors during a
>> repair?  http://pastie.org/8084292
>> I get the "5 Reallocated_Sector_Ct", but where do you find pending errors?
>>
>> What does it mean to get all these "failed command: READ FPDMA QUEUED
>> status: { DRDY ERR } error: { UNC }" errors and the smart status seems
>> to be fine even after a repair?
>
>
> Have you considered that your SATA may be faulty? I had consistent bad
> experiences with "cheap" SATA cables. I also use exclusively now cables with
> latches. I said "cheap" because the price is not an absolute criteria,
> quality of sourcing is more important in my experience.
>
> Regards,
> N.
>
>
>>
>> Thanks everyone, I'm learning a lot.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02  0:17                             ` Barrett Lewis
@ 2013-07-02  1:57                               ` Stan Hoeppner
  2013-07-02 15:48                                 ` Barrett Lewis
  2013-07-02 21:49                               ` Phil Turmel
  1 sibling, 1 reply; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-02  1:57 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 7/1/2013 7:17 PM, Barrett Lewis wrote:
> I am very sorry to keep bugging this list, but I am really lost.

I apologize as I just noticed this thread.  If I'd jumped in sooner you
might already have it fixed.  I pulled your previous posts from my
archive folder and read with interest.

> I noticed one drive was going up and down and determined that
> the drive had actual physical damage to the power connecter and
> was losing and regaining power through vibration.

This intermittent contact could have damaged the PSU.  You've continued
to have drive and lockup problems since replacing this drive with bad
connector.

The pink elephant in the room is thermal failure due to insufficient
airflow.  The symptoms you describe sound like drives overheating.  What
chassis is this?  Make/model please.  If you've installed individual
drive hot swap cages, etc, it would be helpful if you snapped a photo or
two and made those available.

I've seen many instances of this type of failure over the years and, in
order of prevalence, they are:

1.  Failed cheap backplane
2.  Insufficient airflow
3.  Failed or cheap PSU
4.  Failed HBA (or Southbridge)

-- 
Stan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02  1:57                               ` Stan Hoeppner
@ 2013-07-02 15:48                                 ` Barrett Lewis
  2013-07-02 19:44                                   ` Stan Hoeppner
  0 siblings, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-07-02 15:48 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

After sending the last email I went out and bought 2 new WD reds, and
a new motherboard.  I came back and in those 2 hours all but 1 of my
drives failed to the point of being unable to read the superblock so
it really seems like my array is ended

On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> I noticed one drive was going up and down and determined that
>> the drive had actual physical damage to the power connecter and
>> was losing and regaining power through vibration.
>
> This intermittent contact could have damaged the PSU.  You've continued
> to have drive and lockup problems since replacing this drive with bad
> connector.

I hadn't thought of it until you said so but I bet you are right about
the iffy connector.  It certainly seemed as if I never had an issue
with the array for 8 months, and then suddenly everything got unstable
at once, and since then I've lost atleast 6 hard drives.

>
> The pink elephant in the room is thermal failure due to insufficient
> airflow.  The symptoms you describe sound like drives overheating.  What
> chassis is this?  Make/model please.  If you've installed individual
> drive hot swap cages, etc, it would be helpful if you snapped a photo or
> two and made those available.
>
>

It is also possible that there were cooling issues.  The case is an
NZXT H2.  It has some fans blowing directly on all the hard drives,
but there were a few times I have to admit I took the fans off to work
on things and forgot to put them back on for a few days, coming back
to find them very hot to the touch.  I would have mentioned that
earlier, but a data recovery place told me that it was unlikely that
would be a culprit (after they had my money).

I don't have any drives in special cages but here's a pic anyway.  The
two fanboxes that sit in front of them are taken off.
https://docs.google.com/file/d/0B1w3WvCHlYUWRVhWOVd0Qmt1TUk/edit?usp=sharing


Maybe thats all academic at this point.  I guess i'll have to rebuild
my server from scratch since all my disks seem destroyed and I can't
trust the mobo, cpu, or psu.  Atleast I can memtest the ram.  The psu
wasn't dirt cheap, Thermaltake TR2 500w @ $58.  Should I buy all new
everything?  If so, while I'm at can you suggest a set of consumer
level hardware ideal running a personal mdadm server.  Powered but not
overpowered, reliable not bleeding edge.  If I need 6-8 sata ports,
should I do onboard or get a controller?

I still have one backup allthough I'm very nervous now since it's on a
3 disk RAID0, just asking to implode (created in an emergency).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 15:48                                 ` Barrett Lewis
@ 2013-07-02 19:44                                   ` Stan Hoeppner
  2013-07-02 19:54                                     ` Stan Hoeppner
                                                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-02 19:44 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 7/2/2013 10:48 AM, Barrett Lewis wrote:
> After sending the last email I went out and bought 2 new WD reds, and
> a new motherboard.  I came back and in those 2 hours all but 1 of my
> drives failed to the point of being unable to read the superblock so
> it really seems like my array is ended

The drive may be ok.  They all may be.

> On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> I noticed one drive was going up and down and determined that
>>> the drive had actual physical damage to the power connecter and
>>> was losing and regaining power through vibration.
>>
>> This intermittent contact could have damaged the PSU.  You've continued
>> to have drive and lockup problems since replacing this drive with bad
>> connector.
> 
> I hadn't thought of it until you said so but I bet you are right about
> the iffy connector.  It certainly seemed as if I never had an issue
> with the array for 8 months, and then suddenly everything got unstable
> at once, and since then I've lost atleast 6 hard drives.

Your drives may not be toast.  Don't toss them out, and don't throw up
your hands yet.

>> The pink elephant in the room is thermal failure due to insufficient
>> airflow.  The symptoms you describe sound like drives overheating.  What
>> chassis is this?  Make/model please.  If you've installed individual
>> drive hot swap cages, etc, it would be helpful if you snapped a photo or
>> two and made those available.
>
> It is also possible that there were cooling issues.  The case is an
> NZXT H2.  It has some fans blowing directly on all the hard drives,
> but there were a few times I have to admit I took the fans off to work
> on things and forgot to put them back on for a few days, coming back
> to find them very hot to the touch.  I would have mentioned that
> earlier, but a data recovery place told me that it was unlikely that
> would be a culprit (after they had my money).

I checked out the chassis on the NZXT site.  With the front fans
removed, you have only 2x120mm low rpm, low static pressure, and low CFM
exhaust fans, one on in the PSU, one top rear.  With 8 drives packed in
such close proximity and with other lower resistance intake paths (the
perforated chassis bottom), you won't get enough air through the front
drive cage to cool those drives properly over a long period.

However, running with the two front fans removed for a couple of days on
an occasion or two shouldn't have overheated the drives to the point of
permanent damage, assuming ambient air temp was ~75F or lower, and
assuming you were not performing long array operations such as rebuilds
or reshapes--if you did so the drives could get hot enough, long enough,
to be permanently damaged.

> Maybe thats all academic at this point.  I guess i'll have to rebuild
> my server from scratch since all my disks seem destroyed and I can't
> trust the mobo, cpu, or psu.

Don't start over.  Not just yet.  Leave everything as is for now.
Simply replace the PSU.  Fire it up and see what you can recover.

> The psu wasn't dirt cheap, Thermaltake TR2 500w @ $58.  

The price isn't relevant.  The quality and rail configuration is, and
whether it's been damaged.  I checked the spec on your TR2-500
yesterday.  It has dual +12V rails, one rated at 18A and one at 17A.  I
was unable to locate a wiring diagram for it.  On paper it should have
plenty of juice for your gear when in working order.  My assumption here
is that something internal to it may have failed.

> Should I buy all new
> everything?  

I wouldn't.  Most of your gear is probably fine.  Get the PSU swapped
out and see if that fixes it.  You may still have to wipe the drives and
build a new array.  You should know pretty quickly if the PSU swap fixed
the problem, as drives will not continue to drop, or they will.  You
already have a new mobo in hand, so if the PSU isn't the problem, swap
the mobo.  That's a good chassis design with good airflow assuming you
keep the front fans in it.  Why you'd leave them removed is beyond me.

> If so, while I'm at can you suggest a set of consumer
> level hardware ideal running a personal mdadm server.  Powered but not
> overpowered, reliable not bleeding edge.  If I need 6-8 sata ports,
> should I do onboard or get a controller?

A new HBA shouldn't be necessary.  But if you choose to go that route
further down the road I'd recommend an LSI 9211-8i.

> I still have one backup allthough I'm very nervous now since it's on a
> 3 disk RAID0, just asking to implode (created in an emergency).

I assume this resides on a different machine.

Swap the PSU.  Recover the array if possible.  If not blow it away and
create new.  If no drives drop out you're probably golden and the PSU
fixed the problem.  If they drop, swap in the new mobo.  At that point
you'll have replaced everything that could be the source of the problem
but for the remaining original drives.  They can't all be bad, if any.
Always run with those front fans installed.

-- 
Stan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 19:44                                   ` Stan Hoeppner
@ 2013-07-02 19:54                                     ` Stan Hoeppner
  2013-07-02 20:07                                     ` Jon Nelson
  2013-07-02 20:58                                     ` Barrett Lewis
  2 siblings, 0 replies; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-02 19:54 UTC (permalink / raw)
  To: stan; +Cc: Barrett Lewis, linux-raid

Forgot to ask previously.  This system is attached to a UPS isn't it?

-- 
Stan


On 7/2/2013 2:44 PM, Stan Hoeppner wrote:
> On 7/2/2013 10:48 AM, Barrett Lewis wrote:
>> After sending the last email I went out and bought 2 new WD reds, and
>> a new motherboard.  I came back and in those 2 hours all but 1 of my
>> drives failed to the point of being unable to read the superblock so
>> it really seems like my array is ended
> 
> The drive may be ok.  They all may be.
> 
>> On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> I noticed one drive was going up and down and determined that
>>>> the drive had actual physical damage to the power connecter and
>>>> was losing and regaining power through vibration.
>>>
>>> This intermittent contact could have damaged the PSU.  You've continued
>>> to have drive and lockup problems since replacing this drive with bad
>>> connector.
>>
>> I hadn't thought of it until you said so but I bet you are right about
>> the iffy connector.  It certainly seemed as if I never had an issue
>> with the array for 8 months, and then suddenly everything got unstable
>> at once, and since then I've lost atleast 6 hard drives.
> 
> Your drives may not be toast.  Don't toss them out, and don't throw up
> your hands yet.
> 
>>> The pink elephant in the room is thermal failure due to insufficient
>>> airflow.  The symptoms you describe sound like drives overheating.  What
>>> chassis is this?  Make/model please.  If you've installed individual
>>> drive hot swap cages, etc, it would be helpful if you snapped a photo or
>>> two and made those available.
>>
>> It is also possible that there were cooling issues.  The case is an
>> NZXT H2.  It has some fans blowing directly on all the hard drives,
>> but there were a few times I have to admit I took the fans off to work
>> on things and forgot to put them back on for a few days, coming back
>> to find them very hot to the touch.  I would have mentioned that
>> earlier, but a data recovery place told me that it was unlikely that
>> would be a culprit (after they had my money).
> 
> I checked out the chassis on the NZXT site.  With the front fans
> removed, you have only 2x120mm low rpm, low static pressure, and low CFM
> exhaust fans, one on in the PSU, one top rear.  With 8 drives packed in
> such close proximity and with other lower resistance intake paths (the
> perforated chassis bottom), you won't get enough air through the front
> drive cage to cool those drives properly over a long period.
> 
> However, running with the two front fans removed for a couple of days on
> an occasion or two shouldn't have overheated the drives to the point of
> permanent damage, assuming ambient air temp was ~75F or lower, and
> assuming you were not performing long array operations such as rebuilds
> or reshapes--if you did so the drives could get hot enough, long enough,
> to be permanently damaged.
> 
>> Maybe thats all academic at this point.  I guess i'll have to rebuild
>> my server from scratch since all my disks seem destroyed and I can't
>> trust the mobo, cpu, or psu.
> 
> Don't start over.  Not just yet.  Leave everything as is for now.
> Simply replace the PSU.  Fire it up and see what you can recover.
> 
>> The psu wasn't dirt cheap, Thermaltake TR2 500w @ $58.  
> 
> The price isn't relevant.  The quality and rail configuration is, and
> whether it's been damaged.  I checked the spec on your TR2-500
> yesterday.  It has dual +12V rails, one rated at 18A and one at 17A.  I
> was unable to locate a wiring diagram for it.  On paper it should have
> plenty of juice for your gear when in working order.  My assumption here
> is that something internal to it may have failed.
> 
>> Should I buy all new
>> everything?  
> 
> I wouldn't.  Most of your gear is probably fine.  Get the PSU swapped
> out and see if that fixes it.  You may still have to wipe the drives and
> build a new array.  You should know pretty quickly if the PSU swap fixed
> the problem, as drives will not continue to drop, or they will.  You
> already have a new mobo in hand, so if the PSU isn't the problem, swap
> the mobo.  That's a good chassis design with good airflow assuming you
> keep the front fans in it.  Why you'd leave them removed is beyond me.
> 
>> If so, while I'm at can you suggest a set of consumer
>> level hardware ideal running a personal mdadm server.  Powered but not
>> overpowered, reliable not bleeding edge.  If I need 6-8 sata ports,
>> should I do onboard or get a controller?
> 
> A new HBA shouldn't be necessary.  But if you choose to go that route
> further down the road I'd recommend an LSI 9211-8i.
> 
>> I still have one backup allthough I'm very nervous now since it's on a
>> 3 disk RAID0, just asking to implode (created in an emergency).
> 
> I assume this resides on a different machine.
> 
> Swap the PSU.  Recover the array if possible.  If not blow it away and
> create new.  If no drives drop out you're probably golden and the PSU
> fixed the problem.  If they drop, swap in the new mobo.  At that point
> you'll have replaced everything that could be the source of the problem
> but for the remaining original drives.  They can't all be bad, if any.
> Always run with those front fans installed.
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 19:44                                   ` Stan Hoeppner
  2013-07-02 19:54                                     ` Stan Hoeppner
@ 2013-07-02 20:07                                     ` Jon Nelson
  2013-07-02 20:23                                       ` Stan Hoeppner
  2013-07-02 20:58                                     ` Barrett Lewis
  2 siblings, 1 reply; 34+ messages in thread
From: Jon Nelson @ 2013-07-02 20:07 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Barrett Lewis, linux-raid

On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 7/2/2013 10:48 AM, Barrett Lewis wrote:
>> After sending the last email I went out and bought 2 new WD reds, and
>> a new motherboard.  I came back and in those 2 hours all but 1 of my
>> drives failed to the point of being unable to read the superblock so
>> it really seems like my array is ended
>
> The drive may be ok.  They all may be.

Indeed. A number of years back, I had an MD RAID array that kept
throwing drives, one after the other, after years of rock-solid
stability. Nothing had changed, the machine hadn't been touched (or
even rebooted!) in months, etc... It turns out that the motherboard
had gone. It "worked" perfectly, except under any drive load at all it
would start throwing I/O errors. I replaced only the motherboard (same
PSU, memory, CPU, etc....) and that machine - built at least 4 years
ago - is still humming along quite nicely.

--
Jon

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 20:07                                     ` Jon Nelson
@ 2013-07-02 20:23                                       ` Stan Hoeppner
  0 siblings, 0 replies; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-02 20:23 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Barrett Lewis, linux-raid

On 7/2/2013 3:07 PM, Jon Nelson wrote:
> On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 7/2/2013 10:48 AM, Barrett Lewis wrote:
>>> After sending the last email I went out and bought 2 new WD reds, and
>>> a new motherboard.  I came back and in those 2 hours all but 1 of my
>>> drives failed to the point of being unable to read the superblock so
>>> it really seems like my array is ended
>>
>> The drive may be ok.  They all may be.
> 
> Indeed. A number of years back, I had an MD RAID array that kept
> throwing drives, one after the other, after years of rock-solid
> stability. Nothing had changed, the machine hadn't been touched (or
> even rebooted!) in months, etc... It turns out that the motherboard
> had gone. It "worked" perfectly, except under any drive load at all it
> would start throwing I/O errors. I replaced only the motherboard (same
> PSU, memory, CPU, etc....) and that machine - built at least 4 years
> ago - is still humming along quite nicely.

Were the drives were attached to the onboard SATA controller or an HBA?

-- 
Stan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 19:44                                   ` Stan Hoeppner
  2013-07-02 19:54                                     ` Stan Hoeppner
  2013-07-02 20:07                                     ` Jon Nelson
@ 2013-07-02 20:58                                     ` Barrett Lewis
  2013-07-03  1:50                                       ` Stan Hoeppner
  2 siblings, 1 reply; 34+ messages in thread
From: Barrett Lewis @ 2013-07-02 20:58 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> I checked out the chassis on the NZXT site.  With the front fans
> removed, you have only 2x120mm low rpm, low static pressure, and low CFM
> exhaust fans, one on in the PSU, one top rear.  With 8 drives packed in
> such close proximity and with other lower resistance intake paths (the
> perforated chassis bottom), you won't get enough air through the front
> drive cage to cool those drives properly over a long period.
>
> However, running with the two front fans removed for a couple of days on
> an occasion or two shouldn't have overheated the drives to the point of
> permanent damage, assuming ambient air temp was ~75F or lower, and
> assuming you were not performing long array operations such as rebuilds
> or reshapes--if you did so the drives could get hot enough, long enough,
> to be permanently damaged.
>

Interesting.  Just to be clear, I never intentionally ran it without
the fans on.  The picture I sent was from when I first assembled the
server and hadn't put the fans in or plugged the machine in.  Also the
few times the fans were left of were, as you said "for a couple of
days on an occasion or two", but other than that the fans have always
been on.  It is possible though that one of those events was during a
resync, since the fans were off because I was swapping out a failed
drive and forgot to put them on.  Which is when I came home to hear
this beeping noise coming from all my drives.
https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing
I don't know what that beeping is, but it is a later recording, the
original event had many drives beeping at once (some with slightly
lower/higher pitches).  I thought maybe it might have been an overheat
alarm or something similar.  Most of those original drives have been
replaced at this point.  But it was also at that time that I first
started pulling wires on things (looking for which drives were
beeping) that I found the broken power connector.  This was back when
this all started in february.


> I wouldn't.  Most of your gear is probably fine.  Get the PSU swapped
> out and see if that fixes it.  You may still have to wipe the drives and
> build a new array.  You should know pretty quickly if the PSU swap fixed
> the problem, as drives will not continue to drop, or they will.

Good starting point, I'll do that tonight.  Any particular trusty
brands?  Otherwise all I can really go off of is price (like before, I
just tried to pay a little extra for "not the cheapest").

> Forgot to ask previously.  This system is attached to a UPS isn't it?

Yes, the server is plugged in through a dedicated UPS.

> I assume this resides on a different machine.

4 drives in an external USB enclosure.  3 are a RAID0.

> Were the drives were attached to the onboard SATA controller or an HBA?

All 6 drives and my OS SSD are plugged into onboard SATA.


Thanks for your help!
Barrett

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02  0:17                             ` Barrett Lewis
  2013-07-02  1:57                               ` Stan Hoeppner
@ 2013-07-02 21:49                               ` Phil Turmel
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2013-07-02 21:49 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid, Stan Hoeppner

On 07/01/2013 08:17 PM, Barrett Lewis wrote:
> I am very sorry to keep bugging this list, but I am really lost.

My apologies...  I was helping you before I disappeared on a 2-week
business trip.  I plain forgot about your case.

Anyways, Stan's on the case.  [Thanks, Stan.]

Phil


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-02 20:58                                     ` Barrett Lewis
@ 2013-07-03  1:50                                       ` Stan Hoeppner
  2013-07-03  5:26                                         ` Barrett Lewis
  0 siblings, 1 reply; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-03  1:50 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 7/2/2013 3:58 PM, Barrett Lewis wrote:

> https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing

Drives don't beep, they can't.  They don't contain transducers, never
have.  And you don't have a RAID card.  So that beep must be from the
motherboard connected PC speaker, which means you have raidmon or
another md monitoring daemon active.  If this is the case it was simply
giving an audible alert that a drive had been dropped.

> Good starting point, I'll do that tonight.  Any particular trusty
> brands?  Otherwise all I can really go off of is price (like before, I
> just tried to pay a little extra for "not the cheapest").

For troubleshooting purposes I'd think any recent 400+ watt ATX PSU you
have lying around should work, assuming there's no high wattage PCIe GPU
card in the box sucking +12V power, and assuming you have all the
necessary y-cables and SATA power adapters, etc.  Try a spare PSU if
possible before plunking cash on a possibly unneeded replacement.

For a permanent replacement, I'll tell ya, they're all of pretty much
similar quality today, except for the fan, after you get off the very
bottom of the barrel.  Cheap units come with cheap sleeve bearing fans
that don't last.  I buy near the bottom of the barrel and replace the
fans on day one.  I buy quality fans in bulk on closeout/overstock/etc
every few years specifically for this purpose.  Most don't have standard
2 pin PC connectors so I cut the one off the stock crap fan and solder
it to the good one.  Currently I'm draining a box of a dozen 80x25mm NMB
-30 series Boxers for PSU duty, and a box of a dozen Nidec BetaV 92x25mm
industrial fans for chassis duty.  All double ball bearing, highest
quality you can get.  Not the quietest, butthey're high CFM and high
static pressure.  Others in this class are Sanyo Denki, Pabst, Delta,
Panaflow, etc.  I won't use 120mm fans in PSUs or chassis, but that's a
discussion for another day.

Either of these two should be ok.  I'm not into the goofy lights and
what not on the Apevia, or the triple fan design (more to replace), but
at least it has a fan speed controller.  Both have great reviews, and
plenty of +12V power.  The one thing I -really- like about the Apevia is
the single +12V rail rated at 35 amps.  Single rail is always better,
contrary to popular belief.  Multiple +12V rail PSUs came into existence
because they're cheaper to produce, not because they're any better.
2/3/4 small MOSFETS, one per rail, are cheaper than one big MOSFET.
Take a look at any -real- server PSU design.  They're all single +12V
rail, some rated to 150 amps (1800 watts).

http://www.newegg.com/Product/Product.aspx?Item=N82E16817101021
http://www.newegg.com/Product/Product.aspx?Item=N82E16817148008

>> Forgot to ask previously.  This system is attached to a UPS isn't it?
> 
> Yes, the server is plugged in through a dedicated UPS.

Good, takes care of that.

>> I assume this resides on a different machine.
> 
> 4 drives in an external USB enclosure.  3 are a RAID0.

Ok, so this is your workstation, not a dedicated server?  Does it have a
PCIe GPU?  If so what wattage?  Ok, if you don't know that, what model?

>> Were the drives were attached to the onboard SATA controller or an HBA?
> 
> All 6 drives and my OS SSD are plugged into onboard SATA.

I counted 8 drives in the picture.

-- 
Stan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03  1:50                                       ` Stan Hoeppner
@ 2013-07-03  5:26                                         ` Barrett Lewis
  2013-07-03 14:03                                           ` Jon Nelson
  2013-07-03 17:05                                           ` Stan Hoeppner
  0 siblings, 2 replies; 34+ messages in thread
From: Barrett Lewis @ 2013-07-03  5:26 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Tue, Jul 2, 2013 at 8:50 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>
> On 7/2/2013 3:58 PM, Barrett Lewis wrote:
> >> I assume this resides on a different machine.
> >
> > 4 drives in an external USB enclosure.  3 are a RAID0.
>
> Ok, so this is your workstation, not a dedicated server?  Does it have a
> PCIe GPU?  If so what wattage?  Ok, if you don't know that, what model?

This is all about my dedicated server.  The external enclosure with
the 4 drives, 3 of which in a raid0 is just something I used for
creating an emergency backup, and was plugged directly into the server
via USB, (has it's own power supply too).  The server is using the
onboard video card on the Asrock z77 extreme 4.

> >> Were the drives were attached to the onboard SATA controller or an HBA?
> >
> > All 6 drives and my OS SSD are plugged into onboard SATA.
>
> I counted 8 drives in the picture.

The other 2 drives in the picture are the source drives that had the
original data that the array was initially populated with.  They are
not plugged into power or data.  Just taking up space, really.  I
never took them out because I always intended to grow the array onto
them, but then failures started.

> > https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing
>
> Drives don't beep, they can't.  They don't contain transducers, never
> have.  And you don't have a RAID card.  So that beep must be from the
> motherboard connected PC speaker, which means you have raidmon or
> another md monitoring daemon active.  If this is the case it was simply
> giving an audible alert that a drive had been dropped.

So, I accept that you know this stuff better than me, but I was pretty
sure that noise was coming out of the drives (and had never seen or
heard of anything like that before so I was very surprised).  When I
first built the machine I heard it once when a drive was jarred, when
the caddy wasn't pushed all the way back and I pushed it till it
clicked while it was running, and something made a quick "beep", which
I thought was odd.  Then the day these failures started, it sounded
like there were the same "beeping" noises coming out of several drives
all at once, out of sync with each other, sometimes the sounds
overlapping with each other, sometimes with pitches offset, it really
didn't sound like a single source at all.  But I guess could have been
mistaken.  I have been really curious about this "beeping" issue since
it is so bizarre.  Anyway like I said only 2 of those original 6 (they
were seagate ST2000DM001) remain.

>
> For troubleshooting purposes I'd think any recent 400+ watt ATX PSU you
> have lying around should work, assuming there's no high wattage PCIe GPU
> card in the box sucking +12V power, and assuming you have all the
> necessary y-cables and SATA power adapters, etc.  Try a spare PSU if
> possible before plunking cash on a possibly unneeded replacement.
>
> For a permanent replacement, I'll tell ya, they're all of pretty much
> similar quality today, except for the fan, after you get off the very
> bottom of the barrel.  Cheap units come with cheap sleeve bearing fans
> that don't last.  I buy near the bottom of the barrel and replace the
> fans on day one.  I buy quality fans in bulk on closeout/overstock/etc
> every few years specifically for this purpose.  Most don't have standard
> 2 pin PC connectors so I cut the one off the stock crap fan and solder
> it to the good one.


Cheap alternate PSU seemed to work OK so I went to buy a decent
permanent replacement.  I couldn't find either of the two you
suggested at the store (they were closing and I wanted to get this
done).  So I ended up going with a 750w corsair CX750M.  Like magic,
with a new power supply most of the drives seem to be back working,
except the first two that failed out yesterday.  It seems like maybe
the event counters (or something) are too far behind to assemble them
back.  That said, md0 mounts fine and fsck returned clean, so that
deserves some kinda hooray!

Here is some data about the two (sdd and sdf) that won't socialize
with the other disks.

sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[a-f]
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdf is identified as a member of /dev/md0, slot 2.
mdadm: added /dev/sdd to /dev/md0 as 1 (possibly out of date)
mdadm: added /dev/sdf to /dev/md0 as 2 (possibly out of date)
mdadm: added /dev/sde to /dev/md0 as 3
mdadm: added /dev/sda to /dev/md0 as 4
mdadm: added /dev/sdc to /dev/md0 as 5
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: /dev/md0 has been started with 4 drives (out of 6).


and from dmesg
[ 4481.356723] md: bind<sdd>
[ 4481.356850] md: bind<sdf>
[ 4481.357007] md: bind<sde>
[ 4481.357134] md: bind<sda>
[ 4481.357248] md: bind<sdc>
[ 4481.357365] md: bind<sdb>
[ 4481.357395] md: kicking non-fresh sdf from array!
[ 4481.357400] md: unbind<sdf>
[ 4481.374480] md: export_rdev(sdf)
[ 4481.374484] md: kicking non-fresh sdd from array!
[ 4481.374488] md: unbind<sdd>
[ 4481.394486] md: export_rdev(sdd)
[ 4481.396164] md/raid:md0: device sdb operational as raid disk 0
[ 4481.396168] md/raid:md0: device sdc operational as raid disk 5
[ 4481.396171] md/raid:md0: device sda operational as raid disk 4
[ 4481.396173] md/raid:md0: device sde operational as raid disk 3
[ 4481.396571] md/raid:md0: allocated 6384kB
[ 4481.396805] md/raid:md0: raid level 6 active with 4 out of 6
devices, algorithm 2
[ 4481.396808] RAID conf printout:
[ 4481.396810]  --- level:6 rd:6 wd:4
[ 4481.396812]  disk 0, o:1, dev:sdb
[ 4481.396814]  disk 3, o:1, dev:sde
[ 4481.396815]  disk 4, o:1, dev:sda
[ 4481.396817]  disk 5, o:1, dev:sdc
[ 4481.396848] md0: detected capacity change from 0 to 8001056407552
[ 4481.426011]  md0: unknown partition table

sudo mdadm -E /dev/sd[a-f] | nopaste
http://pastie.org/8105693

sudo smartctl -x /dev/sdd | nopaste
http://pastie.org/8105706

sudo smartctl -x /dev/sdf | nopaste
http://pastie.org/8105707


Are sdd and sdf just too out of sync?  Should I zero the superblocks
and re-add them to the array?  Or I could replace them (I have two
unopened WD reds here, but I'd like to return them if I don't really
need them right now).

Thanks for the advice about the PSU, I would have never dreamed it
would cause behaviour like that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03  5:26                                         ` Barrett Lewis
@ 2013-07-03 14:03                                           ` Jon Nelson
  2013-07-03 14:36                                             ` Phil Turmel
  2013-07-03 17:32                                             ` Stan Hoeppner
  2013-07-03 17:05                                           ` Stan Hoeppner
  1 sibling, 2 replies; 34+ messages in thread
From: Jon Nelson @ 2013-07-03 14:03 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: Stan Hoeppner, linux-raid

On Wed, Jul 3, 2013 at 12:26 AM, Barrett Lewis
<barrett.lewis.mitsi@gmail.com> wrote:
>
> didn't sound like a single source at all.  But I guess could have been
> mistaken.  I have been really curious about this "beeping" issue since
> it is so bizarre.  Anyway like I said only 2 of those original 6 (they
> were seagate ST2000DM001) remain.


A quick google search shows the ST2000DM001 (which I have 2 of) *do*
make "chirping" or "beeping" noises. Additionally, it seems there are
firmware updates available. Sadly, I bought two of these drives some
months ago (current firmware: CC26) and so far so good. However,
should I be worried about these drives?

smartctl -a
showed me a pair of links that brought me to the firmware update pages.


--
Jon

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03 14:03                                           ` Jon Nelson
@ 2013-07-03 14:36                                             ` Phil Turmel
  2013-07-03 17:32                                             ` Stan Hoeppner
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Turmel @ 2013-07-03 14:36 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Barrett Lewis, Stan Hoeppner, linux-raid

On 07/03/2013 10:03 AM, Jon Nelson wrote:
> On Wed, Jul 3, 2013 at 12:26 AM, Barrett Lewis
> <barrett.lewis.mitsi@gmail.com> wrote:
>>
>> didn't sound like a single source at all.  But I guess could have been
>> mistaken.  I have been really curious about this "beeping" issue since
>> it is so bizarre.  Anyway like I said only 2 of those original 6 (they
>> were seagate ST2000DM001) remain.
> 
> 
> A quick google search shows the ST2000DM001 (which I have 2 of) *do*
> make "chirping" or "beeping" noises. Additionally, it seems there are
> firmware updates available. Sadly, I bought two of these drives some
> months ago (current firmware: CC26) and so far so good. However,
> should I be worried about these drives?

They don't support ERC.  You *must* use a 2-3 minute driver timeout for
these if you use them in a raid array.

Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03  5:26                                         ` Barrett Lewis
  2013-07-03 14:03                                           ` Jon Nelson
@ 2013-07-03 17:05                                           ` Stan Hoeppner
  1 sibling, 0 replies; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-03 17:05 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid, Phil Turmel

On 7/3/2013 12:26 AM, Barrett Lewis wrote:
...
> This is all about my dedicated server.  The external enclosure with
> the 4 drives, 3 of which in a raid0 is just something I used for
> creating an emergency backup, and was plugged directly into the server
> via USB, (has it's own power supply too).  The server is using the
> onboard video card on the Asrock z77 extreme 4.

Got it.
...
> The other 2 drives in the picture are the source drives that had the
> original data that the array was initially populated with.

Got it.  These questions were simply to get a handle on how much +12V
power you needed before recommending a PSU.

...
> I have been really curious about this "beeping" issue since
> it is so bizarre.  Anyway like I said only 2 of those original 6 (they
> were seagate ST2000DM001) remain.

When power supplies go bad you may witness all kinds of weird things.
If the voltage to the speaker drive circuit fluctuates wildly it can
cause leakage on the output drive, which causes the speaker to make
random noises.

> Cheap alternate PSU seemed to work OK so I went to buy a decent
> permanent replacement.  I couldn't find either of the two you
> suggested at the store (they were closing and I wanted to get this
> done).  So I ended up going with a 750w corsair CX750M.  Like magic,
> with a new power supply most of the drives seem to be back working,
> except the first two that failed out yesterday.  It seems like maybe
> the event counters (or something) are too far behind to assemble them
> back.  That said, md0 mounts fine and fsck returned clean, so that
> deserves some kinda hooray!

The key thing is whether drives keep showing errors in dmesg and
dropping.  If not your problem is likely solved.  :)

> Here is some data about the two (sdd and sdf) that won't socialize
> with the other disks.
> 
> sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[a-f]
> mdadm: looking for devices for /dev/md0
> mdadm: /dev/sda is identified as a member of /dev/md0, slot 4.
> mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
> mdadm: /dev/sdc is identified as a member of /dev/md0, slot 5.
> mdadm: /dev/sdd is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sde is identified as a member of /dev/md0, slot 3.
> mdadm: /dev/sdf is identified as a member of /dev/md0, slot 2.
> mdadm: added /dev/sdd to /dev/md0 as 1 (possibly out of date)
> mdadm: added /dev/sdf to /dev/md0 as 2 (possibly out of date)
> mdadm: added /dev/sde to /dev/md0 as 3
> mdadm: added /dev/sda to /dev/md0 as 4
> mdadm: added /dev/sdc to /dev/md0 as 5
> mdadm: added /dev/sdb to /dev/md0 as 0
> mdadm: /dev/md0 has been started with 4 drives (out of 6).
> 
> 
> and from dmesg
> [ 4481.356723] md: bind<sdd>
> [ 4481.356850] md: bind<sdf>
> [ 4481.357007] md: bind<sde>
> [ 4481.357134] md: bind<sda>
> [ 4481.357248] md: bind<sdc>
> [ 4481.357365] md: bind<sdb>
> [ 4481.357395] md: kicking non-fresh sdf from array!
> [ 4481.357400] md: unbind<sdf>
> [ 4481.374480] md: export_rdev(sdf)
> [ 4481.374484] md: kicking non-fresh sdd from array!
> [ 4481.374488] md: unbind<sdd>
> [ 4481.394486] md: export_rdev(sdd)
> [ 4481.396164] md/raid:md0: device sdb operational as raid disk 0
> [ 4481.396168] md/raid:md0: device sdc operational as raid disk 5
> [ 4481.396171] md/raid:md0: device sda operational as raid disk 4
> [ 4481.396173] md/raid:md0: device sde operational as raid disk 3
> [ 4481.396571] md/raid:md0: allocated 6384kB
> [ 4481.396805] md/raid:md0: raid level 6 active with 4 out of 6
> devices, algorithm 2
> [ 4481.396808] RAID conf printout:
> [ 4481.396810]  --- level:6 rd:6 wd:4
> [ 4481.396812]  disk 0, o:1, dev:sdb
> [ 4481.396814]  disk 3, o:1, dev:sde
> [ 4481.396815]  disk 4, o:1, dev:sda
> [ 4481.396817]  disk 5, o:1, dev:sdc
> [ 4481.396848] md0: detected capacity change from 0 to 8001056407552
> [ 4481.426011]  md0: unknown partition table
> 
> sudo mdadm -E /dev/sd[a-f] | nopaste
> http://pastie.org/8105693
> 
> sudo smartctl -x /dev/sdd | nopaste
> http://pastie.org/8105706
> 
> sudo smartctl -x /dev/sdf | nopaste
> http://pastie.org/8105707
> 
> 
> Are sdd and sdf just too out of sync?  Should I zero the superblocks
> and re-add them to the array?  Or I could replace them (I have two
> unopened WD reds here, but I'd like to return them if I don't really
> need them right now).

I'm not an expert on recovery when things go this far South.  Phil and
others are much more knowledgeable with this so I'll pass the thread
back to them now.

> Thanks for the advice about the PSU, I would have never dreamed it
> would cause behaviour like that.

You're welcome.  I've spent a just little time around hardware, as you
might have guessed based on my email address.  Started in 1986, so
that's, what, 26 years now?  Damn I'm getting old...

-- 
Stan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03 14:03                                           ` Jon Nelson
  2013-07-03 14:36                                             ` Phil Turmel
@ 2013-07-03 17:32                                             ` Stan Hoeppner
  2013-07-03 19:47                                               ` Barrett Lewis
  1 sibling, 1 reply; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-03 17:32 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Barrett Lewis, linux-raid

On 7/3/2013 9:03 AM, Jon Nelson wrote:

> A quick google search shows the ST2000DM001 (which I have 2 of) *do*
> make "chirping" or "beeping" noises. Additionally

Yes, Seagate still makes some relatively noisy drives compared to others
on the market.  I had ST225s and ST251s in the late 80s that could be
heard across a large room.  All drives were noisy back then due to the
use of stepper motors.  Drives have used voice coil actuators since the
early 90s, shortly after the IDE spec was adopted, which are an order of
magnitude quieter.  These are very old and thus a bit noiser, but not
all that much:

http://www.youtube.com/watch?v=NYEkC7FBXa4
http://www.youtube.com/watch?v=RZMrwdQBVf4

But surely nobody would confuse this random mechanical drive noise for
an audible alarm.  And of course, with dirty power as in the OP's case,
drives will make more noise due to the firmware doing constant
recalibration of the heads as the spindle constantly drops below minimum
RPM threshold and comes back up again when voltage increases.

I think some people have simply become accustomed to ultra quiet drives
that rarely make a peep, and when they do people get all nervous.

-- 
Stan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03 17:32                                             ` Stan Hoeppner
@ 2013-07-03 19:47                                               ` Barrett Lewis
  2013-07-03 20:38                                                 ` Jon Nelson
  2013-07-04  2:21                                                 ` Stan Hoeppner
  0 siblings, 2 replies; 34+ messages in thread
From: Barrett Lewis @ 2013-07-03 19:47 UTC (permalink / raw)
  To: linux-raid

I added the two non-fresh drives back to the array and they have been
resyncing.  The first one is almost complete.  No errors so far.
Everything is been very smooth since replacing the power supply.
And I am paying extra close attention to make sure my TLER capable
drives have it turned on, and the others have the driver timeout set
to 180.  From now on I will only be buying TLER capable drives.
Just an update.

Also, this is likely the same drive making the same noise with the top
off https://www.youtube.com/watch?v=a9i5yixsJbk
I wonder if its PWM driving the actuator against the barrier.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03 19:47                                               ` Barrett Lewis
@ 2013-07-03 20:38                                                 ` Jon Nelson
  2013-07-04  2:21                                                 ` Stan Hoeppner
  1 sibling, 0 replies; 34+ messages in thread
From: Jon Nelson @ 2013-07-03 20:38 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On Wed, Jul 3, 2013 at 2:47 PM, Barrett Lewis
<barrett.lewis.mitsi@gmail.com> wrote:
> I added the two non-fresh drives back to the array and they have been
> resyncing.  The first one is almost complete.  No errors so far.
> Everything is been very smooth since replacing the power supply.
> And I am paying extra close attention to make sure my TLER capable
> drives have it turned on, and the others have the driver timeout set
> to 180.  From now on I will only be buying TLER capable drives.
> Just an update.

That's very good news. Data loss can be very frustrating, I know!

As for myself, even though I've got newer drives and firmware:
ST2000DM001-1CH164 and firmware CC26
I'm going to be actively looking to replace these drives.


--
Jon

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-07-03 19:47                                               ` Barrett Lewis
  2013-07-03 20:38                                                 ` Jon Nelson
@ 2013-07-04  2:21                                                 ` Stan Hoeppner
  1 sibling, 0 replies; 34+ messages in thread
From: Stan Hoeppner @ 2013-07-04  2:21 UTC (permalink / raw)
  To: Barrett Lewis; +Cc: linux-raid

On 7/3/2013 2:47 PM, Barrett Lewis wrote:
> I added the two non-fresh drives back to the array and they have been
> resyncing.  The first one is almost complete.  No errors so far.
> Everything is been very smooth since replacing the power supply.
> And I am paying extra close attention to make sure my TLER capable
> drives have it turned on, and the others have the driver timeout set
> to 180.  From now on I will only be buying TLER capable drives.
> Just an update.

Good to hear.  I hope things keep looking up.

> Also, this is likely the same drive making the same noise with the top
> off https://www.youtube.com/watch?v=a9i5yixsJbk
> I wonder if its PWM driving the actuator against the barrier.

Spinning HDDs use voice coil actuators to move the head.  The voice coil
is driven by direct DC current, not pulse width modulation.  PWM
generates constant voltage and current but cycles the circuit hundreds
or thousands of times per second.  The lower the frequency the less
power to the device.  The higher the frequency the greater the power.
Thus PWM is suitable for varying the speed of brushless DC fans and the
brightness of incandescent light bulbs.  It simply won't work for
driving voice coil actuators.

Regarding HDD noises, identifying/diagnosing them is a voodoo science,
unless you happen to be an engineer at Seagate, WD, or Toshiba, which
are, sadly, AFAIK, the only 3 remaining HDD vendors left on the planet.
 Seagate and WD have swallowed all the others.

-- 
Stan



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Mdadm server eating drives
  2013-06-14  2:08         ` Phil Turmel
       [not found]           ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>
@ 2013-07-29 22:25           ` Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 34+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-07-29 22:25 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid, Barrett Lewis

> > 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ;
> > done
> >                  http://pastie.org/8040870
> 
> All timeouts are still the default 30 seconds. With enabled ERC
> support, these values must be two to three minutes. I recommend 180
> seconds. Your array *will not* complete a rebuild with dealing with
> this problem.

With ERC suppot, those timeouts should be around 7 seconds, not 3 minutes. What he pasted was 180 seconds, as in three minutes, which will bust a RAID rather quickly.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2013-07-29 22:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-12 13:47 Mdadm server eating drives Barrett Lewis
2013-06-12 13:57 ` David Brown
2013-06-12 14:44 ` Phil Turmel
2013-06-12 15:41 ` Adam Goryachev
     [not found]   ` <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com>
     [not found]     ` <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com>
2013-06-14  0:19       ` Barrett Lewis
2013-06-14  2:08         ` Phil Turmel
     [not found]           ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>
2013-06-14 21:18             ` Barrett Lewis
2013-06-14 21:20               ` Barrett Lewis
2013-06-14 21:25                 ` Phil Turmel
2013-06-14 21:30                   ` Phil Turmel
2013-06-17 21:37                     ` Barrett Lewis
2013-06-18  4:13                       ` Mikael Abrahamsson
2013-06-27  0:23                         ` Barrett Lewis
2013-06-27 17:13                           ` Nicolas Jungers
2013-07-02  0:17                             ` Barrett Lewis
2013-07-02  1:57                               ` Stan Hoeppner
2013-07-02 15:48                                 ` Barrett Lewis
2013-07-02 19:44                                   ` Stan Hoeppner
2013-07-02 19:54                                     ` Stan Hoeppner
2013-07-02 20:07                                     ` Jon Nelson
2013-07-02 20:23                                       ` Stan Hoeppner
2013-07-02 20:58                                     ` Barrett Lewis
2013-07-03  1:50                                       ` Stan Hoeppner
2013-07-03  5:26                                         ` Barrett Lewis
2013-07-03 14:03                                           ` Jon Nelson
2013-07-03 14:36                                             ` Phil Turmel
2013-07-03 17:32                                             ` Stan Hoeppner
2013-07-03 19:47                                               ` Barrett Lewis
2013-07-03 20:38                                                 ` Jon Nelson
2013-07-04  2:21                                                 ` Stan Hoeppner
2013-07-03 17:05                                           ` Stan Hoeppner
2013-07-02 21:49                               ` Phil Turmel
2013-06-14 21:24               ` Phil Turmel
2013-07-29 22:25           ` Roy Sigurd Karlsbakk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.