* Mdadm server eating drives @ 2013-06-12 13:47 Barrett Lewis 2013-06-12 13:57 ` David Brown ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Barrett Lewis @ 2013-06-12 13:47 UTC (permalink / raw) To: linux-raid I started about 1 year ago with a 5x2tb raid 5. At the beginning of feburary, I came home from work and my drives were all making these crazy beeping noises. At that point I was on kernel version .34 I shutdown and rebooted the server and the raid array didn't come back online. I noticed one drive was going up and down and determined that the drive had actual physical damage to the power connecter and was losing and regaining power through vibration. No problem. I bought another hard drive and mdadm started recovering to the new drive. Got it back to a Raid 5, backed up my data, then started growing to a raid6, and my computer hung hard where even REISUB was ignored. I restarted and resumed the grow. Then I started getting errors like these, they repeat for a minute or two and then the device gets failed out of the array: [ 193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0 [ 193.801554] ata4.00: irq_stat 0x40000008 [ 193.801581] ata4.00: failed command: READ FPDMA QUEUED [ 193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30 ncq 4096 in [ 193.801618] res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask 0x409 (media error) <F> [ 193.801703] ata4.00: status: { DRDY ERR } [ 193.801728] ata4.00: error: { UNC } [ 193.804479] ata4.00: configured for UDMA/133 [ 193.804499] ata4: EH complete First one one drive, then on another, then on another, as the slow grow to raid 6 was happening these messages kept coming up and taking drives down. Eventually (over the course of the week long grow time) the failures were happening faster than I could recover them and I had to revert to ddrescueing raid components to keep it from going under the minimum components. I ended up having to ddrescue 3 failed drives and force the array assembly to get back to 5 drives and by that time the arrays ext4 file system could no longer mount (said something about group descriptors being corrupted). By this time, every one of the original drives has been replaced and this has been ongoing for 5 months. I didn't even want to do an fsck to *attempt* to fix the file system until I got a solid raid6. I upgraded my kernel to .40, bought another hard drive and put it in there and started the grow. Within an hour the system froze. I rebooted and restarted the array (and the grow), 2 hours later the system froze again, rebooted restarted the array (and the grow) again, and got those same errors again, this time on a drive that I had bought last month. Frustrated (feeling like this will never end) I let it keep going, hoping to atleast get back to raid 5. A few hours later I got these errors AGAIN on ANOTHER drive I got last month (of a differen't brand and model). So now I'm back with a non functional array. A pile of 6 dead drives (not counting the ones still in the computer, components of a now incomplete array). What is going on here? If brand new drives from a month ago from two different manufacturers are failling, something else is going on. Is it my motherboard? I've run memtest for 15 hours so far with no errors, and ill let it go for 48 before I stop it, lets assume its not the RAM for now. Not included in this history are SEVERAL times the machine locked up harder than a REISUB, almost always during the heavy IO of component recovery. It seems to stay up for weeks when the array is inactive (and I'm too busy with other things to deal with it) and then as soon as I put a new drive in and the recovery starts, it hangs within an hour, and does so every few hours, and eventually I get the "failed command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors and another drive falls off the array. I don't mind buying a new motherboard if thats what it is (i've already spent almost a grand on hard drives), I just want to get this fixed/stable and the nightmare behind me. Here is the dmesg output for my last boot where two drives failed at 193 and 12196: http://paste.ubuntu.com/5753575/ Thanks for any thoughts on the matter ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-12 13:47 Mdadm server eating drives Barrett Lewis @ 2013-06-12 13:57 ` David Brown 2013-06-12 14:44 ` Phil Turmel 2013-06-12 15:41 ` Adam Goryachev 2 siblings, 0 replies; 34+ messages in thread From: David Brown @ 2013-06-12 13:57 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid Hi, Since you mentioned problems with power, are you sure your power supply is enough for all these drives? mvh., David On 12/06/13 15:47, Barrett Lewis wrote: > I started about 1 year ago with a 5x2tb raid 5. At the beginning of > feburary, I came home from work and my drives were all making these > crazy beeping noises. At that point I was on kernel version .34 > > I shutdown and rebooted the server and the raid array didn't come back > online. I noticed one drive was going up and down and determined that > the drive had actual physical damage to the power connecter and was > losing and regaining power through vibration. No problem. I bought > another hard drive and mdadm started recovering to the new drive. > Got it back to a Raid 5, backed up my data, then started growing to a > raid6, and my computer hung hard where even REISUB was ignored. I > restarted and resumed the grow. Then I started getting errors like > these, they repeat for a minute or two and then the device gets failed > out of the array: > > [ 193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0 > [ 193.801554] ata4.00: irq_stat 0x40000008 > [ 193.801581] ata4.00: failed command: READ FPDMA QUEUED > [ 193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30 > ncq 4096 in > [ 193.801618] res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask > 0x409 (media error) <F> > [ 193.801703] ata4.00: status: { DRDY ERR } > [ 193.801728] ata4.00: error: { UNC } > [ 193.804479] ata4.00: configured for UDMA/133 > [ 193.804499] ata4: EH complete > > First one one drive, then on another, then on another, as the slow > grow to raid 6 was happening these messages kept coming up and taking > drives down. Eventually (over the course of the week long grow time) > the failures were happening faster than I could recover them and I had > to revert to ddrescueing raid components to keep it from going under > the minimum components. I ended up having to ddrescue 3 failed drives > and force the array assembly to get back to 5 drives and by that time > the arrays ext4 file system could no longer mount (said something > about group descriptors being corrupted). By this time, every one of > the original drives has been replaced and this has been ongoing for 5 > months. I didn't even want to do an fsck to *attempt* to fix the file > system until I got a solid raid6. > > I upgraded my kernel to .40, bought another hard drive and put it in > there and started the grow. Within an hour the system froze. I > rebooted and restarted the array (and the grow), 2 hours later the > system froze again, rebooted restarted the array (and the grow) again, > and got those same errors again, this time on a drive that I had > bought last month. Frustrated (feeling like this will never end) I > let it keep going, hoping to atleast get back to raid 5. A few hours > later I got these errors AGAIN on ANOTHER drive I got last month (of a > differen't brand and model). So now I'm back with a non functional > array. A pile of 6 dead drives (not counting the ones still in the > computer, components of a now incomplete array). > > What is going on here? If brand new drives from a month ago from two > different manufacturers are failling, something else is going on. Is > it my motherboard? I've run memtest for 15 hours so far with no > errors, and ill let it go for 48 before I stop it, lets assume its not > the RAM for now. > > Not included in this history are SEVERAL times the machine locked up > harder than a REISUB, almost always during the heavy IO of component > recovery. It seems to stay up for weeks when the array is inactive > (and I'm too busy with other things to deal with it) and then as soon > as I put a new drive in and the recovery starts, it hangs within an > hour, and does so every few hours, and eventually I get the "failed > command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors > and another drive falls off the array. > > I don't mind buying a new motherboard if thats what it is (i've > already spent almost a grand on hard drives), I just want to get this > fixed/stable and the nightmare behind me. > > Here is the dmesg output for my last boot where two drives failed at > 193 and 12196: http://paste.ubuntu.com/5753575/ > > Thanks for any thoughts on the matter ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-12 13:47 Mdadm server eating drives Barrett Lewis 2013-06-12 13:57 ` David Brown @ 2013-06-12 14:44 ` Phil Turmel 2013-06-12 15:41 ` Adam Goryachev 2 siblings, 0 replies; 34+ messages in thread From: Phil Turmel @ 2013-06-12 14:44 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 06/12/2013 09:47 AM, Barrett Lewis wrote: > I started about 1 year ago with a 5x2tb raid 5. At the beginning of > feburary, I came home from work and my drives were all making these > crazy beeping noises. At that point I was on kernel version .34 [trim /] What you are experiencing is typical of a hobby-level user who bought non-raid-rated drives and is now experiencing timeout mismatch array failures due to a lack of error recovery control. I suggest you search the archives for various combinations of "scterc", "URE", "timeout", and "error recovery". In the end, you almost certainly will need to either use "smartctl -l scterc,70,70" to turn on ERC in your drives, or use "echo 180 >/sys/block/sdX/device/timeout" to lengthen linux's standard driver command timeout. Anyways, when you check in again, please report the output of the following: 1) "mdadm -E /dev/sdX" for each member device or partition 2) "mdadm -D /dev/mdX" for your array 3) "smartctl -x /dev/sdX" for each member device 4) "cat /proc/mdstat" 5) "for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done" 6) "dmesg" (trimmed to relevant md and sd* messages) 7) "cat /etc/mdadm.conf" Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-12 13:47 Mdadm server eating drives Barrett Lewis 2013-06-12 13:57 ` David Brown 2013-06-12 14:44 ` Phil Turmel @ 2013-06-12 15:41 ` Adam Goryachev [not found] ` <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com> 2 siblings, 1 reply; 34+ messages in thread From: Adam Goryachev @ 2013-06-12 15:41 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 12/06/13 23:47, Barrett Lewis wrote: > I started about 1 year ago with a 5x2tb raid 5. At the beginning of > feburary, I came home from work and my drives were all making these > crazy beeping noises. At that point I was on kernel version .34 > > I shutdown and rebooted the server and the raid array didn't come back > online. I noticed one drive was going up and down and determined that > the drive had actual physical damage to the power connecter and was > losing and regaining power through vibration. No problem. I bought > another hard drive and mdadm started recovering to the new drive. > Got it back to a Raid 5, backed up my data, then started growing to a > raid6, and my computer hung hard where even REISUB was ignored. I > restarted and resumed the grow. Then I started getting errors like > these, they repeat for a minute or two and then the device gets failed > out of the array: > > [ 193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 action 0x0 > [ 193.801554] ata4.00: irq_stat 0x40000008 > [ 193.801581] ata4.00: failed command: READ FPDMA QUEUED > [ 193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30 > ncq 4096 in > [ 193.801618] res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask > 0x409 (media error) <F> > [ 193.801703] ata4.00: status: { DRDY ERR } > [ 193.801728] ata4.00: error: { UNC } > [ 193.804479] ata4.00: configured for UDMA/133 > [ 193.804499] ata4: EH complete > > First one one drive, then on another, then on another, as the slow > grow to raid 6 was happening these messages kept coming up and taking > drives down. Eventually (over the course of the week long grow time) > the failures were happening faster than I could recover them and I had > to revert to ddrescueing raid components to keep it from going under > the minimum components. I ended up having to ddrescue 3 failed drives > and force the array assembly to get back to 5 drives and by that time > the arrays ext4 file system could no longer mount (said something > about group descriptors being corrupted). By this time, every one of > the original drives has been replaced and this has been ongoing for 5 > months. I didn't even want to do an fsck to *attempt* to fix the file > system until I got a solid raid6. > > I upgraded my kernel to .40, bought another hard drive and put it in > there and started the grow. Within an hour the system froze. I > rebooted and restarted the array (and the grow), 2 hours later the > system froze again, rebooted restarted the array (and the grow) again, > and got those same errors again, this time on a drive that I had > bought last month. Frustrated (feeling like this will never end) I > let it keep going, hoping to atleast get back to raid 5. A few hours > later I got these errors AGAIN on ANOTHER drive I got last month (of a > differen't brand and model). So now I'm back with a non functional > array. A pile of 6 dead drives (not counting the ones still in the > computer, components of a now incomplete array). > > What is going on here? If brand new drives from a month ago from two > different manufacturers are failling, something else is going on. Is > it my motherboard? I've run memtest for 15 hours so far with no > errors, and ill let it go for 48 before I stop it, lets assume its not > the RAM for now. > > Not included in this history are SEVERAL times the machine locked up > harder than a REISUB, almost always during the heavy IO of component > recovery. It seems to stay up for weeks when the array is inactive > (and I'm too busy with other things to deal with it) and then as soon > as I put a new drive in and the recovery starts, it hangs within an > hour, and does so every few hours, and eventually I get the "failed > command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors > and another drive falls off the array. > > I don't mind buying a new motherboard if thats what it is (i've > already spent almost a grand on hard drives), I just want to get this > fixed/stable and the nightmare behind me. > > Here is the dmesg output for my last boot where two drives failed at > 193 and 12196: http://paste.ubuntu.com/5753575/ > > Thanks for any thoughts on the matter Apart from the previous thought regarding lack of power for the number of drives, have you considered getting a SATA controller card? This would totally rule out the motherboard as being an issue without forcing you to replace the motherboard. I'd probably check out the power supply issue first (quick, cheap, easy) and then follow up with using a well supported SATA controller card.... (ie, not a cheap crappy sata card with poor drivers/etc). Hope this helps Regards, Adam -- Adam Goryachev Website Managers Ph: +61 2 8304 0000 adam@websitemanagers.com.au Fax: +61 2 8304 0001 www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com>]
[parent not found: <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com>]
* Re: Mdadm server eating drives [not found] ` <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com> @ 2013-06-14 0:19 ` Barrett Lewis 2013-06-14 2:08 ` Phil Turmel 0 siblings, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-06-14 0:19 UTC (permalink / raw) To: linux-raid Sorry for the delay, I wanted to let the memtest run for 48 hours. It's at 49 hours now with zero errors, so memory is pretty much ruled out. As far as power, I would *think* I have enough power. The power supply is a 500w Thermaltake TR2. It's powering an Asrock z77 mobo with an i5-3570k, and the only card on it is a dinky little 2 port sata card my OS drive is on (the RAID components are plugged into the mobo). Eight 7200 drives and an SSD. Tell me if this sounds insufficient. Phil, when you say "what you are experiencing", what do you mean specifically? The dmesg errors and drives falling off? Or did you mean the beeping noises (since thats the part you trimmed)? Here is the data you requested 1) mdadm -E /dev/sd[a-f] http://pastie.org/8040826 2) mdadm -D /dev/md0 http://pastie.org/8040828 3) smartctl -x /dev/sda http://pastie.org/8040847 smartctl -x /dev/sdb http://pastie.org/8040848 smartctl -x /dev/sdc http://pastie.org/8040850 smartctl -x /dev/sdd http://pastie.org/8040851 smartctl -x /dev/sde http://pastie.org/8040852 smartctl -x /dev/sdf http://pastie.org/8040853 4) cat /proc/mdstat http://pastie.org/8040859 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done http://pastie.org/8040870 6) dmesg | grep -e sd -e md http://pastie.org/8040871 (note that I have rebooted since the last dmesg link I posted (where two drives failed) because I was running memtest, if I should do dmesg differently, let me know) 7) cat /etc/mdadm.conf http://pastie.org/8040876 Adam, I wouldn't be opposed to spending the money on a good sata card, but I'd like to get opinions from a few people first. Any suggestions on a good one for mdadm specifically? Thanks all! On Thu, Jun 13, 2013 at 7:19 PM, Barrett Lewis <barrett.lewis.mitsi@gmail.com> wrote: > Sorry for the delay, I wanted to let the memtest run for 48 hours. It's at > 49 hours now with zero errors, so memory is pretty much ruled out. > > As far as power, I would *think* I have enough power. The power supply is a > 500w Thermaltake TR2. It's powering an Asrock z77 mobo with an i5-3570k, > and the only card on it is a dinky little 2 port sata card my OS drive is on > (the RAID components are plugged into the mobo). Eight 7200 drives and an > SSD. Tell me if this sounds insufficient. > > Phil, when you say "what you are experiencing", what do you mean > specifically? The dmesg errors and drives falling off? Or did you mean the > beeping noises (since thats the part you trimmed)? > > > Here is the data you requested > > 1) mdadm -E /dev/sd[a-f] http://pastie.org/8040826 > > 2) mdadm -D /dev/md0 http://pastie.org/8040828 > > 3) > smartctl -x /dev/sda http://pastie.org/8040847 > smartctl -x /dev/sdb http://pastie.org/8040848 > smartctl -x /dev/sdc http://pastie.org/8040850 > smartctl -x /dev/sdd http://pastie.org/8040851 > smartctl -x /dev/sde http://pastie.org/8040852 > smartctl -x /dev/sdf http://pastie.org/8040853 > > 4) cat /proc/mdstat http://pastie.org/8040859 > > 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done > http://pastie.org/8040870 > > 6) dmesg | grep -e sd -e md http://pastie.org/8040871 > (note that I have rebooted since the last dmesg link I posted (where two > drives failed) because I was running memtest, if I should do dmesg > differently, let me know) > > 7) cat /etc/mdadm.conf http://pastie.org/8040876 > > > Adam, I wouldn't be opposed to spending the money on a good sata card, but > I'd like to get opinions from a few people first. Any suggestions on a good > one for mdadm specifically? > > Thanks all! > > > On Thu, Jun 13, 2013 at 7:17 PM, Barrett Lewis > <barrett.lewis.mitsi@gmail.com> wrote: >> >> Sorry for the delay, I wanted to let the memtest run for 48 hours. It's >> at 49 now with zero errors, so memory is pretty much ruled out. >> >> As far as power, I would think I have enough power. The power supply is a >> 500w Thermaltake TR2. It's powering an Asrock z77 mobo with an i5-3570k, >> and the only card on it is a dinky little 2 port sata card my OS drive is on >> (the RAID components are plugged into the mobo). Eight 7200 drives and an >> SSD. Tell me if this sounds like insufficient power. >> >> Phil, when you say "what you are experiencing", what do you mean >> specifically? The dmesg errors and drives falling off? Or did you mean the >> beeping noises (since thats the part you trimmed)? >> >> >> Here is the data you requested >> >> 1) mdadm -E /dev/sd[a-f] http://pastie.org/8040826 >> >> 2) mdadm -D /dev/md0 http://pastie.org/8040828 >> >> 3) >> smartctl -x /dev/sda http://pastie.org/8040847 >> smartctl -x /dev/sdb http://pastie.org/8040848 >> smartctl -x /dev/sdc http://pastie.org/8040850 >> smartctl -x /dev/sdd http://pastie.org/8040851 >> smartctl -x /dev/sde http://pastie.org/8040852 >> smartctl -x /dev/sdf http://pastie.org/8040853 >> >> 4) cat /proc/mdstat http://pastie.org/8040859 >> >> 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done >> http://pastie.org/8040870 >> >> 6) dmesg | grep -e sd -e md http://pastie.org/8040871 >> (note that I have rebooted since the last dmesg link I posted (where two >> drives failed) because I was running memtest, if I should do dmesg >> differently, let me know) >> >> 7) cat /etc/mdadm.conf http://pastie.org/8040876 >> >> >> Adam, I wouldn't be opposed to spending the money on a good sata card, but >> I'd like to get opinions from a few people first. Any suggestions on a good >> one for mdadm specifically? >> >> >> >> On Wed, Jun 12, 2013 at 10:41 AM, Adam Goryachev >> <adam@websitemanagers.com.au> wrote: >>> >>> On 12/06/13 23:47, Barrett Lewis wrote: >>> > I started about 1 year ago with a 5x2tb raid 5. At the beginning of >>> > feburary, I came home from work and my drives were all making these >>> > crazy beeping noises. At that point I was on kernel version .34 >>> > >>> > I shutdown and rebooted the server and the raid array didn't come back >>> > online. I noticed one drive was going up and down and determined that >>> > the drive had actual physical damage to the power connecter and was >>> > losing and regaining power through vibration. No problem. I bought >>> > another hard drive and mdadm started recovering to the new drive. >>> > Got it back to a Raid 5, backed up my data, then started growing to a >>> > raid6, and my computer hung hard where even REISUB was ignored. I >>> > restarted and resumed the grow. Then I started getting errors like >>> > these, they repeat for a minute or two and then the device gets failed >>> > out of the array: >>> > >>> > [ 193.801507] ata4.00: exception Emask 0x0 SAct 0x40000063 SErr 0x0 >>> > action 0x0 >>> > [ 193.801554] ata4.00: irq_stat 0x40000008 >>> > [ 193.801581] ata4.00: failed command: READ FPDMA QUEUED >>> > [ 193.801616] ata4.00: cmd 60/08:f0:98:c8:2b/00:00:10:00:00/40 tag 30 >>> > ncq 4096 in >>> > [ 193.801618] res 51/40:08:98:c8:2b/00:00:10:00:00/40 Emask >>> > 0x409 (media error) <F> >>> > [ 193.801703] ata4.00: status: { DRDY ERR } >>> > [ 193.801728] ata4.00: error: { UNC } >>> > [ 193.804479] ata4.00: configured for UDMA/133 >>> > [ 193.804499] ata4: EH complete >>> > >>> > First one one drive, then on another, then on another, as the slow >>> > grow to raid 6 was happening these messages kept coming up and taking >>> > drives down. Eventually (over the course of the week long grow time) >>> > the failures were happening faster than I could recover them and I had >>> > to revert to ddrescueing raid components to keep it from going under >>> > the minimum components. I ended up having to ddrescue 3 failed drives >>> > and force the array assembly to get back to 5 drives and by that time >>> > the arrays ext4 file system could no longer mount (said something >>> > about group descriptors being corrupted). By this time, every one of >>> > the original drives has been replaced and this has been ongoing for 5 >>> > months. I didn't even want to do an fsck to *attempt* to fix the file >>> > system until I got a solid raid6. >>> > >>> > I upgraded my kernel to .40, bought another hard drive and put it in >>> > there and started the grow. Within an hour the system froze. I >>> > rebooted and restarted the array (and the grow), 2 hours later the >>> > system froze again, rebooted restarted the array (and the grow) again, >>> > and got those same errors again, this time on a drive that I had >>> > bought last month. Frustrated (feeling like this will never end) I >>> > let it keep going, hoping to atleast get back to raid 5. A few hours >>> > later I got these errors AGAIN on ANOTHER drive I got last month (of a >>> > differen't brand and model). So now I'm back with a non functional >>> > array. A pile of 6 dead drives (not counting the ones still in the >>> > computer, components of a now incomplete array). >>> > >>> > What is going on here? If brand new drives from a month ago from two >>> > different manufacturers are failling, something else is going on. Is >>> > it my motherboard? I've run memtest for 15 hours so far with no >>> > errors, and ill let it go for 48 before I stop it, lets assume its not >>> > the RAM for now. >>> > >>> > Not included in this history are SEVERAL times the machine locked up >>> > harder than a REISUB, almost always during the heavy IO of component >>> > recovery. It seems to stay up for weeks when the array is inactive >>> > (and I'm too busy with other things to deal with it) and then as soon >>> > as I put a new drive in and the recovery starts, it hangs within an >>> > hour, and does so every few hours, and eventually I get the "failed >>> > command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors >>> > and another drive falls off the array. >>> > >>> > I don't mind buying a new motherboard if thats what it is (i've >>> > already spent almost a grand on hard drives), I just want to get this >>> > fixed/stable and the nightmare behind me. >>> > >>> > Here is the dmesg output for my last boot where two drives failed at >>> > 193 and 12196: http://paste.ubuntu.com/5753575/ >>> > >>> > Thanks for any thoughts on the matter >>> >>> Apart from the previous thought regarding lack of power for the number >>> of drives, have you considered getting a SATA controller card? This >>> would totally rule out the motherboard as being an issue without forcing >>> you to replace the motherboard. I'd probably check out the power supply >>> issue first (quick, cheap, easy) and then follow up with using a well >>> supported SATA controller card.... (ie, not a cheap crappy sata card >>> with poor drivers/etc). >>> >>> Hope this helps >>> >>> Regards, >>> Adam >>> >>> -- >>> Adam Goryachev >>> Website Managers >>> Ph: +61 2 8304 0000 >>> adam@websitemanagers.com.au >>> Fax: +61 2 8304 0001 >>> www.websitemanagers.com.au >>> >> > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 0:19 ` Barrett Lewis @ 2013-06-14 2:08 ` Phil Turmel [not found] ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com> 2013-07-29 22:25 ` Roy Sigurd Karlsbakk 0 siblings, 2 replies; 34+ messages in thread From: Phil Turmel @ 2013-06-14 2:08 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid Hi Barrett, Please interleave your replies, and trim unnecessary quotes. On 06/13/2013 08:19 PM, Barrett Lewis wrote: > Sorry for the delay, I wanted to let the memtest run for 48 hours. > It's at 49 hours now with zero errors, so memory is pretty much ruled > out. > > As far as power, I would *think* I have enough power. The power > supply is a 500w Thermaltake TR2. It's powering an Asrock z77 mobo > with an i5-3570k, and the only card on it is a dinky little 2 port > sata card my OS drive is on (the RAID components are plugged into the > mobo). Eight 7200 drives and an SSD. Tell me if this sounds > insufficient. > > Phil, when you say "what you are experiencing", what do you mean > specifically? The dmesg errors and drives falling off? Or did you > mean the beeping noises (since thats the part you trimmed)? Drives dropping out when they shouldn't, and smartctl says "PASSED". This is *unavoidable* when you have mismatched device and driver timeouts. > Here is the data you requested > > 1) mdadm -E /dev/sd[a-f] http://pastie.org/8040826 /dev/sdd and /dev/sde have old event counts ... > 2) mdadm -D /dev/md0 http://pastie.org/8040828 ... matching the array report ... > 3) > smartctl -x /dev/sda http://pastie.org/8040847 Ok, but no error recovery support (typical of green drives). > smartctl -x /dev/sdb http://pastie.org/8040848 Ok, green again. No ERC. > smartctl -x /dev/sdc http://pastie.org/8040850 Ok, with ERC support, but disabled. Not a green drive. > smartctl -x /dev/sdd http://pastie.org/8040851 Not Ok. A few relocations, a couple pending errors. ERC support present but disabled. > smartctl -x /dev/sde http://pastie.org/8040852 Not Ok. No relocations, but several pending errors. No ERC. > smartctl -x /dev/sdf http://pastie.org/8040853 Ok, but no ERC. > 4) cat /proc/mdstat http://pastie.org/8040859 > > 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done > http://pastie.org/8040870 All timeouts are still the default 30 seconds. With enabled ERC support, these values must be two to three minutes. I recommend 180 seconds. Your array *will not* complete a rebuild with dealing with this problem. > 6) dmesg | grep -e sd -e md http://pastie.org/8040871 > (note that I have rebooted since the last dmesg link I posted (where > two drives failed) because I was running memtest, if I should do dmesg > differently, let me know) > > 7) cat /etc/mdadm.conf http://pastie.org/8040876 I generally simplify the ARRAY line to just the device and the UUID, but it is ok as is. > Adam, I wouldn't be opposed to spending the money on a good sata card, > but I'd like to get opinions from a few people first. Any suggestions > on a good one for mdadm specifically? No need. Just fix your timeouts. For the two devices that support ERC, you need to turn it on: > smartctl -l scterc,70,70 /dev/sdc > smartctl -l scterc,70,70 /dev/sdd For the others, you need long timeouts in the linux driver: > for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done This must be done now, and at every power cycle or reboot. rc.local or similar distro config is the appropriate place. (Enterprise drives power up with ERC enabled. As do raid-rated consumer drives like WD Red.) Then stop and re-assemble your array. Use --force to reintegrate your problem drives. Fortunately, this is a raid6--with compatible timeouts, your rebuild will succeed. A URE on /dev/sdd would have to fall in the same place as a URE on /dev/sde to kill it. Upon completion, the UREs will either be fixed or relocated. If any drive's relocations reach double digits, I'd replace it. Finally, after your array is recovered, set up a cron job that'll trigger a "check" scrub of your array on a regular basis. I use a weekly scrub. The scrub keeps UREs that develop on idle parts of your array from accumulating. Note, the scrub itself will crash your array if your timeouts are mismatched and any UREs are lurking. I'll let you browse the archives for a more detailed explanation of *why* this happens. Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>]
* Re: Mdadm server eating drives [not found] ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com> @ 2013-06-14 21:18 ` Barrett Lewis 2013-06-14 21:20 ` Barrett Lewis 2013-06-14 21:24 ` Phil Turmel 0 siblings, 2 replies; 34+ messages in thread From: Barrett Lewis @ 2013-06-14 21:18 UTC (permalink / raw) To: linux-raid On Thu, Jun 13, 2013 at 9:08 PM, Phil Turmel <philip@turmel.org> wrote: > Please interleave your replies, and trim unnecessary quotes. No problem. >> smartctl -l scterc,70,70 /dev/sdc >> smartctl -l scterc,70,70 /dev/sdd >> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done > > This must be done now, and at every power cycle or reboot. rc.local or > similar distro config is the appropriate place. (Enterprise drives > power up with ERC enabled. As do raid-rated consumer drives like WD Red.) Seems that the drives themselves retained the ERC settings after a reboot. But I went ahead and put scterc and the timeouts in rc.local. > > Then stop and re-assemble your array. Use --force to reintegrate your > problem drives. Fortunately, this is a raid6--with compatible timeouts, > your rebuild will succeed. A URE on /dev/sdd would have to fall in the > same place as a URE on /dev/sde to kill it. It worked. Yer a wizard! Thank you! > Finally, after your array is recovered, set up a cron job that'll > trigger a "check" scrub of your array on a regular basis. I use a > weekly scrub. The scrub keeps UREs that develop on idle parts of your > array from accumulating. Note, the scrub itself will crash your array > if your timeouts are mismatched and any UREs are lurking. I'll definatly do this. When you talk about mismatched timeouts, do you mean matched between each of the components (as in /sys/block/sdX/device/timeout) or between that driver timeout and some device timeout per component? If you mean between components, are my timeouts matched now, even though I did not raise the 30 seconds on the two drives with ERC? On Fri, Jun 14, 2013 at 4:16 PM, Barrett Lewis <barrett.lewis.mitsi@gmail.com> wrote: > On Thu, Jun 13, 2013 at 9:08 PM, Phil Turmel <philip@turmel.org> wrote: >> Please interleave your replies, and trim unnecessary quotes. > > No problem. > >>> smartctl -l scterc,70,70 /dev/sdc >>> smartctl -l scterc,70,70 /dev/sdd >>> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done >> >> This must be done now, and at every power cycle or reboot. rc.local or >> similar distro config is the appropriate place. (Enterprise drives >> power up with ERC enabled. As do raid-rated consumer drives like WD Red.) > > Seems that the drives themselves retained the ERC settings after a > reboot. But I went ahead and put scterc and the timeouts in rc.local. > >> >> Then stop and re-assemble your array. Use --force to reintegrate your >> problem drives. Fortunately, this is a raid6--with compatible timeouts, >> your rebuild will succeed. A URE on /dev/sdd would have to fall in the >> same place as a URE on /dev/sde to kill it. > > It worked. Yer a wizard! Thank you! > >> Finally, after your array is recovered, set up a cron job that'll >> trigger a "check" scrub of your array on a regular basis. I use a >> weekly scrub. The scrub keeps UREs that develop on idle parts of your >> array from accumulating. Note, the scrub itself will crash your array >> if your timeouts are mismatched and any UREs are lurking. > > I'll definatly do this. When you talk about mismatched timeouts, do > you mean matched between each of the components (as in > /sys/block/sdX/device/timeout) or between that driver timeout and some > device timeout per component? If you mean between components, are my > timeouts matched now, even though I did not raise the 30 seconds on > the two drives with ERC? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 21:18 ` Barrett Lewis @ 2013-06-14 21:20 ` Barrett Lewis 2013-06-14 21:25 ` Phil Turmel 2013-06-14 21:24 ` Phil Turmel 1 sibling, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-06-14 21:20 UTC (permalink / raw) To: linux-raid Oops, again, sorry for the email issues, I'm having trouble getting gmail to play right. So now that I have a synced raid6, I'm looking at this problem of the filesystem having been partially or fully corrupted, which happened after a few components were ddrescued onto other components and force assembled. Is this something you might expect to happen in that scenario? mount /dev/md0 /media/vault http://pastie.org/8042532 I do have a backup, so if it comes down to it, I can just make a new filesystem and restore that way. However the backup is not 100% complete, so if possible id like to get the file system back, even with errors, to supplement the missing part of the backup. Is there any reason not to run "e2fsck -y /dev/md0"? Thanks ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 21:20 ` Barrett Lewis @ 2013-06-14 21:25 ` Phil Turmel 2013-06-14 21:30 ` Phil Turmel 0 siblings, 1 reply; 34+ messages in thread From: Phil Turmel @ 2013-06-14 21:25 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 06/14/2013 05:20 PM, Barrett Lewis wrote: > Oops, again, sorry for the email issues, I'm having trouble getting > gmail to play right. > > So now that I have a synced raid6, I'm looking at this problem of the > filesystem having been partially or fully corrupted, which happened > after a few components were ddrescued onto other components and force > assembled. Is this something you might expect to happen in that > scenario? > > mount /dev/md0 /media/vault http://pastie.org/8042532 > > I do have a backup, so if it comes down to it, I can just make a new > filesystem and restore that way. However the backup is not 100% > complete, so if possible id like to get the file system back, even > with errors, to supplement the missing part of the backup. > > Is there any reason not to run "e2fsck -y /dev/md0"? An fsck is often needed after one of these crises. So, yes. Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 21:25 ` Phil Turmel @ 2013-06-14 21:30 ` Phil Turmel 2013-06-17 21:37 ` Barrett Lewis 0 siblings, 1 reply; 34+ messages in thread From: Phil Turmel @ 2013-06-14 21:30 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 06/14/2013 05:25 PM, Phil Turmel wrote: > On 06/14/2013 05:20 PM, Barrett Lewis wrote: >> Is there any reason not to run "e2fsck -y /dev/md0"? > > An fsck is often needed after one of these crises. So, yes. After wrapping my head around the grammar... *No*, no reason to not run fsck. :-) Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 21:30 ` Phil Turmel @ 2013-06-17 21:37 ` Barrett Lewis 2013-06-18 4:13 ` Mikael Abrahamsson 0 siblings, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-06-17 21:37 UTC (permalink / raw) To: linux-raid This is terrific. fsck found tons and tons of errors and fixed them all. Then I ran rsync -avHcn [array] [backup] and found 5 or so files out of 8tb which had some slight corruption, which can easily be restored from the backup but I was curious to do a dry run first to use vbindiff and see what the corruption looked like at the byte level. Interesting! I did notice that before rsync found one of the differences (corrupt files) it started spitting out those same "failed command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors as before but this time it did not fail the drive. I take this to mean there is still some physical problems with the drive, but with the new timeout settings it is not unnecessarily failing the drive out of the array. So if I overwrite the corrupted files with the backups, (or write any new data to the array really), will it avoid those problem areas on the platter? I just want to say a big thanks as this has been causing me indescribable stress and monetary cost for months since the beginning of february and it looks like I am back in business. I think I will write some perl scripts to help monitor some of these things. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-17 21:37 ` Barrett Lewis @ 2013-06-18 4:13 ` Mikael Abrahamsson 2013-06-27 0:23 ` Barrett Lewis 0 siblings, 1 reply; 34+ messages in thread From: Mikael Abrahamsson @ 2013-06-18 4:13 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On Mon, 17 Jun 2013, Barrett Lewis wrote: > I did notice that before rsync found one of the differences (corrupt > files) it started spitting out those same "failed command: READ FPDMA > QUEUED status: { DRDY ERR } error: { UNC }" errors as before but this > time it did not fail the drive. I take this to mean there is still some > physical problems with the drive, but with the new timeout settings it > is not unnecessarily failing the drive out of the array. So if I > overwrite the corrupted files with the backups, (or write any new data > to the array really), will it avoid those problem areas on the platter? What should have happened here is that when md received the read error it should have read parity and recalculculated what should have been on those read error sectors and written to them, and the drive should have either succeeded in writing the new information, or written them to another place (reallocation). If your system is now working well, it might make sense to issue a "repair" to the array and let it run through completely: echo repair > /sys/block/md0/md/sync_action -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-18 4:13 ` Mikael Abrahamsson @ 2013-06-27 0:23 ` Barrett Lewis 2013-06-27 17:13 ` Nicolas Jungers 0 siblings, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-06-27 0:23 UTC (permalink / raw) To: linux-raid Everything is going well, I am just trying to replace the parts that are on the way out. I ran a 'repair' and it came out with 5477 under /sys/block/md0/md/mismatch_cnt. Then a 'check' came out with 0. Then I went out and bought a couple WD Reds (I'm done with greens now that I know they lack ERC). I replaced one of the two drives Phil said was not ok, which had many reallocations (I can personally see those) in the smart status. I then ran another repair to be safe. It came up with 0 mismatches, but in the process /dev/sda started giving me tons (and tons and tons, rolled over dmesg) of these "failed command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors. sda hadn't been giving me problems before but I'll come back to it. The second disk Phil said was "not ok" was this one which showed "several pending errors". (original smart status) http://pastie.org/8040852 I was going to replace it with my second spare Red, but the errors seem to have gone away. (current smart status) http://pastie.org/8084278 Or maybe I am looking in the wrong place to find the pending errors (looking at "197 Current_Pending_Sector"). Is the drive currently in need of replacement? I'm not sure what I'm looking for. What about this one (sda), after it gave all of those errors during a repair? http://pastie.org/8084292 I get the "5 Reallocated_Sector_Ct", but where do you find pending errors? What does it mean to get all these "failed command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" errors and the smart status seems to be fine even after a repair? Thanks everyone, I'm learning a lot. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-27 0:23 ` Barrett Lewis @ 2013-06-27 17:13 ` Nicolas Jungers 2013-07-02 0:17 ` Barrett Lewis 0 siblings, 1 reply; 34+ messages in thread From: Nicolas Jungers @ 2013-06-27 17:13 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 06/27/2013 02:23 AM, Barrett Lewis wrote: > Everything is going well, I am just trying to replace the parts that > are on the way out. > I ran a 'repair' and it came out with 5477 under > /sys/block/md0/md/mismatch_cnt. Then a 'check' came out with 0. > > Then I went out and bought a couple WD Reds (I'm done with greens now > that I know they lack ERC). I replaced one of the two drives Phil > said was not ok, which had many reallocations (I can personally see > those) in the smart status. I then ran another repair to be safe. It > came up with 0 mismatches, but in the process /dev/sda started giving > me tons (and tons and tons, rolled over dmesg) of these "failed > command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" > errors. sda hadn't been giving me problems before but I'll come back > to it. > > The second disk Phil said was "not ok" was this one which showed > "several pending errors". > (original smart status) http://pastie.org/8040852 > I was going to replace it with my second spare Red, but the errors > seem to have gone away. > (current smart status) http://pastie.org/8084278 > Or maybe I am looking in the wrong place to find the pending errors > (looking at "197 Current_Pending_Sector"). Is the drive currently in > need of replacement? I'm not sure what I'm looking for. > > What about this one (sda), after it gave all of those errors during a > repair? http://pastie.org/8084292 > I get the "5 Reallocated_Sector_Ct", but where do you find pending errors? > > What does it mean to get all these "failed command: READ FPDMA QUEUED > status: { DRDY ERR } error: { UNC }" errors and the smart status seems > to be fine even after a repair? Have you considered that your SATA may be faulty? I had consistent bad experiences with "cheap" SATA cables. I also use exclusively now cables with latches. I said "cheap" because the price is not an absolute criteria, quality of sourcing is more important in my experience. Regards, N. > > Thanks everyone, I'm learning a lot. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-27 17:13 ` Nicolas Jungers @ 2013-07-02 0:17 ` Barrett Lewis 2013-07-02 1:57 ` Stan Hoeppner 2013-07-02 21:49 ` Phil Turmel 0 siblings, 2 replies; 34+ messages in thread From: Barrett Lewis @ 2013-07-02 0:17 UTC (permalink / raw) To: linux-raid I am very sorry to keep bugging this list, but I am really lost. After learning about erc and timeouts the severity of the problem was reduced to the point that I could atleast get my system back to a raid6. I ran a repair and fixed 5477 mismatches, and then a check showed it clean. Yet drives continue to give me DRDY statuses. I replaced the two that were doing it with WD reds (which my intent is to only buy from now on). Then I tried to run a repair again, and my system crashed, as if the timers were mismatched, but I had set the driver timeouts on all drives to 180, even the ones with erc to be safe. This repair crashed several (3-4) times under these conditions (usually within a few minutes of starting). Finally instead of a repair I ran a check which somehow completed fine and showed zero mismatches. I started rsync to verify my data against a backup. And now 3 drives are giving me DRDY statuses. Two of them have REALLY failed out of the array, giving DRDY DF ERR messages, and don't even show superblock present from mdadm --examine, so now I'm back to the bare minimum of my raid6. One of the two drives that is so bad it lost it's superblock is one of the WD reds I just bought and installed 5 days ago. Any thoughts on what is going on? I have to ask again if it's possibly my motherboard is frying the hardware in these drives? cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid6 sdd[6](F) sdc[7] sda[9] sdf[8](F) sdb[0] sde[4] 7813531648 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [U__UUU] unused devices: <none> sudo mdadm -D /dev/md0 | nopaste http://pastie.org/8101687 sudo mdadm --examine /dev/sd[a-f] 2>&1 | nopaste http://pastie.org/8101681 sudo smartctl -x /dev/sda | nopaste http://pastie.org/8101691 sudo smartctl -x /dev/sdb | nopaste http://pastie.org/8101693 sudo smartctl -x /dev/sdc | nopaste http://pastie.org/8101694 sudo smartctl -x /dev/sdd | nopaste http://pastie.org/8101695 sudo smartctl -x /dev/sde | nopaste http://pastie.org/8101696 sudo smartctl -x /dev/sdf | nopaste http://pastie.org/8101697 for x in /sys/block/sd[a-f]/device/timeout ; do echo $x $(< $x); done /sys/block/sda/device/timeout 180 /sys/block/sdb/device/timeout 180 /sys/block/sdc/device/timeout 180 /sys/block/sdd/device/timeout 180 /sys/block/sde/device/timeout 180 /sys/block/sdf/device/timeout 180 On Thu, Jun 27, 2013 at 12:13 PM, Nicolas Jungers <nicolas@jungers.net> wrote: > On 06/27/2013 02:23 AM, Barrett Lewis wrote: >> >> Everything is going well, I am just trying to replace the parts that >> are on the way out. >> I ran a 'repair' and it came out with 5477 under >> /sys/block/md0/md/mismatch_cnt. Then a 'check' came out with 0. >> >> Then I went out and bought a couple WD Reds (I'm done with greens now >> that I know they lack ERC). I replaced one of the two drives Phil >> said was not ok, which had many reallocations (I can personally see >> those) in the smart status. I then ran another repair to be safe. It >> came up with 0 mismatches, but in the process /dev/sda started giving >> me tons (and tons and tons, rolled over dmesg) of these "failed >> command: READ FPDMA QUEUED status: { DRDY ERR } error: { UNC }" >> errors. sda hadn't been giving me problems before but I'll come back >> to it. >> >> The second disk Phil said was "not ok" was this one which showed >> "several pending errors". >> (original smart status) http://pastie.org/8040852 >> I was going to replace it with my second spare Red, but the errors >> seem to have gone away. >> (current smart status) http://pastie.org/8084278 >> Or maybe I am looking in the wrong place to find the pending errors >> (looking at "197 Current_Pending_Sector"). Is the drive currently in >> need of replacement? I'm not sure what I'm looking for. >> >> What about this one (sda), after it gave all of those errors during a >> repair? http://pastie.org/8084292 >> I get the "5 Reallocated_Sector_Ct", but where do you find pending errors? >> >> What does it mean to get all these "failed command: READ FPDMA QUEUED >> status: { DRDY ERR } error: { UNC }" errors and the smart status seems >> to be fine even after a repair? > > > Have you considered that your SATA may be faulty? I had consistent bad > experiences with "cheap" SATA cables. I also use exclusively now cables with > latches. I said "cheap" because the price is not an absolute criteria, > quality of sourcing is more important in my experience. > > Regards, > N. > > >> >> Thanks everyone, I'm learning a lot. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 0:17 ` Barrett Lewis @ 2013-07-02 1:57 ` Stan Hoeppner 2013-07-02 15:48 ` Barrett Lewis 2013-07-02 21:49 ` Phil Turmel 1 sibling, 1 reply; 34+ messages in thread From: Stan Hoeppner @ 2013-07-02 1:57 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 7/1/2013 7:17 PM, Barrett Lewis wrote: > I am very sorry to keep bugging this list, but I am really lost. I apologize as I just noticed this thread. If I'd jumped in sooner you might already have it fixed. I pulled your previous posts from my archive folder and read with interest. > I noticed one drive was going up and down and determined that > the drive had actual physical damage to the power connecter and > was losing and regaining power through vibration. This intermittent contact could have damaged the PSU. You've continued to have drive and lockup problems since replacing this drive with bad connector. The pink elephant in the room is thermal failure due to insufficient airflow. The symptoms you describe sound like drives overheating. What chassis is this? Make/model please. If you've installed individual drive hot swap cages, etc, it would be helpful if you snapped a photo or two and made those available. I've seen many instances of this type of failure over the years and, in order of prevalence, they are: 1. Failed cheap backplane 2. Insufficient airflow 3. Failed or cheap PSU 4. Failed HBA (or Southbridge) -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 1:57 ` Stan Hoeppner @ 2013-07-02 15:48 ` Barrett Lewis 2013-07-02 19:44 ` Stan Hoeppner 0 siblings, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-07-02 15:48 UTC (permalink / raw) To: stan; +Cc: linux-raid After sending the last email I went out and bought 2 new WD reds, and a new motherboard. I came back and in those 2 hours all but 1 of my drives failed to the point of being unable to read the superblock so it really seems like my array is ended On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> I noticed one drive was going up and down and determined that >> the drive had actual physical damage to the power connecter and >> was losing and regaining power through vibration. > > This intermittent contact could have damaged the PSU. You've continued > to have drive and lockup problems since replacing this drive with bad > connector. I hadn't thought of it until you said so but I bet you are right about the iffy connector. It certainly seemed as if I never had an issue with the array for 8 months, and then suddenly everything got unstable at once, and since then I've lost atleast 6 hard drives. > > The pink elephant in the room is thermal failure due to insufficient > airflow. The symptoms you describe sound like drives overheating. What > chassis is this? Make/model please. If you've installed individual > drive hot swap cages, etc, it would be helpful if you snapped a photo or > two and made those available. > > It is also possible that there were cooling issues. The case is an NZXT H2. It has some fans blowing directly on all the hard drives, but there were a few times I have to admit I took the fans off to work on things and forgot to put them back on for a few days, coming back to find them very hot to the touch. I would have mentioned that earlier, but a data recovery place told me that it was unlikely that would be a culprit (after they had my money). I don't have any drives in special cages but here's a pic anyway. The two fanboxes that sit in front of them are taken off. https://docs.google.com/file/d/0B1w3WvCHlYUWRVhWOVd0Qmt1TUk/edit?usp=sharing Maybe thats all academic at this point. I guess i'll have to rebuild my server from scratch since all my disks seem destroyed and I can't trust the mobo, cpu, or psu. Atleast I can memtest the ram. The psu wasn't dirt cheap, Thermaltake TR2 500w @ $58. Should I buy all new everything? If so, while I'm at can you suggest a set of consumer level hardware ideal running a personal mdadm server. Powered but not overpowered, reliable not bleeding edge. If I need 6-8 sata ports, should I do onboard or get a controller? I still have one backup allthough I'm very nervous now since it's on a 3 disk RAID0, just asking to implode (created in an emergency). ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 15:48 ` Barrett Lewis @ 2013-07-02 19:44 ` Stan Hoeppner 2013-07-02 19:54 ` Stan Hoeppner ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Stan Hoeppner @ 2013-07-02 19:44 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 7/2/2013 10:48 AM, Barrett Lewis wrote: > After sending the last email I went out and bought 2 new WD reds, and > a new motherboard. I came back and in those 2 hours all but 1 of my > drives failed to the point of being unable to read the superblock so > it really seems like my array is ended The drive may be ok. They all may be. > On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >>> I noticed one drive was going up and down and determined that >>> the drive had actual physical damage to the power connecter and >>> was losing and regaining power through vibration. >> >> This intermittent contact could have damaged the PSU. You've continued >> to have drive and lockup problems since replacing this drive with bad >> connector. > > I hadn't thought of it until you said so but I bet you are right about > the iffy connector. It certainly seemed as if I never had an issue > with the array for 8 months, and then suddenly everything got unstable > at once, and since then I've lost atleast 6 hard drives. Your drives may not be toast. Don't toss them out, and don't throw up your hands yet. >> The pink elephant in the room is thermal failure due to insufficient >> airflow. The symptoms you describe sound like drives overheating. What >> chassis is this? Make/model please. If you've installed individual >> drive hot swap cages, etc, it would be helpful if you snapped a photo or >> two and made those available. > > It is also possible that there were cooling issues. The case is an > NZXT H2. It has some fans blowing directly on all the hard drives, > but there were a few times I have to admit I took the fans off to work > on things and forgot to put them back on for a few days, coming back > to find them very hot to the touch. I would have mentioned that > earlier, but a data recovery place told me that it was unlikely that > would be a culprit (after they had my money). I checked out the chassis on the NZXT site. With the front fans removed, you have only 2x120mm low rpm, low static pressure, and low CFM exhaust fans, one on in the PSU, one top rear. With 8 drives packed in such close proximity and with other lower resistance intake paths (the perforated chassis bottom), you won't get enough air through the front drive cage to cool those drives properly over a long period. However, running with the two front fans removed for a couple of days on an occasion or two shouldn't have overheated the drives to the point of permanent damage, assuming ambient air temp was ~75F or lower, and assuming you were not performing long array operations such as rebuilds or reshapes--if you did so the drives could get hot enough, long enough, to be permanently damaged. > Maybe thats all academic at this point. I guess i'll have to rebuild > my server from scratch since all my disks seem destroyed and I can't > trust the mobo, cpu, or psu. Don't start over. Not just yet. Leave everything as is for now. Simply replace the PSU. Fire it up and see what you can recover. > The psu wasn't dirt cheap, Thermaltake TR2 500w @ $58. The price isn't relevant. The quality and rail configuration is, and whether it's been damaged. I checked the spec on your TR2-500 yesterday. It has dual +12V rails, one rated at 18A and one at 17A. I was unable to locate a wiring diagram for it. On paper it should have plenty of juice for your gear when in working order. My assumption here is that something internal to it may have failed. > Should I buy all new > everything? I wouldn't. Most of your gear is probably fine. Get the PSU swapped out and see if that fixes it. You may still have to wipe the drives and build a new array. You should know pretty quickly if the PSU swap fixed the problem, as drives will not continue to drop, or they will. You already have a new mobo in hand, so if the PSU isn't the problem, swap the mobo. That's a good chassis design with good airflow assuming you keep the front fans in it. Why you'd leave them removed is beyond me. > If so, while I'm at can you suggest a set of consumer > level hardware ideal running a personal mdadm server. Powered but not > overpowered, reliable not bleeding edge. If I need 6-8 sata ports, > should I do onboard or get a controller? A new HBA shouldn't be necessary. But if you choose to go that route further down the road I'd recommend an LSI 9211-8i. > I still have one backup allthough I'm very nervous now since it's on a > 3 disk RAID0, just asking to implode (created in an emergency). I assume this resides on a different machine. Swap the PSU. Recover the array if possible. If not blow it away and create new. If no drives drop out you're probably golden and the PSU fixed the problem. If they drop, swap in the new mobo. At that point you'll have replaced everything that could be the source of the problem but for the remaining original drives. They can't all be bad, if any. Always run with those front fans installed. -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 19:44 ` Stan Hoeppner @ 2013-07-02 19:54 ` Stan Hoeppner 2013-07-02 20:07 ` Jon Nelson 2013-07-02 20:58 ` Barrett Lewis 2 siblings, 0 replies; 34+ messages in thread From: Stan Hoeppner @ 2013-07-02 19:54 UTC (permalink / raw) To: stan; +Cc: Barrett Lewis, linux-raid Forgot to ask previously. This system is attached to a UPS isn't it? -- Stan On 7/2/2013 2:44 PM, Stan Hoeppner wrote: > On 7/2/2013 10:48 AM, Barrett Lewis wrote: >> After sending the last email I went out and bought 2 new WD reds, and >> a new motherboard. I came back and in those 2 hours all but 1 of my >> drives failed to the point of being unable to read the superblock so >> it really seems like my array is ended > > The drive may be ok. They all may be. > >> On Mon, Jul 1, 2013 at 8:57 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >>>> I noticed one drive was going up and down and determined that >>>> the drive had actual physical damage to the power connecter and >>>> was losing and regaining power through vibration. >>> >>> This intermittent contact could have damaged the PSU. You've continued >>> to have drive and lockup problems since replacing this drive with bad >>> connector. >> >> I hadn't thought of it until you said so but I bet you are right about >> the iffy connector. It certainly seemed as if I never had an issue >> with the array for 8 months, and then suddenly everything got unstable >> at once, and since then I've lost atleast 6 hard drives. > > Your drives may not be toast. Don't toss them out, and don't throw up > your hands yet. > >>> The pink elephant in the room is thermal failure due to insufficient >>> airflow. The symptoms you describe sound like drives overheating. What >>> chassis is this? Make/model please. If you've installed individual >>> drive hot swap cages, etc, it would be helpful if you snapped a photo or >>> two and made those available. >> >> It is also possible that there were cooling issues. The case is an >> NZXT H2. It has some fans blowing directly on all the hard drives, >> but there were a few times I have to admit I took the fans off to work >> on things and forgot to put them back on for a few days, coming back >> to find them very hot to the touch. I would have mentioned that >> earlier, but a data recovery place told me that it was unlikely that >> would be a culprit (after they had my money). > > I checked out the chassis on the NZXT site. With the front fans > removed, you have only 2x120mm low rpm, low static pressure, and low CFM > exhaust fans, one on in the PSU, one top rear. With 8 drives packed in > such close proximity and with other lower resistance intake paths (the > perforated chassis bottom), you won't get enough air through the front > drive cage to cool those drives properly over a long period. > > However, running with the two front fans removed for a couple of days on > an occasion or two shouldn't have overheated the drives to the point of > permanent damage, assuming ambient air temp was ~75F or lower, and > assuming you were not performing long array operations such as rebuilds > or reshapes--if you did so the drives could get hot enough, long enough, > to be permanently damaged. > >> Maybe thats all academic at this point. I guess i'll have to rebuild >> my server from scratch since all my disks seem destroyed and I can't >> trust the mobo, cpu, or psu. > > Don't start over. Not just yet. Leave everything as is for now. > Simply replace the PSU. Fire it up and see what you can recover. > >> The psu wasn't dirt cheap, Thermaltake TR2 500w @ $58. > > The price isn't relevant. The quality and rail configuration is, and > whether it's been damaged. I checked the spec on your TR2-500 > yesterday. It has dual +12V rails, one rated at 18A and one at 17A. I > was unable to locate a wiring diagram for it. On paper it should have > plenty of juice for your gear when in working order. My assumption here > is that something internal to it may have failed. > >> Should I buy all new >> everything? > > I wouldn't. Most of your gear is probably fine. Get the PSU swapped > out and see if that fixes it. You may still have to wipe the drives and > build a new array. You should know pretty quickly if the PSU swap fixed > the problem, as drives will not continue to drop, or they will. You > already have a new mobo in hand, so if the PSU isn't the problem, swap > the mobo. That's a good chassis design with good airflow assuming you > keep the front fans in it. Why you'd leave them removed is beyond me. > >> If so, while I'm at can you suggest a set of consumer >> level hardware ideal running a personal mdadm server. Powered but not >> overpowered, reliable not bleeding edge. If I need 6-8 sata ports, >> should I do onboard or get a controller? > > A new HBA shouldn't be necessary. But if you choose to go that route > further down the road I'd recommend an LSI 9211-8i. > >> I still have one backup allthough I'm very nervous now since it's on a >> 3 disk RAID0, just asking to implode (created in an emergency). > > I assume this resides on a different machine. > > Swap the PSU. Recover the array if possible. If not blow it away and > create new. If no drives drop out you're probably golden and the PSU > fixed the problem. If they drop, swap in the new mobo. At that point > you'll have replaced everything that could be the source of the problem > but for the remaining original drives. They can't all be bad, if any. > Always run with those front fans installed. > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 19:44 ` Stan Hoeppner 2013-07-02 19:54 ` Stan Hoeppner @ 2013-07-02 20:07 ` Jon Nelson 2013-07-02 20:23 ` Stan Hoeppner 2013-07-02 20:58 ` Barrett Lewis 2 siblings, 1 reply; 34+ messages in thread From: Jon Nelson @ 2013-07-02 20:07 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Barrett Lewis, linux-raid On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 7/2/2013 10:48 AM, Barrett Lewis wrote: >> After sending the last email I went out and bought 2 new WD reds, and >> a new motherboard. I came back and in those 2 hours all but 1 of my >> drives failed to the point of being unable to read the superblock so >> it really seems like my array is ended > > The drive may be ok. They all may be. Indeed. A number of years back, I had an MD RAID array that kept throwing drives, one after the other, after years of rock-solid stability. Nothing had changed, the machine hadn't been touched (or even rebooted!) in months, etc... It turns out that the motherboard had gone. It "worked" perfectly, except under any drive load at all it would start throwing I/O errors. I replaced only the motherboard (same PSU, memory, CPU, etc....) and that machine - built at least 4 years ago - is still humming along quite nicely. -- Jon ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 20:07 ` Jon Nelson @ 2013-07-02 20:23 ` Stan Hoeppner 0 siblings, 0 replies; 34+ messages in thread From: Stan Hoeppner @ 2013-07-02 20:23 UTC (permalink / raw) To: Jon Nelson; +Cc: Barrett Lewis, linux-raid On 7/2/2013 3:07 PM, Jon Nelson wrote: > On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 7/2/2013 10:48 AM, Barrett Lewis wrote: >>> After sending the last email I went out and bought 2 new WD reds, and >>> a new motherboard. I came back and in those 2 hours all but 1 of my >>> drives failed to the point of being unable to read the superblock so >>> it really seems like my array is ended >> >> The drive may be ok. They all may be. > > Indeed. A number of years back, I had an MD RAID array that kept > throwing drives, one after the other, after years of rock-solid > stability. Nothing had changed, the machine hadn't been touched (or > even rebooted!) in months, etc... It turns out that the motherboard > had gone. It "worked" perfectly, except under any drive load at all it > would start throwing I/O errors. I replaced only the motherboard (same > PSU, memory, CPU, etc....) and that machine - built at least 4 years > ago - is still humming along quite nicely. Were the drives were attached to the onboard SATA controller or an HBA? -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 19:44 ` Stan Hoeppner 2013-07-02 19:54 ` Stan Hoeppner 2013-07-02 20:07 ` Jon Nelson @ 2013-07-02 20:58 ` Barrett Lewis 2013-07-03 1:50 ` Stan Hoeppner 2 siblings, 1 reply; 34+ messages in thread From: Barrett Lewis @ 2013-07-02 20:58 UTC (permalink / raw) To: stan; +Cc: linux-raid On Tue, Jul 2, 2013 at 2:44 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > I checked out the chassis on the NZXT site. With the front fans > removed, you have only 2x120mm low rpm, low static pressure, and low CFM > exhaust fans, one on in the PSU, one top rear. With 8 drives packed in > such close proximity and with other lower resistance intake paths (the > perforated chassis bottom), you won't get enough air through the front > drive cage to cool those drives properly over a long period. > > However, running with the two front fans removed for a couple of days on > an occasion or two shouldn't have overheated the drives to the point of > permanent damage, assuming ambient air temp was ~75F or lower, and > assuming you were not performing long array operations such as rebuilds > or reshapes--if you did so the drives could get hot enough, long enough, > to be permanently damaged. > Interesting. Just to be clear, I never intentionally ran it without the fans on. The picture I sent was from when I first assembled the server and hadn't put the fans in or plugged the machine in. Also the few times the fans were left of were, as you said "for a couple of days on an occasion or two", but other than that the fans have always been on. It is possible though that one of those events was during a resync, since the fans were off because I was swapping out a failed drive and forgot to put them on. Which is when I came home to hear this beeping noise coming from all my drives. https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing I don't know what that beeping is, but it is a later recording, the original event had many drives beeping at once (some with slightly lower/higher pitches). I thought maybe it might have been an overheat alarm or something similar. Most of those original drives have been replaced at this point. But it was also at that time that I first started pulling wires on things (looking for which drives were beeping) that I found the broken power connector. This was back when this all started in february. > I wouldn't. Most of your gear is probably fine. Get the PSU swapped > out and see if that fixes it. You may still have to wipe the drives and > build a new array. You should know pretty quickly if the PSU swap fixed > the problem, as drives will not continue to drop, or they will. Good starting point, I'll do that tonight. Any particular trusty brands? Otherwise all I can really go off of is price (like before, I just tried to pay a little extra for "not the cheapest"). > Forgot to ask previously. This system is attached to a UPS isn't it? Yes, the server is plugged in through a dedicated UPS. > I assume this resides on a different machine. 4 drives in an external USB enclosure. 3 are a RAID0. > Were the drives were attached to the onboard SATA controller or an HBA? All 6 drives and my OS SSD are plugged into onboard SATA. Thanks for your help! Barrett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 20:58 ` Barrett Lewis @ 2013-07-03 1:50 ` Stan Hoeppner 2013-07-03 5:26 ` Barrett Lewis 0 siblings, 1 reply; 34+ messages in thread From: Stan Hoeppner @ 2013-07-03 1:50 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 7/2/2013 3:58 PM, Barrett Lewis wrote: > https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing Drives don't beep, they can't. They don't contain transducers, never have. And you don't have a RAID card. So that beep must be from the motherboard connected PC speaker, which means you have raidmon or another md monitoring daemon active. If this is the case it was simply giving an audible alert that a drive had been dropped. > Good starting point, I'll do that tonight. Any particular trusty > brands? Otherwise all I can really go off of is price (like before, I > just tried to pay a little extra for "not the cheapest"). For troubleshooting purposes I'd think any recent 400+ watt ATX PSU you have lying around should work, assuming there's no high wattage PCIe GPU card in the box sucking +12V power, and assuming you have all the necessary y-cables and SATA power adapters, etc. Try a spare PSU if possible before plunking cash on a possibly unneeded replacement. For a permanent replacement, I'll tell ya, they're all of pretty much similar quality today, except for the fan, after you get off the very bottom of the barrel. Cheap units come with cheap sleeve bearing fans that don't last. I buy near the bottom of the barrel and replace the fans on day one. I buy quality fans in bulk on closeout/overstock/etc every few years specifically for this purpose. Most don't have standard 2 pin PC connectors so I cut the one off the stock crap fan and solder it to the good one. Currently I'm draining a box of a dozen 80x25mm NMB -30 series Boxers for PSU duty, and a box of a dozen Nidec BetaV 92x25mm industrial fans for chassis duty. All double ball bearing, highest quality you can get. Not the quietest, butthey're high CFM and high static pressure. Others in this class are Sanyo Denki, Pabst, Delta, Panaflow, etc. I won't use 120mm fans in PSUs or chassis, but that's a discussion for another day. Either of these two should be ok. I'm not into the goofy lights and what not on the Apevia, or the triple fan design (more to replace), but at least it has a fan speed controller. Both have great reviews, and plenty of +12V power. The one thing I -really- like about the Apevia is the single +12V rail rated at 35 amps. Single rail is always better, contrary to popular belief. Multiple +12V rail PSUs came into existence because they're cheaper to produce, not because they're any better. 2/3/4 small MOSFETS, one per rail, are cheaper than one big MOSFET. Take a look at any -real- server PSU design. They're all single +12V rail, some rated to 150 amps (1800 watts). http://www.newegg.com/Product/Product.aspx?Item=N82E16817101021 http://www.newegg.com/Product/Product.aspx?Item=N82E16817148008 >> Forgot to ask previously. This system is attached to a UPS isn't it? > > Yes, the server is plugged in through a dedicated UPS. Good, takes care of that. >> I assume this resides on a different machine. > > 4 drives in an external USB enclosure. 3 are a RAID0. Ok, so this is your workstation, not a dedicated server? Does it have a PCIe GPU? If so what wattage? Ok, if you don't know that, what model? >> Were the drives were attached to the onboard SATA controller or an HBA? > > All 6 drives and my OS SSD are plugged into onboard SATA. I counted 8 drives in the picture. -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 1:50 ` Stan Hoeppner @ 2013-07-03 5:26 ` Barrett Lewis 2013-07-03 14:03 ` Jon Nelson 2013-07-03 17:05 ` Stan Hoeppner 0 siblings, 2 replies; 34+ messages in thread From: Barrett Lewis @ 2013-07-03 5:26 UTC (permalink / raw) To: stan; +Cc: linux-raid On Tue, Jul 2, 2013 at 8:50 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > > On 7/2/2013 3:58 PM, Barrett Lewis wrote: > >> I assume this resides on a different machine. > > > > 4 drives in an external USB enclosure. 3 are a RAID0. > > Ok, so this is your workstation, not a dedicated server? Does it have a > PCIe GPU? If so what wattage? Ok, if you don't know that, what model? This is all about my dedicated server. The external enclosure with the 4 drives, 3 of which in a raid0 is just something I used for creating an emergency backup, and was plugged directly into the server via USB, (has it's own power supply too). The server is using the onboard video card on the Asrock z77 extreme 4. > >> Were the drives were attached to the onboard SATA controller or an HBA? > > > > All 6 drives and my OS SSD are plugged into onboard SATA. > > I counted 8 drives in the picture. The other 2 drives in the picture are the source drives that had the original data that the array was initially populated with. They are not plugged into power or data. Just taking up space, really. I never took them out because I always intended to grow the array onto them, but then failures started. > > https://docs.google.com/file/d/0B1w3WvCHlYUWSGdBdjh3dWpuUnc/edit?usp=sharing > > Drives don't beep, they can't. They don't contain transducers, never > have. And you don't have a RAID card. So that beep must be from the > motherboard connected PC speaker, which means you have raidmon or > another md monitoring daemon active. If this is the case it was simply > giving an audible alert that a drive had been dropped. So, I accept that you know this stuff better than me, but I was pretty sure that noise was coming out of the drives (and had never seen or heard of anything like that before so I was very surprised). When I first built the machine I heard it once when a drive was jarred, when the caddy wasn't pushed all the way back and I pushed it till it clicked while it was running, and something made a quick "beep", which I thought was odd. Then the day these failures started, it sounded like there were the same "beeping" noises coming out of several drives all at once, out of sync with each other, sometimes the sounds overlapping with each other, sometimes with pitches offset, it really didn't sound like a single source at all. But I guess could have been mistaken. I have been really curious about this "beeping" issue since it is so bizarre. Anyway like I said only 2 of those original 6 (they were seagate ST2000DM001) remain. > > For troubleshooting purposes I'd think any recent 400+ watt ATX PSU you > have lying around should work, assuming there's no high wattage PCIe GPU > card in the box sucking +12V power, and assuming you have all the > necessary y-cables and SATA power adapters, etc. Try a spare PSU if > possible before plunking cash on a possibly unneeded replacement. > > For a permanent replacement, I'll tell ya, they're all of pretty much > similar quality today, except for the fan, after you get off the very > bottom of the barrel. Cheap units come with cheap sleeve bearing fans > that don't last. I buy near the bottom of the barrel and replace the > fans on day one. I buy quality fans in bulk on closeout/overstock/etc > every few years specifically for this purpose. Most don't have standard > 2 pin PC connectors so I cut the one off the stock crap fan and solder > it to the good one. Cheap alternate PSU seemed to work OK so I went to buy a decent permanent replacement. I couldn't find either of the two you suggested at the store (they were closing and I wanted to get this done). So I ended up going with a 750w corsair CX750M. Like magic, with a new power supply most of the drives seem to be back working, except the first two that failed out yesterday. It seems like maybe the event counters (or something) are too far behind to assemble them back. That said, md0 mounts fine and fsck returned clean, so that deserves some kinda hooray! Here is some data about the two (sdd and sdf) that won't socialize with the other disks. sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[a-f] mdadm: looking for devices for /dev/md0 mdadm: /dev/sda is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdc is identified as a member of /dev/md0, slot 5. mdadm: /dev/sdd is identified as a member of /dev/md0, slot 1. mdadm: /dev/sde is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdf is identified as a member of /dev/md0, slot 2. mdadm: added /dev/sdd to /dev/md0 as 1 (possibly out of date) mdadm: added /dev/sdf to /dev/md0 as 2 (possibly out of date) mdadm: added /dev/sde to /dev/md0 as 3 mdadm: added /dev/sda to /dev/md0 as 4 mdadm: added /dev/sdc to /dev/md0 as 5 mdadm: added /dev/sdb to /dev/md0 as 0 mdadm: /dev/md0 has been started with 4 drives (out of 6). and from dmesg [ 4481.356723] md: bind<sdd> [ 4481.356850] md: bind<sdf> [ 4481.357007] md: bind<sde> [ 4481.357134] md: bind<sda> [ 4481.357248] md: bind<sdc> [ 4481.357365] md: bind<sdb> [ 4481.357395] md: kicking non-fresh sdf from array! [ 4481.357400] md: unbind<sdf> [ 4481.374480] md: export_rdev(sdf) [ 4481.374484] md: kicking non-fresh sdd from array! [ 4481.374488] md: unbind<sdd> [ 4481.394486] md: export_rdev(sdd) [ 4481.396164] md/raid:md0: device sdb operational as raid disk 0 [ 4481.396168] md/raid:md0: device sdc operational as raid disk 5 [ 4481.396171] md/raid:md0: device sda operational as raid disk 4 [ 4481.396173] md/raid:md0: device sde operational as raid disk 3 [ 4481.396571] md/raid:md0: allocated 6384kB [ 4481.396805] md/raid:md0: raid level 6 active with 4 out of 6 devices, algorithm 2 [ 4481.396808] RAID conf printout: [ 4481.396810] --- level:6 rd:6 wd:4 [ 4481.396812] disk 0, o:1, dev:sdb [ 4481.396814] disk 3, o:1, dev:sde [ 4481.396815] disk 4, o:1, dev:sda [ 4481.396817] disk 5, o:1, dev:sdc [ 4481.396848] md0: detected capacity change from 0 to 8001056407552 [ 4481.426011] md0: unknown partition table sudo mdadm -E /dev/sd[a-f] | nopaste http://pastie.org/8105693 sudo smartctl -x /dev/sdd | nopaste http://pastie.org/8105706 sudo smartctl -x /dev/sdf | nopaste http://pastie.org/8105707 Are sdd and sdf just too out of sync? Should I zero the superblocks and re-add them to the array? Or I could replace them (I have two unopened WD reds here, but I'd like to return them if I don't really need them right now). Thanks for the advice about the PSU, I would have never dreamed it would cause behaviour like that. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 5:26 ` Barrett Lewis @ 2013-07-03 14:03 ` Jon Nelson 2013-07-03 14:36 ` Phil Turmel 2013-07-03 17:32 ` Stan Hoeppner 2013-07-03 17:05 ` Stan Hoeppner 1 sibling, 2 replies; 34+ messages in thread From: Jon Nelson @ 2013-07-03 14:03 UTC (permalink / raw) To: Barrett Lewis; +Cc: Stan Hoeppner, linux-raid On Wed, Jul 3, 2013 at 12:26 AM, Barrett Lewis <barrett.lewis.mitsi@gmail.com> wrote: > > didn't sound like a single source at all. But I guess could have been > mistaken. I have been really curious about this "beeping" issue since > it is so bizarre. Anyway like I said only 2 of those original 6 (they > were seagate ST2000DM001) remain. A quick google search shows the ST2000DM001 (which I have 2 of) *do* make "chirping" or "beeping" noises. Additionally, it seems there are firmware updates available. Sadly, I bought two of these drives some months ago (current firmware: CC26) and so far so good. However, should I be worried about these drives? smartctl -a showed me a pair of links that brought me to the firmware update pages. -- Jon ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 14:03 ` Jon Nelson @ 2013-07-03 14:36 ` Phil Turmel 2013-07-03 17:32 ` Stan Hoeppner 1 sibling, 0 replies; 34+ messages in thread From: Phil Turmel @ 2013-07-03 14:36 UTC (permalink / raw) To: Jon Nelson; +Cc: Barrett Lewis, Stan Hoeppner, linux-raid On 07/03/2013 10:03 AM, Jon Nelson wrote: > On Wed, Jul 3, 2013 at 12:26 AM, Barrett Lewis > <barrett.lewis.mitsi@gmail.com> wrote: >> >> didn't sound like a single source at all. But I guess could have been >> mistaken. I have been really curious about this "beeping" issue since >> it is so bizarre. Anyway like I said only 2 of those original 6 (they >> were seagate ST2000DM001) remain. > > > A quick google search shows the ST2000DM001 (which I have 2 of) *do* > make "chirping" or "beeping" noises. Additionally, it seems there are > firmware updates available. Sadly, I bought two of these drives some > months ago (current firmware: CC26) and so far so good. However, > should I be worried about these drives? They don't support ERC. You *must* use a 2-3 minute driver timeout for these if you use them in a raid array. Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 14:03 ` Jon Nelson 2013-07-03 14:36 ` Phil Turmel @ 2013-07-03 17:32 ` Stan Hoeppner 2013-07-03 19:47 ` Barrett Lewis 1 sibling, 1 reply; 34+ messages in thread From: Stan Hoeppner @ 2013-07-03 17:32 UTC (permalink / raw) To: Jon Nelson; +Cc: Barrett Lewis, linux-raid On 7/3/2013 9:03 AM, Jon Nelson wrote: > A quick google search shows the ST2000DM001 (which I have 2 of) *do* > make "chirping" or "beeping" noises. Additionally Yes, Seagate still makes some relatively noisy drives compared to others on the market. I had ST225s and ST251s in the late 80s that could be heard across a large room. All drives were noisy back then due to the use of stepper motors. Drives have used voice coil actuators since the early 90s, shortly after the IDE spec was adopted, which are an order of magnitude quieter. These are very old and thus a bit noiser, but not all that much: http://www.youtube.com/watch?v=NYEkC7FBXa4 http://www.youtube.com/watch?v=RZMrwdQBVf4 But surely nobody would confuse this random mechanical drive noise for an audible alarm. And of course, with dirty power as in the OP's case, drives will make more noise due to the firmware doing constant recalibration of the heads as the spindle constantly drops below minimum RPM threshold and comes back up again when voltage increases. I think some people have simply become accustomed to ultra quiet drives that rarely make a peep, and when they do people get all nervous. -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 17:32 ` Stan Hoeppner @ 2013-07-03 19:47 ` Barrett Lewis 2013-07-03 20:38 ` Jon Nelson 2013-07-04 2:21 ` Stan Hoeppner 0 siblings, 2 replies; 34+ messages in thread From: Barrett Lewis @ 2013-07-03 19:47 UTC (permalink / raw) To: linux-raid I added the two non-fresh drives back to the array and they have been resyncing. The first one is almost complete. No errors so far. Everything is been very smooth since replacing the power supply. And I am paying extra close attention to make sure my TLER capable drives have it turned on, and the others have the driver timeout set to 180. From now on I will only be buying TLER capable drives. Just an update. Also, this is likely the same drive making the same noise with the top off https://www.youtube.com/watch?v=a9i5yixsJbk I wonder if its PWM driving the actuator against the barrier. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 19:47 ` Barrett Lewis @ 2013-07-03 20:38 ` Jon Nelson 2013-07-04 2:21 ` Stan Hoeppner 1 sibling, 0 replies; 34+ messages in thread From: Jon Nelson @ 2013-07-03 20:38 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On Wed, Jul 3, 2013 at 2:47 PM, Barrett Lewis <barrett.lewis.mitsi@gmail.com> wrote: > I added the two non-fresh drives back to the array and they have been > resyncing. The first one is almost complete. No errors so far. > Everything is been very smooth since replacing the power supply. > And I am paying extra close attention to make sure my TLER capable > drives have it turned on, and the others have the driver timeout set > to 180. From now on I will only be buying TLER capable drives. > Just an update. That's very good news. Data loss can be very frustrating, I know! As for myself, even though I've got newer drives and firmware: ST2000DM001-1CH164 and firmware CC26 I'm going to be actively looking to replace these drives. -- Jon ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 19:47 ` Barrett Lewis 2013-07-03 20:38 ` Jon Nelson @ 2013-07-04 2:21 ` Stan Hoeppner 1 sibling, 0 replies; 34+ messages in thread From: Stan Hoeppner @ 2013-07-04 2:21 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 7/3/2013 2:47 PM, Barrett Lewis wrote: > I added the two non-fresh drives back to the array and they have been > resyncing. The first one is almost complete. No errors so far. > Everything is been very smooth since replacing the power supply. > And I am paying extra close attention to make sure my TLER capable > drives have it turned on, and the others have the driver timeout set > to 180. From now on I will only be buying TLER capable drives. > Just an update. Good to hear. I hope things keep looking up. > Also, this is likely the same drive making the same noise with the top > off https://www.youtube.com/watch?v=a9i5yixsJbk > I wonder if its PWM driving the actuator against the barrier. Spinning HDDs use voice coil actuators to move the head. The voice coil is driven by direct DC current, not pulse width modulation. PWM generates constant voltage and current but cycles the circuit hundreds or thousands of times per second. The lower the frequency the less power to the device. The higher the frequency the greater the power. Thus PWM is suitable for varying the speed of brushless DC fans and the brightness of incandescent light bulbs. It simply won't work for driving voice coil actuators. Regarding HDD noises, identifying/diagnosing them is a voodoo science, unless you happen to be an engineer at Seagate, WD, or Toshiba, which are, sadly, AFAIK, the only 3 remaining HDD vendors left on the planet. Seagate and WD have swallowed all the others. -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-03 5:26 ` Barrett Lewis 2013-07-03 14:03 ` Jon Nelson @ 2013-07-03 17:05 ` Stan Hoeppner 1 sibling, 0 replies; 34+ messages in thread From: Stan Hoeppner @ 2013-07-03 17:05 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid, Phil Turmel On 7/3/2013 12:26 AM, Barrett Lewis wrote: ... > This is all about my dedicated server. The external enclosure with > the 4 drives, 3 of which in a raid0 is just something I used for > creating an emergency backup, and was plugged directly into the server > via USB, (has it's own power supply too). The server is using the > onboard video card on the Asrock z77 extreme 4. Got it. ... > The other 2 drives in the picture are the source drives that had the > original data that the array was initially populated with. Got it. These questions were simply to get a handle on how much +12V power you needed before recommending a PSU. ... > I have been really curious about this "beeping" issue since > it is so bizarre. Anyway like I said only 2 of those original 6 (they > were seagate ST2000DM001) remain. When power supplies go bad you may witness all kinds of weird things. If the voltage to the speaker drive circuit fluctuates wildly it can cause leakage on the output drive, which causes the speaker to make random noises. > Cheap alternate PSU seemed to work OK so I went to buy a decent > permanent replacement. I couldn't find either of the two you > suggested at the store (they were closing and I wanted to get this > done). So I ended up going with a 750w corsair CX750M. Like magic, > with a new power supply most of the drives seem to be back working, > except the first two that failed out yesterday. It seems like maybe > the event counters (or something) are too far behind to assemble them > back. That said, md0 mounts fine and fsck returned clean, so that > deserves some kinda hooray! The key thing is whether drives keep showing errors in dmesg and dropping. If not your problem is likely solved. :) > Here is some data about the two (sdd and sdf) that won't socialize > with the other disks. > > sudo mdadm --assemble --force --verbose /dev/md0 /dev/sd[a-f] > mdadm: looking for devices for /dev/md0 > mdadm: /dev/sda is identified as a member of /dev/md0, slot 4. > mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0. > mdadm: /dev/sdc is identified as a member of /dev/md0, slot 5. > mdadm: /dev/sdd is identified as a member of /dev/md0, slot 1. > mdadm: /dev/sde is identified as a member of /dev/md0, slot 3. > mdadm: /dev/sdf is identified as a member of /dev/md0, slot 2. > mdadm: added /dev/sdd to /dev/md0 as 1 (possibly out of date) > mdadm: added /dev/sdf to /dev/md0 as 2 (possibly out of date) > mdadm: added /dev/sde to /dev/md0 as 3 > mdadm: added /dev/sda to /dev/md0 as 4 > mdadm: added /dev/sdc to /dev/md0 as 5 > mdadm: added /dev/sdb to /dev/md0 as 0 > mdadm: /dev/md0 has been started with 4 drives (out of 6). > > > and from dmesg > [ 4481.356723] md: bind<sdd> > [ 4481.356850] md: bind<sdf> > [ 4481.357007] md: bind<sde> > [ 4481.357134] md: bind<sda> > [ 4481.357248] md: bind<sdc> > [ 4481.357365] md: bind<sdb> > [ 4481.357395] md: kicking non-fresh sdf from array! > [ 4481.357400] md: unbind<sdf> > [ 4481.374480] md: export_rdev(sdf) > [ 4481.374484] md: kicking non-fresh sdd from array! > [ 4481.374488] md: unbind<sdd> > [ 4481.394486] md: export_rdev(sdd) > [ 4481.396164] md/raid:md0: device sdb operational as raid disk 0 > [ 4481.396168] md/raid:md0: device sdc operational as raid disk 5 > [ 4481.396171] md/raid:md0: device sda operational as raid disk 4 > [ 4481.396173] md/raid:md0: device sde operational as raid disk 3 > [ 4481.396571] md/raid:md0: allocated 6384kB > [ 4481.396805] md/raid:md0: raid level 6 active with 4 out of 6 > devices, algorithm 2 > [ 4481.396808] RAID conf printout: > [ 4481.396810] --- level:6 rd:6 wd:4 > [ 4481.396812] disk 0, o:1, dev:sdb > [ 4481.396814] disk 3, o:1, dev:sde > [ 4481.396815] disk 4, o:1, dev:sda > [ 4481.396817] disk 5, o:1, dev:sdc > [ 4481.396848] md0: detected capacity change from 0 to 8001056407552 > [ 4481.426011] md0: unknown partition table > > sudo mdadm -E /dev/sd[a-f] | nopaste > http://pastie.org/8105693 > > sudo smartctl -x /dev/sdd | nopaste > http://pastie.org/8105706 > > sudo smartctl -x /dev/sdf | nopaste > http://pastie.org/8105707 > > > Are sdd and sdf just too out of sync? Should I zero the superblocks > and re-add them to the array? Or I could replace them (I have two > unopened WD reds here, but I'd like to return them if I don't really > need them right now). I'm not an expert on recovery when things go this far South. Phil and others are much more knowledgeable with this so I'll pass the thread back to them now. > Thanks for the advice about the PSU, I would have never dreamed it > would cause behaviour like that. You're welcome. I've spent a just little time around hardware, as you might have guessed based on my email address. Started in 1986, so that's, what, 26 years now? Damn I'm getting old... -- Stan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-07-02 0:17 ` Barrett Lewis 2013-07-02 1:57 ` Stan Hoeppner @ 2013-07-02 21:49 ` Phil Turmel 1 sibling, 0 replies; 34+ messages in thread From: Phil Turmel @ 2013-07-02 21:49 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid, Stan Hoeppner On 07/01/2013 08:17 PM, Barrett Lewis wrote: > I am very sorry to keep bugging this list, but I am really lost. My apologies... I was helping you before I disappeared on a 2-week business trip. I plain forgot about your case. Anyways, Stan's on the case. [Thanks, Stan.] Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 21:18 ` Barrett Lewis 2013-06-14 21:20 ` Barrett Lewis @ 2013-06-14 21:24 ` Phil Turmel 1 sibling, 0 replies; 34+ messages in thread From: Phil Turmel @ 2013-06-14 21:24 UTC (permalink / raw) To: Barrett Lewis; +Cc: linux-raid On 06/14/2013 05:18 PM, Barrett Lewis wrote: > I'll definatly do this. When you talk about mismatched timeouts, do > you mean matched between each of the components (as in > /sys/block/sdX/device/timeout) or between that driver timeout and some > device timeout per component? If you mean between components, are my > timeouts matched now, even though I did not raise the 30 seconds on > the two drives with ERC? For each drive, the driver timeout (/sys/block/.../device/timeout) must be longer than the drive's timeout (smartctl -l scterc). Note that scterc is in deciseconds, while the driver uses seconds. Enterprise drives typically power up with 7.0 second timeouts. The few SSDs I've been playing with power up with 4.0 second timeouts. Without ERC, the drives I've played with will perform error recovery for about two full minutes, ignoring the world for the duration. Phil ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Mdadm server eating drives 2013-06-14 2:08 ` Phil Turmel [not found] ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com> @ 2013-07-29 22:25 ` Roy Sigurd Karlsbakk 1 sibling, 0 replies; 34+ messages in thread From: Roy Sigurd Karlsbakk @ 2013-07-29 22:25 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid, Barrett Lewis > > 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; > > done > > http://pastie.org/8040870 > > All timeouts are still the default 30 seconds. With enabled ERC > support, these values must be two to three minutes. I recommend 180 > seconds. Your array *will not* complete a rebuild with dealing with > this problem. With ERC suppot, those timeouts should be around 7 seconds, not 3 minutes. What he pasted was 180 seconds, as in three minutes, which will bust a RAID rather quickly. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy@karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2013-07-29 22:25 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-06-12 13:47 Mdadm server eating drives Barrett Lewis 2013-06-12 13:57 ` David Brown 2013-06-12 14:44 ` Phil Turmel 2013-06-12 15:41 ` Adam Goryachev [not found] ` <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com> [not found] ` <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com> 2013-06-14 0:19 ` Barrett Lewis 2013-06-14 2:08 ` Phil Turmel [not found] ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com> 2013-06-14 21:18 ` Barrett Lewis 2013-06-14 21:20 ` Barrett Lewis 2013-06-14 21:25 ` Phil Turmel 2013-06-14 21:30 ` Phil Turmel 2013-06-17 21:37 ` Barrett Lewis 2013-06-18 4:13 ` Mikael Abrahamsson 2013-06-27 0:23 ` Barrett Lewis 2013-06-27 17:13 ` Nicolas Jungers 2013-07-02 0:17 ` Barrett Lewis 2013-07-02 1:57 ` Stan Hoeppner 2013-07-02 15:48 ` Barrett Lewis 2013-07-02 19:44 ` Stan Hoeppner 2013-07-02 19:54 ` Stan Hoeppner 2013-07-02 20:07 ` Jon Nelson 2013-07-02 20:23 ` Stan Hoeppner 2013-07-02 20:58 ` Barrett Lewis 2013-07-03 1:50 ` Stan Hoeppner 2013-07-03 5:26 ` Barrett Lewis 2013-07-03 14:03 ` Jon Nelson 2013-07-03 14:36 ` Phil Turmel 2013-07-03 17:32 ` Stan Hoeppner 2013-07-03 19:47 ` Barrett Lewis 2013-07-03 20:38 ` Jon Nelson 2013-07-04 2:21 ` Stan Hoeppner 2013-07-03 17:05 ` Stan Hoeppner 2013-07-02 21:49 ` Phil Turmel 2013-06-14 21:24 ` Phil Turmel 2013-07-29 22:25 ` Roy Sigurd Karlsbakk
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.