All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 reshape, 2 disk failures
@ 2012-10-16 22:57 Mathias Burén
  2012-10-17  2:29 ` Stan Hoeppner
  2012-10-17  3:06 ` Chris Murphy
  0 siblings, 2 replies; 18+ messages in thread
From: Mathias Burén @ 2012-10-16 22:57 UTC (permalink / raw)
  To: Linux-RAID

Hi list,

I started a reshape from 64K chunk size to 512K (now default IIRC).
During this time 2 disks failed with some time in between. The first
one was removed by MD, so I shut down and removed the HDD, continued
the reshape. After a while the second HDD failed. This is what it
looks liek right now, the second failed HDD still in as you can see:

 $ iostat -m
Linux 3.5.5-1-ck (ion)  10/16/2012      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.93    7.81    5.40   15.57    0.00   62.28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              38.93         0.00        13.09        939    8134936
sdb              59.37         5.19         2.60    3224158    1613418
sdf              59.37         5.19         2.60    3224136    1613418
sdc              59.37         5.19         2.60    3224134    1613418
sdd              59.37         5.19         2.60    3224151    1613418
sde              42.17         3.68         1.84    2289332    1145595
sdg              59.37         5.19         2.60    3224061    1613418
sdh               0.00         0.00         0.00          9          0
md0               0.06         0.00         0.00       2023          0
dm-0              0.06         0.00         0.00       2022          0

 $ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9]
      9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/5] [_UUUUU_]
      [================>....]  reshape = 84.6% (1650786304/1950351360)
finish=2089.2min speed=2389K/sec

unused devices: <none>

 $ sudo mdadm -D /dev/md0
[sudo] password for x:
/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 19 08:58:41 2010
     Raid Level : raid6
     Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
  Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
   Raid Devices : 7
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Tue Oct 16 23:55:28 2012
          State : clean, degraded, reshaping
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

 Reshape Status : 84% complete
  New Chunksize : 512K

           Name : ion:0  (local to host ion)
           UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
         Events : 8386010

    Number   Major   Minor   RaidDevice State
       0       8       65        0      faulty spare rebuilding   /dev/sde1
       9       8       81        1      active sync   /dev/sdf1
       4       8       17        2      active sync   /dev/sdb1
       3       8       49        3      active sync   /dev/sdd1
       5       8       33        4      active sync   /dev/sdc1
       8       8       97        5      active sync   /dev/sdg1
       6       0        0        6      removed


What is confusing to me is that /dev/sde1 (which is failing) is
currently marked as rebuilding. But when I check iostat, it's far
behind the other drives in total I/O since the reshape started, and
the I/O hasn't actually changed for a few hours. This together with _
instead of U leads me to believe that it's not actually being used. So
why does it say rebuilding?

I guess my question is if it's possible for me to remove the drive, or
would I mess the array up? I am not going to anything until the
reshape finishes though.

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén
@ 2012-10-17  2:29 ` Stan Hoeppner
  2012-10-17  3:06 ` Chris Murphy
  1 sibling, 0 replies; 18+ messages in thread
From: Stan Hoeppner @ 2012-10-17  2:29 UTC (permalink / raw)
  To: Linux-RAID

On 10/16/2012 5:57 PM, Mathias Burén wrote:
> Hi list,
> 
> I started a reshape from 64K chunk size to 512K (now default IIRC).
> During this time 2 disks failed with some time in between. The first
> one was removed by MD, so I shut down and removed the HDD, continued
> the reshape. After a while the second HDD failed. This is what it
> looks liek right now, the second failed HDD still in as you can see:

Apparently you don't realize you're going through all of this for the
sake of a senseless change that will gain you nothing, and cost you
performance.  Large chunk sizes are murder for parity RAID due to the
increased IO bandwidth required during RMW cycles.  The new 512KB
default is way too big.  And with many random IO workloads even 64KB is
a bit large.  This was discussed on this list in detail not long ago.

I guess one positive aspect is you've discovered problems with a couple
of drives.  Better now than later I guess.

-- 
Stan


>  $ iostat -m
> Linux 3.5.5-1-ck (ion)  10/16/2012      _x86_64_        (4 CPU)
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            8.93    7.81    5.40   15.57    0.00   62.28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda              38.93         0.00        13.09        939    8134936
> sdb              59.37         5.19         2.60    3224158    1613418
> sdf              59.37         5.19         2.60    3224136    1613418
> sdc              59.37         5.19         2.60    3224134    1613418
> sdd              59.37         5.19         2.60    3224151    1613418
> sde              42.17         3.68         1.84    2289332    1145595
> sdg              59.37         5.19         2.60    3224061    1613418
> sdh               0.00         0.00         0.00          9          0
> md0               0.06         0.00         0.00       2023          0
> dm-0              0.06         0.00         0.00       2022          0
> 
>  $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9]
>       9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [7/5] [_UUUUU_]
>       [================>....]  reshape = 84.6% (1650786304/1950351360)
> finish=2089.2min speed=2389K/sec
> 
> unused devices: <none>
> 
>  $ sudo mdadm -D /dev/md0
> [sudo] password for x:
> /dev/md0:
>         Version : 1.2
>   Creation Time : Tue Oct 19 08:58:41 2010
>      Raid Level : raid6
>      Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
>   Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
>    Raid Devices : 7
>   Total Devices : 6
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Oct 16 23:55:28 2012
>           State : clean, degraded, reshaping
>  Active Devices : 5
> Working Devices : 5
>  Failed Devices : 1
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>  Reshape Status : 84% complete
>   New Chunksize : 512K
> 
>            Name : ion:0  (local to host ion)
>            UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
>          Events : 8386010
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       65        0      faulty spare rebuilding   /dev/sde1
>        9       8       81        1      active sync   /dev/sdf1
>        4       8       17        2      active sync   /dev/sdb1
>        3       8       49        3      active sync   /dev/sdd1
>        5       8       33        4      active sync   /dev/sdc1
>        8       8       97        5      active sync   /dev/sdg1
>        6       0        0        6      removed
> 
> 
> What is confusing to me is that /dev/sde1 (which is failing) is
> currently marked as rebuilding. But when I check iostat, it's far
> behind the other drives in total I/O since the reshape started, and
> the I/O hasn't actually changed for a few hours. This together with _
> instead of U leads me to believe that it's not actually being used. So
> why does it say rebuilding?
> 
> I guess my question is if it's possible for me to remove the drive, or
> would I mess the array up? I am not going to anything until the
> reshape finishes though.
> 
> Thanks,
> Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén
  2012-10-17  2:29 ` Stan Hoeppner
@ 2012-10-17  3:06 ` Chris Murphy
  2012-10-17  8:03   ` Mathias Burén
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2012-10-17  3:06 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 16, 2012, at 4:57 PM, Mathias Burén wrote:

> I started a reshape from 64K chunk size to 512K

I agree with Stan, not a good idea, and also a waste of time. Do you have check scrubs and extended offline smart tests scheduled for these drives periodically?

> 
> I guess my question is if it's possible for me to remove the drive, or
> would I mess the array up? I am not going to anything until the
> reshape finishes though.

I think you should put in a replacement drive for sda (#6) and get it rebuilding, as sde seems rather tenuous, before you decide to remove sde.

You should find out why it's slow 'smartctl -A /dev/sde' might reveal this now, which you can issue even while the reshape is occurring - the command just polls for existing smart attribute values for the drive. If it's the same model disk, connected the same way, as all the other drives, I'd get rid of it.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17  3:06 ` Chris Murphy
@ 2012-10-17  8:03   ` Mathias Burén
  2012-10-17  9:09     ` Chris Murphy
  2012-10-21 22:31     ` NeilBrown
  0 siblings, 2 replies; 18+ messages in thread
From: Mathias Burén @ 2012-10-17  8:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid RAID

On 17 October 2012 04:06, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Oct 16, 2012, at 4:57 PM, Mathias Burén wrote:
>
>> I started a reshape from 64K chunk size to 512K
>
> I agree with Stan, not a good idea, and also a waste of time. Do you have check scrubs and extended offline smart tests scheduled for these drives periodically?
>

Weekly scrubs and weekly offline self-tests. SMART always looked good,
until 1 drive died completely, the other has 5 uncorrectable sectors.
LCC is under 250K. WD20EARS

There are basically no files under 8GB on the array so therefore I
thought the new chunk size made sense.

>>
>> I guess my question is if it's possible for me to remove the drive, or
>> would I mess the array up? I am not going to anything until the
>> reshape finishes though.
>
> I think you should put in a replacement drive for sda (#6) and get it rebuilding, as sde seems rather tenuous, before you decide to remove sde.
>
> You should find out why it's slow 'smartctl -A /dev/sde' might reveal this now, which you can issue even while the reshape is occurring - the command just polls for existing smart attribute values for the drive. If it's the same model disk, connected the same way, as all the other drives, I'd get rid of it.

It's slow because it's broken (see above).

Any idea why it says rebuilding, when it's not? Is it going to attempt
a rebuild after the reshape?

>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17  8:03   ` Mathias Burén
@ 2012-10-17  9:09     ` Chris Murphy
       [not found]       ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com>
  2012-10-21 22:31     ` NeilBrown
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2012-10-17  9:09 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 17, 2012, at 2:03 AM, Mathias Burén wrote:

> Weekly scrubs and weekly offline self-tests. SMART always looked good,
> until 1 drive died completely, the other has 5 uncorrectable sectors.
> LCC is under 250K. WD20EARS.

Color me confused. Uncorrectable sectors should produce a read error on a check, which will cause the data to be reconstructed from parity, and written back to those sectors. A write to a bad sector, if persistent, will cause it to be relocated. If this isn't possible, the disk is toast if it can't reliably deal with bad sectors (out of reserve sectors?)

The smartmontools page has information on how to clear uncorrectable sectors manually. But I'd think check would do this.

> There are basically no files under 8GB on the array so therefore I
> thought the new chunk size made sense.

Yeah it seems reasonable in that case. But unless it's benchmarked you don't actually know if it matters.

> It's slow because it's broken (see above).
> Any idea why it says rebuilding, when it's not? Is it going to attempt
> a rebuild after the reshape?

Not sure. With two drives missing, you're in a very precarious situation. I would not worry about this detail until you have the sda (#6) replaced and rebuilt. Presumably the reshape must finish before the rebuild will start but I'm not sure of this.

What's dmesg reporting while all of this is going on?


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
       [not found]       ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com>
@ 2012-10-17 18:46         ` Chris Murphy
  2012-10-17 19:03           ` Mathias Burén
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2012-10-17 18:46 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 17, 2012, at 10:27 AM, Mathias Burén wrote:

> [419246.582409] end_request: I/O error, dev sde, sector 2343892776
> [419246.582492] md/raid:md0: read error not correctable (sector
> 2343890728 on sde1).
> [419246.582502] md/raid:md0: read error not correctable (sector
> 2343890736 on sde1).
> [419246.582511] md/raid:md0: read error not correctable (sector
> 2343890744 on sde1).
> [419246.582519] md/raid:md0: read error not correctable (sector
> 2343890752 on sde1).
> [419246.582527] md/raid:md0: read error not correctable (sector
> 2343890760 on sde1).
> [419246.582535] md/raid:md0: read error not correctable (sector
> 2343890768 on sde1).
> [419246.582543] md/raid:md0: read error not correctable (sector
> 2343890776 on sde1).
> [419246.582552] md/raid:md0: read error not correctable (sector
> 2343890784 on sde1).
> [419246.582560] md/raid:md0: read error not correctable (sector
> 2343890792 on sde1).
> [419246.582568] md/raid:md0: read error not correctable (sector


...
> 
> You can see the first start of the reshape, then sde started freaking out.

A lot more than just 5 sectors. I'd replace the drive and the cable. If it's under warranty, have it replaced. If not, maybe ata secure erase it, extended offline smart test, and use it down the road for something not so important if it passes without further problems.

So basically, replace the dead sda drive ASAP so it can start rebuilding.

I'd consider marking sde faulty so that it's neither being reshaped or rebuilt and then replace it once the new sda is rebuilt. You can probably replace them both at the same time and have them both rebuilding; but I'm being a little conservative on how many changes you make until you get yourself back to some level of redundancy.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17 18:46         ` Chris Murphy
@ 2012-10-17 19:03           ` Mathias Burén
  2012-10-17 19:35             ` Chris Murphy
  2012-10-18 11:56             ` Stan Hoeppner
  0 siblings, 2 replies; 18+ messages in thread
From: Mathias Burén @ 2012-10-17 19:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid RAID

On 17 October 2012 19:46, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Oct 17, 2012, at 10:27 AM, Mathias Burén wrote:
>
>> [419246.582409] end_request: I/O error, dev sde, sector 2343892776
>> [419246.582492] md/raid:md0: read error not correctable (sector
>> 2343890728 on sde1).
>> [419246.582502] md/raid:md0: read error not correctable (sector
>> 2343890736 on sde1).
>> [419246.582511] md/raid:md0: read error not correctable (sector
>> 2343890744 on sde1).
>> [419246.582519] md/raid:md0: read error not correctable (sector
>> 2343890752 on sde1).
>> [419246.582527] md/raid:md0: read error not correctable (sector
>> 2343890760 on sde1).
>> [419246.582535] md/raid:md0: read error not correctable (sector
>> 2343890768 on sde1).
>> [419246.582543] md/raid:md0: read error not correctable (sector
>> 2343890776 on sde1).
>> [419246.582552] md/raid:md0: read error not correctable (sector
>> 2343890784 on sde1).
>> [419246.582560] md/raid:md0: read error not correctable (sector
>> 2343890792 on sde1).
>> [419246.582568] md/raid:md0: read error not correctable (sector
>
>
> ...
>>
>> You can see the first start of the reshape, then sde started freaking out.
>
> A lot more than just 5 sectors. I'd replace the drive and the cable. If it's under warranty, have it replaced. If not, maybe ata secure erase it, extended offline smart test, and use it down the road for something not so important if it passes without further problems.

There are no CRC errors so I doubt the cable is at fault. In any way,
I've RMA'd drives for less, and an RMA is underway for this drive.
Just need to wait for the reshape to finish so I can get in the
server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs
no problem: http://www.antec.com/productPSU.php?id=30&pid=3

>
> So basically, replace the dead sda drive ASAP so it can start rebuilding.
>

Hm where do you get sda from? sda is the OS disk, an old SSD. (it
currently holds the reshape backup file)

> I'd consider marking sde faulty so that it's neither being reshaped or rebuilt and then replace it once the new sda is rebuilt. You can probably replace them both at the same time and have them both rebuilding; but I'm being a little conservative on how many changes you make until you get yourself back to some level of redundancy.
>
>
> Chris Murphy--

Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17 19:03           ` Mathias Burén
@ 2012-10-17 19:35             ` Chris Murphy
  2012-10-18 11:56             ` Stan Hoeppner
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2012-10-17 19:35 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 17, 2012, at 1:03 PM, Mathias Burén wrote:
> 
> Hm where do you get sda from? sda is the OS disk, an old SSD. (it
> currently holds the reshape backup file)

Oh. Misread. So it's sdh that's dead, and sde that's dying/pooping bad sector bullets.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17 19:03           ` Mathias Burén
  2012-10-17 19:35             ` Chris Murphy
@ 2012-10-18 11:56             ` Stan Hoeppner
  2012-10-18 12:17               ` Mathias Burén
  1 sibling, 1 reply; 18+ messages in thread
From: Stan Hoeppner @ 2012-10-18 11:56 UTC (permalink / raw)
  To: linux-raid RAID

On 10/17/2012 2:03 PM, Mathias Burén wrote:

> There are no CRC errors so I doubt the cable is at fault. In any way,
> I've RMA'd drives for less, and an RMA is underway for this drive.
> Just need to wait for the reshape to finish so I can get in the
> server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs
> no problem: http://www.antec.com/productPSU.php?id=30&pid=3

It would seem you didn't mod the airflow of the case along with the
increased drive count.  The NSK1380 has really poor airflow to begin
with: a single PSU mounted 120mm super low RPM fan.  Antec is currently
shipping the NSK1380 with an additional PCI slot centrifugal fan to help
overcome the limitations of the native design.

You bought crap drives, WD20EARS, then improperly modded a case to house
more than twice the design limit of HDDs.

I'd say you stacked the deck against yourself here Mathias.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 11:56             ` Stan Hoeppner
@ 2012-10-18 12:17               ` Mathias Burén
  2012-10-18 17:11                 ` Mathias Burén
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2012-10-18 12:17 UTC (permalink / raw)
  To: stan; +Cc: linux-raid RAID

On 18 October 2012 12:56, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 10/17/2012 2:03 PM, Mathias Burén wrote:
>
>> There are no CRC errors so I doubt the cable is at fault. In any way,
>> I've RMA'd drives for less, and an RMA is underway for this drive.
>> Just need to wait for the reshape to finish so I can get in the
>> server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs
>> no problem: http://www.antec.com/productPSU.php?id=30&pid=3
>
> It would seem you didn't mod the airflow of the case along with the
> increased drive count.  The NSK1380 has really poor airflow to begin
> with: a single PSU mounted 120mm super low RPM fan.  Antec is currently
> shipping the NSK1380 with an additional PCI slot centrifugal fan to help
> overcome the limitations of the native design.
>
> You bought crap drives, WD20EARS, then improperly modded a case to house
> more than twice the design limit of HDDs.
>
> I'd say you stacked the deck against yourself here Mathias.
>
> --
> Stan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Now now, the setup is working like a charm. Disk failures happen all
the time. There's an additional 120mm at the bottom, blowing up
towards the 7 HDDs. I bought "crap" drives because they were cheap.

In the 2 years a total of 3 drives have failed, but the array has
never failed. I'm very pleased with it (HTPC, with an ION board and 4x
SATA PCI-E controller for E10)

Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 12:17               ` Mathias Burén
@ 2012-10-18 17:11                 ` Mathias Burén
  2012-10-18 19:54                   ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2012-10-18 17:11 UTC (permalink / raw)
  To: stan; +Cc: linux-raid RAID

On 18 October 2012 13:17, Mathias Burén <mathias.buren@gmail.com> wrote:
> On 18 October 2012 12:56, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 10/17/2012 2:03 PM, Mathias Burén wrote:
>>
>>> There are no CRC errors so I doubt the cable is at fault. In any way,
>>> I've RMA'd drives for less, and an RMA is underway for this drive.
>>> Just need to wait for the reshape to finish so I can get in the
>>> server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs
>>> no problem: http://www.antec.com/productPSU.php?id=30&pid=3
>>
>> It would seem you didn't mod the airflow of the case along with the
>> increased drive count.  The NSK1380 has really poor airflow to begin
>> with: a single PSU mounted 120mm super low RPM fan.  Antec is currently
>> shipping the NSK1380 with an additional PCI slot centrifugal fan to help
>> overcome the limitations of the native design.
>>
>> You bought crap drives, WD20EARS, then improperly modded a case to house
>> more than twice the design limit of HDDs.
>>
>> I'd say you stacked the deck against yourself here Mathias.
>>
>> --
>> Stan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Now now, the setup is working like a charm. Disk failures happen all
> the time. There's an additional 120mm at the bottom, blowing up
> towards the 7 HDDs. I bought "crap" drives because they were cheap.
>
> In the 2 years a total of 3 drives have failed, but the array has
> never failed. I'm very pleased with it (HTPC, with an ION board and 4x
> SATA PCI-E controller for E10)
>
> Mathias

Just to follow up, the reshape succeeded and I'll now shutdown and RMA
/dev/sde. Thanks all for the answers.

[748891.476091] md: md0: reshape done.
[748891.505225] RAID conf printout:
[748891.505235]  --- level:6 rd:7 wd:5
[748891.505241]  disk 0, o:0, dev:sde1
[748891.505246]  disk 1, o:1, dev:sdf1
[748891.505251]  disk 2, o:1, dev:sdb1
[748891.505257]  disk 3, o:1, dev:sdd1
[748891.505263]  disk 4, o:1, dev:sdc1
[748891.505268]  disk 5, o:1, dev:sdg1
[748891.535219] RAID conf printout:
[748891.535229]  --- level:6 rd:7 wd:5
[748891.535236]  disk 0, o:0, dev:sde1
[748891.535242]  disk 1, o:1, dev:sdf1
[748891.535246]  disk 2, o:1, dev:sdb1
[748891.535251]  disk 3, o:1, dev:sdd1
[748891.535256]  disk 4, o:1, dev:sdc1
[748891.535261]  disk 5, o:1, dev:sdg1
[748891.548477] RAID conf printout:
[748891.548483]  --- level:6 rd:7 wd:5
[748891.548487]  disk 1, o:1, dev:sdf1
[748891.548491]  disk 2, o:1, dev:sdb1
[748891.548494]  disk 3, o:1, dev:sdd1
[748891.548498]  disk 4, o:1, dev:sdc1
[748891.548501]  disk 5, o:1, dev:sdg1
ion ~ $ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9]
      9751756800 blocks super 1.2 level 6, 512k chunk, algorithm 2
[7/5] [_UUUUU_]

unused devices: <none>
ion ~ $ sudo mdadm -D /dev/md0
[sudo] password for:
/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 19 08:58:41 2010
     Raid Level : raid6
     Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
  Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
   Raid Devices : 7
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Thu Oct 18 11:19:35 2012
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : ion:0  (local to host ion)
           UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
         Events : 8678539

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       9       8       81        1      active sync   /dev/sdf1
       4       8       17        2      active sync   /dev/sdb1
       3       8       49        3      active sync   /dev/sdd1
       5       8       33        4      active sync   /dev/sdc1
       8       8       97        5      active sync   /dev/sdg1
       6       0        0        6      removed

       0       8       65        -      faulty spare   /dev/sde1
ion ~ $ sudo mdadm --manage /dev/md0 --remove /dev/sde1
mdadm: hot removed /dev/sde1 from /dev/md0
ion ~ $ sudo mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 19 08:58:41 2010
     Raid Level : raid6
     Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
  Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
   Raid Devices : 7
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Thu Oct 18 18:09:54 2012
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : ion:0  (local to host ion)
           UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
         Events : 8678542

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       9       8       81        1      active sync   /dev/sdf1
       4       8       17        2      active sync   /dev/sdb1
       3       8       49        3      active sync   /dev/sdd1
       5       8       33        4      active sync   /dev/sdc1
       8       8       97        5      active sync   /dev/sdg1
       6       0        0        6      removed
ion ~ $
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 17:11                 ` Mathias Burén
@ 2012-10-18 19:54                   ` Chris Murphy
  2012-10-18 20:17                     ` Mathias Burén
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2012-10-18 19:54 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 18, 2012, at 11:11 AM, Mathias Burén wrote:

> Just to follow up, the reshape succeeded and I'll now shutdown and RMA
> /dev/sde. Thanks all for the answers.

Yeah but two days later and you still are critically degraded without either failed disk replaced and rebuilding. You're one tiny problem away from that whole array collapsing and you're worried about this one fussy disk? I don't understand your delay in immediately getting a replacement drive in this array unless you really don't care about the data at all, in which case why have a RAID6?

Sure, what are the odds of a 3rd drive dying… *shrug* Seems like an unwise risk tempting fate like this.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 19:54                   ` Chris Murphy
@ 2012-10-18 20:17                     ` Mathias Burén
  2012-10-18 20:58                       ` Stan Hoeppner
  2012-10-18 21:28                       ` RAID6 reshape, 2 disk failures Chris Murphy
  0 siblings, 2 replies; 18+ messages in thread
From: Mathias Burén @ 2012-10-18 20:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid RAID

On 18 October 2012 20:54, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Oct 18, 2012, at 11:11 AM, Mathias Burén wrote:
>
>> Just to follow up, the reshape succeeded and I'll now shutdown and RMA
>> /dev/sde. Thanks all for the answers.
>
> Yeah but two days later and you still are critically degraded without either failed disk replaced and rebuilding. You're one tiny problem away from that whole array collapsing and you're worried about this one fussy disk? I don't understand your delay in immediately getting a replacement drive in this array unless you really don't care about the data at all, in which case why have a RAID6?
>
> Sure, what are the odds of a 3rd drive dying… *shrug* Seems like an unwise risk tempting fate like this.
>
>
> Chris Murphy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

There's no more dying drives in the array, 2 out of 7 died, they are
on RMA soon. (when I can get to the post office).

I did see 5 pending sectors on 1 HDD after the reshape finished
though. I don't care much about the data (it's not critical), RAID6 is
just so I can have one large volume, some speed increase and a bit of
redundancy. If I had them all as single volumes I'd have to use mhddfs
or something to make it look like 1 logical volume. Or even use some
kind of LVM perhaps.

Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 20:17                     ` Mathias Burén
@ 2012-10-18 20:58                       ` Stan Hoeppner
  2012-10-19 14:32                         ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio
  2012-10-18 21:28                       ` RAID6 reshape, 2 disk failures Chris Murphy
  1 sibling, 1 reply; 18+ messages in thread
From: Stan Hoeppner @ 2012-10-18 20:58 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux RAID

Lack of a List-Post header got me again...sorry for the dup Mathias.

On 10/18/2012 3:17 PM, Mathias Burén wrote:

> If I had them all as single volumes I'd have to use mhddfs
> or something to make it look like 1 logical volume. Or even use some
> kind of LVM perhaps.

Or md/RAID --linear.  Given the list you're posting to I'm surprised you
forgot this option.

But running without redundancy of any kind will cause more trouble than
you currently have.

Last words of advice:

In the future, spend a little more per drive and get units that will
live longer.  Also, mounting a 120mm fan in the bottom of that Antec
cube chassis blowing "up" on the drives simply circulates the hot air
already inside the chassis.  It does not increase CFM of cool air intake
nor exhaust of hot air.  So T_case is pretty much the same as before you
put the 2nd 120mm fan in there.  And T_case is the temp that determines
drive life.

"silent chassis" and RAID are mutually exclusive.  You'll rarely, if
ever, properly cool multiple HDDs, of any persuasion, in a "silent"
chassis.  To make it silent the fans must turn at very low RPM, thus
yielding very low CFM, thus yielding high device temps.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-18 20:17                     ` Mathias Burén
  2012-10-18 20:58                       ` Stan Hoeppner
@ 2012-10-18 21:28                       ` Chris Murphy
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2012-10-18 21:28 UTC (permalink / raw)
  To: linux-raid RAID


On Oct 18, 2012, at 2:17 PM, Mathias Burén wrote:

> There's no more dying drives in the array, 2 out of 7 died, they are
> on RMA soon. (when I can get to the post office).

Right but 2 of those 7 you're saying gave you no warning they were about to die, which means you have 5 of 7 which could easily do the same thing any moment now. For having such cheap drives you'd think you could have at least one on standby, hotspare or not.

You realize that WDC is within their right, if they knew you were using raid6 with these drives, refusing the RMA? These are not raid5/6 drives. They're not 24x7 use drives.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Offtopic: on case (was: R: RAID6 reshape, 2 disk failures)
  2012-10-18 20:58                       ` Stan Hoeppner
@ 2012-10-19 14:32                         ` Carabetta Giulio
  2012-10-19 16:44                           ` Offtopic: on case Stan Hoeppner
  0 siblings, 1 reply; 18+ messages in thread
From: Carabetta Giulio @ 2012-10-19 14:32 UTC (permalink / raw)
  To: 'stan@hardwarefreak.com', 'Mathias Burén'
  Cc: 'Linux RAID'

Sorry for the OT, but...

> Lack of a List-Post header got me again...sorry for the dup Mathias.
>
> On 10/18/2012 3:17 PM, Mathias Burén wrote:
> 
> > If I had them all as single volumes I'd have to use mhddfs or 
> > something to make it look like 1 logical volume. Or even use some kind 
> > of LVM perhaps.
> 
> Or md/RAID --linear.  Given the list you're posting to I'm surprised you forgot this option.
> 
> But running without redundancy of any kind will cause more trouble than you currently have.
> 
> Last words of advice:
> 
> In the future, spend a little more per drive and get units that will live longer.  Also, mounting a 120mm fan in the bottom of that Antec > cube chassis blowing "up" on the drives simply circulates the hot air already inside the chassis.  It does not increase CFM of cool air 
> intake nor exhaust of hot air.  So T_case is pretty much the same as before you put the 2nd 120mm fan in there.  And T_case is the temp 
> that determines drive life.
> 
> "silent chassis" and RAID are mutually exclusive.  You'll rarely, if ever, properly cool multiple HDDs, of any persuasion, in a "silent"
> chassis.  To make it silent the fans must turn at very low RPM, thus yielding very low CFM, thus yielding high device temps.

You are right, I know that very well... 
Also I'm looking for a compromise between temperature and noise: what do you think about this case?
http://www.lian-li.com/v2/en/product/product06.php?pr_index=480&cl_index=1&sc_index=26&ss_index=67&g=f


> 
> --
> Stan
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo > info at  http://vger.kernel.org/majordomo-info.html

Giulio Carabetta--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Offtopic: on case
  2012-10-19 14:32                         ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio
@ 2012-10-19 16:44                           ` Stan Hoeppner
  0 siblings, 0 replies; 18+ messages in thread
From: Stan Hoeppner @ 2012-10-19 16:44 UTC (permalink / raw)
  To: Carabetta Giulio; +Cc: 'Mathias Burén', 'Linux RAID'

On 10/19/2012 9:32 AM, Carabetta Giulio wrote:
> Sorry for the OT, but...
...
> You are right, I know that very well... 
> Also I'm looking for a compromise between temperature and noise: what do you think about this case?
> http://www.lian-li.com/v2/en/product/product06.php?pr_index=480&cl_index=1&sc_index=26&ss_index=67&g=f

It should keep six 3.5" SATA drives within normal operating temp range,
even though the hole punched cage frame is inefficient, along with the
lateral vs longitudinal orientation.  Going lateral saved them 2" on
case depth, which is critical to their aesthetics.  They could have
eliminated the side and bottom intake grilles and the top exhaust fan,
by rotating the PSU 180 degrees, reversing its fan, and adding a small
director vane on the back of the case.  This would decrease total noise
by 3-5 dB without impacting cooling capacity.

There are two reasons I've never been big on Lian Li cases:

1.  You pay a 3-5x premium for aesthetics and the name
2.  Airflow is an engineering afterthought--aesthetics comes first

Point 2 is interesting regarding this case.  I'm surprised to see a huge
blue glowing front intake grille on a Lian Li.  They've heretofore
always been about the Apple clean lines look, brushed aluminum with as
few interruptions as possible, which is they they had typically located
media bays and device connectors on the sides, not the front.

-- 
Stan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID6 reshape, 2 disk failures
  2012-10-17  8:03   ` Mathias Burén
  2012-10-17  9:09     ` Chris Murphy
@ 2012-10-21 22:31     ` NeilBrown
  1 sibling, 0 replies; 18+ messages in thread
From: NeilBrown @ 2012-10-21 22:31 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Chris Murphy, linux-raid RAID

[-- Attachment #1: Type: text/plain, Size: 922 bytes --]

On Wed, 17 Oct 2012 09:03:11 +0100 Mathias Burén <mathias.buren@gmail.com>
wrote:


> Any idea why it says rebuilding, when it's not? Is it going to attempt
> a rebuild after the reshape?

Minor bug in mdadm.
The device is clearly faulty:

    Number   Major   Minor   RaidDevice State
       0       8       65        0      faulty spare rebuilding   /dev/sde1

so mdadm should never suggest that it is also spare and rebuilding. 
I'll fix that.

What is a little odd is that 'RaidDevice' is still '0'.  Normally when a
device fails the RaidDevice gets set to '-1'.
This doesn't happen immediately though - md waits until all pending requests
have completed and then disassociated with the device and sets raid_disk to
-1.
So you seem to have caught it before the device was fully quiesced .... or
some bug has slipped in and devices that get errors don't fully quiesce any
more....

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-10-21 22:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén
2012-10-17  2:29 ` Stan Hoeppner
2012-10-17  3:06 ` Chris Murphy
2012-10-17  8:03   ` Mathias Burén
2012-10-17  9:09     ` Chris Murphy
     [not found]       ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com>
2012-10-17 18:46         ` Chris Murphy
2012-10-17 19:03           ` Mathias Burén
2012-10-17 19:35             ` Chris Murphy
2012-10-18 11:56             ` Stan Hoeppner
2012-10-18 12:17               ` Mathias Burén
2012-10-18 17:11                 ` Mathias Burén
2012-10-18 19:54                   ` Chris Murphy
2012-10-18 20:17                     ` Mathias Burén
2012-10-18 20:58                       ` Stan Hoeppner
2012-10-19 14:32                         ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio
2012-10-19 16:44                           ` Offtopic: on case Stan Hoeppner
2012-10-18 21:28                       ` RAID6 reshape, 2 disk failures Chris Murphy
2012-10-21 22:31     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.