All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid6 recovery
@ 2020-03-19 19:55 Glenn Greibesland
  2020-03-20 19:15 ` Wols Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-19 19:55 UTC (permalink / raw)
  To: linux-raid

Hi. I need some help with recovering from multiple disk failure on a
RAID6 array.
I had two failed disks and therefore shut down the server and
connected new disks.
After I powered on the server, another disk got booted out of the
array leaving it with only 15 out of 18 working devices, so it won’t
start.
I ran an offline test with smartctl and the disk that got thrown out
of the array seems totally fine.

Here is where I think I made a mistake. I use the –re-add command on
the disk. Now it is regarded as spare and the array still won’t start.

I’ve been reading on
https://raid.wiki.kernel.org/index.php/RAID_Recovery and I have tried
`–assemble –scan –force –verbose` and manual `–assemble –force` with
specifying each drive. Neither of them works (reporting that 15 out of
18 devices is not enough).

All drives has the same event count and used dev size, but two of the
devices has a lower Avail Dev Size, and a different Data Offset.

After a bit of digging in the manual and on different forums I have
concluded that the next step for me is to recreate the array using
–assume-clean and –data-offset=variable.
I have tried a dry run of the command (answering no to “Continue
creating array”), and mdadm accepts the parameters without any errors:


mdadm --create --assume-clean --level=6 --raid-devices=18
--size=3906763776s --chunk=512K --data-offset=variable /dev/md0
/dev/sdj1:262144s /dev/sdk1:262144s /dev/sdi1:262144s
/dev/sdh1:262144s /dev/sdo1:262144s /dev/sdp1:262144s
/dev/sdr1:262144s /dev/sdq1:262144s /dev/sdf1:262144s
/dev/sdb1:262144ss /dev/sdg1:262144s /dev/sdd1:262144s
/dev/sdm1:262144s /dev/sdf2:241664s missing missing /dev/sdc2:241664s
/dev/sdc1:262144s
mdadm: /dev/sdj1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdk1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdi1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdh1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdo1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdp1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdr1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdq1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdf1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdb1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: partition table exists on /dev/sdb1 but will be lost or
       meaningless after creating array
mdadm: /dev/sdg1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdm1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdf2 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdc2 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
       level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
Continue creating array? N

My only worries now are the size and data-offset parameters. According
to the man page, the size should be specified in KiloBytes. It was
KibiBytes previously.
The Used Device Size of all array members is 3906763776 sectors
(1862.89 GiB 2000.26 GB).

Should I convert the sectors into KiloBytes or does mdadm support
using sectors as unit for –size and data-offset? It is not mentioned
in the manual, but I’ve seen it being used on different forum threads
and mdadm does not blow up if I try using it.

Any other suggestions?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-19 19:55 Raid6 recovery Glenn Greibesland
@ 2020-03-20 19:15 ` Wols Lists
       [not found]   ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2020-03-20 19:15 UTC (permalink / raw)
  To: Glenn Greibesland, linux-raid; +Cc: Phil Turmel, NeilBrown

On 19/03/20 19:55, Glenn Greibesland wrote:
> After a bit of digging in the manual and on different forums I have
> concluded that the next step for me is to recreate the array using
> –assume-clean and –data-offset=variable.
> I have tried a dry run of the command (answering no to “Continue
> creating array”), and mdadm accepts the parameters without any errors:

Oh my god NO!!!

Do NOT use --create unless someone rather more experienced than me tells
you to!!!

The obvious thing is to somehow get the sixteen drives that you know
should be okay, re-assembled in a forced manner. The --re-add should not
have done any real damage because, as mdadm keeps complaining, you
didn't have enough drives so it won't have touched the data on that
drive. Unfortunately, my fu isn't good enough to tell you how to get
that drive back in.

What's wrong with the two failed drives? Can you ddrescue them? They
might be enough to get you going again.

You say you've read the web page "Raid recovery" - which says it's
obsolete and points you at "When things go wrogn" - but you don't appear
to have read that! PLEASE read "asking for help" and in particular you
NEED to run lsdrv and give us that information. Without that, if you DO
run --create, you will be in for a world of hurt.

I know you may feel it's asking for loads of information, and the
resulting email will be massive, but trust me - the experts will look at
it and they will probably be able to come up with a plan of action. At
present, they don't have much to go on, and nor will you if carry on as
you're going ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
       [not found]   ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
@ 2020-03-21  0:06     ` antlists
  2020-03-21 11:54       ` Glenn Greibesland
  0 siblings, 1 reply; 13+ messages in thread
From: antlists @ 2020-03-21  0:06 UTC (permalink / raw)
  To: Glenn Greibesland; +Cc: linux-raid, Phil Turmel, NeilBrown

On 20/03/2020 21:05, Glenn Greibesland wrote:
> fre. 20. mar. 2020 kl. 20:15 skrev Wols Lists <antlists@youngman.org.uk>:
>>
>> On 19/03/20 19:55, Glenn Greibesland wrote:
>>> After a bit of digging in the manual and on different forums I have
>>> concluded that the next step for me is to recreate the array using
>>> –assume-clean and –data-offset=variable.
>>> I have tried a dry run of the command (answering no to “Continue
>>> creating array”), and mdadm accepts the parameters without any errors:
>>
>> Oh my god NO!!!
>>
>> Do NOT use --create unless someone rather more experienced than me tells
>> you to!!!
>>
>> The obvious thing is to somehow get the sixteen drives that you know
>> should be okay, re-assembled in a forced manner. The --re-add should not
>> have done any real damage because, as mdadm keeps complaining, you
>> didn't have enough drives so it won't have touched the data on that
>> drive. Unfortunately, my fu isn't good enough to tell you how to get
>> that drive back in.
>>
>> What's wrong with the two failed drives? Can you ddrescue them? They
>> might be enough to get you going again.
>>
>> You say you've read the web page "Raid recovery" - which says it's
>> obsolete and points you at "When things go wrogn" - but you don't appear
>> to have read that! PLEASE read "asking for help" and in particular you
>> NEED to run lsdrv and give us that information. Without that, if you DO
>> run --create, you will be in for a world of hurt.
>>
>> I know you may feel it's asking for loads of information, and the
>> resulting email will be massive, but trust me - the experts will look at
>> it and they will probably be able to come up with a plan of action. At
>> present, they don't have much to go on, and nor will you if carry on as
>> you're going ...
>>
>> Cheers,
>> Wol
> 
> Thanks for replying to the thread.
> 
> The two failed drives has "unreadable (pending) sectors", and they
> have a lower Event Count than the other disks, so that is why I've
> been trying to get the array up and running with the remaining 16
> disks that has the same Event Count.
> 
> I concluded myself that --create --assume-clean had to be the only
> thing left to try, that's why I didn't provide any logs or info. Sorry
> about that, you are right, I should check if there is any other
> options first. I've been trying to get this array up and running again
> for quite some time, so I'm all ears if someone has some magic to try.
> Yesterday I read some of the source code of mdadm and sort of answered
> my own question. According to the source code, specifying sizes in
> sectors is supported. I'd still like some confirmation though (talking
> about parse_size function in util.c).
> 
> Here's some additional info:
> 
> mdadm: added /dev/sdj1 to /dev/md/0 as 0
> mdadm: added /dev/sdk1 to /dev/md/0 as 1
> mdadm: added /dev/sdi1 to /dev/md/0 as 2
> mdadm: added /dev/sdh1 to /dev/md/0 as 3
> mdadm: added /dev/sdo1 to /dev/md/0 as 4
> mdadm: added /dev/sdp1 to /dev/md/0 as 5
> mdadm: added /dev/sdr1 to /dev/md/0 as 6
> mdadm: added /dev/sdq1 to /dev/md/0 as 7
> mdadm: added /dev/sdf1 to /dev/md/0 as 8
> mdadm: added /dev/sdb1 to /dev/md/0 as 9
> mdadm: added /dev/sdg1 to /dev/md/0 as -1   <<<< This is the drive
> that is now regarded as spare. It originally had slot 10 in the array
> mdadm: added /dev/sdd1 to /dev/md/0 as 11
> mdadm: added /dev/sdm1 to /dev/md/0 as 12
> mdadm: added /dev/sdf2 to /dev/md/0 as 13
> mdadm: added /dev/sdc2 to /dev/md/0 as 16
> mdadm: added /dev/sdc1 to /dev/md/0 as 17
> 
> 
> 
> mdadm: no uptodate device for slot 10 of /dev/md/0 << sdg1
> mdadm: no uptodate device for slot 14 of /dev/md/0 << drive disconnected
> mdadm: no uptodate device for slot 15 of /dev/md/0 << drive disconnected
> 
> mdadm: /dev/md/0 assembled from 15 drives and 1 spare - not enough to
> start the array.
> 
>   mdadm -D /dev/md0
> /dev/md0:
>             Version : 1.2
>          Raid Level : raid0
>       Total Devices : 16
>         Persistence : Superblock is persistent
> 
>               State : inactive
>     Working Devices : 16
> 
>                Name : vm-test:0
>                UUID : 45ced2f9:947773d4:106077ab:2df799d6
>              Events : 1937517
> 
>      Number   Major   Minor   RaidDevice
> 
>         -       8       17        -        /dev/sdb1
>         -       8       33        -        /dev/sdc1
>         -       8       34        -        /dev/sdc2

What's this? Two partitions in the array on the same physical disk?

>         -       8       49        -        /dev/sdd1
>         -       8       81        -        /dev/sdf1
>         -       8       82        -        /dev/sdf2

And again?

>         -       8       97        -        /dev/sdg1
>         -       8      113        -        /dev/sdh1
>         -       8      129        -        /dev/sdi1
>         -       8      145        -        /dev/sdj1
>         -       8      161        -        /dev/sdk1
>         -       8      193        -        /dev/sdm1
>         -       8      241        -        /dev/sdp1
>         -      65        1        -        /dev/sdq1
>         -      65       17        -        /dev/sdr1
>         -      65       33        -        /dev/sds1
> 


> 
> SMART WRITE LOG does not return COUNT and LBA_LOW register
> SCT (Get) Error Recovery Control command failed

Which disk is this? No error recovery? BAD sign ...
> 
> Device Statistics (GP/SMART Log 0x04) not supported
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x0008  2            0  Device-to-host non-data FIS retries
> 0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
> 0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
> 0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
> 0x8000  4      1208382  Vendor specific
> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 


> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 


> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Green

What's this?

> Device Model:     WDC WD20EARX-00PASB0
> Serial Number:    WD-WMAZA9538601
> LU WWN Device Id: 5 0014ee 15a0a4ffa
> Firmware Version: 51.0AB51
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
> Local Time is:    Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM feature is:   Unavailable
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (37200) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 359) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:        (0x3035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.

No mention of ERC - Bad sign ...
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>    1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
>    3 Spin_Up_Time            POS--K   171   171   021    -    6416
>    4 Start_Stop_Count        -O--CK   100   100   000    -    255
>    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
>    7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
>    9 Power_On_Hours          -O--CK   098   098   000    -    1583
>   10 Spin_Retry_Count        -O--CK   100   100   000    -    0
>   11 Calibration_Retry_Count -O--CK   100   100   000    -    0
>   12 Power_Cycle_Count       -O--CK   100   100   000    -    131
> 192 Power-Off_Retract_Count -O--CK   200   200   000    -    61
> 193 Load_Cycle_Count        -O--CK   191   191   000    -    29372
> 194 Temperature_Celsius     -O---K   122   101   000    -    28
> 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> 198 Offline_Uncorrectable   ----CK   200   200   000    -    0
> 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
>                              ||||||_ K auto-keep
>                              |||||__ C event count
>                              ||||___ R error rate
>                              |||____ S speed/performance
>                              ||_____ O updated online
>                              |______ P prefailure warning
> 
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
> 0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
> 0xbd       GPL,SL  VS       1  Device vendor specific log
> 0xc0       GPL,SL  VS       1  Device vendor specific log
> 0xc1       GPL     VS      93  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> 
> SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> No Errors Logged
> 
> SMART Extended Self-test Log Version: 1 (1 sectors)
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%      1245         -
> 
> SMART Selective self-test log data structure revision number 1
>   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>      1        0        0  Not_testing
>      2        0        0  Not_testing
>      3        0        0  Not_testing
>      4        0        0  Not_testing
>      5        0        0  Not_testing
> Selective self-test flags (0x0):
>    After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> SCT Status Version:                  3
> SCT Version (vendor specific):       258 (0x0102)
> SCT Support Level:                   1
> Device State:                        Active (0)
> Current Temperature:                    28 Celsius
> Power Cycle Min/Max Temperature:      8/43 Celsius
> Lifetime    Min/Max Temperature:      0/49 Celsius
> Under/Over Temperature Limit Count:   0/0
> 
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:      0/60 Celsius
> Min/Max Temperature Limit:           -41/85 Celsius
> Temperature History Size (Index):    478 (305)
> 
> Index    Estimated Time   Temperature Celsius
>   306    2020-03-20 13:03    23  ****
>   ...    ..( 33 skipped).    ..  ****
>   340    2020-03-20 13:37    23  ****
>   341    2020-03-20 13:38     ?  -
>   342    2020-03-20 13:39    23  ****
>   343    2020-03-20 13:40    23  ****
>   344    2020-03-20 13:41    24  *****
>   345    2020-03-20 13:42    25  ******
>   346    2020-03-20 13:43    25  ******
>   347    2020-03-20 13:44    25  ******
>   348    2020-03-20 13:45    26  *******
>   ...    ..(  2 skipped).    ..  *******
>   351    2020-03-20 13:48    26  *******
>   352    2020-03-20 13:49    27  ********
>   353    2020-03-20 13:50    27  ********
>   354    2020-03-20 13:51    28  *********
>   355    2020-03-20 13:52    28  *********
>   356    2020-03-20 13:53    22  ***
>   ...    ..(276 skipped).    ..  ***
>   155    2020-03-20 18:30    22  ***
>   156    2020-03-20 18:31    23  ****
>   ...    ..(148 skipped).    ..  ****
>   305    2020-03-20 21:00    23  ****
> 
> SCT Error Recovery Control command not supported

Yup. Ouch!
> 
> Device Statistics (GP/SMART Log 0x04) not supported
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x8000  4      1208379  Vendor specific
> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Red
> Device Model:     WDC WD20EFRX-68AX9N0
> Serial Number:    WD-WMC300320657
> LU WWN Device Id: 5 0014ee 0ae1ee098
> Firmware Version: 80.00A80
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM feature is:   Unavailable
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Unknown
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
> was never started.
> Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (27120) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   2) minutes.
> Extended self-test routine
> recommended polling time: ( 274) minutes.
> Conveyance self-test routine
> recommended polling time: (   5) minutes.
> SCT capabilities:        (0x70bd) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>    1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
>    3 Spin_Up_Time            POS--K   176   169   021    -    4183
>    4 Start_Stop_Count        -O--CK   100   100   000    -    502
>    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
>    7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
>    9 Power_On_Hours          -O--CK   061   061   000    -    28588
>   10 Spin_Retry_Count        -O--CK   100   100   000    -    0
>   11 Calibration_Retry_Count -O--CK   100   100   000    -    0
>   12 Power_Cycle_Count       -O--CK   100   100   000    -    490
> 192 Power-Off_Retract_Count -O--CK   200   200   000    -    483
> 193 Load_Cycle_Count        -O--CK   200   200   000    -    18
> 194 Temperature_Celsius     -O---K   120   089   000    -    27
> 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> 198 Offline_Uncorrectable   ----CK   100   253   000    -    0
> 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> 200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
>                              ||||||_ K auto-keep
>                              |||||__ C event count
>                              ||||___ R error rate
>                              |||____ S speed/performance
>                              ||_____ O updated online
>                              |______ P prefailure warning
> 
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters log
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
> 0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
> 0xbd       GPL,SL  VS       1  Device vendor specific log
> 0xc0       GPL,SL  VS       1  Device vendor specific log
> 0xc1       GPL     VS      93  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> 
> SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> No Errors Logged
> 
> SMART Extended Self-test Log Version: 1 (1 sectors)
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%     26024         -
> 
> SMART Selective self-test log data structure revision number 1
>   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>      1        0        0  Not_testing
>      2        0        0  Not_testing
>      3        0        0  Not_testing
>      4        0        0  Not_testing
>      5        0        0  Not_testing
> Selective self-test flags (0x0):
>    After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> SCT Status Version:                  3
> SCT Version (vendor specific):       258 (0x0102)
> SCT Support Level:                   1
> Device State:                        Active (0)
> Current Temperature:                    27 Celsius
> Power Cycle Min/Max Temperature:     10/32 Celsius
> Lifetime    Min/Max Temperature:      2/58 Celsius
> Under/Over Temperature Limit Count:   0/0
> Vendor specific:
> 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:      0/60 Celsius
> Min/Max Temperature Limit:           -41/85 Celsius
> Temperature History Size (Index):    478 (56)
> 
> Index    Estimated Time   Temperature Celsius
>    57    2020-03-20 13:03    24  *****
>   ...    ..(377 skipped).    ..  *****
>   435    2020-03-20 19:21    24  *****
>   436    2020-03-20 19:22     ?  -
>   437    2020-03-20 19:23    24  *****
>   438    2020-03-20 19:24    25  ******
>   ...    ..(  3 skipped).    ..  ******
>   442    2020-03-20 19:28    25  ******
>   443    2020-03-20 19:29    26  *******
>   444    2020-03-20 19:30    26  *******
>   445    2020-03-20 19:31    26  *******
>   446    2020-03-20 19:32    27  ********
>   ...    ..(  3 skipped).    ..  ********
>   450    2020-03-20 19:36    27  ********
>   451    2020-03-20 19:37    24  *****
>   ...    ..( 82 skipped).    ..  *****
>    56    2020-03-20 21:00    24  *****
> 
> SCT Error Recovery Control:
>             Read: Disabled
>            Write: Disabled

What's going on here? We have a RED drive, but ERC isn't working ...
> 
> Device Statistics (GP/SMART Log 0x04) not supported
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x0008  2            0  Device-to-host non-data FIS retries
> 0x0009  2           33  Transition from drive PhyRdy to drive PhyNRdy
> 0x000a  2           34  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
> 0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
> 0x8000  4      1208361  Vendor specific
> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Device Model:     ST4000VN008-2DR166
> Serial Number:    ZDH82183
> LU WWN Device Id: 5 000c50 0c37c42c0
> Firmware Version: SC60
> User Capacity:    4,000,787,030,016 bytes [4.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5980 rpm
> Form Factor:      3.5 inches
> Device is:        Not in smartctl database [for details use: -P showall]
> ATA Version is:   ACS-3 T13/2161-D revision 5
> SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM level is:     254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Unknown
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (  581) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: (   1) minutes.
> Extended self-test routine
> recommended polling time: ( 621) minutes.
> Conveyance self-test routine
> recommended polling time: (   2) minutes.
> SCT capabilities:        (0x50bd) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>    1 Raw_Read_Error_Rate     POSR--   070   065   044    -    10856451
>    3 Spin_Up_Time            PO----   094   094   000    -    0
>    4 Start_Stop_Count        -O--CK   100   100   020    -    53
>    5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
>    7 Seek_Error_Rate         POSR--   075   061   045    -    29667756
>    9 Power_On_Hours          -O--CK   100   100   000    -    506 (130 79 0)
>   10 Spin_Retry_Count        PO--C-   100   100   097    -    0
>   12 Power_Cycle_Count       -O--CK   100   100   020    -    5
> 184 End-to-End_Error        -O--CK   100   100   099    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> 188 Command_Timeout         -O--CK   098   098   000    -    65538
> 189 High_Fly_Writes         -O-RCK   100   100   000    -    0
> 190 Airflow_Temperature_Cel -O---K   076   070   040    -    24 (Min/Max 9/26)
> 191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    44
> 193 Load_Cycle_Count        -O--CK   100   100   000    -    284
> 194 Temperature_Celsius     -O---K   024   040   000    -    24 (0 9 0 0 0)
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> 240 Head_Flying_Hours       ------   100   253   000    -    139 (51 45 0)
> 241 Total_LBAs_Written      ------   100   253   000    -    8177237744
> 242 Total_LBAs_Read         ------   100   253   000    -    5818370819
>                              ||||||_ K auto-keep
>                              |||||__ C event count
>                              ||||___ R error rate
>                              |||____ S speed/performance
>                              ||_____ O updated online
>                              |______ P prefailure warning
> 
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> 0x04       GPL,SL  R/O      8  Device Statistics log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters log
> 0x13       GPL     R/O      1  SATA NCQ Send and Receive log
> 0x15       GPL     R/W      1  SATA Rebuild Assist log
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x24       GPL     R/O    512  Current Device Internal Status Data log
> 0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa1       GPL,SL  VS      24  Device vendor specific log
> 0xa2       GPL     VS    8160  Device vendor specific log
> 0xa6       GPL     VS     192  Device vendor specific log
> 0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
> 0xab       GPL     VS       1  Device vendor specific log
> 0xb0       GPL     VS    9048  Device vendor specific log
> 0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> 0xc1       GPL,SL  VS      16  Device vendor specific log
> 0xd1       GPL     VS     136  Device vendor specific log
> 0xd2       GPL     VS   10000  Device vendor specific log
> 0xd3       GPL     VS    1920  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> 
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
> 
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> 
> SMART Selective self-test log data structure revision number 1
>   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>      1        0        0  Not_testing
>      2        0        0  Not_testing
>      3        0        0  Not_testing
>      4        0        0  Not_testing
>      5        0        0  Not_testing
> Selective self-test flags (0x0):
>    After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> SCT Status Version:                  3
> SCT Version (vendor specific):       522 (0x020a)
> SCT Support Level:                   1
> Device State:                        Active (0)
> Current Temperature:                    23 Celsius
> Power Cycle Min/Max Temperature:      8/26 Celsius
> Lifetime    Min/Max Temperature:      8/30 Celsius
> Under/Over Temperature Limit Count:   0/336
> 
> SCT Temperature History Version:     2
> Temperature Sampling Period:         3 minutes
> Temperature Logging Interval:        59 minutes
> Min/Max recommended Temperature:      0/ 0 Celsius
> Min/Max Temperature Limit:            0/ 0 Celsius
> Temperature History Size (Index):    128 (119)
> 
> Index    Estimated Time   Temperature Celsius
>   120    2020-03-15 16:02    21  **
>   ...    ..(  5 skipped).    ..  **
>   126    2020-03-15 21:56    21  **
>   127    2020-03-15 22:55    22  ***
>   ...    ..( 16 skipped).    ..  ***
>    16    2020-03-16 15:38    22  ***
>    17    2020-03-16 16:37    23  ****
>   ...    ..(  3 skipped).    ..  ****
>    21    2020-03-16 20:33    23  ****
>    22    2020-03-16 21:32    24  *****
>    23    2020-03-16 22:31    23  ****
>    24    2020-03-16 23:30    24  *****
>    25    2020-03-17 00:29    24  *****
>    26    2020-03-17 01:28    24  *****
>    27    2020-03-17 02:27    23  ****
>   ...    ..(  7 skipped).    ..  ****
>    35    2020-03-17 10:19    23  ****
>    36    2020-03-17 11:18    22  ***
>   ...    ..(  3 skipped).    ..  ***
>    40    2020-03-17 15:14    22  ***
>    41    2020-03-17 16:13    23  ****
>   ...    ..( 14 skipped).    ..  ****
>    56    2020-03-18 06:58    23  ****
>    57    2020-03-18 07:57    22  ***
>   ...    ..(  2 skipped).    ..  ***
>    60    2020-03-18 10:54    22  ***
>    61    2020-03-18 11:53    21  **
>    62    2020-03-18 12:52    20  *
>    63    2020-03-18 13:51    21  **
>    64    2020-03-18 14:50    20  *
>    65    2020-03-18 15:49    20  *
>    66    2020-03-18 16:48    21  **
>   ...    ..(  5 skipped).    ..  **
>    72    2020-03-18 22:42    21  **
>    73    2020-03-18 23:41    24  *****
>    74    2020-03-19 00:40    26  *******
>   ...    ..(  2 skipped).    ..  *******
>    77    2020-03-19 03:37    26  *******
>    78    2020-03-19 04:36    22  ***
>   ...    ..(  2 skipped).    ..  ***
>    81    2020-03-19 07:33    22  ***
>    82    2020-03-19 08:32    21  **
>    83    2020-03-19 09:31    22  ***
>    84    2020-03-19 10:30    22  ***
>    85    2020-03-19 11:29    21  **
>   ...    ..(  2 skipped).    ..  **
>    88    2020-03-19 14:26    21  **
>    89    2020-03-19 15:25    25  ******
>    90    2020-03-19 16:24    25  ******
>    91    2020-03-19 17:23    26  *******
>    92    2020-03-19 18:22    25  ******
>    93    2020-03-19 19:21    22  ***
>   ...    ..(  3 skipped).    ..  ***
>    97    2020-03-19 23:17    22  ***
>    98    2020-03-20 00:16    21  **
>   ...    ..(  4 skipped).    ..  **
>   103    2020-03-20 05:11    21  **
>   104    2020-03-20 06:10    20  *
>   ...    ..( 11 skipped).    ..  *
>   116    2020-03-20 17:58    20  *
>   117    2020-03-20 18:57    21  **
>   118    2020-03-20 19:56    21  **
>   119    2020-03-20 20:55    21  **
> 
> SCT Error Recovery Control:
>             Read: Disabled
>            Write: Disabled

OUCH! AGAIN!
> 
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 1) ==
> 0x01  0x008  4               5  ---  Lifetime Power-On Resets
> 0x01  0x010  4             506  ---  Power-on Hours
> 0x01  0x018  6      8177237744  ---  Logical Sectors Written
> 0x01  0x020  6        32254131  ---  Number of Write Commands
> 0x01  0x028  6      5818370805  ---  Logical Sectors Read
> 0x01  0x030  6        24397122  ---  Number of Read Commands
> 0x01  0x038  6               -  ---  Date and Time TimeStamp
> 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> 0x03  0x008  4             159  ---  Spindle Motor Power-on Hours
> 0x03  0x010  4              10  ---  Head Flying Hours
> 0x03  0x018  4             284  ---  Head Load Events
> 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> 0x03  0x028  4               0  ---  Read Recovery Attempts
> 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> 0x03  0x038  4               0  ---  Number of Realloc. Candidate
> Logical Sectors
> 0x03  0x040  4              45  ---  Number of High Priority Unload Events
> 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> 0x04  0x010  4               2  ---  Resets Between Cmd Acceptance and
> Completion
> 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> 0x05  0x008  1              23  ---  Current Temperature
> 0x05  0x010  1              20  ---  Average Short Term Temperature
> 0x05  0x018  1               -  ---  Average Long Term Temperature
> 0x05  0x020  1              30  ---  Highest Temperature
> 0x05  0x028  1               0  ---  Lowest Temperature
> 0x05  0x030  1              27  ---  Highest Average Short Term Temperature
> 0x05  0x038  1              14  ---  Lowest Average Short Term Temperature
> 0x05  0x040  1               -  ---  Highest Average Long Term Temperature
> 0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
> 0x05  0x050  4               0  ---  Time in Over-Temperature
> 0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
> 0x05  0x060  4               0  ---  Time in Under-Temperature
> 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> 0x06  0x008  4             101  ---  Number of Hardware Resets
> 0x06  0x010  4              17  ---  Number of ASR Events
> 0x06  0x018  4               0  ---  Number of Interface CRC Errors
>                                  |||_ C monitored condition met
>                                  ||__ D supports DSN
>                                  |___ N normalized value
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x000a  2           34  Device-to-host register FISes sent due to a COMRESET
> 0x0001  2            0  Command failed due to ICRC error
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
Oh My God.

This array is just asking for disaster. Whoops, you've just had one, sorry.

I'm looking for details of your two failed drives, but I don't seem able 
to find any. But as soon as you can get the array back, you need to fix 
those problems ASAP!!!

Firstly, get rid of that Green!!! Were the two failed drives greens? 
Read the timeout page to find out why.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

That will hopefully also fix the problem with those Reds with ERC 
disabled. It would not surprise me in the slightest if this is what has 
done the damage to your array.

Lastly, those ST4000s. Are they Ironwolves? I guess they're good drives, 
but they've just trashed your raid-6 redundancy - lose just one of them 
and your array is teetering on the edge. You need to get your sdx2 
partitions copied on to new drives ASAP.

What I'd do is get a couple more ST4000s, and use them, creating 4GB 
partitions. Then take your existing ST4000s, and convert them to 4GB 
partitions. At which point you only need five more ST4000s to move your 
array on to new drives.

I'm not sure how you get there - once you've got your 9 4GB drives you 
*may* be able to just fail and remove the remaining 2GB drives. 
Otherwise, I'd use the freed-up 2GB drives to create 4GB raid-0s. You'd 
end up having to buy a couple of spare 4GB drives to move the entire 
array on to 4GB "drives", but then you could remove the raid-0 arrays.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-21  0:06     ` antlists
@ 2020-03-21 11:54       ` Glenn Greibesland
  2020-03-21 19:24         ` Phil Turmel
  0 siblings, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-21 11:54 UTC (permalink / raw)
  To: antlists; +Cc: linux-raid, Phil Turmel, NeilBrown

Yes, I am aware of the problems with WD Green and multiple partitions
on single 4TB disk. I am in the middle of getting rid of old disks and
I have enough new drives to stop having multiple partitions on single
drives, but not enough power and free SATA ports. It is just a
temporary solution. Also a reason why I did not
include much details in the original post, I knew it would just
distract from the problem I want to solve right away.

What I need help with now is just getting the array started with the
16 out of 18 disks. Then I can continue migrating data and replacing
old disks as planned.

When I built the array in 2012, I used WD Green. They turned out to be
horrible disks and I have since replaced some of them with WD Red. The
newest disks I've bought are Ironwolves

lør. 21. mar. 2020 kl. 01:06 skrev antlists <antlists@youngman.org.uk>:
>
> On 20/03/2020 21:05, Glenn Greibesland wrote:
> > fre. 20. mar. 2020 kl. 20:15 skrev Wols Lists <antlists@youngman.org.uk>:
> >>
> >> On 19/03/20 19:55, Glenn Greibesland wrote:
> >>> After a bit of digging in the manual and on different forums I have
> >>> concluded that the next step for me is to recreate the array using
> >>> –assume-clean and –data-offset=variable.
> >>> I have tried a dry run of the command (answering no to “Continue
> >>> creating array”), and mdadm accepts the parameters without any errors:
> >>
> >> Oh my god NO!!!
> >>
> >> Do NOT use --create unless someone rather more experienced than me tells
> >> you to!!!
> >>
> >> The obvious thing is to somehow get the sixteen drives that you know
> >> should be okay, re-assembled in a forced manner. The --re-add should not
> >> have done any real damage because, as mdadm keeps complaining, you
> >> didn't have enough drives so it won't have touched the data on that
> >> drive. Unfortunately, my fu isn't good enough to tell you how to get
> >> that drive back in.
> >>
> >> What's wrong with the two failed drives? Can you ddrescue them? They
> >> might be enough to get you going again.
> >>
> >> You say you've read the web page "Raid recovery" - which says it's
> >> obsolete and points you at "When things go wrogn" - but you don't appear
> >> to have read that! PLEASE read "asking for help" and in particular you
> >> NEED to run lsdrv and give us that information. Without that, if you DO
> >> run --create, you will be in for a world of hurt.
> >>
> >> I know you may feel it's asking for loads of information, and the
> >> resulting email will be massive, but trust me - the experts will look at
> >> it and they will probably be able to come up with a plan of action. At
> >> present, they don't have much to go on, and nor will you if carry on as
> >> you're going ...
> >>
> >> Cheers,
> >> Wol
> >
> > Thanks for replying to the thread.
> >
> > The two failed drives has "unreadable (pending) sectors", and they
> > have a lower Event Count than the other disks, so that is why I've
> > been trying to get the array up and running with the remaining 16
> > disks that has the same Event Count.
> >
> > I concluded myself that --create --assume-clean had to be the only
> > thing left to try, that's why I didn't provide any logs or info. Sorry
> > about that, you are right, I should check if there is any other
> > options first. I've been trying to get this array up and running again
> > for quite some time, so I'm all ears if someone has some magic to try.
> > Yesterday I read some of the source code of mdadm and sort of answered
> > my own question. According to the source code, specifying sizes in
> > sectors is supported. I'd still like some confirmation though (talking
> > about parse_size function in util.c).
> >
> > Here's some additional info:
> >
> > mdadm: added /dev/sdj1 to /dev/md/0 as 0
> > mdadm: added /dev/sdk1 to /dev/md/0 as 1
> > mdadm: added /dev/sdi1 to /dev/md/0 as 2
> > mdadm: added /dev/sdh1 to /dev/md/0 as 3
> > mdadm: added /dev/sdo1 to /dev/md/0 as 4
> > mdadm: added /dev/sdp1 to /dev/md/0 as 5
> > mdadm: added /dev/sdr1 to /dev/md/0 as 6
> > mdadm: added /dev/sdq1 to /dev/md/0 as 7
> > mdadm: added /dev/sdf1 to /dev/md/0 as 8
> > mdadm: added /dev/sdb1 to /dev/md/0 as 9
> > mdadm: added /dev/sdg1 to /dev/md/0 as -1   <<<< This is the drive
> > that is now regarded as spare. It originally had slot 10 in the array
> > mdadm: added /dev/sdd1 to /dev/md/0 as 11
> > mdadm: added /dev/sdm1 to /dev/md/0 as 12
> > mdadm: added /dev/sdf2 to /dev/md/0 as 13
> > mdadm: added /dev/sdc2 to /dev/md/0 as 16
> > mdadm: added /dev/sdc1 to /dev/md/0 as 17
> >
> >
> >
> > mdadm: no uptodate device for slot 10 of /dev/md/0 << sdg1
> > mdadm: no uptodate device for slot 14 of /dev/md/0 << drive disconnected
> > mdadm: no uptodate device for slot 15 of /dev/md/0 << drive disconnected
> >
> > mdadm: /dev/md/0 assembled from 15 drives and 1 spare - not enough to
> > start the array.
> >
> >   mdadm -D /dev/md0
> > /dev/md0:
> >             Version : 1.2
> >          Raid Level : raid0
> >       Total Devices : 16
> >         Persistence : Superblock is persistent
> >
> >               State : inactive
> >     Working Devices : 16
> >
> >                Name : vm-test:0
> >                UUID : 45ced2f9:947773d4:106077ab:2df799d6
> >              Events : 1937517
> >
> >      Number   Major   Minor   RaidDevice
> >
> >         -       8       17        -        /dev/sdb1
> >         -       8       33        -        /dev/sdc1
> >         -       8       34        -        /dev/sdc2
>
> What's this? Two partitions in the array on the same physical disk?
>
> >         -       8       49        -        /dev/sdd1
> >         -       8       81        -        /dev/sdf1
> >         -       8       82        -        /dev/sdf2
>
> And again?
>
> >         -       8       97        -        /dev/sdg1
> >         -       8      113        -        /dev/sdh1
> >         -       8      129        -        /dev/sdi1
> >         -       8      145        -        /dev/sdj1
> >         -       8      161        -        /dev/sdk1
> >         -       8      193        -        /dev/sdm1
> >         -       8      241        -        /dev/sdp1
> >         -      65        1        -        /dev/sdq1
> >         -      65       17        -        /dev/sdr1
> >         -      65       33        -        /dev/sds1
> >
>
>
> >
> > SMART WRITE LOG does not return COUNT and LBA_LOW register
> > SCT (Get) Error Recovery Control command failed
>
> Which disk is this? No error recovery? BAD sign ...
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID      Size     Value  Description
> > 0x0001  2            0  Command failed due to ICRC error
> > 0x0002  2            0  R_ERR response for data FIS
> > 0x0003  2            0  R_ERR response for device-to-host data FIS
> > 0x0004  2            0  R_ERR response for host-to-device data FIS
> > 0x0005  2            0  R_ERR response for non-data FIS
> > 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> > 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> > 0x0008  2            0  Device-to-host non-data FIS retries
> > 0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
> > 0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
> > 0x000b  2            0  CRC errors within host-to-device FIS
> > 0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
> > 0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
> > 0x8000  4      1208382  Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
>
>
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
>
>
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Green
>
> What's this?
>
> > Device Model:     WDC WD20EARX-00PASB0
> > Serial Number:    WD-WMAZA9538601
> > LU WWN Device Id: 5 0014ee 15a0a4ffa
> > Firmware Version: 51.0AB51
> > User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   ATA8-ACS (minor revision not indicated)
> > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
> > Local Time is:    Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is:   Unavailable
> > APM feature is:   Unavailable
> > Rd look-ahead is: Enabled
> > Write cache is:   Enabled
> > ATA Security is:  Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x84) Offline data collection activity
> > was suspended by an interrupting command from host.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (37200) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 359) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   5) minutes.
> > SCT capabilities:        (0x3035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
>
> No mention of ERC - Bad sign ...
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >    1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
> >    3 Spin_Up_Time            POS--K   171   171   021    -    6416
> >    4 Start_Stop_Count        -O--CK   100   100   000    -    255
> >    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> >    7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> >    9 Power_On_Hours          -O--CK   098   098   000    -    1583
> >   10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> >   11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> >   12 Power_Cycle_Count       -O--CK   100   100   000    -    131
> > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    61
> > 193 Load_Cycle_Count        -O--CK   191   191   000    -    29372
> > 194 Temperature_Celsius     -O---K   122   101   000    -    28
> > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> > 198 Offline_Uncorrectable   ----CK   200   200   000    -    0
> > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
> >                              ||||||_ K auto-keep
> >                              |||||__ C event count
> >                              ||||___ R error rate
> >                              |||____ S speed/performance
> >                              ||_____ O updated online
> >                              |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART           Log Directory Version 1 [multi-sector log support]
> > Address    Access  R/W   Size  Description
> > 0x00       GPL,SL  R/O      1  Log Directory
> > 0x01           SL  R/O      1  Summary SMART error log
> > 0x02           SL  R/O      5  Comprehensive SMART error log
> > 0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
> > 0x06           SL  R/O      1  SMART self-test log
> > 0x07       GPL     R/O      1  Extended self-test log
> > 0x09           SL  R/W      1  Selective self-test log
> > 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> > 0x11       GPL     R/O      1  SATA Phy Event Counters log
> > 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> > 0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
> > 0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
> > 0xbd       GPL,SL  VS       1  Device vendor specific log
> > 0xc0       GPL,SL  VS       1  Device vendor specific log
> > 0xc1       GPL     VS      93  Device vendor specific log
> > 0xe0       GPL,SL  R/W      1  SCT Command/Status
> > 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Short offline       Completed without error       00%      1245         -
> >
> > SMART Selective self-test log data structure revision number 1
> >   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >      1        0        0  Not_testing
> >      2        0        0  Not_testing
> >      3        0        0  Not_testing
> >      4        0        0  Not_testing
> >      5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >    After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version:                  3
> > SCT Version (vendor specific):       258 (0x0102)
> > SCT Support Level:                   1
> > Device State:                        Active (0)
> > Current Temperature:                    28 Celsius
> > Power Cycle Min/Max Temperature:      8/43 Celsius
> > Lifetime    Min/Max Temperature:      0/49 Celsius
> > Under/Over Temperature Limit Count:   0/0
> >
> > SCT Temperature History Version:     2
> > Temperature Sampling Period:         1 minute
> > Temperature Logging Interval:        1 minute
> > Min/Max recommended Temperature:      0/60 Celsius
> > Min/Max Temperature Limit:           -41/85 Celsius
> > Temperature History Size (Index):    478 (305)
> >
> > Index    Estimated Time   Temperature Celsius
> >   306    2020-03-20 13:03    23  ****
> >   ...    ..( 33 skipped).    ..  ****
> >   340    2020-03-20 13:37    23  ****
> >   341    2020-03-20 13:38     ?  -
> >   342    2020-03-20 13:39    23  ****
> >   343    2020-03-20 13:40    23  ****
> >   344    2020-03-20 13:41    24  *****
> >   345    2020-03-20 13:42    25  ******
> >   346    2020-03-20 13:43    25  ******
> >   347    2020-03-20 13:44    25  ******
> >   348    2020-03-20 13:45    26  *******
> >   ...    ..(  2 skipped).    ..  *******
> >   351    2020-03-20 13:48    26  *******
> >   352    2020-03-20 13:49    27  ********
> >   353    2020-03-20 13:50    27  ********
> >   354    2020-03-20 13:51    28  *********
> >   355    2020-03-20 13:52    28  *********
> >   356    2020-03-20 13:53    22  ***
> >   ...    ..(276 skipped).    ..  ***
> >   155    2020-03-20 18:30    22  ***
> >   156    2020-03-20 18:31    23  ****
> >   ...    ..(148 skipped).    ..  ****
> >   305    2020-03-20 21:00    23  ****
> >
> > SCT Error Recovery Control command not supported
>
> Yup. Ouch!
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID      Size     Value  Description
> > 0x0001  2            0  Command failed due to ICRC error
> > 0x0002  2            0  R_ERR response for data FIS
> > 0x0003  2            0  R_ERR response for device-to-host data FIS
> > 0x0004  2            0  R_ERR response for host-to-device data FIS
> > 0x0005  2            0  R_ERR response for non-data FIS
> > 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> > 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> > 0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
> > 0x000b  2            0  CRC errors within host-to-device FIS
> > 0x8000  4      1208379  Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Red
> > Device Model:     WDC WD20EFRX-68AX9N0
> > Serial Number:    WD-WMC300320657
> > LU WWN Device Id: 5 0014ee 0ae1ee098
> > Firmware Version: 80.00A80
> > User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   ACS-2 (minor revision not indicated)
> > SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> > Local Time is:    Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is:   Unavailable
> > APM feature is:   Unavailable
> > Rd look-ahead is: Enabled
> > Write cache is:   Enabled
> > ATA Security is:  Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Unknown
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x00) Offline data collection activity
> > was never started.
> > Auto Offline Data Collection: Disabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (27120) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 274) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   5) minutes.
> > SCT capabilities:        (0x70bd) SCT Status supported.
> > SCT Error Recovery Control supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >    1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
> >    3 Spin_Up_Time            POS--K   176   169   021    -    4183
> >    4 Start_Stop_Count        -O--CK   100   100   000    -    502
> >    5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> >    7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> >    9 Power_On_Hours          -O--CK   061   061   000    -    28588
> >   10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> >   11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> >   12 Power_Cycle_Count       -O--CK   100   100   000    -    490
> > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    483
> > 193 Load_Cycle_Count        -O--CK   200   200   000    -    18
> > 194 Temperature_Celsius     -O---K   120   089   000    -    27
> > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> > 198 Offline_Uncorrectable   ----CK   100   253   000    -    0
> > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > 200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
> >                              ||||||_ K auto-keep
> >                              |||||__ C event count
> >                              ||||___ R error rate
> >                              |||____ S speed/performance
> >                              ||_____ O updated online
> >                              |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART           Log Directory Version 1 [multi-sector log support]
> > Address    Access  R/W   Size  Description
> > 0x00       GPL,SL  R/O      1  Log Directory
> > 0x01           SL  R/O      1  Summary SMART error log
> > 0x02           SL  R/O      5  Comprehensive SMART error log
> > 0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
> > 0x06           SL  R/O      1  SMART self-test log
> > 0x07       GPL     R/O      1  Extended self-test log
> > 0x09           SL  R/W      1  Selective self-test log
> > 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> > 0x11       GPL     R/O      1  SATA Phy Event Counters log
> > 0x21       GPL     R/O      1  Write stream error log
> > 0x22       GPL     R/O      1  Read stream error log
> > 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> > 0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
> > 0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
> > 0xbd       GPL,SL  VS       1  Device vendor specific log
> > 0xc0       GPL,SL  VS       1  Device vendor specific log
> > 0xc1       GPL     VS      93  Device vendor specific log
> > 0xe0       GPL,SL  R/W      1  SCT Command/Status
> > 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Short offline       Completed without error       00%     26024         -
> >
> > SMART Selective self-test log data structure revision number 1
> >   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >      1        0        0  Not_testing
> >      2        0        0  Not_testing
> >      3        0        0  Not_testing
> >      4        0        0  Not_testing
> >      5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >    After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version:                  3
> > SCT Version (vendor specific):       258 (0x0102)
> > SCT Support Level:                   1
> > Device State:                        Active (0)
> > Current Temperature:                    27 Celsius
> > Power Cycle Min/Max Temperature:     10/32 Celsius
> > Lifetime    Min/Max Temperature:      2/58 Celsius
> > Under/Over Temperature Limit Count:   0/0
> > Vendor specific:
> > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > SCT Temperature History Version:     2
> > Temperature Sampling Period:         1 minute
> > Temperature Logging Interval:        1 minute
> > Min/Max recommended Temperature:      0/60 Celsius
> > Min/Max Temperature Limit:           -41/85 Celsius
> > Temperature History Size (Index):    478 (56)
> >
> > Index    Estimated Time   Temperature Celsius
> >    57    2020-03-20 13:03    24  *****
> >   ...    ..(377 skipped).    ..  *****
> >   435    2020-03-20 19:21    24  *****
> >   436    2020-03-20 19:22     ?  -
> >   437    2020-03-20 19:23    24  *****
> >   438    2020-03-20 19:24    25  ******
> >   ...    ..(  3 skipped).    ..  ******
> >   442    2020-03-20 19:28    25  ******
> >   443    2020-03-20 19:29    26  *******
> >   444    2020-03-20 19:30    26  *******
> >   445    2020-03-20 19:31    26  *******
> >   446    2020-03-20 19:32    27  ********
> >   ...    ..(  3 skipped).    ..  ********
> >   450    2020-03-20 19:36    27  ********
> >   451    2020-03-20 19:37    24  *****
> >   ...    ..( 82 skipped).    ..  *****
> >    56    2020-03-20 21:00    24  *****
> >
> > SCT Error Recovery Control:
> >             Read: Disabled
> >            Write: Disabled
>
> What's going on here? We have a RED drive, but ERC isn't working ...
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID      Size     Value  Description
> > 0x0001  2            0  Command failed due to ICRC error
> > 0x0002  2            0  R_ERR response for data FIS
> > 0x0003  2            0  R_ERR response for device-to-host data FIS
> > 0x0004  2            0  R_ERR response for host-to-device data FIS
> > 0x0005  2            0  R_ERR response for non-data FIS
> > 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> > 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> > 0x0008  2            0  Device-to-host non-data FIS retries
> > 0x0009  2           33  Transition from drive PhyRdy to drive PhyNRdy
> > 0x000a  2           34  Device-to-host register FISes sent due to a COMRESET
> > 0x000b  2            0  CRC errors within host-to-device FIS
> > 0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
> > 0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
> > 0x8000  4      1208361  Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Device Model:     ST4000VN008-2DR166
> > Serial Number:    ZDH82183
> > LU WWN Device Id: 5 000c50 0c37c42c0
> > Firmware Version: SC60
> > User Capacity:    4,000,787,030,016 bytes [4.00 TB]
> > Sector Sizes:     512 bytes logical, 4096 bytes physical
> > Rotation Rate:    5980 rpm
> > Form Factor:      3.5 inches
> > Device is:        Not in smartctl database [for details use: -P showall]
> > ATA Version is:   ACS-3 T13/2161-D revision 5
> > SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> > Local Time is:    Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is:   Unavailable
> > APM level is:     254 (maximum performance)
> > Rd look-ahead is: Enabled
> > Write cache is:   Enabled
> > ATA Security is:  Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Unknown
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status:  (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (  581) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities:            (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability:        (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: (   1) minutes.
> > Extended self-test routine
> > recommended polling time: ( 621) minutes.
> > Conveyance self-test routine
> > recommended polling time: (   2) minutes.
> > SCT capabilities:        (0x50bd) SCT Status supported.
> > SCT Error Recovery Control supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >    1 Raw_Read_Error_Rate     POSR--   070   065   044    -    10856451
> >    3 Spin_Up_Time            PO----   094   094   000    -    0
> >    4 Start_Stop_Count        -O--CK   100   100   020    -    53
> >    5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
> >    7 Seek_Error_Rate         POSR--   075   061   045    -    29667756
> >    9 Power_On_Hours          -O--CK   100   100   000    -    506 (130 79 0)
> >   10 Spin_Retry_Count        PO--C-   100   100   097    -    0
> >   12 Power_Cycle_Count       -O--CK   100   100   020    -    5
> > 184 End-to-End_Error        -O--CK   100   100   099    -    0
> > 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> > 188 Command_Timeout         -O--CK   098   098   000    -    65538
> > 189 High_Fly_Writes         -O-RCK   100   100   000    -    0
> > 190 Airflow_Temperature_Cel -O---K   076   070   040    -    24 (Min/Max 9/26)
> > 191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> > 192 Power-Off_Retract_Count -O--CK   100   100   000    -    44
> > 193 Load_Cycle_Count        -O--CK   100   100   000    -    284
> > 194 Temperature_Celsius     -O---K   024   040   000    -    24 (0 9 0 0 0)
> > 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> > 198 Offline_Uncorrectable   ----C-   100   100   000    -    0
> > 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> > 240 Head_Flying_Hours       ------   100   253   000    -    139 (51 45 0)
> > 241 Total_LBAs_Written      ------   100   253   000    -    8177237744
> > 242 Total_LBAs_Read         ------   100   253   000    -    5818370819
> >                              ||||||_ K auto-keep
> >                              |||||__ C event count
> >                              ||||___ R error rate
> >                              |||____ S speed/performance
> >                              ||_____ O updated online
> >                              |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART           Log Directory Version 1 [multi-sector log support]
> > Address    Access  R/W   Size  Description
> > 0x00       GPL,SL  R/O      1  Log Directory
> > 0x01           SL  R/O      1  Summary SMART error log
> > 0x02           SL  R/O      5  Comprehensive SMART error log
> > 0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> > 0x04       GPL,SL  R/O      8  Device Statistics log
> > 0x06           SL  R/O      1  SMART self-test log
> > 0x07       GPL     R/O      1  Extended self-test log
> > 0x09           SL  R/W      1  Selective self-test log
> > 0x10       GPL     R/O      1  SATA NCQ Queued Error log
> > 0x11       GPL     R/O      1  SATA Phy Event Counters log
> > 0x13       GPL     R/O      1  SATA NCQ Send and Receive log
> > 0x15       GPL     R/W      1  SATA Rebuild Assist log
> > 0x21       GPL     R/O      1  Write stream error log
> > 0x22       GPL     R/O      1  Read stream error log
> > 0x24       GPL     R/O    512  Current Device Internal Status Data log
> > 0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
> > 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> > 0xa1       GPL,SL  VS      24  Device vendor specific log
> > 0xa2       GPL     VS    8160  Device vendor specific log
> > 0xa6       GPL     VS     192  Device vendor specific log
> > 0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
> > 0xab       GPL     VS       1  Device vendor specific log
> > 0xb0       GPL     VS    9048  Device vendor specific log
> > 0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> > 0xc1       GPL,SL  VS      16  Device vendor specific log
> > 0xd1       GPL     VS     136  Device vendor specific log
> > 0xd2       GPL     VS   10000  Device vendor specific log
> > 0xd3       GPL     VS    1920  Device vendor specific log
> > 0xe0       GPL,SL  R/W      1  SCT Command/Status
> > 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> >
> > SMART Selective self-test log data structure revision number 1
> >   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >      1        0        0  Not_testing
> >      2        0        0  Not_testing
> >      3        0        0  Not_testing
> >      4        0        0  Not_testing
> >      5        0        0  Not_testing
> > Selective self-test flags (0x0):
> >    After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version:                  3
> > SCT Version (vendor specific):       522 (0x020a)
> > SCT Support Level:                   1
> > Device State:                        Active (0)
> > Current Temperature:                    23 Celsius
> > Power Cycle Min/Max Temperature:      8/26 Celsius
> > Lifetime    Min/Max Temperature:      8/30 Celsius
> > Under/Over Temperature Limit Count:   0/336
> >
> > SCT Temperature History Version:     2
> > Temperature Sampling Period:         3 minutes
> > Temperature Logging Interval:        59 minutes
> > Min/Max recommended Temperature:      0/ 0 Celsius
> > Min/Max Temperature Limit:            0/ 0 Celsius
> > Temperature History Size (Index):    128 (119)
> >
> > Index    Estimated Time   Temperature Celsius
> >   120    2020-03-15 16:02    21  **
> >   ...    ..(  5 skipped).    ..  **
> >   126    2020-03-15 21:56    21  **
> >   127    2020-03-15 22:55    22  ***
> >   ...    ..( 16 skipped).    ..  ***
> >    16    2020-03-16 15:38    22  ***
> >    17    2020-03-16 16:37    23  ****
> >   ...    ..(  3 skipped).    ..  ****
> >    21    2020-03-16 20:33    23  ****
> >    22    2020-03-16 21:32    24  *****
> >    23    2020-03-16 22:31    23  ****
> >    24    2020-03-16 23:30    24  *****
> >    25    2020-03-17 00:29    24  *****
> >    26    2020-03-17 01:28    24  *****
> >    27    2020-03-17 02:27    23  ****
> >   ...    ..(  7 skipped).    ..  ****
> >    35    2020-03-17 10:19    23  ****
> >    36    2020-03-17 11:18    22  ***
> >   ...    ..(  3 skipped).    ..  ***
> >    40    2020-03-17 15:14    22  ***
> >    41    2020-03-17 16:13    23  ****
> >   ...    ..( 14 skipped).    ..  ****
> >    56    2020-03-18 06:58    23  ****
> >    57    2020-03-18 07:57    22  ***
> >   ...    ..(  2 skipped).    ..  ***
> >    60    2020-03-18 10:54    22  ***
> >    61    2020-03-18 11:53    21  **
> >    62    2020-03-18 12:52    20  *
> >    63    2020-03-18 13:51    21  **
> >    64    2020-03-18 14:50    20  *
> >    65    2020-03-18 15:49    20  *
> >    66    2020-03-18 16:48    21  **
> >   ...    ..(  5 skipped).    ..  **
> >    72    2020-03-18 22:42    21  **
> >    73    2020-03-18 23:41    24  *****
> >    74    2020-03-19 00:40    26  *******
> >   ...    ..(  2 skipped).    ..  *******
> >    77    2020-03-19 03:37    26  *******
> >    78    2020-03-19 04:36    22  ***
> >   ...    ..(  2 skipped).    ..  ***
> >    81    2020-03-19 07:33    22  ***
> >    82    2020-03-19 08:32    21  **
> >    83    2020-03-19 09:31    22  ***
> >    84    2020-03-19 10:30    22  ***
> >    85    2020-03-19 11:29    21  **
> >   ...    ..(  2 skipped).    ..  **
> >    88    2020-03-19 14:26    21  **
> >    89    2020-03-19 15:25    25  ******
> >    90    2020-03-19 16:24    25  ******
> >    91    2020-03-19 17:23    26  *******
> >    92    2020-03-19 18:22    25  ******
> >    93    2020-03-19 19:21    22  ***
> >   ...    ..(  3 skipped).    ..  ***
> >    97    2020-03-19 23:17    22  ***
> >    98    2020-03-20 00:16    21  **
> >   ...    ..(  4 skipped).    ..  **
> >   103    2020-03-20 05:11    21  **
> >   104    2020-03-20 06:10    20  *
> >   ...    ..( 11 skipped).    ..  *
> >   116    2020-03-20 17:58    20  *
> >   117    2020-03-20 18:57    21  **
> >   118    2020-03-20 19:56    21  **
> >   119    2020-03-20 20:55    21  **
> >
> > SCT Error Recovery Control:
> >             Read: Disabled
> >            Write: Disabled
>
> OUCH! AGAIN!
> >
> > Device Statistics (GP Log 0x04)
> > Page  Offset Size        Value Flags Description
> > 0x01  =====  =               =  ===  == General Statistics (rev 1) ==
> > 0x01  0x008  4               5  ---  Lifetime Power-On Resets
> > 0x01  0x010  4             506  ---  Power-on Hours
> > 0x01  0x018  6      8177237744  ---  Logical Sectors Written
> > 0x01  0x020  6        32254131  ---  Number of Write Commands
> > 0x01  0x028  6      5818370805  ---  Logical Sectors Read
> > 0x01  0x030  6        24397122  ---  Number of Read Commands
> > 0x01  0x038  6               -  ---  Date and Time TimeStamp
> > 0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
> > 0x03  0x008  4             159  ---  Spindle Motor Power-on Hours
> > 0x03  0x010  4              10  ---  Head Flying Hours
> > 0x03  0x018  4             284  ---  Head Load Events
> > 0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
> > 0x03  0x028  4               0  ---  Read Recovery Attempts
> > 0x03  0x030  4               0  ---  Number of Mechanical Start Failures
> > 0x03  0x038  4               0  ---  Number of Realloc. Candidate
> > Logical Sectors
> > 0x03  0x040  4              45  ---  Number of High Priority Unload Events
> > 0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
> > 0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
> > 0x04  0x010  4               2  ---  Resets Between Cmd Acceptance and
> > Completion
> > 0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
> > 0x05  0x008  1              23  ---  Current Temperature
> > 0x05  0x010  1              20  ---  Average Short Term Temperature
> > 0x05  0x018  1               -  ---  Average Long Term Temperature
> > 0x05  0x020  1              30  ---  Highest Temperature
> > 0x05  0x028  1               0  ---  Lowest Temperature
> > 0x05  0x030  1              27  ---  Highest Average Short Term Temperature
> > 0x05  0x038  1              14  ---  Lowest Average Short Term Temperature
> > 0x05  0x040  1               -  ---  Highest Average Long Term Temperature
> > 0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
> > 0x05  0x050  4               0  ---  Time in Over-Temperature
> > 0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
> > 0x05  0x060  4               0  ---  Time in Under-Temperature
> > 0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
> > 0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
> > 0x06  0x008  4             101  ---  Number of Hardware Resets
> > 0x06  0x010  4              17  ---  Number of ASR Events
> > 0x06  0x018  4               0  ---  Number of Interface CRC Errors
> >                                  |||_ C monitored condition met
> >                                  ||__ D supports DSN
> >                                  |___ N normalized value
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID      Size     Value  Description
> > 0x000a  2           34  Device-to-host register FISes sent due to a COMRESET
> > 0x0001  2            0  Command failed due to ICRC error
> > 0x0003  2            0  R_ERR response for device-to-host data FIS
> > 0x0004  2            0  R_ERR response for host-to-device data FIS
> > 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> > 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> Oh My God.
>
> This array is just asking for disaster. Whoops, you've just had one, sorry.
>
> I'm looking for details of your two failed drives, but I don't seem able
> to find any. But as soon as you can get the array back, you need to fix
> those problems ASAP!!!
>
> Firstly, get rid of that Green!!! Were the two failed drives greens?
> Read the timeout page to find out why.
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> That will hopefully also fix the problem with those Reds with ERC
> disabled. It would not surprise me in the slightest if this is what has
> done the damage to your array.
>
> Lastly, those ST4000s. Are they Ironwolves? I guess they're good drives,
> but they've just trashed your raid-6 redundancy - lose just one of them
> and your array is teetering on the edge. You need to get your sdx2
> partitions copied on to new drives ASAP.
>
> What I'd do is get a couple more ST4000s, and use them, creating 4GB
> partitions. Then take your existing ST4000s, and convert them to 4GB
> partitions. At which point you only need five more ST4000s to move your
> array on to new drives.
>
> I'm not sure how you get there - once you've got your 9 4GB drives you
> *may* be able to just fail and remove the remaining 2GB drives.
> Otherwise, I'd use the freed-up 2GB drives to create 4GB raid-0s. You'd
> end up having to buy a couple of spare 4GB drives to move the entire
> array on to 4GB "drives", but then you could remove the raid-0 arrays.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-21 11:54       ` Glenn Greibesland
@ 2020-03-21 19:24         ` Phil Turmel
  2020-03-21 22:12           ` Glenn Greibesland
  2020-03-22  0:05           ` Wols Lists
  0 siblings, 2 replies; 13+ messages in thread
From: Phil Turmel @ 2020-03-21 19:24 UTC (permalink / raw)
  To: Glenn Greibesland, antlists; +Cc: linux-raid, NeilBrown

Hi Glenn,

{Convention on kernel.org lists is to interleave replies or bottom post, 
and to trim non-relevant quoted material.  Please do so in the future.}

On 3/21/20 7:54 AM, Glenn Greibesland wrote:
> Yes, I am aware of the problems with WD Green and multiple partitions
> on single 4TB disk. I am in the middle of getting rid of old disks and
> I have enough new drives to stop having multiple partitions on single
> drives, but not enough power and free SATA ports. It is just a
> temporary solution. Also a reason why I did not
> include much details in the original post, I knew it would just
> distract from the problem I want to solve right away.
> 
> What I need help with now is just getting the array started with the
> 16 out of 18 disks. Then I can continue migrating data and replacing
> old disks as planned.

I've examined the material posted, and the sequence of events described. 
  The --re-add damaged that one drive's role record and there is no 
programmatic way in mdadm to correct it.

Since you seem comfortable reading source code, you might consider byte 
editing that drive's superblock to restore it to "active device 10". 
That is what I would do.  With that corrected, --assemble --force should 
give you a running array.

In lieu of superblock surgery, you will indeed need to perform a 
--create --assume-clean, as you proposed in your original email.  Since 
you have already constructed a syntactically valid command for that 
purpose, with appropriate data offsets, that might be the fastest way to 
get a running array.

I would double-check the /dev/ name versus array "active device" number 
relationship to ensure strict ordering in your --create operation. 
Incorrect ordering will utterly scramble your content.

> When I built the array in 2012, I used WD Green. They turned out to be
> horrible disks and I have since replaced some of them with WD Red. The
> newest disks I've bought are Ironwolves

I also noted the drives with Error Recovery Control turned off.  That is 
not an issue while your array has no redundancy, but is catastrophic in 
any normal array.  It is as bad as having a drive that doesn't do ERC at 
all.  Don't do that.  Do read the "Timeout Mismatch" documentation that 
Anthony recommended, if you haven't yet.

I also recommend, when you get to a running array, that you prioritize 
the backup of its content--get the critical data copied out ASAP.  Your 
array will be very vulnerable to Unrecoverable Read Errors until you've 
completed your reconfiguration onto new drives.  Do not attempt to scrub 
the array or read every file right away, as any URE may break the array 
again.

If UREs do break your array again, you will need to use an 
error-ignoring copy tool (some flavor of ddrescue) to put the readable 
data onto a new device, remove the old device from the system, and then 
--assemble --force with the replacement.  Repeat as needed.

Good luck!

Regards,

Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-21 19:24         ` Phil Turmel
@ 2020-03-21 22:12           ` Glenn Greibesland
  2020-03-22  0:32             ` Phil Turmel
  2020-03-22  0:05           ` Wols Lists
  1 sibling, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-21 22:12 UTC (permalink / raw)
  To: Phil Turmel; +Cc: antlists, linux-raid, NeilBrown

lør. 21. mar. 2020 kl. 20:24 skrev Phil Turmel <philip@turmel.org>:
> {Convention on kernel.org lists is to interleave replies or bottom post,
> and to trim non-relevant quoted material.  Please do so in the future.}

Sorry about that.

> Since you seem comfortable reading source code, you might consider byte
> editing that drive's superblock to restore it to "active device 10".
> That is what I would do.  With that corrected, --assemble --force should
> give you a running array.

I did some more digging in the source code, but it looks like the
superblock is replicated onto all drives and that I probably would
have to edit the superblock of all disks, but I'm not sure.
With newfound confidence (thanks) I decided to try the --create
--asume-clean option instead.
It worked fine and I am now copying the data that is not already backed up.

I'll wait until the data is copied onto other drives before I add the
last two disks to the array and start rebuilding.

> I also noted the drives with Error Recovery Control turned off.  That is
> not an issue while your array has no redundancy, but is catastrophic in
> any normal array.  It is as bad as having a drive that doesn't do ERC at
> all.  Don't do that.  Do read the "Timeout Mismatch" documentation that
> Anthony recommended, if you haven't yet.

I'll read up on this documentation to ensure reliable operation in the
future. Thanks Phil and Anthony.

So to summarize what happened and what I've learned:
I had a RAID6 array with only 16 out of 18 working drives.
I received an email from mdadm saying another drive failed.
I ran a full offline smart test that completed successfuly.

The drive was in F (failed) state. I used --re-add and mdadm overwrote
the superblock turning it into a spare drive instead of putting the
drive back into slot 10.
I should have used --assemble --force.

Am I correct?

Glenn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-21 19:24         ` Phil Turmel
  2020-03-21 22:12           ` Glenn Greibesland
@ 2020-03-22  0:05           ` Wols Lists
  1 sibling, 0 replies; 13+ messages in thread
From: Wols Lists @ 2020-03-22  0:05 UTC (permalink / raw)
  To: Phil Turmel, Glenn Greibesland; +Cc: linux-raid, NeilBrown

On 21/03/20 19:24, Phil Turmel wrote:
> If UREs do break your array again, you will need to use an
> error-ignoring copy tool (some flavor of ddrescue) to put the readable
> data onto a new device, remove the old device from the system, and then
> --assemble --force with the replacement.  Repeat as needed.

I would NOT recommend it at the moment - it's untested and reputedly
breaks raids 5 & 6, but look at dm-integrity. If we could trust it, it
would be a wonderful tool with ddrescue.

Hopefully I'm about to have a wonderful system with 6 or so drives I can
play with as a raid test-bed, and I'm hoping to do a load of work on this.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-21 22:12           ` Glenn Greibesland
@ 2020-03-22  0:32             ` Phil Turmel
  2020-03-23  9:23               ` Wols Lists
  0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2020-03-22  0:32 UTC (permalink / raw)
  To: Glenn Greibesland; +Cc: antlists, linux-raid, NeilBrown

On 3/21/20 6:12 PM, Glenn Greibesland wrote:

[trim /]

> So to summarize what happened and what I've learned:
> I had a RAID6 array with only 16 out of 18 working drives.
> I received an email from mdadm saying another drive failed.
> I ran a full offline smart test that completed successfuly.
> 
> The drive was in F (failed) state. I used --re-add and mdadm overwrote
> the superblock turning it into a spare drive instead of putting the
> drive back into slot 10.
> I should have used --assemble --force.
> 
> Am I correct?

Yes.

However, there have been bugs in --force that would cause it to not 
assemble.  Also, I believe latest behavior for --re-add would not have 
damaged the metadata.


Phil

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-22  0:32             ` Phil Turmel
@ 2020-03-23  9:23               ` Wols Lists
  2020-03-23 12:35                 ` Glenn Greibesland
  0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2020-03-23  9:23 UTC (permalink / raw)
  To: Phil Turmel, Glenn Greibesland; +Cc: linux-raid, NeilBrown

On 22/03/20 00:32, Phil Turmel wrote:
> However, there have been bugs in --force that would cause it to not
> assemble.  Also, I believe latest behavior for --re-add would not have
> damaged the metadata.

And note that the website does tell you always to use the latest version
of mdadm when trying to recover an array ... because it's linux-only
it's pretty easy to build from source if you have to.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Raid6 recovery
  2020-03-23  9:23               ` Wols Lists
@ 2020-03-23 12:35                 ` Glenn Greibesland
  0 siblings, 0 replies; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-23 12:35 UTC (permalink / raw)
  To: Wols Lists; +Cc: Phil Turmel, linux-raid, NeilBrown

man. 23. mar. 2020 kl. 10:23 skrev Wols Lists <antlists@youngman.org.uk>:

> And note that the website does tell you always to use the latest version
> of mdadm when trying to recover an array ... because it's linux-only
> it's pretty easy to build from source if you have to.

I was probably using verison 3.3-2 when I ran --re-add and the problem
started. That is the latest version available for the verison of
Ubuntu Server I am running.
I then upgraded to v4.0 and later to v4.1-65 by building from source.
Lesson learned.

Question about mdadm documentation:
Should the man page be updated to reflect the support for using
sectors as units in addition to K M G and T?

Glenn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raid6 recovery
  2011-01-14 16:16 raid6 recovery Björn Englund
@ 2011-01-14 21:52 ` NeilBrown
  0 siblings, 0 replies; 13+ messages in thread
From: NeilBrown @ 2011-01-14 21:52 UTC (permalink / raw)
  To: Björn Englund; +Cc: linux-raid

On Fri, 14 Jan 2011 17:16:26 +0100 Björn Englund <be@smarteye.se> wrote:

> Hi.
> 
> After a loss of communication with a drive in a 10 disk raid6 the disk
> was dropped out of the raid.
> 
> I added it again with
> mdadm /dev/md16 --add /dev/sdbq1
> 
> The array resynced and I used the xfs filesystem on top of the raid.
> 
> After a while I started noticing filesystem errors.
> 
> I did
> echo check > /sys/block/md16/md/sync_action
> 
> I got a lot of errors in /sys/block/md16/md/mismatch_cnt
> 
> I failed and removed the disk I added before from the array.
> 
> Did a check again (on the 9/10 array)
> echo check > /sys/block/md16/md/sync_action
> 
> No errors  /sys/block/md16/md/mismatch_cnt
> 
> Wiped the superblock from /dev/sdbq1 and added it again to the array.
> Let it finish resyncing.
> Did a check and once again a lot of errors.

That is obviously very bad.  After the recovery it may well report a large
number in mismatch_cnt, but if you then do a 'check' the number should go to
zero and stay there.

Did you interrupt the recovery at all, or did it run to completion without
any interference?   What kernel version are you using?

> 
> The drive now has slot 10 instead of slot 3 which it had before the
> first error.

This is normal.  When you wipes the superblock, md though it was a new device
and gave it a new number in the array.  It still filled the same role though.


> 
> Examining each device (see below) shows 11 slots and one failed?
> (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3) ?

These numbers are confusing, but they are correct and suggest the array is
whole and working.
Newer version of mdadm are less confusing.

I'm afraid I cannot suggest what the root problem is.  It seems like
something seriously wrong with IO to the device, but if that is the case you
would expect other errors...

NeilBrown


> 
> 
> Any idea what is going on?
> 
> mdadm --version
> mdadm - v2.6.9 - 10th March 2009
> 
> Centos 5.5
> 
> 
> mdadm -D /dev/md16
> /dev/md16:
>         Version : 1.01
>   Creation Time : Thu Nov 25 09:15:54 2010
>      Raid Level : raid6
>      Array Size : 7809792000 (7448.00 GiB 7997.23 GB)
>   Used Dev Size : 976224000 (931.00 GiB 999.65 GB)
>    Raid Devices : 10
>   Total Devices : 10
> Preferred Minor : 16
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri Jan 14 16:22:10 2011
>           State : clean
>  Active Devices : 10
> Working Devices : 10
>  Failed Devices : 0
>   Spare Devices : 0
> 
>      Chunk Size : 256K
> 
>            Name : 16
>            UUID : fcd585d0:f2918552:7090d8da:532927c8
>          Events : 90
> 
>     Number   Major   Minor   RaidDevice State
>        0       8      145        0      active sync   /dev/sdj1
>        1      65        1        1      active sync   /dev/sdq1
>        2      65       17        2      active sync   /dev/sdr1
>       10      68       65        3      active sync   /dev/sdbq1
>        4      65       49        4      active sync   /dev/sdt1
>        5      65       65        5      active sync   /dev/sdu1
>        6      65      113        6      active sync   /dev/sdx1
>        7      65      129        7      active sync   /dev/sdy1
>        8      65       33        8      active sync   /dev/sds1
>        9      65      145        9      active sync   /dev/sdz1
> 
> 
> 
> mdadm -E /dev/sdj1
> /dev/sdj1:
>           Magic : a92b4efc
>         Version : 1.1
>     Feature Map : 0x0
>      Array UUID : fcd585d0:f2918552:7090d8da:532927c8
>            Name : 16
>   Creation Time : Thu Nov 25 09:15:54 2010
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
>      Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
>   Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
>     Data Offset : 264 sectors
>    Super Offset : 0 sectors
>           State : clean
>     Device UUID : 5db9c8f7:ce5b375e:757c53d0:04e89a06
> 
>     Update Time : Fri Jan 14 16:22:10 2011
>        Checksum : 1f17a675 - correct
>          Events : 90
> 
>      Chunk Size : 256K
> 
>     Array Slot : 0 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
>    Array State : Uuuuuuuuuu 1 failed
> 
> 
> 
> mdadm -E /dev/sdq1
> /dev/sdq1:
>           Magic : a92b4efc
>         Version : 1.1
>     Feature Map : 0x0
>      Array UUID : fcd585d0:f2918552:7090d8da:532927c8
>            Name : 16
>   Creation Time : Thu Nov 25 09:15:54 2010
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
>      Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
>   Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
>     Data Offset : 264 sectors
>    Super Offset : 0 sectors
>           State : clean
>     Device UUID : fb113255:fda391a6:7368a42b:1d6d4655
> 
>     Update Time : Fri Jan 14 16:22:10 2011
>        Checksum : 6ed7b859 - correct
>          Events : 90
> 
>      Chunk Size : 256K
> 
>     Array Slot : 1 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
>    Array State : uUuuuuuuuu 1 failed
> 
> 
>  mdadm -E /dev/sdr1
> /dev/sdr1:
>           Magic : a92b4efc
>         Version : 1.1
>     Feature Map : 0x0
>      Array UUID : fcd585d0:f2918552:7090d8da:532927c8
>            Name : 16
>   Creation Time : Thu Nov 25 09:15:54 2010
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
>      Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
>   Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
>     Data Offset : 264 sectors
>    Super Offset : 0 sectors
>           State : clean
>     Device UUID : afcb4dd8:2aa58944:40a32ed9:eb6178af
> 
>     Update Time : Fri Jan 14 16:22:10 2011
>        Checksum : 97a7a2d7 - correct
>          Events : 90
> 
>      Chunk Size : 256K
> 
>     Array Slot : 2 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
>    Array State : uuUuuuuuuu 1 failed
> 
> 
> mdadm -E /dev/sdbq1
> /dev/sdbq1:
>           Magic : a92b4efc
>         Version : 1.1
>     Feature Map : 0x0
>      Array UUID : fcd585d0:f2918552:7090d8da:532927c8
>            Name : 16
>   Creation Time : Thu Nov 25 09:15:54 2010
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
>      Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
>   Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
>     Data Offset : 264 sectors
>    Super Offset : 0 sectors
>           State : clean
>     Device UUID : 93c6ae7c:d8161356:7ada1043:d0c5a924
> 
>     Update Time : Fri Jan 14 16:22:10 2011
>        Checksum : 2ca5aa8f - correct
>          Events : 90
> 
>      Chunk Size : 256K
> 
>     Array Slot : 10 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
>    Array State : uuuUuuuuuu 1 failed
> 
> 
> and so on for the rest of the drives.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* raid6 recovery
@ 2011-01-14 16:16 Björn Englund
  2011-01-14 21:52 ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Björn Englund @ 2011-01-14 16:16 UTC (permalink / raw)
  To: linux-raid

Hi.

After a loss of communication with a drive in a 10 disk raid6 the disk
was dropped out of the raid.

I added it again with
mdadm /dev/md16 --add /dev/sdbq1

The array resynced and I used the xfs filesystem on top of the raid.

After a while I started noticing filesystem errors.

I did
echo check > /sys/block/md16/md/sync_action

I got a lot of errors in /sys/block/md16/md/mismatch_cnt

I failed and removed the disk I added before from the array.

Did a check again (on the 9/10 array)
echo check > /sys/block/md16/md/sync_action

No errors  /sys/block/md16/md/mismatch_cnt

Wiped the superblock from /dev/sdbq1 and added it again to the array.
Let it finish resyncing.
Did a check and once again a lot of errors.

The drive now has slot 10 instead of slot 3 which it had before the
first error.

Examining each device (see below) shows 11 slots and one failed?
(0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3) ?


Any idea what is going on?

mdadm --version
mdadm - v2.6.9 - 10th March 2009

Centos 5.5


mdadm -D /dev/md16
/dev/md16:
        Version : 1.01
  Creation Time : Thu Nov 25 09:15:54 2010
     Raid Level : raid6
     Array Size : 7809792000 (7448.00 GiB 7997.23 GB)
  Used Dev Size : 976224000 (931.00 GiB 999.65 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 16
    Persistence : Superblock is persistent

    Update Time : Fri Jan 14 16:22:10 2011
          State : clean
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 256K

           Name : 16
           UUID : fcd585d0:f2918552:7090d8da:532927c8
         Events : 90

    Number   Major   Minor   RaidDevice State
       0       8      145        0      active sync   /dev/sdj1
       1      65        1        1      active sync   /dev/sdq1
       2      65       17        2      active sync   /dev/sdr1
      10      68       65        3      active sync   /dev/sdbq1
       4      65       49        4      active sync   /dev/sdt1
       5      65       65        5      active sync   /dev/sdu1
       6      65      113        6      active sync   /dev/sdx1
       7      65      129        7      active sync   /dev/sdy1
       8      65       33        8      active sync   /dev/sds1
       9      65      145        9      active sync   /dev/sdz1



mdadm -E /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : fcd585d0:f2918552:7090d8da:532927c8
           Name : 16
  Creation Time : Thu Nov 25 09:15:54 2010
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
     Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
  Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
    Data Offset : 264 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 5db9c8f7:ce5b375e:757c53d0:04e89a06

    Update Time : Fri Jan 14 16:22:10 2011
       Checksum : 1f17a675 - correct
         Events : 90

     Chunk Size : 256K

    Array Slot : 0 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
   Array State : Uuuuuuuuuu 1 failed



mdadm -E /dev/sdq1
/dev/sdq1:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : fcd585d0:f2918552:7090d8da:532927c8
           Name : 16
  Creation Time : Thu Nov 25 09:15:54 2010
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
     Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
  Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
    Data Offset : 264 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : fb113255:fda391a6:7368a42b:1d6d4655

    Update Time : Fri Jan 14 16:22:10 2011
       Checksum : 6ed7b859 - correct
         Events : 90

     Chunk Size : 256K

    Array Slot : 1 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
   Array State : uUuuuuuuuu 1 failed


 mdadm -E /dev/sdr1
/dev/sdr1:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : fcd585d0:f2918552:7090d8da:532927c8
           Name : 16
  Creation Time : Thu Nov 25 09:15:54 2010
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
     Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
  Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
    Data Offset : 264 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : afcb4dd8:2aa58944:40a32ed9:eb6178af

    Update Time : Fri Jan 14 16:22:10 2011
       Checksum : 97a7a2d7 - correct
         Events : 90

     Chunk Size : 256K

    Array Slot : 2 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
   Array State : uuUuuuuuuu 1 failed


mdadm -E /dev/sdbq1
/dev/sdbq1:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : fcd585d0:f2918552:7090d8da:532927c8
           Name : 16
  Creation Time : Thu Nov 25 09:15:54 2010
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
     Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
  Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
    Data Offset : 264 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 93c6ae7c:d8161356:7ada1043:d0c5a924

    Update Time : Fri Jan 14 16:22:10 2011
       Checksum : 2ca5aa8f - correct
         Events : 90

     Chunk Size : 256K

    Array Slot : 10 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
   Array State : uuuUuuuuuu 1 failed


and so on for the rest of the drives.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* raid6 recovery
@ 2009-01-15 15:24 Jason Weber
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Weber @ 2009-01-15 15:24 UTC (permalink / raw)
  To: linux-raid

Before I cause to much damage, I really need expert help.

Early this morning, machine locked up and my 4x500Gb raid6 did not
recover on reboot.
A smaller 2x18Gb raid came up as normal.

/var/log/messages has:

Jan 15 01:12:22 wildfire Pid: 6056, comm: mdadm Tainted: P
2.6.19-gentoo-r5 #3

with some codes and a lot of others like it when it went down. And then,

Jan 15 01:16:37 wildfire mdadm: DeviceDisappeared event detected on md
device /dev/md1

I tried simple readds:

# mdadm /dev/md1 --add /dev/sdd /dev/sde
mdadm: cannot get array info for /dev/md1

Eventually I noticed that the drives had a different UUID than mdadm.conf;
one byte had changed.  I have a backup of mdadm.conf so I know that
was the same.

So, I changed mdadm.conf to match the drives and started an assemble

# mdadm --assemble --verbose /dev/md1
mdadm: looking for devices for /dev/md1
mdadm: cannot open device
/dev/disk/by-uuid/d7a08e91-0a49-4e91-91d7-d9d1e9e6cda1: Device or
resource busy
mdadm: /dev/disk/by-uuid/d7a08e91-0a49-4e91-91d7-d9d1e9e6cda1 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdg1
mdadm: /dev/sdg1 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: cannot open device /dev/sdi2: Device or resource busy
mdadm: /dev/sdi2 has wrong uuid.
mdadm: cannot open device /dev/sdi1: Device or resource busy
mdadm: /dev/sdi1 has wrong uuid.
mdadm: cannot open device /dev/sdi: Device or resource busy
mdadm: /dev/sdi has wrong uuid.
mdadm: cannot open device /dev/sdh1: Device or resource busy
mdadm: /dev/sdh1 has wrong uuid.
mdadm: cannot open device /dev/sdh: Device or resource busy
mdadm: /dev/sdh has wrong uuid.
mdadm: /dev/sdc has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.
mdadm: cannot open device /dev/sdb: Device or resource busy
mdadm: /dev/sdb has wrong uuid.
mdadm: cannot open device /dev/sda4: Device or resource busy
mdadm: /dev/sda4 has wrong uuid.
mdadm: cannot open device /dev/sda3: Device or resource busy
mdadm: /dev/sda3 has wrong uuid.
mdadm: cannot open device /dev/sda2: Device or resource busy
mdadm: /dev/sda2 has wrong uuid.
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has wrong uuid.
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sdf is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sde is identified as a member of /dev/md1, slot 0.
mdadm: /dev/sdd is identified as a member of /dev/md1, slot 3.

which has been sitting there for about four hours, full CPU, and as
far as I can tell not much drive
activity (how can I tell?  they're not very loud relative to the
overall machine noise).

As for "damage" I've done, first of all, one typo added /dev/sdc, once
of md1, to the md0 array
so now it thinks it is 18Gb according to mdadm -E, but hopefully it
was only set to spare so
maybe it didn't get scrambled:

# mdadm -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 96a4204f:7b6211e6:34105f4c:9857a351
  Creation Time : Tue May 17 23:03:53 2005
     Raid Level : raid1
  Used Dev Size : 17952512 (17.12 GiB 18.38 GB)
     Array Size : 17952512 (17.12 GiB 18.38 GB)
   Raid Devices : 2
  Total Devices : 3
Preferred Minor : 0

    Update Time : Thu Jan 15 01:52:42 2009
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 195f64d3 - correct
         Events : 0.39649024


      Number   Major   Minor   RaidDevice State
this     2       8       32        2      spare   /dev/sdc

   0     0       8      113        0      active sync   /dev/sdh1
   1     1       8      129        1      active sync   /dev/sdi1
   2     2       8       32        2      spare   /dev/sdc

Here's the others:

# mdadm -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 00.91.00
           UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
  Creation Time : Sat Oct 13 00:23:51 2007
     Raid Level : raid6
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
     Array Size : 976772992 (931.52 GiB 1000.22 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

  Reshape pos'n : 9223371671782555647

    Update Time : Thu Jan 15 01:12:21 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : dca29b4 - correct
         Events : 0.79926

     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       48        3      active sync   /dev/sdd

   0     0       8       64        0      active sync   /dev/sde
   1     1       8       80        1      active sync   /dev/sdf
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       48        3      active sync   /dev/sdd

# mdadm -E /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 00.91.00
           UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
  Creation Time : Sat Oct 13 00:23:51 2007
     Raid Level : raid6
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
     Array Size : 976772992 (931.52 GiB 1000.22 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

  Reshape pos'n : 9223371671782555647

    Update Time : Thu Jan 15 01:12:21 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : dca29be - correct
         Events : 0.79926

     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       64        0      active sync   /dev/sde

   0     0       8       64        0      active sync   /dev/sde
   1     1       8       80        1      active sync   /dev/sdf
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       48        3      active sync   /dev/sdd

# mdadm -E /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 00.91.00
           UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
  Creation Time : Sat Oct 13 00:23:51 2007
     Raid Level : raid6
  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
     Array Size : 976772992 (931.52 GiB 1000.22 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

  Reshape pos'n : 9223371671782555647

    Update Time : Thu Jan 15 01:12:21 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : dca29d0 - correct
         Events : 0.79926

     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       80        1      active sync   /dev/sdf

   0     0       8       64        0      active sync   /dev/sde
   1     1       8       80        1      active sync   /dev/sdf
   2     2       8       32        2      active sync   /dev/sdc
   3     3       8       48        3      active sync   /dev/sdd

/etc/mdadm.conf:
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md1 level=raid6 num-devices=4
UUID=f92d43a8:5ab3f411:26e606b2:3c378a67
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=96a4204f:7b6211e6:34105f4c:9857a351

# This file was auto-generated on Tue, 11 Mar 2008 00:10:35 -0700
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $

It previously said:
UUID=f92d43a8:5ab3f491:26e606b2:3c378a67

with a ...491.. instead of ...411...

Is mdadm --assemble supposed to take a long time or should it almost
immediately come back
and let me watch /proc/mdstat, which currently just says:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sdh1[0] sdi1[1]
      17952512 blocks [2/2] [UU]

unused devices: <none>

Also, I did modprobe raid456 manually before the assemble since I
noticed it was only saying raid1.
Maybe it would have been automatic at the right moment anyhow.

Should I just wait for the assemble or is it doing nothing?
Can I recover /dev/sdc as well or is that unimportant since I can
clear it and readd if the other three
(or even two) sync up and become available.

This md1 has been trouble since inception a couple years ago.  I get
corrupt files every week or
so it seems.  My little U320 scsi md0 raid1 has been nearly uneventful
for a much longer time.
Is raid6 less stable or maybe by sata_sil24 card is a bad choice?
Maybe sata doesn't measure
up to scsi.  So please point out any obvious foolishness on my part.

I do have a five day old single non-raid partial backup which is now
the only container of the data.
I'm very nervous about critical loss.  If I absolutely need to start
over, I'd like to get some redundancy
in my data as soon as possible.  Perhaps breaking it into a pair of
raid1 arrays is smarter anyhow.

-- Jason P Weber

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-03-23 12:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 19:55 Raid6 recovery Glenn Greibesland
2020-03-20 19:15 ` Wols Lists
     [not found]   ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
2020-03-21  0:06     ` antlists
2020-03-21 11:54       ` Glenn Greibesland
2020-03-21 19:24         ` Phil Turmel
2020-03-21 22:12           ` Glenn Greibesland
2020-03-22  0:32             ` Phil Turmel
2020-03-23  9:23               ` Wols Lists
2020-03-23 12:35                 ` Glenn Greibesland
2020-03-22  0:05           ` Wols Lists
  -- strict thread matches above, loose matches on Subject: below --
2011-01-14 16:16 raid6 recovery Björn Englund
2011-01-14 21:52 ` NeilBrown
2009-01-15 15:24 Jason Weber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.