* Raid6 recovery
@ 2020-03-19 19:55 Glenn Greibesland
2020-03-20 19:15 ` Wols Lists
0 siblings, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-19 19:55 UTC (permalink / raw)
To: linux-raid
Hi. I need some help with recovering from multiple disk failure on a
RAID6 array.
I had two failed disks and therefore shut down the server and
connected new disks.
After I powered on the server, another disk got booted out of the
array leaving it with only 15 out of 18 working devices, so it won’t
start.
I ran an offline test with smartctl and the disk that got thrown out
of the array seems totally fine.
Here is where I think I made a mistake. I use the –re-add command on
the disk. Now it is regarded as spare and the array still won’t start.
I’ve been reading on
https://raid.wiki.kernel.org/index.php/RAID_Recovery and I have tried
`–assemble –scan –force –verbose` and manual `–assemble –force` with
specifying each drive. Neither of them works (reporting that 15 out of
18 devices is not enough).
All drives has the same event count and used dev size, but two of the
devices has a lower Avail Dev Size, and a different Data Offset.
After a bit of digging in the manual and on different forums I have
concluded that the next step for me is to recreate the array using
–assume-clean and –data-offset=variable.
I have tried a dry run of the command (answering no to “Continue
creating array”), and mdadm accepts the parameters without any errors:
mdadm --create --assume-clean --level=6 --raid-devices=18
--size=3906763776s --chunk=512K --data-offset=variable /dev/md0
/dev/sdj1:262144s /dev/sdk1:262144s /dev/sdi1:262144s
/dev/sdh1:262144s /dev/sdo1:262144s /dev/sdp1:262144s
/dev/sdr1:262144s /dev/sdq1:262144s /dev/sdf1:262144s
/dev/sdb1:262144ss /dev/sdg1:262144s /dev/sdd1:262144s
/dev/sdm1:262144s /dev/sdf2:241664s missing missing /dev/sdc2:241664s
/dev/sdc1:262144s
mdadm: /dev/sdj1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdk1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdo1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdp1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdr1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdq1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: partition table exists on /dev/sdb1 but will be lost or
meaningless after creating array
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdm1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdf2 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdc2 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid6 devices=18 ctime=Wed Nov 14 22:53:28 2012
Continue creating array? N
My only worries now are the size and data-offset parameters. According
to the man page, the size should be specified in KiloBytes. It was
KibiBytes previously.
The Used Device Size of all array members is 3906763776 sectors
(1862.89 GiB 2000.26 GB).
Should I convert the sectors into KiloBytes or does mdadm support
using sectors as unit for –size and data-offset? It is not mentioned
in the manual, but I’ve seen it being used on different forum threads
and mdadm does not blow up if I try using it.
Any other suggestions?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-19 19:55 Raid6 recovery Glenn Greibesland
@ 2020-03-20 19:15 ` Wols Lists
[not found] ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2020-03-20 19:15 UTC (permalink / raw)
To: Glenn Greibesland, linux-raid; +Cc: Phil Turmel, NeilBrown
On 19/03/20 19:55, Glenn Greibesland wrote:
> After a bit of digging in the manual and on different forums I have
> concluded that the next step for me is to recreate the array using
> –assume-clean and –data-offset=variable.
> I have tried a dry run of the command (answering no to “Continue
> creating array”), and mdadm accepts the parameters without any errors:
Oh my god NO!!!
Do NOT use --create unless someone rather more experienced than me tells
you to!!!
The obvious thing is to somehow get the sixteen drives that you know
should be okay, re-assembled in a forced manner. The --re-add should not
have done any real damage because, as mdadm keeps complaining, you
didn't have enough drives so it won't have touched the data on that
drive. Unfortunately, my fu isn't good enough to tell you how to get
that drive back in.
What's wrong with the two failed drives? Can you ddrescue them? They
might be enough to get you going again.
You say you've read the web page "Raid recovery" - which says it's
obsolete and points you at "When things go wrogn" - but you don't appear
to have read that! PLEASE read "asking for help" and in particular you
NEED to run lsdrv and give us that information. Without that, if you DO
run --create, you will be in for a world of hurt.
I know you may feel it's asking for loads of information, and the
resulting email will be massive, but trust me - the experts will look at
it and they will probably be able to come up with a plan of action. At
present, they don't have much to go on, and nor will you if carry on as
you're going ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
[not found] ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
@ 2020-03-21 0:06 ` antlists
2020-03-21 11:54 ` Glenn Greibesland
0 siblings, 1 reply; 13+ messages in thread
From: antlists @ 2020-03-21 0:06 UTC (permalink / raw)
To: Glenn Greibesland; +Cc: linux-raid, Phil Turmel, NeilBrown
On 20/03/2020 21:05, Glenn Greibesland wrote:
> fre. 20. mar. 2020 kl. 20:15 skrev Wols Lists <antlists@youngman.org.uk>:
>>
>> On 19/03/20 19:55, Glenn Greibesland wrote:
>>> After a bit of digging in the manual and on different forums I have
>>> concluded that the next step for me is to recreate the array using
>>> –assume-clean and –data-offset=variable.
>>> I have tried a dry run of the command (answering no to “Continue
>>> creating array”), and mdadm accepts the parameters without any errors:
>>
>> Oh my god NO!!!
>>
>> Do NOT use --create unless someone rather more experienced than me tells
>> you to!!!
>>
>> The obvious thing is to somehow get the sixteen drives that you know
>> should be okay, re-assembled in a forced manner. The --re-add should not
>> have done any real damage because, as mdadm keeps complaining, you
>> didn't have enough drives so it won't have touched the data on that
>> drive. Unfortunately, my fu isn't good enough to tell you how to get
>> that drive back in.
>>
>> What's wrong with the two failed drives? Can you ddrescue them? They
>> might be enough to get you going again.
>>
>> You say you've read the web page "Raid recovery" - which says it's
>> obsolete and points you at "When things go wrogn" - but you don't appear
>> to have read that! PLEASE read "asking for help" and in particular you
>> NEED to run lsdrv and give us that information. Without that, if you DO
>> run --create, you will be in for a world of hurt.
>>
>> I know you may feel it's asking for loads of information, and the
>> resulting email will be massive, but trust me - the experts will look at
>> it and they will probably be able to come up with a plan of action. At
>> present, they don't have much to go on, and nor will you if carry on as
>> you're going ...
>>
>> Cheers,
>> Wol
>
> Thanks for replying to the thread.
>
> The two failed drives has "unreadable (pending) sectors", and they
> have a lower Event Count than the other disks, so that is why I've
> been trying to get the array up and running with the remaining 16
> disks that has the same Event Count.
>
> I concluded myself that --create --assume-clean had to be the only
> thing left to try, that's why I didn't provide any logs or info. Sorry
> about that, you are right, I should check if there is any other
> options first. I've been trying to get this array up and running again
> for quite some time, so I'm all ears if someone has some magic to try.
> Yesterday I read some of the source code of mdadm and sort of answered
> my own question. According to the source code, specifying sizes in
> sectors is supported. I'd still like some confirmation though (talking
> about parse_size function in util.c).
>
> Here's some additional info:
>
> mdadm: added /dev/sdj1 to /dev/md/0 as 0
> mdadm: added /dev/sdk1 to /dev/md/0 as 1
> mdadm: added /dev/sdi1 to /dev/md/0 as 2
> mdadm: added /dev/sdh1 to /dev/md/0 as 3
> mdadm: added /dev/sdo1 to /dev/md/0 as 4
> mdadm: added /dev/sdp1 to /dev/md/0 as 5
> mdadm: added /dev/sdr1 to /dev/md/0 as 6
> mdadm: added /dev/sdq1 to /dev/md/0 as 7
> mdadm: added /dev/sdf1 to /dev/md/0 as 8
> mdadm: added /dev/sdb1 to /dev/md/0 as 9
> mdadm: added /dev/sdg1 to /dev/md/0 as -1 <<<< This is the drive
> that is now regarded as spare. It originally had slot 10 in the array
> mdadm: added /dev/sdd1 to /dev/md/0 as 11
> mdadm: added /dev/sdm1 to /dev/md/0 as 12
> mdadm: added /dev/sdf2 to /dev/md/0 as 13
> mdadm: added /dev/sdc2 to /dev/md/0 as 16
> mdadm: added /dev/sdc1 to /dev/md/0 as 17
>
>
>
> mdadm: no uptodate device for slot 10 of /dev/md/0 << sdg1
> mdadm: no uptodate device for slot 14 of /dev/md/0 << drive disconnected
> mdadm: no uptodate device for slot 15 of /dev/md/0 << drive disconnected
>
> mdadm: /dev/md/0 assembled from 15 drives and 1 spare - not enough to
> start the array.
>
> mdadm -D /dev/md0
> /dev/md0:
> Version : 1.2
> Raid Level : raid0
> Total Devices : 16
> Persistence : Superblock is persistent
>
> State : inactive
> Working Devices : 16
>
> Name : vm-test:0
> UUID : 45ced2f9:947773d4:106077ab:2df799d6
> Events : 1937517
>
> Number Major Minor RaidDevice
>
> - 8 17 - /dev/sdb1
> - 8 33 - /dev/sdc1
> - 8 34 - /dev/sdc2
What's this? Two partitions in the array on the same physical disk?
> - 8 49 - /dev/sdd1
> - 8 81 - /dev/sdf1
> - 8 82 - /dev/sdf2
And again?
> - 8 97 - /dev/sdg1
> - 8 113 - /dev/sdh1
> - 8 129 - /dev/sdi1
> - 8 145 - /dev/sdj1
> - 8 161 - /dev/sdk1
> - 8 193 - /dev/sdm1
> - 8 241 - /dev/sdp1
> - 65 1 - /dev/sdq1
> - 65 17 - /dev/sdr1
> - 65 33 - /dev/sds1
>
>
> SMART WRITE LOG does not return COUNT and LBA_LOW register
> SCT (Get) Error Recovery Control command failed
Which disk is this? No error recovery? BAD sign ...
>
> Device Statistics (GP/SMART Log 0x04) not supported
>
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> 0x0008 2 0 Device-to-host non-data FIS retries
> 0x0009 2 2 Transition from drive PhyRdy to drive PhyNRdy
> 0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
> 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
> 0x8000 4 1208382 Vendor specific
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Green
What's this?
> Device Model: WDC WD20EARX-00PASB0
> Serial Number: WD-WMAZA9538601
> LU WWN Device Id: 5 0014ee 15a0a4ffa
> Firmware Version: 51.0AB51
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
> Local Time is: Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM feature is: Unavailable
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (37200) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 359) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x3035) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
No mention of ERC - Bad sign ...
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> 3 Spin_Up_Time POS--K 171 171 021 - 6416
> 4 Start_Stop_Count -O--CK 100 100 000 - 255
> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 098 098 000 - 1583
> 10 Spin_Retry_Count -O--CK 100 100 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 100 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 131
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 61
> 193 Load_Cycle_Count -O--CK 191 191 000 - 29372
> 194 Temperature_Celsius -O---K 122 101 000 - 28
> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
>
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 6 Ext. Comprehensive SMART error log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 SATA NCQ Queued Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
> 0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
> 0xbd GPL,SL VS 1 Device vendor specific log
> 0xc0 GPL,SL VS 1 Device vendor specific log
> 0xc1 GPL VS 93 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
>
> SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> No Errors Logged
>
> SMART Extended Self-test Log Version: 1 (1 sectors)
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 1245 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> SCT Status Version: 3
> SCT Version (vendor specific): 258 (0x0102)
> SCT Support Level: 1
> Device State: Active (0)
> Current Temperature: 28 Celsius
> Power Cycle Min/Max Temperature: 8/43 Celsius
> Lifetime Min/Max Temperature: 0/49 Celsius
> Under/Over Temperature Limit Count: 0/0
>
> SCT Temperature History Version: 2
> Temperature Sampling Period: 1 minute
> Temperature Logging Interval: 1 minute
> Min/Max recommended Temperature: 0/60 Celsius
> Min/Max Temperature Limit: -41/85 Celsius
> Temperature History Size (Index): 478 (305)
>
> Index Estimated Time Temperature Celsius
> 306 2020-03-20 13:03 23 ****
> ... ..( 33 skipped). .. ****
> 340 2020-03-20 13:37 23 ****
> 341 2020-03-20 13:38 ? -
> 342 2020-03-20 13:39 23 ****
> 343 2020-03-20 13:40 23 ****
> 344 2020-03-20 13:41 24 *****
> 345 2020-03-20 13:42 25 ******
> 346 2020-03-20 13:43 25 ******
> 347 2020-03-20 13:44 25 ******
> 348 2020-03-20 13:45 26 *******
> ... ..( 2 skipped). .. *******
> 351 2020-03-20 13:48 26 *******
> 352 2020-03-20 13:49 27 ********
> 353 2020-03-20 13:50 27 ********
> 354 2020-03-20 13:51 28 *********
> 355 2020-03-20 13:52 28 *********
> 356 2020-03-20 13:53 22 ***
> ... ..(276 skipped). .. ***
> 155 2020-03-20 18:30 22 ***
> 156 2020-03-20 18:31 23 ****
> ... ..(148 skipped). .. ****
> 305 2020-03-20 21:00 23 ****
>
> SCT Error Recovery Control command not supported
Yup. Ouch!
>
> Device Statistics (GP/SMART Log 0x04) not supported
>
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x8000 4 1208379 Vendor specific
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Red
> Device Model: WDC WD20EFRX-68AX9N0
> Serial Number: WD-WMC300320657
> LU WWN Device Id: 5 0014ee 0ae1ee098
> Firmware Version: 80.00A80
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2 (minor revision not indicated)
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM feature is: Unavailable
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Unknown
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x00) Offline data collection activity
> was never started.
> Auto Offline Data Collection: Disabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (27120) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 274) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x70bd) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> 3 Spin_Up_Time POS--K 176 169 021 - 4183
> 4 Start_Stop_Count -O--CK 100 100 000 - 502
> 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> 9 Power_On_Hours -O--CK 061 061 000 - 28588
> 10 Spin_Retry_Count -O--CK 100 100 000 - 0
> 11 Calibration_Retry_Count -O--CK 100 100 000 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 490
> 192 Power-Off_Retract_Count -O--CK 200 200 000 - 483
> 193 Load_Cycle_Count -O--CK 200 200 000 - 18
> 194 Temperature_Celsius -O---K 120 089 000 - 27
> 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> 198 Offline_Uncorrectable ----CK 100 253 000 - 0
> 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> 200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
>
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 6 Ext. Comprehensive SMART error log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 SATA NCQ Queued Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters log
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
> 0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
> 0xbd GPL,SL VS 1 Device vendor specific log
> 0xc0 GPL,SL VS 1 Device vendor specific log
> 0xc1 GPL VS 93 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
>
> SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> No Errors Logged
>
> SMART Extended Self-test Log Version: 1 (1 sectors)
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 26024 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> SCT Status Version: 3
> SCT Version (vendor specific): 258 (0x0102)
> SCT Support Level: 1
> Device State: Active (0)
> Current Temperature: 27 Celsius
> Power Cycle Min/Max Temperature: 10/32 Celsius
> Lifetime Min/Max Temperature: 2/58 Celsius
> Under/Over Temperature Limit Count: 0/0
> Vendor specific:
> 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> SCT Temperature History Version: 2
> Temperature Sampling Period: 1 minute
> Temperature Logging Interval: 1 minute
> Min/Max recommended Temperature: 0/60 Celsius
> Min/Max Temperature Limit: -41/85 Celsius
> Temperature History Size (Index): 478 (56)
>
> Index Estimated Time Temperature Celsius
> 57 2020-03-20 13:03 24 *****
> ... ..(377 skipped). .. *****
> 435 2020-03-20 19:21 24 *****
> 436 2020-03-20 19:22 ? -
> 437 2020-03-20 19:23 24 *****
> 438 2020-03-20 19:24 25 ******
> ... ..( 3 skipped). .. ******
> 442 2020-03-20 19:28 25 ******
> 443 2020-03-20 19:29 26 *******
> 444 2020-03-20 19:30 26 *******
> 445 2020-03-20 19:31 26 *******
> 446 2020-03-20 19:32 27 ********
> ... ..( 3 skipped). .. ********
> 450 2020-03-20 19:36 27 ********
> 451 2020-03-20 19:37 24 *****
> ... ..( 82 skipped). .. *****
> 56 2020-03-20 21:00 24 *****
>
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
What's going on here? We have a RED drive, but ERC isn't working ...
>
> Device Statistics (GP/SMART Log 0x04) not supported
>
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> 0x0008 2 0 Device-to-host non-data FIS retries
> 0x0009 2 33 Transition from drive PhyRdy to drive PhyNRdy
> 0x000a 2 34 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
> 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
> 0x8000 4 1208361 Vendor specific
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Device Model: ST4000VN008-2DR166
> Serial Number: ZDH82183
> LU WWN Device Id: 5 000c50 0c37c42c0
> Firmware Version: SC60
> User Capacity: 4,000,787,030,016 bytes [4.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 5980 rpm
> Form Factor: 3.5 inches
> Device is: Not in smartctl database [for details use: -P showall]
> ATA Version is: ACS-3 T13/2161-D revision 5
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Fri Mar 20 21:00:38 2020 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM level is: 254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Unknown
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 581) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 621) minutes.
> Conveyance self-test routine
> recommended polling time: ( 2) minutes.
> SCT capabilities: (0x50bd) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-- 070 065 044 - 10856451
> 3 Spin_Up_Time PO---- 094 094 000 - 0
> 4 Start_Stop_Count -O--CK 100 100 020 - 53
> 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
> 7 Seek_Error_Rate POSR-- 075 061 045 - 29667756
> 9 Power_On_Hours -O--CK 100 100 000 - 506 (130 79 0)
> 10 Spin_Retry_Count PO--C- 100 100 097 - 0
> 12 Power_Cycle_Count -O--CK 100 100 020 - 5
> 184 End-to-End_Error -O--CK 100 100 099 - 0
> 187 Reported_Uncorrect -O--CK 100 100 000 - 0
> 188 Command_Timeout -O--CK 098 098 000 - 65538
> 189 High_Fly_Writes -O-RCK 100 100 000 - 0
> 190 Airflow_Temperature_Cel -O---K 076 070 040 - 24 (Min/Max 9/26)
> 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 44
> 193 Load_Cycle_Count -O--CK 100 100 000 - 284
> 194 Temperature_Celsius -O---K 024 040 000 - 24 (0 9 0 0 0)
> 197 Current_Pending_Sector -O--C- 100 100 000 - 0
> 198 Offline_Uncorrectable ----C- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
> 240 Head_Flying_Hours ------ 100 253 000 - 139 (51 45 0)
> 241 Total_LBAs_Written ------ 100 253 000 - 8177237744
> 242 Total_LBAs_Read ------ 100 253 000 - 5818370819
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
>
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 5 Ext. Comprehensive SMART error log
> 0x04 GPL,SL R/O 8 Device Statistics log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 SATA NCQ Queued Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters log
> 0x13 GPL R/O 1 SATA NCQ Send and Receive log
> 0x15 GPL R/W 1 SATA Rebuild Assist log
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x24 GPL R/O 512 Current Device Internal Status Data log
> 0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa1 GPL,SL VS 24 Device vendor specific log
> 0xa2 GPL VS 8160 Device vendor specific log
> 0xa6 GPL VS 192 Device vendor specific log
> 0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
> 0xab GPL VS 1 Device vendor specific log
> 0xb0 GPL VS 9048 Device vendor specific log
> 0xbe-0xbf GPL VS 65535 Device vendor specific log
> 0xc1 GPL,SL VS 16 Device vendor specific log
> 0xd1 GPL VS 136 Device vendor specific log
> 0xd2 GPL VS 10000 Device vendor specific log
> 0xd3 GPL VS 1920 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
>
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
>
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> SCT Status Version: 3
> SCT Version (vendor specific): 522 (0x020a)
> SCT Support Level: 1
> Device State: Active (0)
> Current Temperature: 23 Celsius
> Power Cycle Min/Max Temperature: 8/26 Celsius
> Lifetime Min/Max Temperature: 8/30 Celsius
> Under/Over Temperature Limit Count: 0/336
>
> SCT Temperature History Version: 2
> Temperature Sampling Period: 3 minutes
> Temperature Logging Interval: 59 minutes
> Min/Max recommended Temperature: 0/ 0 Celsius
> Min/Max Temperature Limit: 0/ 0 Celsius
> Temperature History Size (Index): 128 (119)
>
> Index Estimated Time Temperature Celsius
> 120 2020-03-15 16:02 21 **
> ... ..( 5 skipped). .. **
> 126 2020-03-15 21:56 21 **
> 127 2020-03-15 22:55 22 ***
> ... ..( 16 skipped). .. ***
> 16 2020-03-16 15:38 22 ***
> 17 2020-03-16 16:37 23 ****
> ... ..( 3 skipped). .. ****
> 21 2020-03-16 20:33 23 ****
> 22 2020-03-16 21:32 24 *****
> 23 2020-03-16 22:31 23 ****
> 24 2020-03-16 23:30 24 *****
> 25 2020-03-17 00:29 24 *****
> 26 2020-03-17 01:28 24 *****
> 27 2020-03-17 02:27 23 ****
> ... ..( 7 skipped). .. ****
> 35 2020-03-17 10:19 23 ****
> 36 2020-03-17 11:18 22 ***
> ... ..( 3 skipped). .. ***
> 40 2020-03-17 15:14 22 ***
> 41 2020-03-17 16:13 23 ****
> ... ..( 14 skipped). .. ****
> 56 2020-03-18 06:58 23 ****
> 57 2020-03-18 07:57 22 ***
> ... ..( 2 skipped). .. ***
> 60 2020-03-18 10:54 22 ***
> 61 2020-03-18 11:53 21 **
> 62 2020-03-18 12:52 20 *
> 63 2020-03-18 13:51 21 **
> 64 2020-03-18 14:50 20 *
> 65 2020-03-18 15:49 20 *
> 66 2020-03-18 16:48 21 **
> ... ..( 5 skipped). .. **
> 72 2020-03-18 22:42 21 **
> 73 2020-03-18 23:41 24 *****
> 74 2020-03-19 00:40 26 *******
> ... ..( 2 skipped). .. *******
> 77 2020-03-19 03:37 26 *******
> 78 2020-03-19 04:36 22 ***
> ... ..( 2 skipped). .. ***
> 81 2020-03-19 07:33 22 ***
> 82 2020-03-19 08:32 21 **
> 83 2020-03-19 09:31 22 ***
> 84 2020-03-19 10:30 22 ***
> 85 2020-03-19 11:29 21 **
> ... ..( 2 skipped). .. **
> 88 2020-03-19 14:26 21 **
> 89 2020-03-19 15:25 25 ******
> 90 2020-03-19 16:24 25 ******
> 91 2020-03-19 17:23 26 *******
> 92 2020-03-19 18:22 25 ******
> 93 2020-03-19 19:21 22 ***
> ... ..( 3 skipped). .. ***
> 97 2020-03-19 23:17 22 ***
> 98 2020-03-20 00:16 21 **
> ... ..( 4 skipped). .. **
> 103 2020-03-20 05:11 21 **
> 104 2020-03-20 06:10 20 *
> ... ..( 11 skipped). .. *
> 116 2020-03-20 17:58 20 *
> 117 2020-03-20 18:57 21 **
> 118 2020-03-20 19:56 21 **
> 119 2020-03-20 20:55 21 **
>
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
OUCH! AGAIN!
>
> Device Statistics (GP Log 0x04)
> Page Offset Size Value Flags Description
> 0x01 ===== = = === == General Statistics (rev 1) ==
> 0x01 0x008 4 5 --- Lifetime Power-On Resets
> 0x01 0x010 4 506 --- Power-on Hours
> 0x01 0x018 6 8177237744 --- Logical Sectors Written
> 0x01 0x020 6 32254131 --- Number of Write Commands
> 0x01 0x028 6 5818370805 --- Logical Sectors Read
> 0x01 0x030 6 24397122 --- Number of Read Commands
> 0x01 0x038 6 - --- Date and Time TimeStamp
> 0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
> 0x03 0x008 4 159 --- Spindle Motor Power-on Hours
> 0x03 0x010 4 10 --- Head Flying Hours
> 0x03 0x018 4 284 --- Head Load Events
> 0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
> 0x03 0x028 4 0 --- Read Recovery Attempts
> 0x03 0x030 4 0 --- Number of Mechanical Start Failures
> 0x03 0x038 4 0 --- Number of Realloc. Candidate
> Logical Sectors
> 0x03 0x040 4 45 --- Number of High Priority Unload Events
> 0x04 ===== = = === == General Errors Statistics (rev 1) ==
> 0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
> 0x04 0x010 4 2 --- Resets Between Cmd Acceptance and
> Completion
> 0x05 ===== = = === == Temperature Statistics (rev 1) ==
> 0x05 0x008 1 23 --- Current Temperature
> 0x05 0x010 1 20 --- Average Short Term Temperature
> 0x05 0x018 1 - --- Average Long Term Temperature
> 0x05 0x020 1 30 --- Highest Temperature
> 0x05 0x028 1 0 --- Lowest Temperature
> 0x05 0x030 1 27 --- Highest Average Short Term Temperature
> 0x05 0x038 1 14 --- Lowest Average Short Term Temperature
> 0x05 0x040 1 - --- Highest Average Long Term Temperature
> 0x05 0x048 1 - --- Lowest Average Long Term Temperature
> 0x05 0x050 4 0 --- Time in Over-Temperature
> 0x05 0x058 1 70 --- Specified Maximum Operating Temperature
> 0x05 0x060 4 0 --- Time in Under-Temperature
> 0x05 0x068 1 0 --- Specified Minimum Operating Temperature
> 0x06 ===== = = === == Transport Statistics (rev 1) ==
> 0x06 0x008 4 101 --- Number of Hardware Resets
> 0x06 0x010 4 17 --- Number of ASR Events
> 0x06 0x018 4 0 --- Number of Interface CRC Errors
> |||_ C monitored condition met
> ||__ D supports DSN
> |___ N normalized value
>
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x000a 2 34 Device-to-host register FISes sent due to a COMRESET
> 0x0001 2 0 Command failed due to ICRC error
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
>
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
Oh My God.
This array is just asking for disaster. Whoops, you've just had one, sorry.
I'm looking for details of your two failed drives, but I don't seem able
to find any. But as soon as you can get the array back, you need to fix
those problems ASAP!!!
Firstly, get rid of that Green!!! Were the two failed drives greens?
Read the timeout page to find out why.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
That will hopefully also fix the problem with those Reds with ERC
disabled. It would not surprise me in the slightest if this is what has
done the damage to your array.
Lastly, those ST4000s. Are they Ironwolves? I guess they're good drives,
but they've just trashed your raid-6 redundancy - lose just one of them
and your array is teetering on the edge. You need to get your sdx2
partitions copied on to new drives ASAP.
What I'd do is get a couple more ST4000s, and use them, creating 4GB
partitions. Then take your existing ST4000s, and convert them to 4GB
partitions. At which point you only need five more ST4000s to move your
array on to new drives.
I'm not sure how you get there - once you've got your 9 4GB drives you
*may* be able to just fail and remove the remaining 2GB drives.
Otherwise, I'd use the freed-up 2GB drives to create 4GB raid-0s. You'd
end up having to buy a couple of spare 4GB drives to move the entire
array on to 4GB "drives", but then you could remove the raid-0 arrays.
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-21 0:06 ` antlists
@ 2020-03-21 11:54 ` Glenn Greibesland
2020-03-21 19:24 ` Phil Turmel
0 siblings, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-21 11:54 UTC (permalink / raw)
To: antlists; +Cc: linux-raid, Phil Turmel, NeilBrown
Yes, I am aware of the problems with WD Green and multiple partitions
on single 4TB disk. I am in the middle of getting rid of old disks and
I have enough new drives to stop having multiple partitions on single
drives, but not enough power and free SATA ports. It is just a
temporary solution. Also a reason why I did not
include much details in the original post, I knew it would just
distract from the problem I want to solve right away.
What I need help with now is just getting the array started with the
16 out of 18 disks. Then I can continue migrating data and replacing
old disks as planned.
When I built the array in 2012, I used WD Green. They turned out to be
horrible disks and I have since replaced some of them with WD Red. The
newest disks I've bought are Ironwolves
lør. 21. mar. 2020 kl. 01:06 skrev antlists <antlists@youngman.org.uk>:
>
> On 20/03/2020 21:05, Glenn Greibesland wrote:
> > fre. 20. mar. 2020 kl. 20:15 skrev Wols Lists <antlists@youngman.org.uk>:
> >>
> >> On 19/03/20 19:55, Glenn Greibesland wrote:
> >>> After a bit of digging in the manual and on different forums I have
> >>> concluded that the next step for me is to recreate the array using
> >>> –assume-clean and –data-offset=variable.
> >>> I have tried a dry run of the command (answering no to “Continue
> >>> creating array”), and mdadm accepts the parameters without any errors:
> >>
> >> Oh my god NO!!!
> >>
> >> Do NOT use --create unless someone rather more experienced than me tells
> >> you to!!!
> >>
> >> The obvious thing is to somehow get the sixteen drives that you know
> >> should be okay, re-assembled in a forced manner. The --re-add should not
> >> have done any real damage because, as mdadm keeps complaining, you
> >> didn't have enough drives so it won't have touched the data on that
> >> drive. Unfortunately, my fu isn't good enough to tell you how to get
> >> that drive back in.
> >>
> >> What's wrong with the two failed drives? Can you ddrescue them? They
> >> might be enough to get you going again.
> >>
> >> You say you've read the web page "Raid recovery" - which says it's
> >> obsolete and points you at "When things go wrogn" - but you don't appear
> >> to have read that! PLEASE read "asking for help" and in particular you
> >> NEED to run lsdrv and give us that information. Without that, if you DO
> >> run --create, you will be in for a world of hurt.
> >>
> >> I know you may feel it's asking for loads of information, and the
> >> resulting email will be massive, but trust me - the experts will look at
> >> it and they will probably be able to come up with a plan of action. At
> >> present, they don't have much to go on, and nor will you if carry on as
> >> you're going ...
> >>
> >> Cheers,
> >> Wol
> >
> > Thanks for replying to the thread.
> >
> > The two failed drives has "unreadable (pending) sectors", and they
> > have a lower Event Count than the other disks, so that is why I've
> > been trying to get the array up and running with the remaining 16
> > disks that has the same Event Count.
> >
> > I concluded myself that --create --assume-clean had to be the only
> > thing left to try, that's why I didn't provide any logs or info. Sorry
> > about that, you are right, I should check if there is any other
> > options first. I've been trying to get this array up and running again
> > for quite some time, so I'm all ears if someone has some magic to try.
> > Yesterday I read some of the source code of mdadm and sort of answered
> > my own question. According to the source code, specifying sizes in
> > sectors is supported. I'd still like some confirmation though (talking
> > about parse_size function in util.c).
> >
> > Here's some additional info:
> >
> > mdadm: added /dev/sdj1 to /dev/md/0 as 0
> > mdadm: added /dev/sdk1 to /dev/md/0 as 1
> > mdadm: added /dev/sdi1 to /dev/md/0 as 2
> > mdadm: added /dev/sdh1 to /dev/md/0 as 3
> > mdadm: added /dev/sdo1 to /dev/md/0 as 4
> > mdadm: added /dev/sdp1 to /dev/md/0 as 5
> > mdadm: added /dev/sdr1 to /dev/md/0 as 6
> > mdadm: added /dev/sdq1 to /dev/md/0 as 7
> > mdadm: added /dev/sdf1 to /dev/md/0 as 8
> > mdadm: added /dev/sdb1 to /dev/md/0 as 9
> > mdadm: added /dev/sdg1 to /dev/md/0 as -1 <<<< This is the drive
> > that is now regarded as spare. It originally had slot 10 in the array
> > mdadm: added /dev/sdd1 to /dev/md/0 as 11
> > mdadm: added /dev/sdm1 to /dev/md/0 as 12
> > mdadm: added /dev/sdf2 to /dev/md/0 as 13
> > mdadm: added /dev/sdc2 to /dev/md/0 as 16
> > mdadm: added /dev/sdc1 to /dev/md/0 as 17
> >
> >
> >
> > mdadm: no uptodate device for slot 10 of /dev/md/0 << sdg1
> > mdadm: no uptodate device for slot 14 of /dev/md/0 << drive disconnected
> > mdadm: no uptodate device for slot 15 of /dev/md/0 << drive disconnected
> >
> > mdadm: /dev/md/0 assembled from 15 drives and 1 spare - not enough to
> > start the array.
> >
> > mdadm -D /dev/md0
> > /dev/md0:
> > Version : 1.2
> > Raid Level : raid0
> > Total Devices : 16
> > Persistence : Superblock is persistent
> >
> > State : inactive
> > Working Devices : 16
> >
> > Name : vm-test:0
> > UUID : 45ced2f9:947773d4:106077ab:2df799d6
> > Events : 1937517
> >
> > Number Major Minor RaidDevice
> >
> > - 8 17 - /dev/sdb1
> > - 8 33 - /dev/sdc1
> > - 8 34 - /dev/sdc2
>
> What's this? Two partitions in the array on the same physical disk?
>
> > - 8 49 - /dev/sdd1
> > - 8 81 - /dev/sdf1
> > - 8 82 - /dev/sdf2
>
> And again?
>
> > - 8 97 - /dev/sdg1
> > - 8 113 - /dev/sdh1
> > - 8 129 - /dev/sdi1
> > - 8 145 - /dev/sdj1
> > - 8 161 - /dev/sdk1
> > - 8 193 - /dev/sdm1
> > - 8 241 - /dev/sdp1
> > - 65 1 - /dev/sdq1
> > - 65 17 - /dev/sdr1
> > - 65 33 - /dev/sds1
> >
>
>
> >
> > SMART WRITE LOG does not return COUNT and LBA_LOW register
> > SCT (Get) Error Recovery Control command failed
>
> Which disk is this? No error recovery? BAD sign ...
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID Size Value Description
> > 0x0001 2 0 Command failed due to ICRC error
> > 0x0002 2 0 R_ERR response for data FIS
> > 0x0003 2 0 R_ERR response for device-to-host data FIS
> > 0x0004 2 0 R_ERR response for host-to-device data FIS
> > 0x0005 2 0 R_ERR response for non-data FIS
> > 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> > 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> > 0x0008 2 0 Device-to-host non-data FIS retries
> > 0x0009 2 2 Transition from drive PhyRdy to drive PhyNRdy
> > 0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
> > 0x000b 2 0 CRC errors within host-to-device FIS
> > 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
> > 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
> > 0x8000 4 1208382 Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
>
>
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
>
>
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Western Digital Green
>
> What's this?
>
> > Device Model: WDC WD20EARX-00PASB0
> > Serial Number: WD-WMAZA9538601
> > LU WWN Device Id: 5 0014ee 15a0a4ffa
> > Firmware Version: 51.0AB51
> > User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: ATA8-ACS (minor revision not indicated)
> > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
> > Local Time is: Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is: Unavailable
> > APM feature is: Unavailable
> > Rd look-ahead is: Enabled
> > Write cache is: Enabled
> > ATA Security is: Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Enabled
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x84) Offline data collection activity
> > was suspended by an interrupting command from host.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (37200) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 359) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 5) minutes.
> > SCT capabilities: (0x3035) SCT Status supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
>
> No mention of ERC - Bad sign ...
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> > 3 Spin_Up_Time POS--K 171 171 021 - 6416
> > 4 Start_Stop_Count -O--CK 100 100 000 - 255
> > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> > 9 Power_On_Hours -O--CK 098 098 000 - 1583
> > 10 Spin_Retry_Count -O--CK 100 100 000 - 0
> > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0
> > 12 Power_Cycle_Count -O--CK 100 100 000 - 131
> > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 61
> > 193 Load_Cycle_Count -O--CK 191 191 000 - 29372
> > 194 Temperature_Celsius -O---K 122 101 000 - 28
> > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> > 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> > 198 Offline_Uncorrectable ----CK 200 200 000 - 0
> > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> > 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
> > ||||||_ K auto-keep
> > |||||__ C event count
> > ||||___ R error rate
> > |||____ S speed/performance
> > ||_____ O updated online
> > |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART Log Directory Version 1 [multi-sector log support]
> > Address Access R/W Size Description
> > 0x00 GPL,SL R/O 1 Log Directory
> > 0x01 SL R/O 1 Summary SMART error log
> > 0x02 SL R/O 5 Comprehensive SMART error log
> > 0x03 GPL R/O 6 Ext. Comprehensive SMART error log
> > 0x06 SL R/O 1 SMART self-test log
> > 0x07 GPL R/O 1 Extended self-test log
> > 0x09 SL R/W 1 Selective self-test log
> > 0x10 GPL R/O 1 SATA NCQ Queued Error log
> > 0x11 GPL R/O 1 SATA Phy Event Counters log
> > 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> > 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
> > 0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
> > 0xbd GPL,SL VS 1 Device vendor specific log
> > 0xc0 GPL,SL VS 1 Device vendor specific log
> > 0xc1 GPL VS 93 Device vendor specific log
> > 0xe0 GPL,SL R/W 1 SCT Command/Status
> > 0xe1 GPL,SL R/W 1 SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > Num Test_Description Status Remaining
> > LifeTime(hours) LBA_of_first_error
> > # 1 Short offline Completed without error 00% 1245 -
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version: 3
> > SCT Version (vendor specific): 258 (0x0102)
> > SCT Support Level: 1
> > Device State: Active (0)
> > Current Temperature: 28 Celsius
> > Power Cycle Min/Max Temperature: 8/43 Celsius
> > Lifetime Min/Max Temperature: 0/49 Celsius
> > Under/Over Temperature Limit Count: 0/0
> >
> > SCT Temperature History Version: 2
> > Temperature Sampling Period: 1 minute
> > Temperature Logging Interval: 1 minute
> > Min/Max recommended Temperature: 0/60 Celsius
> > Min/Max Temperature Limit: -41/85 Celsius
> > Temperature History Size (Index): 478 (305)
> >
> > Index Estimated Time Temperature Celsius
> > 306 2020-03-20 13:03 23 ****
> > ... ..( 33 skipped). .. ****
> > 340 2020-03-20 13:37 23 ****
> > 341 2020-03-20 13:38 ? -
> > 342 2020-03-20 13:39 23 ****
> > 343 2020-03-20 13:40 23 ****
> > 344 2020-03-20 13:41 24 *****
> > 345 2020-03-20 13:42 25 ******
> > 346 2020-03-20 13:43 25 ******
> > 347 2020-03-20 13:44 25 ******
> > 348 2020-03-20 13:45 26 *******
> > ... ..( 2 skipped). .. *******
> > 351 2020-03-20 13:48 26 *******
> > 352 2020-03-20 13:49 27 ********
> > 353 2020-03-20 13:50 27 ********
> > 354 2020-03-20 13:51 28 *********
> > 355 2020-03-20 13:52 28 *********
> > 356 2020-03-20 13:53 22 ***
> > ... ..(276 skipped). .. ***
> > 155 2020-03-20 18:30 22 ***
> > 156 2020-03-20 18:31 23 ****
> > ... ..(148 skipped). .. ****
> > 305 2020-03-20 21:00 23 ****
> >
> > SCT Error Recovery Control command not supported
>
> Yup. Ouch!
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID Size Value Description
> > 0x0001 2 0 Command failed due to ICRC error
> > 0x0002 2 0 R_ERR response for data FIS
> > 0x0003 2 0 R_ERR response for device-to-host data FIS
> > 0x0004 2 0 R_ERR response for host-to-device data FIS
> > 0x0005 2 0 R_ERR response for non-data FIS
> > 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> > 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> > 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET
> > 0x000b 2 0 CRC errors within host-to-device FIS
> > 0x8000 4 1208379 Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Western Digital Red
> > Device Model: WDC WD20EFRX-68AX9N0
> > Serial Number: WD-WMC300320657
> > LU WWN Device Id: 5 0014ee 0ae1ee098
> > Firmware Version: 80.00A80
> > User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: ACS-2 (minor revision not indicated)
> > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> > Local Time is: Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is: Unavailable
> > APM feature is: Unavailable
> > Rd look-ahead is: Enabled
> > Write cache is: Enabled
> > ATA Security is: Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Unknown
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x00) Offline data collection activity
> > was never started.
> > Auto Offline Data Collection: Disabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: (27120) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 2) minutes.
> > Extended self-test routine
> > recommended polling time: ( 274) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 5) minutes.
> > SCT capabilities: (0x70bd) SCT Status supported.
> > SCT Error Recovery Control supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
> > 3 Spin_Up_Time POS--K 176 169 021 - 4183
> > 4 Start_Stop_Count -O--CK 100 100 000 - 502
> > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> > 9 Power_On_Hours -O--CK 061 061 000 - 28588
> > 10 Spin_Retry_Count -O--CK 100 100 000 - 0
> > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0
> > 12 Power_Cycle_Count -O--CK 100 100 000 - 490
> > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 483
> > 193 Load_Cycle_Count -O--CK 200 200 000 - 18
> > 194 Temperature_Celsius -O---K 120 089 000 - 27
> > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> > 197 Current_Pending_Sector -O--CK 200 200 000 - 0
> > 198 Offline_Uncorrectable ----CK 100 253 000 - 0
> > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> > 200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0
> > ||||||_ K auto-keep
> > |||||__ C event count
> > ||||___ R error rate
> > |||____ S speed/performance
> > ||_____ O updated online
> > |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART Log Directory Version 1 [multi-sector log support]
> > Address Access R/W Size Description
> > 0x00 GPL,SL R/O 1 Log Directory
> > 0x01 SL R/O 1 Summary SMART error log
> > 0x02 SL R/O 5 Comprehensive SMART error log
> > 0x03 GPL R/O 6 Ext. Comprehensive SMART error log
> > 0x06 SL R/O 1 SMART self-test log
> > 0x07 GPL R/O 1 Extended self-test log
> > 0x09 SL R/W 1 Selective self-test log
> > 0x10 GPL R/O 1 SATA NCQ Queued Error log
> > 0x11 GPL R/O 1 SATA Phy Event Counters log
> > 0x21 GPL R/O 1 Write stream error log
> > 0x22 GPL R/O 1 Read stream error log
> > 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> > 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
> > 0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
> > 0xbd GPL,SL VS 1 Device vendor specific log
> > 0xc0 GPL,SL VS 1 Device vendor specific log
> > 0xc1 GPL VS 93 Device vendor specific log
> > 0xe0 GPL,SL R/W 1 SCT Command/Status
> > 0xe1 GPL,SL R/W 1 SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > Num Test_Description Status Remaining
> > LifeTime(hours) LBA_of_first_error
> > # 1 Short offline Completed without error 00% 26024 -
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version: 3
> > SCT Version (vendor specific): 258 (0x0102)
> > SCT Support Level: 1
> > Device State: Active (0)
> > Current Temperature: 27 Celsius
> > Power Cycle Min/Max Temperature: 10/32 Celsius
> > Lifetime Min/Max Temperature: 2/58 Celsius
> > Under/Over Temperature Limit Count: 0/0
> > Vendor specific:
> > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > SCT Temperature History Version: 2
> > Temperature Sampling Period: 1 minute
> > Temperature Logging Interval: 1 minute
> > Min/Max recommended Temperature: 0/60 Celsius
> > Min/Max Temperature Limit: -41/85 Celsius
> > Temperature History Size (Index): 478 (56)
> >
> > Index Estimated Time Temperature Celsius
> > 57 2020-03-20 13:03 24 *****
> > ... ..(377 skipped). .. *****
> > 435 2020-03-20 19:21 24 *****
> > 436 2020-03-20 19:22 ? -
> > 437 2020-03-20 19:23 24 *****
> > 438 2020-03-20 19:24 25 ******
> > ... ..( 3 skipped). .. ******
> > 442 2020-03-20 19:28 25 ******
> > 443 2020-03-20 19:29 26 *******
> > 444 2020-03-20 19:30 26 *******
> > 445 2020-03-20 19:31 26 *******
> > 446 2020-03-20 19:32 27 ********
> > ... ..( 3 skipped). .. ********
> > 450 2020-03-20 19:36 27 ********
> > 451 2020-03-20 19:37 24 *****
> > ... ..( 82 skipped). .. *****
> > 56 2020-03-20 21:00 24 *****
> >
> > SCT Error Recovery Control:
> > Read: Disabled
> > Write: Disabled
>
> What's going on here? We have a RED drive, but ERC isn't working ...
> >
> > Device Statistics (GP/SMART Log 0x04) not supported
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID Size Value Description
> > 0x0001 2 0 Command failed due to ICRC error
> > 0x0002 2 0 R_ERR response for data FIS
> > 0x0003 2 0 R_ERR response for device-to-host data FIS
> > 0x0004 2 0 R_ERR response for host-to-device data FIS
> > 0x0005 2 0 R_ERR response for non-data FIS
> > 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> > 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> > 0x0008 2 0 Device-to-host non-data FIS retries
> > 0x0009 2 33 Transition from drive PhyRdy to drive PhyNRdy
> > 0x000a 2 34 Device-to-host register FISes sent due to a COMRESET
> > 0x000b 2 0 CRC errors within host-to-device FIS
> > 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
> > 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
> > 0x8000 4 1208361 Vendor specific
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> > === START OF INFORMATION SECTION ===
> > Device Model: ST4000VN008-2DR166
> > Serial Number: ZDH82183
> > LU WWN Device Id: 5 000c50 0c37c42c0
> > Firmware Version: SC60
> > User Capacity: 4,000,787,030,016 bytes [4.00 TB]
> > Sector Sizes: 512 bytes logical, 4096 bytes physical
> > Rotation Rate: 5980 rpm
> > Form Factor: 3.5 inches
> > Device is: Not in smartctl database [for details use: -P showall]
> > ATA Version is: ACS-3 T13/2161-D revision 5
> > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> > Local Time is: Fri Mar 20 21:00:38 2020 CET
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > AAM feature is: Unavailable
> > APM level is: 254 (maximum performance)
> > Rd look-ahead is: Enabled
> > Write cache is: Enabled
> > ATA Security is: Disabled, NOT FROZEN [SEC1]
> > Wt Cache Reorder: Unknown
> >
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> >
> > General SMART Values:
> > Offline data collection status: (0x82) Offline data collection activity
> > was completed without error.
> > Auto Offline Data Collection: Enabled.
> > Self-test execution status: ( 0) The previous self-test routine completed
> > without error or no self-test has ever
> > been run.
> > Total time to complete Offline
> > data collection: ( 581) seconds.
> > Offline data collection
> > capabilities: (0x7b) SMART execute Offline immediate.
> > Auto Offline data collection on/off support.
> > Suspend Offline collection upon new
> > command.
> > Offline surface scan supported.
> > Self-test supported.
> > Conveyance Self-test supported.
> > Selective Self-test supported.
> > SMART capabilities: (0x0003) Saves SMART data before entering
> > power-saving mode.
> > Supports SMART auto save timer.
> > Error logging capability: (0x01) Error logging supported.
> > General Purpose Logging supported.
> > Short self-test routine
> > recommended polling time: ( 1) minutes.
> > Extended self-test routine
> > recommended polling time: ( 621) minutes.
> > Conveyance self-test routine
> > recommended polling time: ( 2) minutes.
> > SCT capabilities: (0x50bd) SCT Status supported.
> > SCT Error Recovery Control supported.
> > SCT Feature Control supported.
> > SCT Data Table supported.
> >
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> > 1 Raw_Read_Error_Rate POSR-- 070 065 044 - 10856451
> > 3 Spin_Up_Time PO---- 094 094 000 - 0
> > 4 Start_Stop_Count -O--CK 100 100 020 - 53
> > 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
> > 7 Seek_Error_Rate POSR-- 075 061 045 - 29667756
> > 9 Power_On_Hours -O--CK 100 100 000 - 506 (130 79 0)
> > 10 Spin_Retry_Count PO--C- 100 100 097 - 0
> > 12 Power_Cycle_Count -O--CK 100 100 020 - 5
> > 184 End-to-End_Error -O--CK 100 100 099 - 0
> > 187 Reported_Uncorrect -O--CK 100 100 000 - 0
> > 188 Command_Timeout -O--CK 098 098 000 - 65538
> > 189 High_Fly_Writes -O-RCK 100 100 000 - 0
> > 190 Airflow_Temperature_Cel -O---K 076 070 040 - 24 (Min/Max 9/26)
> > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
> > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 44
> > 193 Load_Cycle_Count -O--CK 100 100 000 - 284
> > 194 Temperature_Celsius -O---K 024 040 000 - 24 (0 9 0 0 0)
> > 197 Current_Pending_Sector -O--C- 100 100 000 - 0
> > 198 Offline_Uncorrectable ----C- 100 100 000 - 0
> > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
> > 240 Head_Flying_Hours ------ 100 253 000 - 139 (51 45 0)
> > 241 Total_LBAs_Written ------ 100 253 000 - 8177237744
> > 242 Total_LBAs_Read ------ 100 253 000 - 5818370819
> > ||||||_ K auto-keep
> > |||||__ C event count
> > ||||___ R error rate
> > |||____ S speed/performance
> > ||_____ O updated online
> > |______ P prefailure warning
> >
> > General Purpose Log Directory Version 1
> > SMART Log Directory Version 1 [multi-sector log support]
> > Address Access R/W Size Description
> > 0x00 GPL,SL R/O 1 Log Directory
> > 0x01 SL R/O 1 Summary SMART error log
> > 0x02 SL R/O 5 Comprehensive SMART error log
> > 0x03 GPL R/O 5 Ext. Comprehensive SMART error log
> > 0x04 GPL,SL R/O 8 Device Statistics log
> > 0x06 SL R/O 1 SMART self-test log
> > 0x07 GPL R/O 1 Extended self-test log
> > 0x09 SL R/W 1 Selective self-test log
> > 0x10 GPL R/O 1 SATA NCQ Queued Error log
> > 0x11 GPL R/O 1 SATA Phy Event Counters log
> > 0x13 GPL R/O 1 SATA NCQ Send and Receive log
> > 0x15 GPL R/W 1 SATA Rebuild Assist log
> > 0x21 GPL R/O 1 Write stream error log
> > 0x22 GPL R/O 1 Read stream error log
> > 0x24 GPL R/O 512 Current Device Internal Status Data log
> > 0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
> > 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> > 0xa1 GPL,SL VS 24 Device vendor specific log
> > 0xa2 GPL VS 8160 Device vendor specific log
> > 0xa6 GPL VS 192 Device vendor specific log
> > 0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
> > 0xab GPL VS 1 Device vendor specific log
> > 0xb0 GPL VS 9048 Device vendor specific log
> > 0xbe-0xbf GPL VS 65535 Device vendor specific log
> > 0xc1 GPL,SL VS 16 Device vendor specific log
> > 0xd1 GPL VS 136 Device vendor specific log
> > 0xd2 GPL VS 10000 Device vendor specific log
> > 0xd3 GPL VS 1920 Device vendor specific log
> > 0xe0 GPL,SL R/W 1 SCT Command/Status
> > 0xe1 GPL,SL R/W 1 SCT Data Transfer
> >
> > SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> > No Errors Logged
> >
> > SMART Extended Self-test Log Version: 1 (1 sectors)
> > No self-tests have been logged. [To run self-tests, use: smartctl -t]
> >
> > SMART Selective self-test log data structure revision number 1
> > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> > 1 0 0 Not_testing
> > 2 0 0 Not_testing
> > 3 0 0 Not_testing
> > 4 0 0 Not_testing
> > 5 0 0 Not_testing
> > Selective self-test flags (0x0):
> > After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute delay.
> >
> > SCT Status Version: 3
> > SCT Version (vendor specific): 522 (0x020a)
> > SCT Support Level: 1
> > Device State: Active (0)
> > Current Temperature: 23 Celsius
> > Power Cycle Min/Max Temperature: 8/26 Celsius
> > Lifetime Min/Max Temperature: 8/30 Celsius
> > Under/Over Temperature Limit Count: 0/336
> >
> > SCT Temperature History Version: 2
> > Temperature Sampling Period: 3 minutes
> > Temperature Logging Interval: 59 minutes
> > Min/Max recommended Temperature: 0/ 0 Celsius
> > Min/Max Temperature Limit: 0/ 0 Celsius
> > Temperature History Size (Index): 128 (119)
> >
> > Index Estimated Time Temperature Celsius
> > 120 2020-03-15 16:02 21 **
> > ... ..( 5 skipped). .. **
> > 126 2020-03-15 21:56 21 **
> > 127 2020-03-15 22:55 22 ***
> > ... ..( 16 skipped). .. ***
> > 16 2020-03-16 15:38 22 ***
> > 17 2020-03-16 16:37 23 ****
> > ... ..( 3 skipped). .. ****
> > 21 2020-03-16 20:33 23 ****
> > 22 2020-03-16 21:32 24 *****
> > 23 2020-03-16 22:31 23 ****
> > 24 2020-03-16 23:30 24 *****
> > 25 2020-03-17 00:29 24 *****
> > 26 2020-03-17 01:28 24 *****
> > 27 2020-03-17 02:27 23 ****
> > ... ..( 7 skipped). .. ****
> > 35 2020-03-17 10:19 23 ****
> > 36 2020-03-17 11:18 22 ***
> > ... ..( 3 skipped). .. ***
> > 40 2020-03-17 15:14 22 ***
> > 41 2020-03-17 16:13 23 ****
> > ... ..( 14 skipped). .. ****
> > 56 2020-03-18 06:58 23 ****
> > 57 2020-03-18 07:57 22 ***
> > ... ..( 2 skipped). .. ***
> > 60 2020-03-18 10:54 22 ***
> > 61 2020-03-18 11:53 21 **
> > 62 2020-03-18 12:52 20 *
> > 63 2020-03-18 13:51 21 **
> > 64 2020-03-18 14:50 20 *
> > 65 2020-03-18 15:49 20 *
> > 66 2020-03-18 16:48 21 **
> > ... ..( 5 skipped). .. **
> > 72 2020-03-18 22:42 21 **
> > 73 2020-03-18 23:41 24 *****
> > 74 2020-03-19 00:40 26 *******
> > ... ..( 2 skipped). .. *******
> > 77 2020-03-19 03:37 26 *******
> > 78 2020-03-19 04:36 22 ***
> > ... ..( 2 skipped). .. ***
> > 81 2020-03-19 07:33 22 ***
> > 82 2020-03-19 08:32 21 **
> > 83 2020-03-19 09:31 22 ***
> > 84 2020-03-19 10:30 22 ***
> > 85 2020-03-19 11:29 21 **
> > ... ..( 2 skipped). .. **
> > 88 2020-03-19 14:26 21 **
> > 89 2020-03-19 15:25 25 ******
> > 90 2020-03-19 16:24 25 ******
> > 91 2020-03-19 17:23 26 *******
> > 92 2020-03-19 18:22 25 ******
> > 93 2020-03-19 19:21 22 ***
> > ... ..( 3 skipped). .. ***
> > 97 2020-03-19 23:17 22 ***
> > 98 2020-03-20 00:16 21 **
> > ... ..( 4 skipped). .. **
> > 103 2020-03-20 05:11 21 **
> > 104 2020-03-20 06:10 20 *
> > ... ..( 11 skipped). .. *
> > 116 2020-03-20 17:58 20 *
> > 117 2020-03-20 18:57 21 **
> > 118 2020-03-20 19:56 21 **
> > 119 2020-03-20 20:55 21 **
> >
> > SCT Error Recovery Control:
> > Read: Disabled
> > Write: Disabled
>
> OUCH! AGAIN!
> >
> > Device Statistics (GP Log 0x04)
> > Page Offset Size Value Flags Description
> > 0x01 ===== = = === == General Statistics (rev 1) ==
> > 0x01 0x008 4 5 --- Lifetime Power-On Resets
> > 0x01 0x010 4 506 --- Power-on Hours
> > 0x01 0x018 6 8177237744 --- Logical Sectors Written
> > 0x01 0x020 6 32254131 --- Number of Write Commands
> > 0x01 0x028 6 5818370805 --- Logical Sectors Read
> > 0x01 0x030 6 24397122 --- Number of Read Commands
> > 0x01 0x038 6 - --- Date and Time TimeStamp
> > 0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
> > 0x03 0x008 4 159 --- Spindle Motor Power-on Hours
> > 0x03 0x010 4 10 --- Head Flying Hours
> > 0x03 0x018 4 284 --- Head Load Events
> > 0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
> > 0x03 0x028 4 0 --- Read Recovery Attempts
> > 0x03 0x030 4 0 --- Number of Mechanical Start Failures
> > 0x03 0x038 4 0 --- Number of Realloc. Candidate
> > Logical Sectors
> > 0x03 0x040 4 45 --- Number of High Priority Unload Events
> > 0x04 ===== = = === == General Errors Statistics (rev 1) ==
> > 0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
> > 0x04 0x010 4 2 --- Resets Between Cmd Acceptance and
> > Completion
> > 0x05 ===== = = === == Temperature Statistics (rev 1) ==
> > 0x05 0x008 1 23 --- Current Temperature
> > 0x05 0x010 1 20 --- Average Short Term Temperature
> > 0x05 0x018 1 - --- Average Long Term Temperature
> > 0x05 0x020 1 30 --- Highest Temperature
> > 0x05 0x028 1 0 --- Lowest Temperature
> > 0x05 0x030 1 27 --- Highest Average Short Term Temperature
> > 0x05 0x038 1 14 --- Lowest Average Short Term Temperature
> > 0x05 0x040 1 - --- Highest Average Long Term Temperature
> > 0x05 0x048 1 - --- Lowest Average Long Term Temperature
> > 0x05 0x050 4 0 --- Time in Over-Temperature
> > 0x05 0x058 1 70 --- Specified Maximum Operating Temperature
> > 0x05 0x060 4 0 --- Time in Under-Temperature
> > 0x05 0x068 1 0 --- Specified Minimum Operating Temperature
> > 0x06 ===== = = === == Transport Statistics (rev 1) ==
> > 0x06 0x008 4 101 --- Number of Hardware Resets
> > 0x06 0x010 4 17 --- Number of ASR Events
> > 0x06 0x018 4 0 --- Number of Interface CRC Errors
> > |||_ C monitored condition met
> > ||__ D supports DSN
> > |___ N normalized value
> >
> > SATA Phy Event Counters (GP Log 0x11)
> > ID Size Value Description
> > 0x000a 2 34 Device-to-host register FISes sent due to a COMRESET
> > 0x0001 2 0 Command failed due to ICRC error
> > 0x0003 2 0 R_ERR response for device-to-host data FIS
> > 0x0004 2 0 R_ERR response for host-to-device data FIS
> > 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> > 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> >
> > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-64-generic] (local build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> >
> Oh My God.
>
> This array is just asking for disaster. Whoops, you've just had one, sorry.
>
> I'm looking for details of your two failed drives, but I don't seem able
> to find any. But as soon as you can get the array back, you need to fix
> those problems ASAP!!!
>
> Firstly, get rid of that Green!!! Were the two failed drives greens?
> Read the timeout page to find out why.
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> That will hopefully also fix the problem with those Reds with ERC
> disabled. It would not surprise me in the slightest if this is what has
> done the damage to your array.
>
> Lastly, those ST4000s. Are they Ironwolves? I guess they're good drives,
> but they've just trashed your raid-6 redundancy - lose just one of them
> and your array is teetering on the edge. You need to get your sdx2
> partitions copied on to new drives ASAP.
>
> What I'd do is get a couple more ST4000s, and use them, creating 4GB
> partitions. Then take your existing ST4000s, and convert them to 4GB
> partitions. At which point you only need five more ST4000s to move your
> array on to new drives.
>
> I'm not sure how you get there - once you've got your 9 4GB drives you
> *may* be able to just fail and remove the remaining 2GB drives.
> Otherwise, I'd use the freed-up 2GB drives to create 4GB raid-0s. You'd
> end up having to buy a couple of spare 4GB drives to move the entire
> array on to 4GB "drives", but then you could remove the raid-0 arrays.
>
> Cheers,
> Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-21 11:54 ` Glenn Greibesland
@ 2020-03-21 19:24 ` Phil Turmel
2020-03-21 22:12 ` Glenn Greibesland
2020-03-22 0:05 ` Wols Lists
0 siblings, 2 replies; 13+ messages in thread
From: Phil Turmel @ 2020-03-21 19:24 UTC (permalink / raw)
To: Glenn Greibesland, antlists; +Cc: linux-raid, NeilBrown
Hi Glenn,
{Convention on kernel.org lists is to interleave replies or bottom post,
and to trim non-relevant quoted material. Please do so in the future.}
On 3/21/20 7:54 AM, Glenn Greibesland wrote:
> Yes, I am aware of the problems with WD Green and multiple partitions
> on single 4TB disk. I am in the middle of getting rid of old disks and
> I have enough new drives to stop having multiple partitions on single
> drives, but not enough power and free SATA ports. It is just a
> temporary solution. Also a reason why I did not
> include much details in the original post, I knew it would just
> distract from the problem I want to solve right away.
>
> What I need help with now is just getting the array started with the
> 16 out of 18 disks. Then I can continue migrating data and replacing
> old disks as planned.
I've examined the material posted, and the sequence of events described.
The --re-add damaged that one drive's role record and there is no
programmatic way in mdadm to correct it.
Since you seem comfortable reading source code, you might consider byte
editing that drive's superblock to restore it to "active device 10".
That is what I would do. With that corrected, --assemble --force should
give you a running array.
In lieu of superblock surgery, you will indeed need to perform a
--create --assume-clean, as you proposed in your original email. Since
you have already constructed a syntactically valid command for that
purpose, with appropriate data offsets, that might be the fastest way to
get a running array.
I would double-check the /dev/ name versus array "active device" number
relationship to ensure strict ordering in your --create operation.
Incorrect ordering will utterly scramble your content.
> When I built the array in 2012, I used WD Green. They turned out to be
> horrible disks and I have since replaced some of them with WD Red. The
> newest disks I've bought are Ironwolves
I also noted the drives with Error Recovery Control turned off. That is
not an issue while your array has no redundancy, but is catastrophic in
any normal array. It is as bad as having a drive that doesn't do ERC at
all. Don't do that. Do read the "Timeout Mismatch" documentation that
Anthony recommended, if you haven't yet.
I also recommend, when you get to a running array, that you prioritize
the backup of its content--get the critical data copied out ASAP. Your
array will be very vulnerable to Unrecoverable Read Errors until you've
completed your reconfiguration onto new drives. Do not attempt to scrub
the array or read every file right away, as any URE may break the array
again.
If UREs do break your array again, you will need to use an
error-ignoring copy tool (some flavor of ddrescue) to put the readable
data onto a new device, remove the old device from the system, and then
--assemble --force with the replacement. Repeat as needed.
Good luck!
Regards,
Phil
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-21 19:24 ` Phil Turmel
@ 2020-03-21 22:12 ` Glenn Greibesland
2020-03-22 0:32 ` Phil Turmel
2020-03-22 0:05 ` Wols Lists
1 sibling, 1 reply; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-21 22:12 UTC (permalink / raw)
To: Phil Turmel; +Cc: antlists, linux-raid, NeilBrown
lør. 21. mar. 2020 kl. 20:24 skrev Phil Turmel <philip@turmel.org>:
> {Convention on kernel.org lists is to interleave replies or bottom post,
> and to trim non-relevant quoted material. Please do so in the future.}
Sorry about that.
> Since you seem comfortable reading source code, you might consider byte
> editing that drive's superblock to restore it to "active device 10".
> That is what I would do. With that corrected, --assemble --force should
> give you a running array.
I did some more digging in the source code, but it looks like the
superblock is replicated onto all drives and that I probably would
have to edit the superblock of all disks, but I'm not sure.
With newfound confidence (thanks) I decided to try the --create
--asume-clean option instead.
It worked fine and I am now copying the data that is not already backed up.
I'll wait until the data is copied onto other drives before I add the
last two disks to the array and start rebuilding.
> I also noted the drives with Error Recovery Control turned off. That is
> not an issue while your array has no redundancy, but is catastrophic in
> any normal array. It is as bad as having a drive that doesn't do ERC at
> all. Don't do that. Do read the "Timeout Mismatch" documentation that
> Anthony recommended, if you haven't yet.
I'll read up on this documentation to ensure reliable operation in the
future. Thanks Phil and Anthony.
So to summarize what happened and what I've learned:
I had a RAID6 array with only 16 out of 18 working drives.
I received an email from mdadm saying another drive failed.
I ran a full offline smart test that completed successfuly.
The drive was in F (failed) state. I used --re-add and mdadm overwrote
the superblock turning it into a spare drive instead of putting the
drive back into slot 10.
I should have used --assemble --force.
Am I correct?
Glenn
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-21 19:24 ` Phil Turmel
2020-03-21 22:12 ` Glenn Greibesland
@ 2020-03-22 0:05 ` Wols Lists
1 sibling, 0 replies; 13+ messages in thread
From: Wols Lists @ 2020-03-22 0:05 UTC (permalink / raw)
To: Phil Turmel, Glenn Greibesland; +Cc: linux-raid, NeilBrown
On 21/03/20 19:24, Phil Turmel wrote:
> If UREs do break your array again, you will need to use an
> error-ignoring copy tool (some flavor of ddrescue) to put the readable
> data onto a new device, remove the old device from the system, and then
> --assemble --force with the replacement. Repeat as needed.
I would NOT recommend it at the moment - it's untested and reputedly
breaks raids 5 & 6, but look at dm-integrity. If we could trust it, it
would be a wonderful tool with ddrescue.
Hopefully I'm about to have a wonderful system with 6 or so drives I can
play with as a raid test-bed, and I'm hoping to do a load of work on this.
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-21 22:12 ` Glenn Greibesland
@ 2020-03-22 0:32 ` Phil Turmel
2020-03-23 9:23 ` Wols Lists
0 siblings, 1 reply; 13+ messages in thread
From: Phil Turmel @ 2020-03-22 0:32 UTC (permalink / raw)
To: Glenn Greibesland; +Cc: antlists, linux-raid, NeilBrown
On 3/21/20 6:12 PM, Glenn Greibesland wrote:
[trim /]
> So to summarize what happened and what I've learned:
> I had a RAID6 array with only 16 out of 18 working drives.
> I received an email from mdadm saying another drive failed.
> I ran a full offline smart test that completed successfuly.
>
> The drive was in F (failed) state. I used --re-add and mdadm overwrote
> the superblock turning it into a spare drive instead of putting the
> drive back into slot 10.
> I should have used --assemble --force.
>
> Am I correct?
Yes.
However, there have been bugs in --force that would cause it to not
assemble. Also, I believe latest behavior for --re-add would not have
damaged the metadata.
Phil
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-22 0:32 ` Phil Turmel
@ 2020-03-23 9:23 ` Wols Lists
2020-03-23 12:35 ` Glenn Greibesland
0 siblings, 1 reply; 13+ messages in thread
From: Wols Lists @ 2020-03-23 9:23 UTC (permalink / raw)
To: Phil Turmel, Glenn Greibesland; +Cc: linux-raid, NeilBrown
On 22/03/20 00:32, Phil Turmel wrote:
> However, there have been bugs in --force that would cause it to not
> assemble. Also, I believe latest behavior for --re-add would not have
> damaged the metadata.
And note that the website does tell you always to use the latest version
of mdadm when trying to recover an array ... because it's linux-only
it's pretty easy to build from source if you have to.
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Raid6 recovery
2020-03-23 9:23 ` Wols Lists
@ 2020-03-23 12:35 ` Glenn Greibesland
0 siblings, 0 replies; 13+ messages in thread
From: Glenn Greibesland @ 2020-03-23 12:35 UTC (permalink / raw)
To: Wols Lists; +Cc: Phil Turmel, linux-raid, NeilBrown
man. 23. mar. 2020 kl. 10:23 skrev Wols Lists <antlists@youngman.org.uk>:
> And note that the website does tell you always to use the latest version
> of mdadm when trying to recover an array ... because it's linux-only
> it's pretty easy to build from source if you have to.
I was probably using verison 3.3-2 when I ran --re-add and the problem
started. That is the latest version available for the verison of
Ubuntu Server I am running.
I then upgraded to v4.0 and later to v4.1-65 by building from source.
Lesson learned.
Question about mdadm documentation:
Should the man page be updated to reflect the support for using
sectors as units in addition to K M G and T?
Glenn
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raid6 recovery
2011-01-14 16:16 raid6 recovery Björn Englund
@ 2011-01-14 21:52 ` NeilBrown
0 siblings, 0 replies; 13+ messages in thread
From: NeilBrown @ 2011-01-14 21:52 UTC (permalink / raw)
To: Björn Englund; +Cc: linux-raid
On Fri, 14 Jan 2011 17:16:26 +0100 Björn Englund <be@smarteye.se> wrote:
> Hi.
>
> After a loss of communication with a drive in a 10 disk raid6 the disk
> was dropped out of the raid.
>
> I added it again with
> mdadm /dev/md16 --add /dev/sdbq1
>
> The array resynced and I used the xfs filesystem on top of the raid.
>
> After a while I started noticing filesystem errors.
>
> I did
> echo check > /sys/block/md16/md/sync_action
>
> I got a lot of errors in /sys/block/md16/md/mismatch_cnt
>
> I failed and removed the disk I added before from the array.
>
> Did a check again (on the 9/10 array)
> echo check > /sys/block/md16/md/sync_action
>
> No errors /sys/block/md16/md/mismatch_cnt
>
> Wiped the superblock from /dev/sdbq1 and added it again to the array.
> Let it finish resyncing.
> Did a check and once again a lot of errors.
That is obviously very bad. After the recovery it may well report a large
number in mismatch_cnt, but if you then do a 'check' the number should go to
zero and stay there.
Did you interrupt the recovery at all, or did it run to completion without
any interference? What kernel version are you using?
>
> The drive now has slot 10 instead of slot 3 which it had before the
> first error.
This is normal. When you wipes the superblock, md though it was a new device
and gave it a new number in the array. It still filled the same role though.
>
> Examining each device (see below) shows 11 slots and one failed?
> (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3) ?
These numbers are confusing, but they are correct and suggest the array is
whole and working.
Newer version of mdadm are less confusing.
I'm afraid I cannot suggest what the root problem is. It seems like
something seriously wrong with IO to the device, but if that is the case you
would expect other errors...
NeilBrown
>
>
> Any idea what is going on?
>
> mdadm --version
> mdadm - v2.6.9 - 10th March 2009
>
> Centos 5.5
>
>
> mdadm -D /dev/md16
> /dev/md16:
> Version : 1.01
> Creation Time : Thu Nov 25 09:15:54 2010
> Raid Level : raid6
> Array Size : 7809792000 (7448.00 GiB 7997.23 GB)
> Used Dev Size : 976224000 (931.00 GiB 999.65 GB)
> Raid Devices : 10
> Total Devices : 10
> Preferred Minor : 16
> Persistence : Superblock is persistent
>
> Update Time : Fri Jan 14 16:22:10 2011
> State : clean
> Active Devices : 10
> Working Devices : 10
> Failed Devices : 0
> Spare Devices : 0
>
> Chunk Size : 256K
>
> Name : 16
> UUID : fcd585d0:f2918552:7090d8da:532927c8
> Events : 90
>
> Number Major Minor RaidDevice State
> 0 8 145 0 active sync /dev/sdj1
> 1 65 1 1 active sync /dev/sdq1
> 2 65 17 2 active sync /dev/sdr1
> 10 68 65 3 active sync /dev/sdbq1
> 4 65 49 4 active sync /dev/sdt1
> 5 65 65 5 active sync /dev/sdu1
> 6 65 113 6 active sync /dev/sdx1
> 7 65 129 7 active sync /dev/sdy1
> 8 65 33 8 active sync /dev/sds1
> 9 65 145 9 active sync /dev/sdz1
>
>
>
> mdadm -E /dev/sdj1
> /dev/sdj1:
> Magic : a92b4efc
> Version : 1.1
> Feature Map : 0x0
> Array UUID : fcd585d0:f2918552:7090d8da:532927c8
> Name : 16
> Creation Time : Thu Nov 25 09:15:54 2010
> Raid Level : raid6
> Raid Devices : 10
>
> Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
> Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
> Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
> Data Offset : 264 sectors
> Super Offset : 0 sectors
> State : clean
> Device UUID : 5db9c8f7:ce5b375e:757c53d0:04e89a06
>
> Update Time : Fri Jan 14 16:22:10 2011
> Checksum : 1f17a675 - correct
> Events : 90
>
> Chunk Size : 256K
>
> Array Slot : 0 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
> Array State : Uuuuuuuuuu 1 failed
>
>
>
> mdadm -E /dev/sdq1
> /dev/sdq1:
> Magic : a92b4efc
> Version : 1.1
> Feature Map : 0x0
> Array UUID : fcd585d0:f2918552:7090d8da:532927c8
> Name : 16
> Creation Time : Thu Nov 25 09:15:54 2010
> Raid Level : raid6
> Raid Devices : 10
>
> Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
> Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
> Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
> Data Offset : 264 sectors
> Super Offset : 0 sectors
> State : clean
> Device UUID : fb113255:fda391a6:7368a42b:1d6d4655
>
> Update Time : Fri Jan 14 16:22:10 2011
> Checksum : 6ed7b859 - correct
> Events : 90
>
> Chunk Size : 256K
>
> Array Slot : 1 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
> Array State : uUuuuuuuuu 1 failed
>
>
> mdadm -E /dev/sdr1
> /dev/sdr1:
> Magic : a92b4efc
> Version : 1.1
> Feature Map : 0x0
> Array UUID : fcd585d0:f2918552:7090d8da:532927c8
> Name : 16
> Creation Time : Thu Nov 25 09:15:54 2010
> Raid Level : raid6
> Raid Devices : 10
>
> Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
> Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
> Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
> Data Offset : 264 sectors
> Super Offset : 0 sectors
> State : clean
> Device UUID : afcb4dd8:2aa58944:40a32ed9:eb6178af
>
> Update Time : Fri Jan 14 16:22:10 2011
> Checksum : 97a7a2d7 - correct
> Events : 90
>
> Chunk Size : 256K
>
> Array Slot : 2 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
> Array State : uuUuuuuuuu 1 failed
>
>
> mdadm -E /dev/sdbq1
> /dev/sdbq1:
> Magic : a92b4efc
> Version : 1.1
> Feature Map : 0x0
> Array UUID : fcd585d0:f2918552:7090d8da:532927c8
> Name : 16
> Creation Time : Thu Nov 25 09:15:54 2010
> Raid Level : raid6
> Raid Devices : 10
>
> Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
> Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
> Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
> Data Offset : 264 sectors
> Super Offset : 0 sectors
> State : clean
> Device UUID : 93c6ae7c:d8161356:7ada1043:d0c5a924
>
> Update Time : Fri Jan 14 16:22:10 2011
> Checksum : 2ca5aa8f - correct
> Events : 90
>
> Chunk Size : 256K
>
> Array Slot : 10 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
> Array State : uuuUuuuuuu 1 failed
>
>
> and so on for the rest of the drives.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* raid6 recovery
@ 2011-01-14 16:16 Björn Englund
2011-01-14 21:52 ` NeilBrown
0 siblings, 1 reply; 13+ messages in thread
From: Björn Englund @ 2011-01-14 16:16 UTC (permalink / raw)
To: linux-raid
Hi.
After a loss of communication with a drive in a 10 disk raid6 the disk
was dropped out of the raid.
I added it again with
mdadm /dev/md16 --add /dev/sdbq1
The array resynced and I used the xfs filesystem on top of the raid.
After a while I started noticing filesystem errors.
I did
echo check > /sys/block/md16/md/sync_action
I got a lot of errors in /sys/block/md16/md/mismatch_cnt
I failed and removed the disk I added before from the array.
Did a check again (on the 9/10 array)
echo check > /sys/block/md16/md/sync_action
No errors /sys/block/md16/md/mismatch_cnt
Wiped the superblock from /dev/sdbq1 and added it again to the array.
Let it finish resyncing.
Did a check and once again a lot of errors.
The drive now has slot 10 instead of slot 3 which it had before the
first error.
Examining each device (see below) shows 11 slots and one failed?
(0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3) ?
Any idea what is going on?
mdadm --version
mdadm - v2.6.9 - 10th March 2009
Centos 5.5
mdadm -D /dev/md16
/dev/md16:
Version : 1.01
Creation Time : Thu Nov 25 09:15:54 2010
Raid Level : raid6
Array Size : 7809792000 (7448.00 GiB 7997.23 GB)
Used Dev Size : 976224000 (931.00 GiB 999.65 GB)
Raid Devices : 10
Total Devices : 10
Preferred Minor : 16
Persistence : Superblock is persistent
Update Time : Fri Jan 14 16:22:10 2011
State : clean
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0
Chunk Size : 256K
Name : 16
UUID : fcd585d0:f2918552:7090d8da:532927c8
Events : 90
Number Major Minor RaidDevice State
0 8 145 0 active sync /dev/sdj1
1 65 1 1 active sync /dev/sdq1
2 65 17 2 active sync /dev/sdr1
10 68 65 3 active sync /dev/sdbq1
4 65 49 4 active sync /dev/sdt1
5 65 65 5 active sync /dev/sdu1
6 65 113 6 active sync /dev/sdx1
7 65 129 7 active sync /dev/sdy1
8 65 33 8 active sync /dev/sds1
9 65 145 9 active sync /dev/sdz1
mdadm -E /dev/sdj1
/dev/sdj1:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x0
Array UUID : fcd585d0:f2918552:7090d8da:532927c8
Name : 16
Creation Time : Thu Nov 25 09:15:54 2010
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 5db9c8f7:ce5b375e:757c53d0:04e89a06
Update Time : Fri Jan 14 16:22:10 2011
Checksum : 1f17a675 - correct
Events : 90
Chunk Size : 256K
Array Slot : 0 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
Array State : Uuuuuuuuuu 1 failed
mdadm -E /dev/sdq1
/dev/sdq1:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x0
Array UUID : fcd585d0:f2918552:7090d8da:532927c8
Name : 16
Creation Time : Thu Nov 25 09:15:54 2010
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : fb113255:fda391a6:7368a42b:1d6d4655
Update Time : Fri Jan 14 16:22:10 2011
Checksum : 6ed7b859 - correct
Events : 90
Chunk Size : 256K
Array Slot : 1 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
Array State : uUuuuuuuuu 1 failed
mdadm -E /dev/sdr1
/dev/sdr1:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x0
Array UUID : fcd585d0:f2918552:7090d8da:532927c8
Name : 16
Creation Time : Thu Nov 25 09:15:54 2010
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : afcb4dd8:2aa58944:40a32ed9:eb6178af
Update Time : Fri Jan 14 16:22:10 2011
Checksum : 97a7a2d7 - correct
Events : 90
Chunk Size : 256K
Array Slot : 2 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
Array State : uuUuuuuuuu 1 failed
mdadm -E /dev/sdbq1
/dev/sdbq1:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x0
Array UUID : fcd585d0:f2918552:7090d8da:532927c8
Name : 16
Creation Time : Thu Nov 25 09:15:54 2010
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1952448248 (931.00 GiB 999.65 GB)
Array Size : 15619584000 (7448.00 GiB 7997.23 GB)
Used Dev Size : 1952448000 (931.00 GiB 999.65 GB)
Data Offset : 264 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 93c6ae7c:d8161356:7ada1043:d0c5a924
Update Time : Fri Jan 14 16:22:10 2011
Checksum : 2ca5aa8f - correct
Events : 90
Chunk Size : 256K
Array Slot : 10 (0, 1, 2, failed, 4, 5, 6, 7, 8, 9, 3)
Array State : uuuUuuuuuu 1 failed
and so on for the rest of the drives.
^ permalink raw reply [flat|nested] 13+ messages in thread
* raid6 recovery
@ 2009-01-15 15:24 Jason Weber
0 siblings, 0 replies; 13+ messages in thread
From: Jason Weber @ 2009-01-15 15:24 UTC (permalink / raw)
To: linux-raid
Before I cause to much damage, I really need expert help.
Early this morning, machine locked up and my 4x500Gb raid6 did not
recover on reboot.
A smaller 2x18Gb raid came up as normal.
/var/log/messages has:
Jan 15 01:12:22 wildfire Pid: 6056, comm: mdadm Tainted: P
2.6.19-gentoo-r5 #3
with some codes and a lot of others like it when it went down. And then,
Jan 15 01:16:37 wildfire mdadm: DeviceDisappeared event detected on md
device /dev/md1
I tried simple readds:
# mdadm /dev/md1 --add /dev/sdd /dev/sde
mdadm: cannot get array info for /dev/md1
Eventually I noticed that the drives had a different UUID than mdadm.conf;
one byte had changed. I have a backup of mdadm.conf so I know that
was the same.
So, I changed mdadm.conf to match the drives and started an assemble
# mdadm --assemble --verbose /dev/md1
mdadm: looking for devices for /dev/md1
mdadm: cannot open device
/dev/disk/by-uuid/d7a08e91-0a49-4e91-91d7-d9d1e9e6cda1: Device or
resource busy
mdadm: /dev/disk/by-uuid/d7a08e91-0a49-4e91-91d7-d9d1e9e6cda1 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdg1
mdadm: /dev/sdg1 has wrong uuid.
mdadm: no recogniseable superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: cannot open device /dev/sdi2: Device or resource busy
mdadm: /dev/sdi2 has wrong uuid.
mdadm: cannot open device /dev/sdi1: Device or resource busy
mdadm: /dev/sdi1 has wrong uuid.
mdadm: cannot open device /dev/sdi: Device or resource busy
mdadm: /dev/sdi has wrong uuid.
mdadm: cannot open device /dev/sdh1: Device or resource busy
mdadm: /dev/sdh1 has wrong uuid.
mdadm: cannot open device /dev/sdh: Device or resource busy
mdadm: /dev/sdh has wrong uuid.
mdadm: /dev/sdc has wrong uuid.
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has wrong uuid.
mdadm: cannot open device /dev/sdb: Device or resource busy
mdadm: /dev/sdb has wrong uuid.
mdadm: cannot open device /dev/sda4: Device or resource busy
mdadm: /dev/sda4 has wrong uuid.
mdadm: cannot open device /dev/sda3: Device or resource busy
mdadm: /dev/sda3 has wrong uuid.
mdadm: cannot open device /dev/sda2: Device or resource busy
mdadm: /dev/sda2 has wrong uuid.
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has wrong uuid.
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sdf is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sde is identified as a member of /dev/md1, slot 0.
mdadm: /dev/sdd is identified as a member of /dev/md1, slot 3.
which has been sitting there for about four hours, full CPU, and as
far as I can tell not much drive
activity (how can I tell? they're not very loud relative to the
overall machine noise).
As for "damage" I've done, first of all, one typo added /dev/sdc, once
of md1, to the md0 array
so now it thinks it is 18Gb according to mdadm -E, but hopefully it
was only set to spare so
maybe it didn't get scrambled:
# mdadm -E /dev/sdc
/dev/sdc:
Magic : a92b4efc
Version : 00.90.00
UUID : 96a4204f:7b6211e6:34105f4c:9857a351
Creation Time : Tue May 17 23:03:53 2005
Raid Level : raid1
Used Dev Size : 17952512 (17.12 GiB 18.38 GB)
Array Size : 17952512 (17.12 GiB 18.38 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 0
Update Time : Thu Jan 15 01:52:42 2009
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1
Checksum : 195f64d3 - correct
Events : 0.39649024
Number Major Minor RaidDevice State
this 2 8 32 2 spare /dev/sdc
0 0 8 113 0 active sync /dev/sdh1
1 1 8 129 1 active sync /dev/sdi1
2 2 8 32 2 spare /dev/sdc
Here's the others:
# mdadm -E /dev/sdd
/dev/sdd:
Magic : a92b4efc
Version : 00.91.00
UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
Creation Time : Sat Oct 13 00:23:51 2007
Raid Level : raid6
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 976772992 (931.52 GiB 1000.22 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Reshape pos'n : 9223371671782555647
Update Time : Thu Jan 15 01:12:21 2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : dca29b4 - correct
Events : 0.79926
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 48 3 active sync /dev/sdd
0 0 8 64 0 active sync /dev/sde
1 1 8 80 1 active sync /dev/sdf
2 2 8 32 2 active sync /dev/sdc
3 3 8 48 3 active sync /dev/sdd
# mdadm -E /dev/sde
/dev/sde:
Magic : a92b4efc
Version : 00.91.00
UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
Creation Time : Sat Oct 13 00:23:51 2007
Raid Level : raid6
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 976772992 (931.52 GiB 1000.22 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Reshape pos'n : 9223371671782555647
Update Time : Thu Jan 15 01:12:21 2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : dca29be - correct
Events : 0.79926
Chunk Size : 64K
Number Major Minor RaidDevice State
this 0 8 64 0 active sync /dev/sde
0 0 8 64 0 active sync /dev/sde
1 1 8 80 1 active sync /dev/sdf
2 2 8 32 2 active sync /dev/sdc
3 3 8 48 3 active sync /dev/sdd
# mdadm -E /dev/sdf
/dev/sdf:
Magic : a92b4efc
Version : 00.91.00
UUID : f92d43a8:5ab3f411:26e606b2:3c378a67
Creation Time : Sat Oct 13 00:23:51 2007
Raid Level : raid6
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 976772992 (931.52 GiB 1000.22 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Reshape pos'n : 9223371671782555647
Update Time : Thu Jan 15 01:12:21 2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : dca29d0 - correct
Events : 0.79926
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 80 1 active sync /dev/sdf
0 0 8 64 0 active sync /dev/sde
1 1 8 80 1 active sync /dev/sdf
2 2 8 32 2 active sync /dev/sdc
3 3 8 48 3 active sync /dev/sdd
/etc/mdadm.conf:
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
ARRAY /dev/md1 level=raid6 num-devices=4
UUID=f92d43a8:5ab3f411:26e606b2:3c378a67
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=96a4204f:7b6211e6:34105f4c:9857a351
# This file was auto-generated on Tue, 11 Mar 2008 00:10:35 -0700
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $
It previously said:
UUID=f92d43a8:5ab3f491:26e606b2:3c378a67
with a ...491.. instead of ...411...
Is mdadm --assemble supposed to take a long time or should it almost
immediately come back
and let me watch /proc/mdstat, which currently just says:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sdh1[0] sdi1[1]
17952512 blocks [2/2] [UU]
unused devices: <none>
Also, I did modprobe raid456 manually before the assemble since I
noticed it was only saying raid1.
Maybe it would have been automatic at the right moment anyhow.
Should I just wait for the assemble or is it doing nothing?
Can I recover /dev/sdc as well or is that unimportant since I can
clear it and readd if the other three
(or even two) sync up and become available.
This md1 has been trouble since inception a couple years ago. I get
corrupt files every week or
so it seems. My little U320 scsi md0 raid1 has been nearly uneventful
for a much longer time.
Is raid6 less stable or maybe by sata_sil24 card is a bad choice?
Maybe sata doesn't measure
up to scsi. So please point out any obvious foolishness on my part.
I do have a five day old single non-raid partial backup which is now
the only container of the data.
I'm very nervous about critical loss. If I absolutely need to start
over, I'd like to get some redundancy
in my data as soon as possible. Perhaps breaking it into a pair of
raid1 arrays is smarter anyhow.
-- Jason P Weber
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-03-23 12:35 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 19:55 Raid6 recovery Glenn Greibesland
2020-03-20 19:15 ` Wols Lists
[not found] ` <CA+9eyigMV-E=FwtXDWZszSsV6JOxxFOFVh6WzmeH=OC3heMUHw@mail.gmail.com>
2020-03-21 0:06 ` antlists
2020-03-21 11:54 ` Glenn Greibesland
2020-03-21 19:24 ` Phil Turmel
2020-03-21 22:12 ` Glenn Greibesland
2020-03-22 0:32 ` Phil Turmel
2020-03-23 9:23 ` Wols Lists
2020-03-23 12:35 ` Glenn Greibesland
2020-03-22 0:05 ` Wols Lists
-- strict thread matches above, loose matches on Subject: below --
2011-01-14 16:16 raid6 recovery Björn Englund
2011-01-14 21:52 ` NeilBrown
2009-01-15 15:24 Jason Weber
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.