* Re: mdadm RAID6 "active" with spares and failed disks; need help [not found] <54ABEE54.6020707@sympatico.ca> @ 2015-01-07 13:34 ` Matt Callaghan 2015-01-11 20:26 ` Matt Callaghan 0 siblings, 1 reply; 12+ messages in thread From: Matt Callaghan @ 2015-01-07 13:34 UTC (permalink / raw) To: linux-raid Just to give a small update (I realize many people may still be on holidays) I've tried to work with a few people on IRC, and in conjunction with lots of reading from others' experiences attempting to recover the array but no luck yet. I /hope/ I haven't ruined anything. The forum post referenced below has full details, but here's a summary of "what happened" notice how some drives are "moving" around :( [either due to a mistake I made, or the server haulting/lockup during rebuilds, I'm not sure] {{{ ------------------------------------------------------------------------------------------- | | Device Role # ------------------------------------------------------------------------------------------- | DEVICE | COMMENTS | Dec GOOD | Jan4 6:28AM | 12:10PM | 12:40PM | Jan5 12:30AM | 12:50AM | 8:30AM | 6:34PM | Jan6 6:45AM | ------------------------------------------------------------------------------------------- | /dev/sdi | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | | /dev/sdj | failing | 5 | 5 FAIL | ( ) | 8 | 8 | 8 FAIL | ( ) | ( ) | ( ) | | /dev/sdk | failing? | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 FAIL | 0 FAIL | | /dev/sdl | | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | | /dev/sdm | | 1 | 1 | 1 | 1 | ( ) | ( ) | ( ) | 8 | 8 SPARE | | /dev/sdn | | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | /dev/sdo | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | /dev/sdp | | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | ------------------------------------------------------------------------------------------- }}} Full details from my e-mail notifications of /proc/mdstat (although unfortunately I don't have FULL mdadm --detail/examine information per state transition) {{{ Dec GOOD md2000 : active raid6 sdo1[3] sdj1[5] sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU] FAIL EVENT on Jan 4th @ 6:28AM md2000 : active raid6 sdo1[3] sdj1[5](F) sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [==============>......] check = 73.6% (1439539228/1953513408) finish=536.6min speed=15960K/sec DEGRADED EVENT on Jan 4th @ 6:39AM md2000 : active raid6 sdo1[3] sdj1[5](F) sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [==============>......] check = 73.6% (1439539228/1953513408) finish=5091.8min speed=1682K/sec DEGRADED EVENT on Jan 4th @ 12:10PM md2000 : active raid6 sdo1[3] sdn1[2] sdi1[4] sdm1[1] sdk1[0] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] DEGRADED EVENT on Jan 4th @ 12:21PM md2000 : active raid6 sdk1[0] sdo1[3] sdm1[1] sdn1[2] sdi1[4] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] DEGRADED EVENT on Jan 4th @ 12:40PM md2000 : active raid6 sdj1[8] sdm1[1] sdo1[3] sdn1[2] sdk1[0] sdi1[4] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [>....................] recovery = 0.2% (5137892/1953513408) finish=921.7min speed=35227K/sec DEGRADED EVENT on Jan 5th @ 12:30AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8] sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [============>........] recovery = 62.9% (1229102028/1953513408) finish=259.8min speed=46466K/sec FAIL SPARE EVENT on Jan 5th @ 12:50AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8](F) sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [=============>.......] recovery = 68.1% (1332029020/1953513408) finish=150.3min speed=68897K/sec DEGRADED EVENT on Jan 5th @ 6:43AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8](F) sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [=============>.......] recovery = 68.1% (1332029020/1953513408) finish=76028.6min speed=136K/sec TEST MESSAGE on Jan 5th @ 8:30AM md2000 : active raid6 sdo1[3] sdi1[4] sdn1[2] sdk1[0] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] }}} I've tried mdadm --create --assume-clean for several combinations of the "device role # ordering", but so far none have exposed a usable ext4 partition for /dev/md2000. Was speaking with someone on IRC, and it's been shown that the data offset for the devices has changed over time in mdadm, so I need to recompile mdadm 3.3.x and attempt it that way. I'll update when I get to trying that. ~Fermmy -------- Original Message -------- From: Matt Callaghan <matt_callaghan@sympatico.ca> Sent: Tue 06 Jan 2015 09:16:52 AM EST To: linux-raid@vger.kernel.org Cc: Subject: mdadm RAID6 "active" with spares and failed disks; need help > I think I'm in a really bad state. Could an expert w/ mdadm please > help? I have a RAID6 mdadm device, and it got really messed up with spares: {{{ md2000 : active raid6 sdm1[8](S) sdo1[3] sdi1[4] sdn1[2] sdk1[0](F) sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/5] [__UU_UUU] }}} And is now really broken (inactive) {{{ md2000 : inactive sdn1[2](S) sdm1[8](S) sdl1[6](S) sdp1[7](S) sdi1[4](S) sdo1[3](S) sdk1[0](S) 13674593976 blocks super 1.1 }}} I have a forum post going w/ full details http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ I /think/ I need to force re-assembly here, but I'd like some review from the experts before proceeding. Thank you in advance for your time, ~Matt/Fermulator ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-07 13:34 ` mdadm RAID6 "active" with spares and failed disks; need help Matt Callaghan @ 2015-01-11 20:26 ` Matt Callaghan 2015-01-11 23:22 ` Valentijn Sessink 0 siblings, 1 reply; 12+ messages in thread From: Matt Callaghan @ 2015-01-11 20:26 UTC (permalink / raw) To: linux-raid Updating this e-mail thread. I got the latest mdadm version that supports data offset variance per device and attempted to reconstruct RAID6 according to previous data, but so far no luck. As far as I a can tell (sadly), all of my data is lost. I've updated the forum thread with the final details and failures. http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ I'll leave the drives "in this state" until end of the month in hopes that someone has another idea on how to recover. NOTE: I will pay $$$ if there is a person that helps me to recover the data :) ~Matt -------- Original Message -------- From: Matt Callaghan <matt_callaghan@sympatico.ca> Sent: Wed 07 Jan 2015 08:34:01 AM EST To: linux-raid@vger.kernel.org Cc: Subject: Re: mdadm RAID6 "active" with spares and failed disks; need help > Just to give a small update (I realize many people may still be on holidays) I've tried to work with a few people on IRC, and in conjunction with lots of reading from others' experiences attempting to recover the array but no luck yet. I /hope/ I haven't ruined anything. The forum post referenced below has full details, but here's a summary of "what happened" notice how some drives are "moving" around :( [either due to a mistake I made, or the server haulting/lockup during rebuilds, I'm not sure] {{{ ------------------------------------------------------------------------------------------- | | Device Role # ------------------------------------------------------------------------------------------- | DEVICE | COMMENTS | Dec GOOD | Jan4 6:28AM | 12:10PM | 12:40PM | Jan5 12:30AM | 12:50AM | 8:30AM | 6:34PM | Jan6 6:45AM | ------------------------------------------------------------------------------------------- | /dev/sdi | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | | /dev/sdj | failing | 5 | 5 FAIL | ( ) | 8 | 8 | 8 FAIL | ( ) | ( ) | ( ) | | /dev/sdk | failing? | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 FAIL | 0 FAIL | | /dev/sdl | | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | | /dev/sdm | | 1 | 1 | 1 | 1 | ( ) | ( ) | ( ) | 8 | 8 SPARE | | /dev/sdn | | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | /dev/sdo | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | /dev/sdp | | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | ------------------------------------------------------------------------------------------- }}} Full details from my e-mail notifications of /proc/mdstat (although unfortunately I don't have FULL mdadm --detail/examine information per state transition) {{{ Dec GOOD md2000 : active raid6 sdo1[3] sdj1[5] sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU] FAIL EVENT on Jan 4th @ 6:28AM md2000 : active raid6 sdo1[3] sdj1[5](F) sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [==============>......] check = 73.6% (1439539228/1953513408) finish=536.6min speed=15960K/sec DEGRADED EVENT on Jan 4th @ 6:39AM md2000 : active raid6 sdo1[3] sdj1[5](F) sdk1[0] sdi1[4] sdn1[2] sdm1[1] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [==============>......] check = 73.6% (1439539228/1953513408) finish=5091.8min speed=1682K/sec DEGRADED EVENT on Jan 4th @ 12:10PM md2000 : active raid6 sdo1[3] sdn1[2] sdi1[4] sdm1[1] sdk1[0] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] DEGRADED EVENT on Jan 4th @ 12:21PM md2000 : active raid6 sdk1[0] sdo1[3] sdm1[1] sdn1[2] sdi1[4] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] DEGRADED EVENT on Jan 4th @ 12:40PM md2000 : active raid6 sdj1[8] sdm1[1] sdo1[3] sdn1[2] sdk1[0] sdi1[4] sdp1[7] sdl1[6] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/7] [UUUU_UUU] [>....................] recovery = 0.2% (5137892/1953513408) finish=921.7min speed=35227K/sec DEGRADED EVENT on Jan 5th @ 12:30AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8] sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [============>........] recovery = 62.9% (1229102028/1953513408) finish=259.8min speed=46466K/sec FAIL SPARE EVENT on Jan 5th @ 12:50AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8](F) sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [=============>.......] recovery = 68.1% (1332029020/1953513408) finish=150.3min speed=68897K/sec DEGRADED EVENT on Jan 5th @ 6:43AM md2000 : active raid6 sdk1[0] sdo1[3] sdn1[2] sdj1[8](F) sdi1[4] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] [=============>.......] recovery = 68.1% (1332029020/1953513408) finish=76028.6min speed=136K/sec TEST MESSAGE on Jan 5th @ 8:30AM md2000 : active raid6 sdo1[3] sdi1[4] sdn1[2] sdk1[0] sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/6] [U_UU_UUU] }}} I've tried mdadm --create --assume-clean for several combinations of the "device role # ordering", but so far none have exposed a usable ext4 partition for /dev/md2000. Was speaking with someone on IRC, and it's been shown that the data offset for the devices has changed over time in mdadm, so I need to recompile mdadm 3.3.x and attempt it that way. I'll update when I get to trying that. ~Fermmy -------- Original Message -------- From: Matt Callaghan <matt_callaghan@sympatico.ca> Sent: Tue 06 Jan 2015 09:16:52 AM EST To: linux-raid@vger.kernel.org Cc: Subject: mdadm RAID6 "active" with spares and failed disks; need help I think I'm in a really bad state. Could an expert w/ mdadm please help? I have a RAID6 mdadm device, and it got really messed up with spares: {{{ md2000 : active raid6 sdm1[8](S) sdo1[3] sdi1[4] sdn1[2] sdk1[0](F) sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/5] [__UU_UUU] }}} And is now really broken (inactive) {{{ md2000 : inactive sdn1[2](S) sdm1[8](S) sdl1[6](S) sdp1[7](S) sdi1[4](S) sdo1[3](S) sdk1[0](S) 13674593976 blocks super 1.1 }}} I have a forum post going w/ full details http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ I /think/ I need to force re-assembly here, but I'd like some review from the experts before proceeding. Thank you in advance for your time, ~Matt/Fermulator -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-11 20:26 ` Matt Callaghan @ 2015-01-11 23:22 ` Valentijn Sessink 2015-01-12 16:35 ` Wols Lists 0 siblings, 1 reply; 12+ messages in thread From: Valentijn Sessink @ 2015-01-11 23:22 UTC (permalink / raw) To: Matt Callaghan, linux-raid Hi Matt, I'm by no means a specialist, but I have been saving a few arrays lately, so here's my 2 cents. From what I see, I'd say you're almost there, but you didn't use "--bitmap=none" in your create-statement and as far as I can see, there is no bitmap specified in the original raid blocks but there is one in the newly created one. I may be wrong though!, please take my advice with a grain of salt and at your own risk. Also, I would not have dared to run all these statements on "live" (or dead, for that matter ;-) disks. See my posting (that is unfinished, but I'll add info as I have time) at http://valentijn.sessink.nl/?p=557 where I use "dmsetup" to create a few virtual disks - all writes are redirected to another device. Fun thing is, that after that, you can mess up all you want. You just remove the virtual disk and poof, everything is as it was before (failed raid and all, isn't that funny? :) I hope this helps. Best regards, Valentijn On 11-01-15 21:26, Matt Callaghan wrote: > Updating this e-mail thread. I got the latest mdadm version that > supports data offset variance per device and attempted to reconstruct > RAID6 according to previous data, but so far no luck. > As far as I a can tell (sadly), all of my data is lost. I've updated the > forum thread with the final details and failures. > http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ -- ✉ v@lentijn.sess.ink ☏ +31777777713 (31 7x7 13) ⌂ durgerdamstraat 29 zaandam -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-11 23:22 ` Valentijn Sessink @ 2015-01-12 16:35 ` Wols Lists 2015-01-21 0:34 ` Matt Callaghan 0 siblings, 1 reply; 12+ messages in thread From: Wols Lists @ 2015-01-12 16:35 UTC (permalink / raw) To: linux-raid On 11/01/15 23:22, Valentijn Sessink wrote: > Also, I would not have dared to run all these statements on "live" (or > dead, for that matter ;-) disks. I'm no expert either, but looking at the blog, I'm worried he might have trashed the array with almost the first thing he did :-( "add" says it will add a drive as spare if it didn't originally belong to the array. If it adds a spare to a degraded array, the array will immediately start to repair itself. OOPS!!! As it sounds like exactly this could have happened - mdadm didn't recognise the disk as it added it. Just speculating, but unfortunately this seems quite likely :-( Cheers, Wol ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-12 16:35 ` Wols Lists @ 2015-01-21 0:34 ` Matt Callaghan 2015-01-22 9:47 ` Valentijn 0 siblings, 1 reply; 12+ messages in thread From: Matt Callaghan @ 2015-01-21 0:34 UTC (permalink / raw) To: Wols Lists, linux-raid Thanks Wols and Valentijn for your input. I tried again with the --bitmap=none, clearly that was a miss on my part. However, still even with that correction, and attempting across varying combinations of "drive ordering", the filesystem appears corrupt. I think I have to accept entire data loss here. :( ~Matt -------- Original Message -------- From: Wols Lists <antlists@youngman.org.uk> Sent: 1/12/2015, 11:35:57 AM To: linux-raid@vger.kernel.org Cc: Subject: Re: mdadm RAID6 "active" with spares and failed disks; need help > On 11/01/15 23:22, Valentijn Sessink wrote: Also, I would not have dared to run all these statements on "live" (or dead, for that matter ;-) disks. I'm no expert either, but looking at the blog, I'm worried he might have trashed the array with almost the first thing he did :-( "add" says it will add a drive as spare if it didn't originally belong to the array. If it adds a spare to a degraded array, the array will immediately start to repair itself. OOPS!!! As it sounds like exactly this could have happened - mdadm didn't recognise the disk as it added it. Just speculating, but unfortunately this seems quite likely :-( Cheers, Wol -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-21 0:34 ` Matt Callaghan @ 2015-01-22 9:47 ` Valentijn 2015-03-27 23:48 ` Matt Callaghan 0 siblings, 1 reply; 12+ messages in thread From: Valentijn @ 2015-01-22 9:47 UTC (permalink / raw) To: Matt Callaghan, Wols Lists, linux-raid Hi Matt, As long as your data is still somewhere on these disks, all is not - necessarily - lost. You could still try using dumpe2fs (and later e2fsck) and/or dumpe2fs with different superblocks. And even if you cannot find your file system by any means, you could try to use the "foremost" utility to scrape off images, documents and the like from these disks. So I still don't think all is lost. However, I do think that will cost more time. You may want to dedicate a spare machine to this task, because of the resources. I see that your mdadm says this, somewhere along your odyssee: mdadm: /dev/sdk1 appears to contain an ext2fs file system size=1695282944K mtime=Tue Apr 12 11:10:24 1977 ... which could mean (I'm not sure, I'm just guessing) that due to the internal bitmap, your fs has been overwritten. Your new array in fact said: Internal Bitmap : 8 sectors from superblock Update Time : Wed Jan 7 09:46:44 2015 Bad Block Log : 512 entries available at offset 72 sectors Checksum : c7603819 - correct Events : 0 ... as far as I understand, this means that 8 blocks from the superblock, some - whatever size - sectors were occupied by the Internal Bitmap, which, in turn, would mean your filesystem superblock has been overwritten. The good news is: there is more than one superblock. BTW, didn't you have the right raid drive ordering from the original disks? You did have output of "mdadm --examine" after the array broke down, didn't you? So your "create" statement is, by definition, correct if a new "--examine" output shows the same output - hence the filesystem is correct if the latter is the case? So please try if "dumpe2fs -h -o superblock=32768" does anything. Or 98304, 163840, 229376. Dumpe2fs just dumps the fs header, nothing more. If dumpe2fs doesn't do anything (but complain that it "Couldn't find valid filesystem superblock"), then you could still try if "foremost" finds anything. It's not that hard to use, you simply dedicate some storage to it and tell it to scrape your array. It *will* find things and it's up to you to see if 1) documents, images and the like are all 64K or 128K or less - and/or contain large blocks of rubbish. This probably means you have the wrong array config, because foremost in this case only finds single "chunks" with correct data - if a file is longer, it doesn't find it and/or spews out random data from other images 2) documents, images etcetera are OK. This means your array is OK. You then can use foremost to scrape off everything (it may take weeks but it could work), or simply try to find where the filesystem superblock hangs out (if the array is in good order, the fs superblock must be somewhere, right?) Please, please try to do as little as possible on the real disks. Use dmsetup to create snapshots. Copy the disks. Use hardware that is in good state - you don't want to loose your data that you just found back because the memory is flakey, do you? ;-) I hope it's going to work. Best regards, Valentijn On 01/21/15 01:34, Matt Callaghan wrote: > I tried again with the --bitmap=none, clearly that was a miss on my part. > However, still even with that correction, and attempting across varying > combinations of "drive ordering", the filesystem appears corrupt. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-01-22 9:47 ` Valentijn @ 2015-03-27 23:48 ` Matt Callaghan 2015-03-28 1:59 ` Phil Turmel 0 siblings, 1 reply; 12+ messages in thread From: Matt Callaghan @ 2015-03-27 23:48 UTC (permalink / raw) To: linux-raid Back at it with fresh brain and fresh hardware. (several months ago I got part-way through Valentine's ideas but not all the way -- decided to get a clean setup before progressing further) I have built a new (fresh/clean) server, and compiled+installed the latest mdadm v3.3.2. The 8x drives from this RAID6 array have also been moved to the new temporary server. Now of course, in the new server, the device labels are different. I need to map the previous "known labels" in the old server (/dev/sdX) to the "new labels" in order to get the drive ordering for re-assembly right. http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ e.g. before I had: {{{ /dev/sd[nmlpiokj]1 }}} , and now I have: {{{ /dev/sd[abcdefghi]1 }}} Unfortunately I don't have any smartctl output saved from the previous server and I can't find a way to map device drive label to serial numbers. Any thoughts how I could do this based on the data I have saved in that forum post? ~Matt -------- Original Message -------- From: Valentijn <v@lentijn.sess.ink> Sent: 1/22/2015, 4:47:38 AM To: Matt Callaghan <matt_callaghan@sympatico.ca>, Wols Lists <antlists@youngman.org.uk>, linux-raid@vger.kernel.org Cc: Subject: Re: mdadm RAID6 "active" with spares and failed disks; need help > Hi Matt, As long as your data is still somewhere on these disks, all is not - necessarily - lost. You could still try using dumpe2fs (and later e2fsck) and/or dumpe2fs with different superblocks. And even if you cannot find your file system by any means, you could try to use the "foremost" utility to scrape off images, documents and the like from these disks. So I still don't think all is lost. However, I do think that will cost more time. You may want to dedicate a spare machine to this task, because of the resources. I see that your mdadm says this, somewhere along your odyssee: mdadm: /dev/sdk1 appears to contain an ext2fs file system size=1695282944K mtime=Tue Apr 12 11:10:24 1977 ... which could mean (I'm not sure, I'm just guessing) that due to the internal bitmap, your fs has been overwritten. Your new array in fact said: Internal Bitmap : 8 sectors from superblock Update Time : Wed Jan 7 09:46:44 2015 Bad Block Log : 512 entries available at offset 72 sectors Checksum : c7603819 - correct Events : 0 ... as far as I understand, this means that 8 blocks from the superblock, some - whatever size - sectors were occupied by the Internal Bitmap, which, in turn, would mean your filesystem superblock has been overwritten. The good news is: there is more than one superblock. BTW, didn't you have the right raid drive ordering from the original disks? You did have output of "mdadm --examine" after the array broke down, didn't you? So your "create" statement is, by definition, correct if a new "--examine" output shows the same output - hence the filesystem is correct if the latter is the case? So please try if "dumpe2fs -h -o superblock=32768" does anything. Or 98304, 163840, 229376. Dumpe2fs just dumps the fs header, nothing more. If dumpe2fs doesn't do anything (but complain that it "Couldn't find valid filesystem superblock"), then you could still try if "foremost" finds anything. It's not that hard to use, you simply dedicate some storage to it and tell it to scrape your array. It *will* find things and it's up to you to see if 1) documents, images and the like are all 64K or 128K or less - and/or contain large blocks of rubbish. This probably means you have the wrong array config, because foremost in this case only finds single "chunks" with correct data - if a file is longer, it doesn't find it and/or spews out random data from other images 2) documents, images etcetera are OK. This means your array is OK. You then can use foremost to scrape off everything (it may take weeks but it could work), or simply try to find where the filesystem superblock hangs out (if the array is in good order, the fs superblock must be somewhere, right?) Please, please try to do as little as possible on the real disks. Use dmsetup to create snapshots. Copy the disks. Use hardware that is in good state - you don't want to loose your data that you just found back because the memory is flakey, do you? ;-) I hope it's going to work. Best regards, Valentijn On 01/21/15 01:34, Matt Callaghan wrote: I tried again with the --bitmap=none, clearly that was a miss on my part. However, still even with that correction, and attempting across varying combinations of "drive ordering", the filesystem appears corrupt. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-03-27 23:48 ` Matt Callaghan @ 2015-03-28 1:59 ` Phil Turmel 2015-03-28 10:11 ` Roman Mamedov [not found] ` <55161943.1090206@sympatico.ca> 0 siblings, 2 replies; 12+ messages in thread From: Phil Turmel @ 2015-03-28 1:59 UTC (permalink / raw) To: Matt Callaghan, linux-raid Hi Matt, On 03/27/2015 07:48 PM, Matt Callaghan wrote: > Back at it with fresh brain and fresh hardware. (several months ago I > got part-way through Valentine's ideas but not all the way -- decided to > get a clean setup before progressing further) > > I have built a new (fresh/clean) server, and compiled+installed the > latest mdadm v3.3.2. > The 8x drives from this RAID6 array have also been moved to the new > temporary server. > > Now of course, in the new server, the device labels are different. > I need to map the previous "known labels" in the old server (/dev/sdX) > to the "new labels" in order to get the drive ordering for re-assembly > right. > http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ I read through this. Given all of the destructive actions you took, I am doubtful you will ever get your data. Like mounting "readonly". That gives you a readonly filesystem, but it writes to the device. Possibly a great deal if there's a journal to replay. You also trimmed much useful data with "grep" that probably would help us save you now. However, in the hope you might have useful data that can be correlated with current status, start with lsdrv [1]. Paste the output in your reply with word wrap turned off. That'll at least give us a correlation between device name and serial number. > e.g. before I had: > {{{ > /dev/sd[nmlpiokj]1 > }}} FWIW, it is not safe to use square bracket notation when order matters. > , and now I have: > {{{ > /dev/sd[abcdefghi]1 > }}} The linux 'sd' driver has never guaranteed consistent device names. It's merely an artifact of boot timing that makes it look that way. Which is why array members have superblocks that record the roles. You absolutely *must* have accurate role numbers to get your data back. Show complete 'mdadm -E' output for all of your member partitions as they stand now. > Unfortunately I don't have any smartctl output saved from the previous > server and I can't find a way to map device drive label to serial numbers. > Any thoughts how I could do this based on the data I have saved in that > forum post? Please show current 'smartctl -x' output for all of these devices, too. Just paste it all in your reply (with word wrap turned off). Phil [1] https://github.com/pturmel/lsdrv ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mdadm RAID6 "active" with spares and failed disks; need help 2015-03-28 1:59 ` Phil Turmel @ 2015-03-28 10:11 ` Roman Mamedov 2015-03-28 15:11 ` Read-only mounts (was mdadm RAID6 "active" with spares and failed disks; need help) Phil Turmel [not found] ` <55161943.1090206@sympatico.ca> 1 sibling, 1 reply; 12+ messages in thread From: Roman Mamedov @ 2015-03-28 10:11 UTC (permalink / raw) To: Phil Turmel; +Cc: Matt Callaghan, linux-raid [-- Attachment #1: Type: text/plain, Size: 607 bytes --] On Fri, 27 Mar 2015 21:59:38 -0400 Phil Turmel <philip@turmel.org> wrote: > I read through this. Given all of the destructive actions you took, I > am doubtful you will ever get your data. Like mounting "readonly". > That gives you a readonly filesystem, but it writes to the device. > Possibly a great deal if there's a journal to replay. Are you sure, which FS does that? I remember some discussion on FS lists (Btrfs?), and IIRC the consensus and the implemented behavior was that the device shouldn't ever be touched with writes on RO mounts, no matter what. -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Read-only mounts (was mdadm RAID6 "active" with spares and failed disks; need help) 2015-03-28 10:11 ` Roman Mamedov @ 2015-03-28 15:11 ` Phil Turmel 0 siblings, 0 replies; 12+ messages in thread From: Phil Turmel @ 2015-03-28 15:11 UTC (permalink / raw) To: Roman Mamedov; +Cc: Matt Callaghan, linux-raid On 03/28/2015 06:11 AM, Roman Mamedov wrote: > On Fri, 27 Mar 2015 21:59:38 -0400 > Phil Turmel <philip@turmel.org> wrote: > >> I read through this. Given all of the destructive actions you took, I >> am doubtful you will ever get your data. Like mounting "readonly". >> That gives you a readonly filesystem, but it writes to the device. >> Possibly a great deal if there's a journal to replay. > > Are you sure, which FS does that? I remember some discussion on FS lists > (Btrfs?), and IIRC the consensus and the implemented behavior was that the > device shouldn't ever be touched with writes on RO mounts, no matter what. I remember people being burned by it with ext3/4 a couple years ago. Which is why all of the array disaster recoveries I've helped with called for fsck -n to verify a reconstruction attempt, not a mount -o ro. I will set up a small VM and see what recent kernels do. Phil ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <55161943.1090206@sympatico.ca>]
[parent not found: <BLU436-SMTP135A84B36120ACD11B2783D81F70@phx.gbl>]
* Re: mdadm RAID6 "active" with spares and failed disks; need help [not found] ` <BLU436-SMTP135A84B36120ACD11B2783D81F70@phx.gbl> @ 2015-03-28 17:40 ` Phil Turmel 0 siblings, 0 replies; 12+ messages in thread From: Phil Turmel @ 2015-03-28 17:40 UTC (permalink / raw) To: Matt Callaghan, linux-raid Hi Matt, I didn't see this make it to linux-raid, so I'll quote more than normal. Oh, and convention on kernel.org is to reply-to-all, trim unnecessary quotes, and avoid top-posting. Please. On 03/27/2015 11:10 PM, Matt Callaghan wrote: > Just noticed the lsdrv [1] link to git; got it, here's the output > {{{ > fermulator@fermmy-mdadm:~/downloads/lsdrv/lsdrv$ ./lsdrv > PCI [ahci] 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40) > ├scsi 0:x:x:x [Empty] > └scsi 1:0:0:0 ATA Maxtor 6Y160M0 > └sda 152.67g [8:0] Empty/Unknown > ├sda1 512.00m [8:1] Empty/Unknown > │└Mounted as /dev/sda1 @ /boot/efi > ├sda2 148.71g [8:2] Empty/Unknown > │└Mounted as /dev/disk/by-uuid/5549ca2f-758a-4e04-8e36-cf4544bef4fb @ / > └sda3 3.46g [8:3] Empty/Unknown > PCI [mptsas] 05:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) > ├scsi 2:0:0:0 ATA ST2000DL003-9VT1 > │└sdb 1.82t [8:16] Empty/Unknown > │ └sdb1 1.82t [8:17] Empty/Unknown > ├scsi 2:0:1:0 ATA ST2000DL003-9VT1 > │└sdc 1.82t [8:32] Empty/Unknown > │ └sdc1 1.82t [8:33] Empty/Unknown > ├scsi 2:0:2:0 ATA ST2000DL003-9VT1 > │└sdd 1.82t [8:48] Empty/Unknown > │ └sdd1 1.82t [8:49] Empty/Unknown > ├scsi 2:0:3:0 ATA ST2000VN000-1H31 > │└sde 1.82t [8:64] Empty/Unknown > │ └sde1 1.82t [8:65] Empty/Unknown > ├scsi 2:0:4:0 ATA ST2000DL003-9VT1 > │└sdf 1.82t [8:80] Empty/Unknown > │ └sdf1 1.82t [8:81] Empty/Unknown > ├scsi 2:0:5:0 ATA ST2000DL003-9VT1 > │└sdg 1.82t [8:96] Empty/Unknown > │ └sdg1 1.82t [8:97] Empty/Unknown > ├scsi 2:0:6:0 ATA ST2000DL003-9VT1 > │└sdh 1.82t [8:112] Empty/Unknown > │ └sdh1 1.82t [8:113] Empty/Unknown > ├scsi 2:0:7:0 ATA ST2000VN000-1H31 > │└sdi 1.82t [8:128] Empty/Unknown > │ └sdi1 1.82t [8:129] Empty/Unknown > └scsi 2:x:x:x [Empty] > Other Block Devices > ├loop0 0.00k [7:0] Empty/Unknown > ├loop1 0.00k [7:1] Empty/Unknown > ├loop2 0.00k [7:2] Empty/Unknown > ├loop3 0.00k [7:3] Empty/Unknown > ├loop4 0.00k [7:4] Empty/Unknown > ├loop5 0.00k [7:5] Empty/Unknown > ├loop6 0.00k [7:6] Empty/Unknown > ├loop7 0.00k [7:7] Empty/Unknown > ├ram0 64.00m [1:0] Empty/Unknown > ├ram1 64.00m [1:1] Empty/Unknown > ├ram2 64.00m [1:2] Empty/Unknown > ├ram3 64.00m [1:3] Empty/Unknown > ├ram4 64.00m [1:4] Empty/Unknown > ├ram5 64.00m [1:5] Empty/Unknown > ├ram6 64.00m [1:6] Empty/Unknown > ├ram7 64.00m [1:7] Empty/Unknown > ├ram8 64.00m [1:8] Empty/Unknown > ├ram9 64.00m [1:9] Empty/Unknown > ├ram10 64.00m [1:10] Empty/Unknown > ├ram11 64.00m [1:11] Empty/Unknown > ├ram12 64.00m [1:12] Empty/Unknown > ├ram13 64.00m [1:13] Empty/Unknown > ├ram14 64.00m [1:14] Empty/Unknown > └ram15 64.00m [1:15] Empty/Unknown > }}} Ok. Not that helpful. I suspect you had error messages about missing utilities. No serial numbers. [trim /] > mdadm output as of NOW. But note that the output here is likely useless > since the last thing I was trying to was getting the array back together > as per the forum posting... (it's definitely not in the original state > anymore...) Yep, useless. [trim /] > smartctl outputs are: /dev/sdb: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD0XWHR > LU WWN Device Id: 5 000c50 02f4197f5 > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:05 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 113 099 006 - 51859880 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 422 > 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 > 7 Seek_Error_Rate POSR-- 072 060 030 - 17185766 > 9 Power_On_Hours -O--CK 061 061 000 - 34871 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 71 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 099 099 000 - 1 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 094 094 000 - 6 > 190 Airflow_Temperature_Cel -O---K 059 043 045 Past 41 (5 77 42 35 0) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 420 > 193 Load_Cycle_Count -O--CK 100 100 000 - 422 > 194 Temperature_Celsius -O---K 041 057 000 - 41 (0 13 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 017 003 000 - 51859880 > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 16990890657845 > 241 Total_LBAs_Written ------ 100 253 000 - 731266756 > 242 Total_LBAs_Read ------ 100 253 000 - 1129016466 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > > SCT Error Recovery Control command not supported Now we know why your array fell apart. Using green and/or desktop drives without mitigating the timeout mismatch problem. /dev/sdc: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD1B1ZJ > LU WWN Device Id: 5 000c50 02f361865 > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:06 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 112 090 006 - 44947192 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 68 > 5 Reallocated_Sector_Ct PO--CK 078 078 036 - 14728 > 7 Seek_Error_Rate POSR-- 072 066 030 - 15873942 > 9 Power_On_Hours -O--CK 061 061 000 - 34875 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 74 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 001 001 000 - 823 > 188 Command_Timeout -O--CK 100 099 000 - 65539 > 189 High_Fly_Writes -O-RCK 093 093 000 - 7 > 190 Airflow_Temperature_Cel -O---K 058 044 045 Past 42 (2 158 44 36 0) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 65 > 193 Load_Cycle_Count -O--CK 100 100 000 - 68 > 194 Temperature_Celsius -O---K 042 056 000 - 42 (0 13 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 016 003 000 - 44947192 > 197 Current_Pending_Sector -O--C- 089 089 000 - 952 ^^^^^ Wow! > 198 Offline_Uncorrectable ----C- 089 089 000 - 952 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 141149805250605 > 241 Total_LBAs_Written ------ 100 253 000 - 3292940140 > 242 Total_LBAs_Read ------ 100 253 000 - 496297916 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > > SCT Error Recovery Control command not supported And again. /dev/sdd: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD15M4K > LU WWN Device Id: 5 000c50 02f386588 > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:07 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 117 099 006 - 153485440 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 352 > 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 > 7 Seek_Error_Rate POSR-- 076 060 030 - 43819206 > 9 Power_On_Hours -O--CK 061 061 000 - 35013 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 74 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 097 097 000 - 3 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 099 099 000 - 1 > 190 Airflow_Temperature_Cel -O---K 057 046 045 - 43 (Min/Max 36/43) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 351 > 193 Load_Cycle_Count -O--CK 100 100 000 - 353 > 194 Temperature_Celsius -O---K 043 054 000 - 43 (0 11 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 021 003 000 - 153485440 > 197 Current_Pending_Sector -O--C- 100 100 000 - 8 More Pending sectors. These are locations where unrecoverable read errors occurred that the firmware is waiting for an overwrite to decide if they are fixable. > 198 Offline_Uncorrectable ----C- 100 100 000 - 8 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 134501195876534 > 241 Total_LBAs_Written ------ 100 253 000 - 879538094 > 242 Total_LBAs_Read ------ 100 253 000 - 1783662156 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control command not supported Sigh. /dev/sde: > === START OF INFORMATION SECTION === > Device Model: ST2000VN000-1H3164 > Serial Number: W1H25K77 > LU WWN Device Id: 5 000c50 06a40c121 > Firmware Version: SC42 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5900 rpm > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Fri Mar 27 22:57:07 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM feature is: Unavailable > APM level is: 254 (maximum performance) > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 116 099 006 - 117001736 > 3 Spin_Up_Time PO---- 096 095 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 21 > 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 > 7 Seek_Error_Rate POSR-- 064 060 030 - 3017660 > 9 Power_On_Hours -O--CK 085 085 000 - 13146 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 21 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 058 058 000 - 42 > 190 Airflow_Temperature_Cel -O---K 065 056 045 - 35 (Min/Max 35/37) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 21 > 193 Load_Cycle_Count -O--CK 100 100 000 - 21 > 194 Temperature_Celsius -O---K 035 044 000 - 35 (0 16 0 0 0) > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control: > Read: 1 (0.1 seconds) > Write: 1 (0.1 seconds) Interesting. Is this the device default? The drives I've seen that have a default have either 4.0s or 7.0s. /dev/sdf: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD18S73 > LU WWN Device Id: 5 000c50 02f3fab7d > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:07 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 109 099 006 - 23951160 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 70 > 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 > 7 Seek_Error_Rate POSR-- 075 060 030 - 39605538 > 9 Power_On_Hours -O--CK 061 061 000 - 34955 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 75 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 089 089 000 - 11 > 190 Airflow_Temperature_Cel -O---K 058 048 045 - 42 (Min/Max 34/42) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 69 > 193 Load_Cycle_Count -O--CK 100 100 000 - 70 > 194 Temperature_Celsius -O---K 042 052 000 - 42 (0 10 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 013 003 000 - 23951160 > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 194931385731211 > 241 Total_LBAs_Written ------ 100 253 000 - 4208935845 > 242 Total_LBAs_Read ------ 100 253 000 - 3841138908 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control command not supported And again. /dev/sdg: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD1ACSD > LU WWN Device Id: 5 000c50 02f31ac2f > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:08 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 113 099 006 - 50711848 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 70 > 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 > 7 Seek_Error_Rate POSR-- 075 060 030 - 41597886 > 9 Power_On_Hours -O--CK 061 061 000 - 34955 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 74 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 100 100 000 - 0 > 190 Airflow_Temperature_Cel -O---K 058 048 045 - 42 (Min/Max 36/43) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 69 > 193 Load_Cycle_Count -O--CK 100 100 000 - 70 > 194 Temperature_Celsius -O---K 042 052 000 - 42 (0 10 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 017 003 000 - 50711848 > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 121040768370827 > 241 Total_LBAs_Written ------ 100 253 000 - 1173584109 > 242 Total_LBAs_Read ------ 100 253 000 - 1269612579 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control command not supported And sigh again. Broken record, I know. But this is a big deal. /dev/sdh: > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda Green (AF) > Device Model: ST2000DL003-9VT166 > Serial Number: 5YD18S0M > LU WWN Device Id: 5 000c50 02f3f4ec7 > Firmware Version: CC32 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 5900 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS T13/1699-D revision 4 > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) > Local Time is: Fri Mar 27 22:57:08 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM level is: 0 (vendor specific), recommended: 254 > APM feature is: Unavailable > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 119 099 006 - 229878536 > 3 Spin_Up_Time PO---- 093 092 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 70 > 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 > 7 Seek_Error_Rate POSR-- 075 060 030 - 38838566 > 9 Power_On_Hours -O--CK 061 061 000 - 34957 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 76 > 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 094 094 000 - 6 > 190 Airflow_Temperature_Cel -O---K 061 051 045 - 39 (Min/Max 29/40) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 69 > 193 Load_Cycle_Count -O--CK 100 100 000 - 70 > 194 Temperature_Celsius -O---K 039 049 000 - 39 (0 11 0 0 0) > 195 Hardware_ECC_Recovered -O-RC- 023 003 000 - 229878536 > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > 240 Head_Flying_Hours ------ 100 253 000 - 30356828883085 > 241 Total_LBAs_Written ------ 100 253 000 - 16063676 > 242 Total_LBAs_Read ------ 100 253 000 - 2558000514 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control command not supported /dev/sdi: > === START OF INFORMATION SECTION === > Device Model: ST2000VN000-1H3164 > Serial Number: W1H25JXM > LU WWN Device Id: 5 000c50 06a406dab > Firmware Version: SC42 > User Capacity: 2,000,398,934,016 bytes [2.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5900 rpm > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Fri Mar 27 22:57:09 2015 EDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > AAM feature is: Unavailable > APM level is: 254 (maximum performance) > Rd look-ahead is: Enabled > Write cache is: Enabled > ATA Security is: Disabled, NOT FROZEN [SEC1] > Wt Cache Reorder: Enabled > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 119 099 006 - 218566352 > 3 Spin_Up_Time PO---- 096 096 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 21 > 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 > 7 Seek_Error_Rate POSR-- 064 060 030 - 3082219 > 9 Power_On_Hours -O--CK 085 085 000 - 13146 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 21 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > 188 Command_Timeout -O--CK 100 100 000 - 0 > 189 High_Fly_Writes -O-RCK 050 050 000 - 50 > 190 Airflow_Temperature_Cel -O---K 064 052 045 - 36 (Min/Max 36/38) > 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 > 192 Power-Off_Retract_Count -O--CK 100 100 000 - 21 > 193 Load_Cycle_Count -O--CK 100 100 000 - 21 > 194 Temperature_Celsius -O---K 036 048 000 - 36 (0 16 0 0 0) > 197 Current_Pending_Sector -O--C- 100 100 000 - 0 > 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > SCT Error Recovery Control: > Read: 1 (0.1 seconds) > Write: 1 (0.1 seconds) So. You have eight devices that need to make a raid6, and you have no order information. You have two devices with pending errors that cannot help us without role #s. First, you need to deal with the timeout mismatch problem. Only two of your devices support ERC, so you will need to set long driver timeouts. Some reading: http://marc.info/?l=linux-raid&m=135811522817345&w=1 http://marc.info/?l=linux-raid&m=133665797115876&w=2 http://marc.info/?l=linux-raid&m=142504030927143&w=2 As for the latter link, I haven't tested that. When I needed such features myself, I just put the appropriate commands into rc.local. Since then, I've retired all of my non-raid-rated drives. Next, you need to run numerous "mdadm --create --assume-clean" attempts to figure out your device role order. You have 8-factorial permutations to try (40,320). /dev/sdc and /dev/sdd have pending errors, so leave them out (use "missing" in their places). Your only info from the original post that shows all of the necessary device characteristics is this: > /dev/sdj1: > Magic : a92b4efc > Version : 1.1 > Feature Map : 0x2 > Array UUID : 15d2158f:5cf74d95:fd7f5607:0e447573 > Name : fermmy-server:2000 (local to host fermmy-server) > Creation Time : Fri Apr 22 01:12:07 2011 > Raid Level : raid6 > Raid Devices : 8 > > Avail Dev Size : 3907026816 (1863.02 GiB 2000.40 GB) > Array Size : 11721080448 (11178.09 GiB 12002.39 GB) > Data Offset : 304 sectors > Super Offset : 0 sectors > Recovery Offset : 2441891840 sectors > State : clean > Device UUID : eee3ae0e:f594fdba:58e19113:bc196464 > > Update Time : Mon Jan 5 00:30:41 2015 > Checksum : 7a5a498d - correct > Events : 42912 > > Layout : left-symmetric > Chunk Size : 64K > > Device Role : Active device 4 > Array State : A.AAAAAA ('A' == active, '.' == missing) Note that the data offset is 304. Some of your devices reported a data offset of 264. None of the reports were from original undisturbed devices, so we really don't know what offset is correct. "mdadm --add" will use that mdadm version's offset if it can. I suggest you try to re-establish the distro you used at the time (April 2011) in a VM and create some test arrays with its version of mdadm to get the offset to try first. You then need to create a script that will perform the necessary "mdadm --create --assume-clean" operations, followed by an "fsck -n" of the device each time to see how messed up it is. Each attempt into its own log file, so you can see (by size) which attempts were "cleanest". Inspect the "best" log files manually to see what was found. With 40k permutations, you may need to work out some grepping that will help identify bad from possibly good. If none of them come up relatively clean, try again with your next best guess on the offset. Good luck! Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* mdadm RAID6 "active" with spares and failed disks; need help @ 2015-01-06 14:16 Matt Callaghan 0 siblings, 0 replies; 12+ messages in thread From: Matt Callaghan @ 2015-01-06 14:16 UTC (permalink / raw) To: linux-raid I think I'm in a really bad state. Could an expert w/ mdadm please help? I have a RAID6 mdadm device, and it got really messed up with spares: {{{ md2000 : active raid6 sdm1[8](S) sdo1[3] sdi1[4] sdn1[2] sdk1[0](F) sdl1[6] sdp1[7] 11721080448 blocks super 1.1 level 6, 64k chunk, algorithm 2 [8/5] [__UU_UUU] }}} And is now really broken (inactive) {{{ md2000 : inactive sdn1[2](S) sdm1[8](S) sdl1[6](S) sdp1[7](S) sdi1[4](S) sdo1[3](S) sdk1[0](S) 13674593976 blocks super 1.1 }}} I have a forum post going w/ full details http://www.linuxquestions.org/questions/linux-server-73/mdadm-raid6-active-with-spares-and-failed-disks%3B-need-help-4175530127/ I /think/ I need to force re-assembly here, but I'd like some review from the experts before proceeding. Thank you in advance for your time, ~Matt/Fermulator ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-03-28 17:40 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <54ABEE54.6020707@sympatico.ca> 2015-01-07 13:34 ` mdadm RAID6 "active" with spares and failed disks; need help Matt Callaghan 2015-01-11 20:26 ` Matt Callaghan 2015-01-11 23:22 ` Valentijn Sessink 2015-01-12 16:35 ` Wols Lists 2015-01-21 0:34 ` Matt Callaghan 2015-01-22 9:47 ` Valentijn 2015-03-27 23:48 ` Matt Callaghan 2015-03-28 1:59 ` Phil Turmel 2015-03-28 10:11 ` Roman Mamedov 2015-03-28 15:11 ` Read-only mounts (was mdadm RAID6 "active" with spares and failed disks; need help) Phil Turmel [not found] ` <55161943.1090206@sympatico.ca> [not found] ` <BLU436-SMTP135A84B36120ACD11B2783D81F70@phx.gbl> 2015-03-28 17:40 ` mdadm RAID6 "active" with spares and failed disks; need help Phil Turmel 2015-01-06 14:16 Matt Callaghan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.