* RAID5 failure and consequent ext4 problems @ 2022-09-08 14:51 Luigi Fabio 2022-09-08 17:23 ` Phil Turmel 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-08 14:51 UTC (permalink / raw) To: linux-raid [also on linux-ext4] I am encountering an unusual problem after an mdraid failure, I'll summarise briefly and can provide further details as required. First of all, the context. This is happening on a Debian 11 system, amd64 arch, with current updates (kernel 5.10.136-1, util-linux 2.36.1). The system has a 12 drive mdraid RAID5 for data, recently migrated to LSI 2308 HBAs. This is relevant because earlier this week, at around 13,00 local (EST), four drives, an entire HBA channel, decided to drop from the RAID. Of course, mdraid didn't like that and stopped the arrays. I reverted to best practice and shut down the system first of all. Further context: the filesystem in the array is ancient - I am vaguely proud of that - from 2001. It started as ext2, grew to ext3, then to ext4 and finally to ext4 with 64 bits. Because I am paranoid, I always mount ext4 with nodelalloc and data=journal. The journal is external on a RAID1 of SSDs. I recently (within the last ~3 months) enabled metadata_csum, which is relevant to the following - the filesystem had never had metadata_csum enabled before. Upon reboot, the arrays would not reassemble - this is expected, because 4/12 drives were marked faulty. So I re--created the array using the same parameters as were used back when the array was built. Unfortunately, I had a moment of stupid and didn't specify metadata 0.90 in the re--create, so it was recreated with metadata 1.2... which writes its data block at the beginning of the components, not at the end. I noticed it, restopped the array and recreated with the correct 0.90, but the damage was done: the 256 byte + 12 * 20 header was written at the beginning of each of the 12 components. Still, unless I am mistaken, this just means that at worst 12x (second block of each component) were damaged, which shouldn't be too bad. The only further possibility is that mdraid also zeroed out the 'blank space' that it puts AFTER the header block and BEFORE the data, but according to documentation it shouldn't do that. In any case, I subsequently reassembled the array 'correctly' to match the previous order and settings and I believe I got it right. I kept the array RO and tried fsck -n, which gave me this: ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks... It then warns that it won't attempt journal recovery because it's in RO mode and declares the fs clean - with a reasonable looking number of files and blocks. If I try to mount -t ext4 -o ro, I get : mount: /mnt: mount(2) system call failed: Structure needs cleaning. so before anything else, I tried fsck -nf to make sure that the REST of the filesystem is in one logical piece. THAT painted a very different picture: On pass 1, I get approximately 980k (almost 10^6) of Inode nnnnn passes checks, but checksum does not match inode and ~ 2000 Inode nnnnn contains garbage Plus some 'tree not optimised' which are technically not errors, from what I understand. After ~11 hours, it switches to 1b, tells me that inode 12 has a long list of duplicate blocks Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 12: 2928004133 [....] And ends after the list of multiply claimed blocks with: e2fsck: aborted Error while scanning inodes (8193): Inode checksum does not match inode /dev/md123: ********** WARNING: Filesystem still has errors ********** /dev/md123: ********** WARNING: Filesystem still has errors ********** So, what is my next step? I realise I should NOT have touched the original drives and dd-ed images to a separate array to work on those, but I believe the only writing that occurred were the mdraid superblocks. I am, in any case, grabbing more drives to image the 'faulty' array and work on the images, leaving the original data alone. Where do I go from here? I have had similar issues in the past, all the way back to the early 00s, and I had a near-100% success rate by re--creating the arrays. What is different this time? Or, is nothing different and is the problem just in the checksumming? Thanks! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-08 14:51 RAID5 failure and consequent ext4 problems Luigi Fabio @ 2022-09-08 17:23 ` Phil Turmel 2022-09-09 20:32 ` Luigi Fabio 0 siblings, 1 reply; 24+ messages in thread From: Phil Turmel @ 2022-09-08 17:23 UTC (permalink / raw) To: Luigi Fabio, linux-raid Hi Luigi, On 9/8/22 10:51, Luigi Fabio wrote: > Upon reboot, the arrays would not reassemble - this is expected, > because 4/12 drives were marked faulty. So I re--created the array > using the same parameters as were used back when the array was built. Oh, No! > Unfortunately, I had a moment of stupid and didn't specify metadata > 0.90 in the re--create, so it was recreated with metadata 1.2... which > writes its data block at the beginning of the components, not at the > end. I noticed it, restopped the array and recreated with the correct > 0.90, but the damage was done: the 256 byte + 12 * 20 header was > written at the beginning of each of the 12 components. No, the moment of stupid was that you re-created the array. Simultaneous multi-drive failures that stop an array are easily fixed with --assemble --force. Too late for that now. It is absurdly easy to screw up device order when re-creating, and if you didn't specify every allocation and layout detail, the changes in defaults over the years would also screw up your data. And finally, omitting --assume-clean would cause all of your parity to be recalculated immediately, with catastrophic results if any order or allocation attributes are wrong. ): > Where do I go from here? I have had similar issues in the past, all > the way back to the early 00s, and I had a near-100% success rate by > re--creating the arrays. What is different this time? > Or, is nothing different and is the problem just in the checksumming? No, you just got lucky in the past. Probably by using mdadm versions that hadn't been updated. You'll need to show us every command you tried from your history, and full details of all drives/partitions involved. But I'll be brutally honest: your data is likely toast. Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-08 17:23 ` Phil Turmel @ 2022-09-09 20:32 ` Luigi Fabio 2022-09-09 21:01 ` Luigi Fabio 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-09 20:32 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Thanks for reaching out, first of all. Apologies for the late reply, the brilliant (...) spam filter strikes again... On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: > No, the moment of stupid was that you re-created the array. > Simultaneous multi-drive failures that stop an array are easily fixed > with --assemble --force. Too late for that now. Noted for the future, thanks. > It is absurdly easy to screw up device order when re-creating, and if > you didn't specify every allocation and layout detail, the changes in > defaults over the years would also screw up your data. And finally, > omitting --assume-clean would cause all of your parity to be > recalculated immediately, with catastrophic results if any order or > allocation attributes are wrong. Of course. Which is why I specified everything and why I checked the details with --examine and --detail and they match exactly, minus the metadata version because, well, I wasn't actually the one typing (it's a slightly complicated story.. I was reassembling by proxy on the phone) and I made an incorrect assumption about the person typing. There aren't, in the end, THAT many things to specify: RAID level, number of drives, order thereof, chunk size, 'layout' and metadata version. 0.90 doesn't allow before/after gaps so that should be it, I believe. Am I missing anything? > No, you just got lucky in the past. Probably by using mdadm versions > that hadn't been updated. That's not quite it: I keep records of how arrays are built and match them, though it is true that I tend to update things as little as possible on production machines. One of the differences, this time, is that this was NOT a production machine. The other was that I was driving, dictating on the phone and was under a lot of pressure to get the thing back up ASAP. Nonetheless, I have an --examine of at least two drives from the previous setup so there should be enough information there to rebuild a matching array, I think? > You'll need to show us every command you tried from your history, and > full details of all drives/partitions involved. > > But I'll be brutally honest: your data is likely toast. Well, let's hope it isn't. All mdadm commands were -o and --assume-clean, so in theory the only thing which HAS been written are the md blocks, unless I am mistaken and/or I read the docs incorrectly? That does, of course, leave the problem of the blocks overwritten by the 1.2 metadata, but as I read the docs that should be a very small number - let's say one 4096byte block (a portion thereof, to be pedantic, but ext4 doesn't really care?) per drive, correct? Background: Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro X10 series)'s chipset SATA ports. All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs have journals on another SSD RAID1 (one per FS, obviously). Data drives: 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, each with two four-drive ports (and one of these went DELIGHTFULLY missing) This is the layout of each drive: --- GPT fdisk (gdisk) version 1.0.6 ... Found valid GPT with protective MBR; using GPT. Disk /dev/sdc: 7814037168 sectors, 3.6 TiB Model: ST4000NC001-1FS1 Sector size (logical/physical): 512/4096 bytes ... Total free space is 99949 sectors (48.8 MiB) Number Start (sector) End (sector) Size Code Name 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup --- So there were two RAID arrays. Both RAID5 - a main RAID called 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array called backup which had 12 x 90 GB. A little further backstory: right before the event, one drive had been pulled because it had started failing. What I did was shut down the machine, put the failing drive on a MB port and put a new drive on the LSI controllers. I then brought the machine back online, did the --replace --with thing and this worked fine. At that point the faulty drive (/dev/sdc, MB drives come before the LSI drives in the count) got deleted via /sys/block.... and physically disconnected from the system, which was then happily running with /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as the 'archive' drives. It went 96 hours or so like that under moderate load. Then the failure happened, the machine was rebooted thus the previous sdd -> sdo drives became sdc -> sdn drives. However, the relative order was, to the best of my knowledge, conserved - AND I still have the 'faulty' drive, so I could very easily put it back in to have everything match. Most importantly, this drive has on it, without a doubt, the details of the array BEFORE everything happened - by definition untouched because the drive was stopped and pulled before the event. I also have a cat of the --examine of two of the faulty drives BEFORE anything was written to them - thus, unless I am mistaken, these contained the md block details from 'before the event'. Here is one of them, taken after the reboot and therefore when the MB /dev/sdc was no longer there: --- /dev/sdc1: Magic : a92b4efc Version : 0.90.00 UUID : 2457b506:85728e9d:c44c77eb:7ee19756 Creation Time : Sat Mar 30 18:18:00 2019 Raid Level : raid5 Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) Array Size : 41938562688 (39995.73 GiB 42945.09 GB) Raid Devices : 12 Total Devices : 12 Preferred Minor : 123 Update Time : Tue Sep 6 11:37:53 2022 State : clean Active Devices : 12 Working Devices : 12 Failed Devices : 0 Spare Devices : 0 Checksum : 391e325d - correct Events : 52177 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 5 8 49 5 active sync /dev/sdd1 0 0 8 225 0 active sync 1 1 8 81 1 active sync /dev/sdf1 2 2 8 97 2 active sync /dev/sdg1 3 3 8 161 3 active sync /dev/sdk1 4 4 8 113 4 active sync /dev/sdh1 5 5 8 49 5 active sync /dev/sdd1 6 6 8 177 6 active sync /dev/sdl1 7 7 8 145 7 active sync /dev/sdj1 8 8 8 129 8 active sync /dev/sdi1 9 9 8 65 9 active sync /dev/sde1 10 10 8 209 10 active sync /dev/sdn1 11 11 8 193 11 active sync /dev/sdm1 --- Note that the drives are 'moved' because the old /dev/sdc isn't there any more but the relative position should be the same, correct me if I am wrong. If you prefer, what you need to do to get the 'new' drive letter is to take 16 out of the minor of each of the drives. This is the 'new' --create --- /dev/sdc1: Magic : a92b4efc Version : 0.90.00 UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) Creation Time : Tue Sep 6 15:15:03 2022 Raid Level : raid5 Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) Array Size : 41938562688 (39995.73 GiB 42945.09 GB) Raid Devices : 12 Total Devices : 12 Preferred Minor : 123 Update Time : Tue Sep 6 15:15:03 2022 State : clean Active Devices : 12 Working Devices : 12 Failed Devices : 0 Spare Devices : 0 Checksum : ed12b96a - correct Events : 1 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 5 8 33 5 active sync /dev/sdc1 0 0 8 209 0 active sync /dev/sdn1 1 1 8 65 1 active sync /dev/sde1 2 2 8 81 2 active sync /dev/sdf1 3 3 8 145 3 active sync /dev/sdj1 4 4 8 97 4 active sync /dev/sdg1 5 5 8 33 5 active sync /dev/sdc1 6 6 8 161 6 active sync /dev/sdk1 7 7 8 129 7 active sync /dev/sdi1 8 8 8 113 8 active sync /dev/sdh1 9 9 8 49 9 active sync /dev/sdd1 10 10 8 193 10 active sync /dev/sdm1 11 11 8 177 11 active sync /dev/sdl1 --- If you put the layout lines side by side, it would seem to me that they match, modulo the '16' difference. This is the list of --create and --assemble commands from the 6th which involve the sdx1 partitions, those we care about right now - there were others involving /dev/md124 and the /dev/sdx2 which however are not relevant - the data there : -- 9813 mdadm --assemble /dev/md123 missing 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 /dev/sdm1 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 /dev/sdm1 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 /dev/sdm1 /dev/sdl1 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 ^^^^ note that these were the WRONG ARRAY - this was an unfortunate miscommunication which caused potential damage. 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 /dev/sdm1 /dev/sdl1 --- Note that they all were -o, therefore if I am not mistaken no parity data was written anywhere. Note further the fact that the first two were the 'mistake' ones, which did NOT have --assume-clean (but with -o this shouldn't make a difference AFAIK) and most importantly the metadata was the 1.2 default AND they were the wrong array in the first place. Note also that the 'final' --create commands also had --bitmap=none to match the original array, though according to the docs the bitmap space in 0.90 (and 1.2?) is in a space which does not affect the data in the first place. Now, first of all a question: if I get the 'old' sdc, the one that was taken out prior to this whole mess, onto a different system in order to examine it, the modern mdraid auto discovery shoud NOT overwrite the md data, correct? Thus I should be able to double-check the drive order on that as well? Any other pointers, insults etc are of course welcome. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 20:32 ` Luigi Fabio @ 2022-09-09 21:01 ` Luigi Fabio 2022-09-09 21:48 ` Phil Turmel 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-09 21:01 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Another helpful datapoint, this is the boot *before* sdc got --replaced with sdo: [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 devices, algorithm 2 [ 13.531644] md123: detected capacity change from 0 to 42945088192512 This gives us, correct me if I am wrong of course, an exact representation of what the array 'used to look like', with sdc1 then replaced by sdo1 (8/225). Just some confirmation that the order should (?) be the one above. LF On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > Thanks for reaching out, first of all. Apologies for the late reply, > the brilliant (...) spam filter strikes again... > > On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: > > No, the moment of stupid was that you re-created the array. > > Simultaneous multi-drive failures that stop an array are easily fixed > > with --assemble --force. Too late for that now. > Noted for the future, thanks. > > > It is absurdly easy to screw up device order when re-creating, and if > > you didn't specify every allocation and layout detail, the changes in > > defaults over the years would also screw up your data. And finally, > > omitting --assume-clean would cause all of your parity to be > > recalculated immediately, with catastrophic results if any order or > > allocation attributes are wrong. > Of course. Which is why I specified everything and why I checked the > details with --examine and --detail and they match exactly, minus the > metadata version because, well, I wasn't actually the one typing (it's > a slightly complicated story.. I was reassembling by proxy on the > phone) and I made an incorrect assumption about the person typing. > There aren't, in the end, THAT many things to specify: RAID level, > number of drives, order thereof, chunk size, 'layout' and metadata > version. 0.90 doesn't allow before/after gaps so that should be it, I > believe. > Am I missing anything? > > > No, you just got lucky in the past. Probably by using mdadm versions > > that hadn't been updated. > That's not quite it: I keep records of how arrays are built and match > them, though it is true that I tend to update things as little as > possible on production machines. > One of the differences, this time, is that this was NOT a production > machine. The other was that I was driving, dictating on the phone and > was under a lot of pressure to get the thing back up ASAP. > Nonetheless, I have an --examine of at least two drives from the > previous setup so there should be enough information there to rebuild > a matching array, I think? > > > You'll need to show us every command you tried from your history, and > > full details of all drives/partitions involved. > > > > But I'll be brutally honest: your data is likely toast. > Well, let's hope it isn't. All mdadm commands were -o and > --assume-clean, so in theory the only thing which HAS been written are > the md blocks, unless I am mistaken and/or I read the docs > incorrectly? > > That does, of course, leave the problem of the blocks overwritten by > the 1.2 metadata, but as I read the docs that should be a very small > number - let's say one 4096byte block (a portion thereof, to be > pedantic, but ext4 doesn't really care?) per drive, correct? > > Background: > Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro > X10 series)'s chipset SATA ports. > All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs > have journals on another SSD RAID1 (one per FS, obviously). > Data drives: > 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, > each with two four-drive ports (and one of these went DELIGHTFULLY > missing) > > This is the layout of each drive: > --- > GPT fdisk (gdisk) version 1.0.6 > ... > Found valid GPT with protective MBR; using GPT. > Disk /dev/sdc: 7814037168 sectors, 3.6 TiB > Model: ST4000NC001-1FS1 > Sector size (logical/physical): 512/4096 bytes > ... > Total free space is 99949 sectors (48.8 MiB) > > Number Start (sector) End (sector) Size Code Name > 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume > 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup > --- > > So there were two RAID arrays. Both RAID5 - a main RAID called > 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array > called backup which had 12 x 90 GB. > > A little further backstory: right before the event, one drive had been > pulled because it had started failing. What I did was shut down the > machine, put the failing drive on a MB port and put a new drive on the > LSI controllers. I then brought the machine back online, did the > --replace --with thing and this worked fine. > At that point the faulty drive (/dev/sdc, MB drives come before the > LSI drives in the count) got deleted via /sys/block.... and physically > disconnected from the system, which was then happily running with > /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as > the 'archive' drives. > It went 96 hours or so like that under moderate load. Then the failure > happened, the machine was rebooted thus the previous sdd -> sdo drives > became sdc -> sdn drives. > However, the relative order was, to the best of my knowledge, > conserved - AND I still have the 'faulty' drive, so I could very > easily put it back in to have everything match. > Most importantly, this drive has on it, without a doubt, the details > of the array BEFORE everything happened - by definition untouched > because the drive was stopped and pulled before the event. > I also have a cat of the --examine of two of the faulty drives BEFORE > anything was written to them - thus, unless I am mistaken, these > contained the md block details from 'before the event'. > > Here is one of them, taken after the reboot and therefore when the MB > /dev/sdc was no longer there: > --- > /dev/sdc1: > Magic : a92b4efc > Version : 0.90.00 > UUID : 2457b506:85728e9d:c44c77eb:7ee19756 > Creation Time : Sat Mar 30 18:18:00 2019 > Raid Level : raid5 > Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > Raid Devices : 12 > Total Devices : 12 > Preferred Minor : 123 > > Update Time : Tue Sep 6 11:37:53 2022 > State : clean > Active Devices : 12 > Working Devices : 12 > Failed Devices : 0 > Spare Devices : 0 > Checksum : 391e325d - correct > Events : 52177 > > Layout : left-symmetric > Chunk Size : 128K > > Number Major Minor RaidDevice State > this 5 8 49 5 active sync /dev/sdd1 > > 0 0 8 225 0 active sync > 1 1 8 81 1 active sync /dev/sdf1 > 2 2 8 97 2 active sync /dev/sdg1 > 3 3 8 161 3 active sync /dev/sdk1 > 4 4 8 113 4 active sync /dev/sdh1 > 5 5 8 49 5 active sync /dev/sdd1 > 6 6 8 177 6 active sync /dev/sdl1 > 7 7 8 145 7 active sync /dev/sdj1 > 8 8 8 129 8 active sync /dev/sdi1 > 9 9 8 65 9 active sync /dev/sde1 > 10 10 8 209 10 active sync /dev/sdn1 > 11 11 8 193 11 active sync /dev/sdm1 > --- > Note that the drives are 'moved' because the old /dev/sdc isn't there > any more but the relative position should be the same, correct me if I > am wrong. If you prefer, what you need to do to get the 'new' drive > letter is to take 16 out of the minor of each of the drives. > > This is the 'new' --create > --- > /dev/sdc1: > Magic : a92b4efc > Version : 0.90.00 > UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) > Creation Time : Tue Sep 6 15:15:03 2022 > Raid Level : raid5 > Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > Raid Devices : 12 > Total Devices : 12 > Preferred Minor : 123 > > Update Time : Tue Sep 6 15:15:03 2022 > State : clean > Active Devices : 12 > Working Devices : 12 > Failed Devices : 0 > Spare Devices : 0 > Checksum : ed12b96a - correct > Events : 1 > > Layout : left-symmetric > Chunk Size : 128K > > Number Major Minor RaidDevice State > this 5 8 33 5 active sync /dev/sdc1 > > 0 0 8 209 0 active sync /dev/sdn1 > 1 1 8 65 1 active sync /dev/sde1 > 2 2 8 81 2 active sync /dev/sdf1 > 3 3 8 145 3 active sync /dev/sdj1 > 4 4 8 97 4 active sync /dev/sdg1 > 5 5 8 33 5 active sync /dev/sdc1 > 6 6 8 161 6 active sync /dev/sdk1 > 7 7 8 129 7 active sync /dev/sdi1 > 8 8 8 113 8 active sync /dev/sdh1 > 9 9 8 49 9 active sync /dev/sdd1 > 10 10 8 193 10 active sync /dev/sdm1 > 11 11 8 177 11 active sync /dev/sdl1 > --- > > If you put the layout lines side by side, it would seem to me that > they match, modulo the '16' difference. > > This is the list of --create and --assemble commands from the 6th > which involve the sdx1 partitions, those we care about right now - > there were others involving /dev/md124 and the /dev/sdx2 which however > are not relevant - the data there : > -- > 9813 mdadm --assemble /dev/md123 missing > 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 > /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 > /dev/sdn1 /dev/sdm1 > 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 > /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 > /dev/sdm1 > 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 > /dev/sdd1 /dev/sdm1 /dev/sdl1 > ^^^^ note that these were the WRONG ARRAY - this was an unfortunate > miscommunication which caused potential damage. > 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 > /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 > 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 > /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 > /dev/sdl1 > 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 > /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 > /dev/sdk1 /dev/sdl1 > 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > /dev/sdm1 /dev/sdl1 > --- > > Note that they all were -o, therefore if I am not mistaken no parity > data was written anywhere. Note further the fact that the first two > were the 'mistake' ones, which did NOT have --assume-clean (but with > -o this shouldn't make a difference AFAIK) and most importantly the > metadata was the 1.2 default AND they were the wrong array in the > first place. > Note also that the 'final' --create commands also had --bitmap=none to > match the original array, though according to the docs the bitmap > space in 0.90 (and 1.2?) is in a space which does not affect the data > in the first place. > > Now, first of all a question: if I get the 'old' sdc, the one that was > taken out prior to this whole mess, onto a different system in order > to examine it, the modern mdraid auto discovery shoud NOT overwrite > the md data, correct? Thus I should be able to double-check the drive > order on that as well? > > Any other pointers, insults etc are of course welcome. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 21:01 ` Luigi Fabio @ 2022-09-09 21:48 ` Phil Turmel 2022-09-09 22:11 ` David T-G 2022-09-09 22:50 ` Luigi Fabio 0 siblings, 2 replies; 24+ messages in thread From: Phil Turmel @ 2022-09-09 21:48 UTC (permalink / raw) To: Luigi Fabio; +Cc: linux-raid Reasonably likely, but not certain. Devices can be re-ordered by different kernels. That's why lsdrv prints serial numbers in its tree. You haven't mentioned whether your --create operations specified --assume-clean. Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] is sorted to /dev/sd[abcdefgh]. Use curly brace expansion with commas if you are taking shortcuts. On 9/9/22 17:01, Luigi Fabio wrote: > Another helpful datapoint, this is the boot *before* sdc got > --replaced with sdo: > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > devices, algorithm 2 > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > This gives us, correct me if I am wrong of course, an exact > representation of what the array 'used to look like', with sdc1 then > replaced by sdo1 (8/225). > > Just some confirmation that the order should (?) be the one above. > > LF > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: >> >> Thanks for reaching out, first of all. Apologies for the late reply, >> the brilliant (...) spam filter strikes again... >> >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: >>> No, the moment of stupid was that you re-created the array. >>> Simultaneous multi-drive failures that stop an array are easily fixed >>> with --assemble --force. Too late for that now. >> Noted for the future, thanks. >> >>> It is absurdly easy to screw up device order when re-creating, and if >>> you didn't specify every allocation and layout detail, the changes in >>> defaults over the years would also screw up your data. And finally, >>> omitting --assume-clean would cause all of your parity to be >>> recalculated immediately, with catastrophic results if any order or >>> allocation attributes are wrong. >> Of course. Which is why I specified everything and why I checked the >> details with --examine and --detail and they match exactly, minus the >> metadata version because, well, I wasn't actually the one typing (it's >> a slightly complicated story.. I was reassembling by proxy on the >> phone) and I made an incorrect assumption about the person typing. >> There aren't, in the end, THAT many things to specify: RAID level, >> number of drives, order thereof, chunk size, 'layout' and metadata >> version. 0.90 doesn't allow before/after gaps so that should be it, I >> believe. >> Am I missing anything? >> >>> No, you just got lucky in the past. Probably by using mdadm versions >>> that hadn't been updated. >> That's not quite it: I keep records of how arrays are built and match >> them, though it is true that I tend to update things as little as >> possible on production machines. >> One of the differences, this time, is that this was NOT a production >> machine. The other was that I was driving, dictating on the phone and >> was under a lot of pressure to get the thing back up ASAP. >> Nonetheless, I have an --examine of at least two drives from the >> previous setup so there should be enough information there to rebuild >> a matching array, I think? >> >>> You'll need to show us every command you tried from your history, and >>> full details of all drives/partitions involved. >>> >>> But I'll be brutally honest: your data is likely toast. >> Well, let's hope it isn't. All mdadm commands were -o and >> --assume-clean, so in theory the only thing which HAS been written are >> the md blocks, unless I am mistaken and/or I read the docs >> incorrectly? >> >> That does, of course, leave the problem of the blocks overwritten by >> the 1.2 metadata, but as I read the docs that should be a very small >> number - let's say one 4096byte block (a portion thereof, to be >> pedantic, but ext4 doesn't really care?) per drive, correct? >> >> Background: >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro >> X10 series)'s chipset SATA ports. >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs >> have journals on another SSD RAID1 (one per FS, obviously). >> Data drives: >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, >> each with two four-drive ports (and one of these went DELIGHTFULLY >> missing) >> >> This is the layout of each drive: >> --- >> GPT fdisk (gdisk) version 1.0.6 >> ... >> Found valid GPT with protective MBR; using GPT. >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB >> Model: ST4000NC001-1FS1 >> Sector size (logical/physical): 512/4096 bytes >> ... >> Total free space is 99949 sectors (48.8 MiB) >> >> Number Start (sector) End (sector) Size Code Name >> 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume >> 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup >> --- >> >> So there were two RAID arrays. Both RAID5 - a main RAID called >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array >> called backup which had 12 x 90 GB. >> >> A little further backstory: right before the event, one drive had been >> pulled because it had started failing. What I did was shut down the >> machine, put the failing drive on a MB port and put a new drive on the >> LSI controllers. I then brought the machine back online, did the >> --replace --with thing and this worked fine. >> At that point the faulty drive (/dev/sdc, MB drives come before the >> LSI drives in the count) got deleted via /sys/block.... and physically >> disconnected from the system, which was then happily running with >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as >> the 'archive' drives. >> It went 96 hours or so like that under moderate load. Then the failure >> happened, the machine was rebooted thus the previous sdd -> sdo drives >> became sdc -> sdn drives. >> However, the relative order was, to the best of my knowledge, >> conserved - AND I still have the 'faulty' drive, so I could very >> easily put it back in to have everything match. >> Most importantly, this drive has on it, without a doubt, the details >> of the array BEFORE everything happened - by definition untouched >> because the drive was stopped and pulled before the event. >> I also have a cat of the --examine of two of the faulty drives BEFORE >> anything was written to them - thus, unless I am mistaken, these >> contained the md block details from 'before the event'. >> >> Here is one of them, taken after the reboot and therefore when the MB >> /dev/sdc was no longer there: >> --- >> /dev/sdc1: >> Magic : a92b4efc >> Version : 0.90.00 >> UUID : 2457b506:85728e9d:c44c77eb:7ee19756 >> Creation Time : Sat Mar 30 18:18:00 2019 >> Raid Level : raid5 >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) >> Raid Devices : 12 >> Total Devices : 12 >> Preferred Minor : 123 >> >> Update Time : Tue Sep 6 11:37:53 2022 >> State : clean >> Active Devices : 12 >> Working Devices : 12 >> Failed Devices : 0 >> Spare Devices : 0 >> Checksum : 391e325d - correct >> Events : 52177 >> >> Layout : left-symmetric >> Chunk Size : 128K >> >> Number Major Minor RaidDevice State >> this 5 8 49 5 active sync /dev/sdd1 >> >> 0 0 8 225 0 active sync >> 1 1 8 81 1 active sync /dev/sdf1 >> 2 2 8 97 2 active sync /dev/sdg1 >> 3 3 8 161 3 active sync /dev/sdk1 >> 4 4 8 113 4 active sync /dev/sdh1 >> 5 5 8 49 5 active sync /dev/sdd1 >> 6 6 8 177 6 active sync /dev/sdl1 >> 7 7 8 145 7 active sync /dev/sdj1 >> 8 8 8 129 8 active sync /dev/sdi1 >> 9 9 8 65 9 active sync /dev/sde1 >> 10 10 8 209 10 active sync /dev/sdn1 >> 11 11 8 193 11 active sync /dev/sdm1 >> --- >> Note that the drives are 'moved' because the old /dev/sdc isn't there >> any more but the relative position should be the same, correct me if I >> am wrong. If you prefer, what you need to do to get the 'new' drive >> letter is to take 16 out of the minor of each of the drives. >> >> This is the 'new' --create >> --- >> /dev/sdc1: >> Magic : a92b4efc >> Version : 0.90.00 >> UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) >> Creation Time : Tue Sep 6 15:15:03 2022 >> Raid Level : raid5 >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) >> Raid Devices : 12 >> Total Devices : 12 >> Preferred Minor : 123 >> >> Update Time : Tue Sep 6 15:15:03 2022 >> State : clean >> Active Devices : 12 >> Working Devices : 12 >> Failed Devices : 0 >> Spare Devices : 0 >> Checksum : ed12b96a - correct >> Events : 1 >> >> Layout : left-symmetric >> Chunk Size : 128K >> >> Number Major Minor RaidDevice State >> this 5 8 33 5 active sync /dev/sdc1 >> >> 0 0 8 209 0 active sync /dev/sdn1 >> 1 1 8 65 1 active sync /dev/sde1 >> 2 2 8 81 2 active sync /dev/sdf1 >> 3 3 8 145 3 active sync /dev/sdj1 >> 4 4 8 97 4 active sync /dev/sdg1 >> 5 5 8 33 5 active sync /dev/sdc1 >> 6 6 8 161 6 active sync /dev/sdk1 >> 7 7 8 129 7 active sync /dev/sdi1 >> 8 8 8 113 8 active sync /dev/sdh1 >> 9 9 8 49 9 active sync /dev/sdd1 >> 10 10 8 193 10 active sync /dev/sdm1 >> 11 11 8 177 11 active sync /dev/sdl1 >> --- >> >> If you put the layout lines side by side, it would seem to me that >> they match, modulo the '16' difference. >> >> This is the list of --create and --assemble commands from the 6th >> which involve the sdx1 partitions, those we care about right now - >> there were others involving /dev/md124 and the /dev/sdx2 which however >> are not relevant - the data there : >> -- >> 9813 mdadm --assemble /dev/md123 missing >> 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 >> /dev/sdn1 /dev/sdm1 >> 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 >> /dev/sdm1 >> 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 >> /dev/sdd1 /dev/sdm1 /dev/sdl1 >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate >> miscommunication which caused potential damage. >> 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 >> 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 >> /dev/sdl1 >> 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 >> /dev/sdk1 /dev/sdl1 >> 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 >> /dev/sdm1 /dev/sdl1 >> --- >> >> Note that they all were -o, therefore if I am not mistaken no parity >> data was written anywhere. Note further the fact that the first two >> were the 'mistake' ones, which did NOT have --assume-clean (but with >> -o this shouldn't make a difference AFAIK) and most importantly the >> metadata was the 1.2 default AND they were the wrong array in the >> first place. >> Note also that the 'final' --create commands also had --bitmap=none to >> match the original array, though according to the docs the bitmap >> space in 0.90 (and 1.2?) is in a space which does not affect the data >> in the first place. >> >> Now, first of all a question: if I get the 'old' sdc, the one that was >> taken out prior to this whole mess, onto a different system in order >> to examine it, the modern mdraid auto discovery shoud NOT overwrite >> the md data, correct? Thus I should be able to double-check the drive >> order on that as well? >> >> Any other pointers, insults etc are of course welcome. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 21:48 ` Phil Turmel @ 2022-09-09 22:11 ` David T-G 2022-09-09 22:50 ` Luigi Fabio 1 sibling, 0 replies; 24+ messages in thread From: David T-G @ 2022-09-09 22:11 UTC (permalink / raw) To: linux-raid Phil & Luigi, et al -- ...and then Phil Turmel said... % ... % % You haven't mentioned whether your --create operations specified % --assume-clean. He hasn't? % % On 9/9/22 17:01, Luigi Fabio wrote: ... % > % > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: % > > ... % > > > But I'll be brutally honest: your data is likely toast. % > > Well, let's hope it isn't. All mdadm commands were -o and % > > --assume-clean, so in theory the only thing which HAS been written are % > > the md blocks, unless I am mistaken and/or I read the docs % > > incorrectly? ... % > > This is the list of --create and --assemble commands from the 6th ... % > > 9813 mdadm --assemble /dev/md123 missing % > > 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 ... % > > 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 ... % > > /dev/sdm1 % > > 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 ... % > > 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 ... % > > 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 ... % > > 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 ... % > > % > > Note that they all were -o, therefore if I am not mistaken no parity % > > data was written anywhere. Note further the fact that the first two % > > were the 'mistake' ones, which did NOT have --assume-clean (but with % > > -o this shouldn't make a difference AFAIK) and most importantly the % > > metadata was the 1.2 default AND they were the wrong array in the % > > first place. [snip] I certainly don't know what I'm talking about, so this is all I'll say, but it looked reasonably complete to me ... HTH & HANW :-D -- David T-G See http://justpickone.org/davidtg/email/ See http://justpickone.org/davidtg/tofu.txt ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 21:48 ` Phil Turmel 2022-09-09 22:11 ` David T-G @ 2022-09-09 22:50 ` Luigi Fabio 2022-09-09 23:04 ` Luigi Fabio 1 sibling, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-09 22:50 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid By different kernels, maybe - but the kernel has been the same for quite a while (months). I did paste the whole of the command lines in the (very long) email, as David mentions (thanks!) - the first ones, the mistaken ones, did NOT have --assume-clean but they did have -o, so no parity activity should have started according to the docs? A new thought came to mind: one of the HBAs lost a channel, right? What if on the subsequent reboot the devices that were on that channel got 'rediscovered' and shunted to the end of the letter order? That would, I believe, be ordinary operating procedure. That would give us an almost-correct array, which would explain how fsck can get ... some pieces. Also, I am not quite brave enough (...) to use shortcuts when handling mdadm commands. I am reconstructing the port order (scsi targets, if you prefer) from the 20220904 boot log. I should at that point be able to have an exact order of the drives. Here it is: --- [ 1.853329] sd 2:0:0:0: [sda] Write Protect is off [ 1.853331] sd 7:0:0:0: [sdc] Write Protect is off [ 1.853382] sd 3:0:0:0: [sdb] Write Protect is off [ 12.531607] sd 10:0:3:0: [sdg] Write Protect is off [ 12.533303] sd 10:0:2:0: [sdf] Write Protect is off [ 12.534606] sd 10:0:0:0: [sdd] Write Protect is off [ 12.570768] sd 10:0:1:0: [sde] Write Protect is off [ 12.959925] sd 11:0:0:0: [sdh] Write Protect is off [ 12.965230] sd 11:0:1:0: [sdi] Write Protect is off [ 12.966145] sd 11:0:4:0: [sdl] Write Protect is off [ 12.966800] sd 11:0:3:0: [sdk] Write Protect is off [ 12.997253] sd 11:0:2:0: [sdj] Write Protect is off [ 13.002395] sd 11:0:7:0: [sdo] Write Protect is off [ 13.012693] sd 11:0:5:0: [sdm] Write Protect is off [ 13.017630] sd 11:0:6:0: [sdn] Write Protect is off --- If we combine this with the previous: --- [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 devices, algorithm 2 [ 13.531644] md123: detected capacity change from 0 to 42945088192512 --- We have a SCSI target -> raid disk number correspondence. As of this boot, the letter -> scsi target correspondences match, shifted by one because as discussed 7:0:0:0 is no longer there (the old, 'faulty' sdc). Thus, having univocally determined the prior scsi target -> raid position we can transpose it to the present drive letters, which are shifted by one. Therefore, we can generate, rectius have generated, a --create with the same software versions, the same settings and the same drive order. Is there any reason why, minus the 1.2 metadata overwriting which should have only affected 12 blocks, the fs should 'not' be as before? Genuine question, mind. On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote: > > Reasonably likely, but not certain. > > Devices can be re-ordered by different kernels. That's why lsdrv prints > serial numbers in its tree. > > You haven't mentioned whether your --create operations specified > --assume-clean. > > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] > is sorted to /dev/sd[abcdefgh]. Use curly brace expansion with commas > if you are taking shortcuts. > > On 9/9/22 17:01, Luigi Fabio wrote: > > Another helpful datapoint, this is the boot *before* sdc got > > --replaced with sdo: > > > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > > devices, algorithm 2 > > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > > > This gives us, correct me if I am wrong of course, an exact > > representation of what the array 'used to look like', with sdc1 then > > replaced by sdo1 (8/225). > > > > Just some confirmation that the order should (?) be the one above. > > > > LF > > > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > >> > >> Thanks for reaching out, first of all. Apologies for the late reply, > >> the brilliant (...) spam filter strikes again... > >> > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: > >>> No, the moment of stupid was that you re-created the array. > >>> Simultaneous multi-drive failures that stop an array are easily fixed > >>> with --assemble --force. Too late for that now. > >> Noted for the future, thanks. > >> > >>> It is absurdly easy to screw up device order when re-creating, and if > >>> you didn't specify every allocation and layout detail, the changes in > >>> defaults over the years would also screw up your data. And finally, > >>> omitting --assume-clean would cause all of your parity to be > >>> recalculated immediately, with catastrophic results if any order or > >>> allocation attributes are wrong. > >> Of course. Which is why I specified everything and why I checked the > >> details with --examine and --detail and they match exactly, minus the > >> metadata version because, well, I wasn't actually the one typing (it's > >> a slightly complicated story.. I was reassembling by proxy on the > >> phone) and I made an incorrect assumption about the person typing. > >> There aren't, in the end, THAT many things to specify: RAID level, > >> number of drives, order thereof, chunk size, 'layout' and metadata > >> version. 0.90 doesn't allow before/after gaps so that should be it, I > >> believe. > >> Am I missing anything? > >> > >>> No, you just got lucky in the past. Probably by using mdadm versions > >>> that hadn't been updated. > >> That's not quite it: I keep records of how arrays are built and match > >> them, though it is true that I tend to update things as little as > >> possible on production machines. > >> One of the differences, this time, is that this was NOT a production > >> machine. The other was that I was driving, dictating on the phone and > >> was under a lot of pressure to get the thing back up ASAP. > >> Nonetheless, I have an --examine of at least two drives from the > >> previous setup so there should be enough information there to rebuild > >> a matching array, I think? > >> > >>> You'll need to show us every command you tried from your history, and > >>> full details of all drives/partitions involved. > >>> > >>> But I'll be brutally honest: your data is likely toast. > >> Well, let's hope it isn't. All mdadm commands were -o and > >> --assume-clean, so in theory the only thing which HAS been written are > >> the md blocks, unless I am mistaken and/or I read the docs > >> incorrectly? > >> > >> That does, of course, leave the problem of the blocks overwritten by > >> the 1.2 metadata, but as I read the docs that should be a very small > >> number - let's say one 4096byte block (a portion thereof, to be > >> pedantic, but ext4 doesn't really care?) per drive, correct? > >> > >> Background: > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro > >> X10 series)'s chipset SATA ports. > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs > >> have journals on another SSD RAID1 (one per FS, obviously). > >> Data drives: > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, > >> each with two four-drive ports (and one of these went DELIGHTFULLY > >> missing) > >> > >> This is the layout of each drive: > >> --- > >> GPT fdisk (gdisk) version 1.0.6 > >> ... > >> Found valid GPT with protective MBR; using GPT. > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB > >> Model: ST4000NC001-1FS1 > >> Sector size (logical/physical): 512/4096 bytes > >> ... > >> Total free space is 99949 sectors (48.8 MiB) > >> > >> Number Start (sector) End (sector) Size Code Name > >> 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume > >> 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup > >> --- > >> > >> So there were two RAID arrays. Both RAID5 - a main RAID called > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array > >> called backup which had 12 x 90 GB. > >> > >> A little further backstory: right before the event, one drive had been > >> pulled because it had started failing. What I did was shut down the > >> machine, put the failing drive on a MB port and put a new drive on the > >> LSI controllers. I then brought the machine back online, did the > >> --replace --with thing and this worked fine. > >> At that point the faulty drive (/dev/sdc, MB drives come before the > >> LSI drives in the count) got deleted via /sys/block.... and physically > >> disconnected from the system, which was then happily running with > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as > >> the 'archive' drives. > >> It went 96 hours or so like that under moderate load. Then the failure > >> happened, the machine was rebooted thus the previous sdd -> sdo drives > >> became sdc -> sdn drives. > >> However, the relative order was, to the best of my knowledge, > >> conserved - AND I still have the 'faulty' drive, so I could very > >> easily put it back in to have everything match. > >> Most importantly, this drive has on it, without a doubt, the details > >> of the array BEFORE everything happened - by definition untouched > >> because the drive was stopped and pulled before the event. > >> I also have a cat of the --examine of two of the faulty drives BEFORE > >> anything was written to them - thus, unless I am mistaken, these > >> contained the md block details from 'before the event'. > >> > >> Here is one of them, taken after the reboot and therefore when the MB > >> /dev/sdc was no longer there: > >> --- > >> /dev/sdc1: > >> Magic : a92b4efc > >> Version : 0.90.00 > >> UUID : 2457b506:85728e9d:c44c77eb:7ee19756 > >> Creation Time : Sat Mar 30 18:18:00 2019 > >> Raid Level : raid5 > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > >> Raid Devices : 12 > >> Total Devices : 12 > >> Preferred Minor : 123 > >> > >> Update Time : Tue Sep 6 11:37:53 2022 > >> State : clean > >> Active Devices : 12 > >> Working Devices : 12 > >> Failed Devices : 0 > >> Spare Devices : 0 > >> Checksum : 391e325d - correct > >> Events : 52177 > >> > >> Layout : left-symmetric > >> Chunk Size : 128K > >> > >> Number Major Minor RaidDevice State > >> this 5 8 49 5 active sync /dev/sdd1 > >> > >> 0 0 8 225 0 active sync > >> 1 1 8 81 1 active sync /dev/sdf1 > >> 2 2 8 97 2 active sync /dev/sdg1 > >> 3 3 8 161 3 active sync /dev/sdk1 > >> 4 4 8 113 4 active sync /dev/sdh1 > >> 5 5 8 49 5 active sync /dev/sdd1 > >> 6 6 8 177 6 active sync /dev/sdl1 > >> 7 7 8 145 7 active sync /dev/sdj1 > >> 8 8 8 129 8 active sync /dev/sdi1 > >> 9 9 8 65 9 active sync /dev/sde1 > >> 10 10 8 209 10 active sync /dev/sdn1 > >> 11 11 8 193 11 active sync /dev/sdm1 > >> --- > >> Note that the drives are 'moved' because the old /dev/sdc isn't there > >> any more but the relative position should be the same, correct me if I > >> am wrong. If you prefer, what you need to do to get the 'new' drive > >> letter is to take 16 out of the minor of each of the drives. > >> > >> This is the 'new' --create > >> --- > >> /dev/sdc1: > >> Magic : a92b4efc > >> Version : 0.90.00 > >> UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) > >> Creation Time : Tue Sep 6 15:15:03 2022 > >> Raid Level : raid5 > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > >> Raid Devices : 12 > >> Total Devices : 12 > >> Preferred Minor : 123 > >> > >> Update Time : Tue Sep 6 15:15:03 2022 > >> State : clean > >> Active Devices : 12 > >> Working Devices : 12 > >> Failed Devices : 0 > >> Spare Devices : 0 > >> Checksum : ed12b96a - correct > >> Events : 1 > >> > >> Layout : left-symmetric > >> Chunk Size : 128K > >> > >> Number Major Minor RaidDevice State > >> this 5 8 33 5 active sync /dev/sdc1 > >> > >> 0 0 8 209 0 active sync /dev/sdn1 > >> 1 1 8 65 1 active sync /dev/sde1 > >> 2 2 8 81 2 active sync /dev/sdf1 > >> 3 3 8 145 3 active sync /dev/sdj1 > >> 4 4 8 97 4 active sync /dev/sdg1 > >> 5 5 8 33 5 active sync /dev/sdc1 > >> 6 6 8 161 6 active sync /dev/sdk1 > >> 7 7 8 129 7 active sync /dev/sdi1 > >> 8 8 8 113 8 active sync /dev/sdh1 > >> 9 9 8 49 9 active sync /dev/sdd1 > >> 10 10 8 193 10 active sync /dev/sdm1 > >> 11 11 8 177 11 active sync /dev/sdl1 > >> --- > >> > >> If you put the layout lines side by side, it would seem to me that > >> they match, modulo the '16' difference. > >> > >> This is the list of --create and --assemble commands from the 6th > >> which involve the sdx1 partitions, those we care about right now - > >> there were others involving /dev/md124 and the /dev/sdx2 which however > >> are not relevant - the data there : > >> -- > >> 9813 mdadm --assemble /dev/md123 missing > >> 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 > >> /dev/sdn1 /dev/sdm1 > >> 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 > >> /dev/sdm1 > >> 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 > >> /dev/sdd1 /dev/sdm1 /dev/sdl1 > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate > >> miscommunication which caused potential damage. > >> 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 > >> 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 > >> /dev/sdl1 > >> 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 > >> /dev/sdk1 /dev/sdl1 > >> 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > >> /dev/sdm1 /dev/sdl1 > >> --- > >> > >> Note that they all were -o, therefore if I am not mistaken no parity > >> data was written anywhere. Note further the fact that the first two > >> were the 'mistake' ones, which did NOT have --assume-clean (but with > >> -o this shouldn't make a difference AFAIK) and most importantly the > >> metadata was the 1.2 default AND they were the wrong array in the > >> first place. > >> Note also that the 'final' --create commands also had --bitmap=none to > >> match the original array, though according to the docs the bitmap > >> space in 0.90 (and 1.2?) is in a space which does not affect the data > >> in the first place. > >> > >> Now, first of all a question: if I get the 'old' sdc, the one that was > >> taken out prior to this whole mess, onto a different system in order > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite > >> the md data, correct? Thus I should be able to double-check the drive > >> order on that as well? > >> > >> Any other pointers, insults etc are of course welcome. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 22:50 ` Luigi Fabio @ 2022-09-09 23:04 ` Luigi Fabio 2022-09-10 1:29 ` Luigi Fabio 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-09 23:04 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid A further question, in THIS boot's log I found: [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12 devices, algorithm 2 [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing full recovery [ 9874.714178] md123: bitmap file is out of date, doing full recovery [ 9874.881106] md123: detected capacity change from 0 to 42945088192512 From, I think, the second --create of /dev/123, before I added the bitmap=none. This should, however, not have written anything with -o and --assume-clean, correct? On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > By different kernels, maybe - but the kernel has been the same for > quite a while (months). > > I did paste the whole of the command lines in the (very long) email, > as David mentions (thanks!) - the first ones, the mistaken ones, did > NOT have --assume-clean but they did have -o, so no parity activity > should have started according to the docs? > A new thought came to mind: one of the HBAs lost a channel, right? > What if on the subsequent reboot the devices that were on that channel > got 'rediscovered' and shunted to the end of the letter order? That > would, I believe, be ordinary operating procedure. > That would give us an almost-correct array, which would explain how > fsck can get ... some pieces. > > Also, I am not quite brave enough (...) to use shortcuts when handling > mdadm commands. > > I am reconstructing the port order (scsi targets, if you prefer) from > the 20220904 boot log. I should at that point be able to have an exact > order of the drives. > > Here it is: > > --- > [ 1.853329] sd 2:0:0:0: [sda] Write Protect is off > [ 1.853331] sd 7:0:0:0: [sdc] Write Protect is off > [ 1.853382] sd 3:0:0:0: [sdb] Write Protect is off > [ 12.531607] sd 10:0:3:0: [sdg] Write Protect is off > [ 12.533303] sd 10:0:2:0: [sdf] Write Protect is off > [ 12.534606] sd 10:0:0:0: [sdd] Write Protect is off > [ 12.570768] sd 10:0:1:0: [sde] Write Protect is off > [ 12.959925] sd 11:0:0:0: [sdh] Write Protect is off > [ 12.965230] sd 11:0:1:0: [sdi] Write Protect is off > [ 12.966145] sd 11:0:4:0: [sdl] Write Protect is off > [ 12.966800] sd 11:0:3:0: [sdk] Write Protect is off > [ 12.997253] sd 11:0:2:0: [sdj] Write Protect is off > [ 13.002395] sd 11:0:7:0: [sdo] Write Protect is off > [ 13.012693] sd 11:0:5:0: [sdm] Write Protect is off > [ 13.017630] sd 11:0:6:0: [sdn] Write Protect is off > --- > If we combine this with the previous: > --- > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > devices, algorithm 2 > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > --- > We have a SCSI target -> raid disk number correspondence. > As of this boot, the letter -> scsi target correspondences match, > shifted by one because as discussed 7:0:0:0 is no longer there (the > old, 'faulty' sdc). > Thus, having univocally determined the prior scsi target -> raid > position we can transpose it to the present drive letters, which are > shifted by one. > Therefore, we can generate, rectius have generated, a --create with > the same software versions, the same settings and the same drive > order. Is there any reason why, minus the 1.2 metadata overwriting > which should have only affected 12 blocks, the fs should 'not' be as > before? > Genuine question, mind. > > On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote: > > > > Reasonably likely, but not certain. > > > > Devices can be re-ordered by different kernels. That's why lsdrv prints > > serial numbers in its tree. > > > > You haven't mentioned whether your --create operations specified > > --assume-clean. > > > > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] > > is sorted to /dev/sd[abcdefgh]. Use curly brace expansion with commas > > if you are taking shortcuts. > > > > On 9/9/22 17:01, Luigi Fabio wrote: > > > Another helpful datapoint, this is the boot *before* sdc got > > > --replaced with sdo: > > > > > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > > > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > > > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > > > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > > > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > > > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > > > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > > > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > > > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > > > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > > > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > > > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > > > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > > > devices, algorithm 2 > > > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > > > > > This gives us, correct me if I am wrong of course, an exact > > > representation of what the array 'used to look like', with sdc1 then > > > replaced by sdo1 (8/225). > > > > > > Just some confirmation that the order should (?) be the one above. > > > > > > LF > > > > > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > >> > > >> Thanks for reaching out, first of all. Apologies for the late reply, > > >> the brilliant (...) spam filter strikes again... > > >> > > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: > > >>> No, the moment of stupid was that you re-created the array. > > >>> Simultaneous multi-drive failures that stop an array are easily fixed > > >>> with --assemble --force. Too late for that now. > > >> Noted for the future, thanks. > > >> > > >>> It is absurdly easy to screw up device order when re-creating, and if > > >>> you didn't specify every allocation and layout detail, the changes in > > >>> defaults over the years would also screw up your data. And finally, > > >>> omitting --assume-clean would cause all of your parity to be > > >>> recalculated immediately, with catastrophic results if any order or > > >>> allocation attributes are wrong. > > >> Of course. Which is why I specified everything and why I checked the > > >> details with --examine and --detail and they match exactly, minus the > > >> metadata version because, well, I wasn't actually the one typing (it's > > >> a slightly complicated story.. I was reassembling by proxy on the > > >> phone) and I made an incorrect assumption about the person typing. > > >> There aren't, in the end, THAT many things to specify: RAID level, > > >> number of drives, order thereof, chunk size, 'layout' and metadata > > >> version. 0.90 doesn't allow before/after gaps so that should be it, I > > >> believe. > > >> Am I missing anything? > > >> > > >>> No, you just got lucky in the past. Probably by using mdadm versions > > >>> that hadn't been updated. > > >> That's not quite it: I keep records of how arrays are built and match > > >> them, though it is true that I tend to update things as little as > > >> possible on production machines. > > >> One of the differences, this time, is that this was NOT a production > > >> machine. The other was that I was driving, dictating on the phone and > > >> was under a lot of pressure to get the thing back up ASAP. > > >> Nonetheless, I have an --examine of at least two drives from the > > >> previous setup so there should be enough information there to rebuild > > >> a matching array, I think? > > >> > > >>> You'll need to show us every command you tried from your history, and > > >>> full details of all drives/partitions involved. > > >>> > > >>> But I'll be brutally honest: your data is likely toast. > > >> Well, let's hope it isn't. All mdadm commands were -o and > > >> --assume-clean, so in theory the only thing which HAS been written are > > >> the md blocks, unless I am mistaken and/or I read the docs > > >> incorrectly? > > >> > > >> That does, of course, leave the problem of the blocks overwritten by > > >> the 1.2 metadata, but as I read the docs that should be a very small > > >> number - let's say one 4096byte block (a portion thereof, to be > > >> pedantic, but ext4 doesn't really care?) per drive, correct? > > >> > > >> Background: > > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro > > >> X10 series)'s chipset SATA ports. > > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs > > >> have journals on another SSD RAID1 (one per FS, obviously). > > >> Data drives: > > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, > > >> each with two four-drive ports (and one of these went DELIGHTFULLY > > >> missing) > > >> > > >> This is the layout of each drive: > > >> --- > > >> GPT fdisk (gdisk) version 1.0.6 > > >> ... > > >> Found valid GPT with protective MBR; using GPT. > > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB > > >> Model: ST4000NC001-1FS1 > > >> Sector size (logical/physical): 512/4096 bytes > > >> ... > > >> Total free space is 99949 sectors (48.8 MiB) > > >> > > >> Number Start (sector) End (sector) Size Code Name > > >> 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume > > >> 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup > > >> --- > > >> > > >> So there were two RAID arrays. Both RAID5 - a main RAID called > > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array > > >> called backup which had 12 x 90 GB. > > >> > > >> A little further backstory: right before the event, one drive had been > > >> pulled because it had started failing. What I did was shut down the > > >> machine, put the failing drive on a MB port and put a new drive on the > > >> LSI controllers. I then brought the machine back online, did the > > >> --replace --with thing and this worked fine. > > >> At that point the faulty drive (/dev/sdc, MB drives come before the > > >> LSI drives in the count) got deleted via /sys/block.... and physically > > >> disconnected from the system, which was then happily running with > > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as > > >> the 'archive' drives. > > >> It went 96 hours or so like that under moderate load. Then the failure > > >> happened, the machine was rebooted thus the previous sdd -> sdo drives > > >> became sdc -> sdn drives. > > >> However, the relative order was, to the best of my knowledge, > > >> conserved - AND I still have the 'faulty' drive, so I could very > > >> easily put it back in to have everything match. > > >> Most importantly, this drive has on it, without a doubt, the details > > >> of the array BEFORE everything happened - by definition untouched > > >> because the drive was stopped and pulled before the event. > > >> I also have a cat of the --examine of two of the faulty drives BEFORE > > >> anything was written to them - thus, unless I am mistaken, these > > >> contained the md block details from 'before the event'. > > >> > > >> Here is one of them, taken after the reboot and therefore when the MB > > >> /dev/sdc was no longer there: > > >> --- > > >> /dev/sdc1: > > >> Magic : a92b4efc > > >> Version : 0.90.00 > > >> UUID : 2457b506:85728e9d:c44c77eb:7ee19756 > > >> Creation Time : Sat Mar 30 18:18:00 2019 > > >> Raid Level : raid5 > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > >> Raid Devices : 12 > > >> Total Devices : 12 > > >> Preferred Minor : 123 > > >> > > >> Update Time : Tue Sep 6 11:37:53 2022 > > >> State : clean > > >> Active Devices : 12 > > >> Working Devices : 12 > > >> Failed Devices : 0 > > >> Spare Devices : 0 > > >> Checksum : 391e325d - correct > > >> Events : 52177 > > >> > > >> Layout : left-symmetric > > >> Chunk Size : 128K > > >> > > >> Number Major Minor RaidDevice State > > >> this 5 8 49 5 active sync /dev/sdd1 > > >> > > >> 0 0 8 225 0 active sync > > >> 1 1 8 81 1 active sync /dev/sdf1 > > >> 2 2 8 97 2 active sync /dev/sdg1 > > >> 3 3 8 161 3 active sync /dev/sdk1 > > >> 4 4 8 113 4 active sync /dev/sdh1 > > >> 5 5 8 49 5 active sync /dev/sdd1 > > >> 6 6 8 177 6 active sync /dev/sdl1 > > >> 7 7 8 145 7 active sync /dev/sdj1 > > >> 8 8 8 129 8 active sync /dev/sdi1 > > >> 9 9 8 65 9 active sync /dev/sde1 > > >> 10 10 8 209 10 active sync /dev/sdn1 > > >> 11 11 8 193 11 active sync /dev/sdm1 > > >> --- > > >> Note that the drives are 'moved' because the old /dev/sdc isn't there > > >> any more but the relative position should be the same, correct me if I > > >> am wrong. If you prefer, what you need to do to get the 'new' drive > > >> letter is to take 16 out of the minor of each of the drives. > > >> > > >> This is the 'new' --create > > >> --- > > >> /dev/sdc1: > > >> Magic : a92b4efc > > >> Version : 0.90.00 > > >> UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) > > >> Creation Time : Tue Sep 6 15:15:03 2022 > > >> Raid Level : raid5 > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > >> Raid Devices : 12 > > >> Total Devices : 12 > > >> Preferred Minor : 123 > > >> > > >> Update Time : Tue Sep 6 15:15:03 2022 > > >> State : clean > > >> Active Devices : 12 > > >> Working Devices : 12 > > >> Failed Devices : 0 > > >> Spare Devices : 0 > > >> Checksum : ed12b96a - correct > > >> Events : 1 > > >> > > >> Layout : left-symmetric > > >> Chunk Size : 128K > > >> > > >> Number Major Minor RaidDevice State > > >> this 5 8 33 5 active sync /dev/sdc1 > > >> > > >> 0 0 8 209 0 active sync /dev/sdn1 > > >> 1 1 8 65 1 active sync /dev/sde1 > > >> 2 2 8 81 2 active sync /dev/sdf1 > > >> 3 3 8 145 3 active sync /dev/sdj1 > > >> 4 4 8 97 4 active sync /dev/sdg1 > > >> 5 5 8 33 5 active sync /dev/sdc1 > > >> 6 6 8 161 6 active sync /dev/sdk1 > > >> 7 7 8 129 7 active sync /dev/sdi1 > > >> 8 8 8 113 8 active sync /dev/sdh1 > > >> 9 9 8 49 9 active sync /dev/sdd1 > > >> 10 10 8 193 10 active sync /dev/sdm1 > > >> 11 11 8 177 11 active sync /dev/sdl1 > > >> --- > > >> > > >> If you put the layout lines side by side, it would seem to me that > > >> they match, modulo the '16' difference. > > >> > > >> This is the list of --create and --assemble commands from the 6th > > >> which involve the sdx1 partitions, those we care about right now - > > >> there were others involving /dev/md124 and the /dev/sdx2 which however > > >> are not relevant - the data there : > > >> -- > > >> 9813 mdadm --assemble /dev/md123 missing > > >> 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 > > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 > > >> /dev/sdn1 /dev/sdm1 > > >> 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 > > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 > > >> /dev/sdm1 > > >> 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 > > >> /dev/sdd1 /dev/sdm1 /dev/sdl1 > > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate > > >> miscommunication which caused potential damage. > > >> 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 > > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 > > >> 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 > > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 > > >> /dev/sdl1 > > >> 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 > > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 > > >> /dev/sdk1 /dev/sdl1 > > >> 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > >> /dev/sdm1 /dev/sdl1 > > >> --- > > >> > > >> Note that they all were -o, therefore if I am not mistaken no parity > > >> data was written anywhere. Note further the fact that the first two > > >> were the 'mistake' ones, which did NOT have --assume-clean (but with > > >> -o this shouldn't make a difference AFAIK) and most importantly the > > >> metadata was the 1.2 default AND they were the wrong array in the > > >> first place. > > >> Note also that the 'final' --create commands also had --bitmap=none to > > >> match the original array, though according to the docs the bitmap > > >> space in 0.90 (and 1.2?) is in a space which does not affect the data > > >> in the first place. > > >> > > >> Now, first of all a question: if I get the 'old' sdc, the one that was > > >> taken out prior to this whole mess, onto a different system in order > > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite > > >> the md data, correct? Thus I should be able to double-check the drive > > >> order on that as well? > > >> > > >> Any other pointers, insults etc are of course welcome. > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-09 23:04 ` Luigi Fabio @ 2022-09-10 1:29 ` Luigi Fabio 2022-09-10 15:18 ` Phil Turmel 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 1:29 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid For completeness' sake, though it should not be relevant, here is the error that caused the mishap: --- Sep 6 11:41:18 beehive kernel: [164700.275878] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:19 beehive kernel: [164701.395828] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:21 beehive kernel: [164702.515813] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:22 beehive kernel: [164703.635801] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:23 beehive kernel: [164704.723793] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:24 beehive kernel: [164705.811778] mpt2sas_cm0: SAS host is non-operational !!!! Sep 6 11:41:24 beehive kernel: [164705.894616] mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!! Sep 6 11:41:24 beehive kernel: [164705.981926] sd 10:0:0:0: [sdd] Synchronizing SCSI cache Sep 6 11:41:24 beehive kernel: [164705.981967] sd 10:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Sep 6 11:41:24 beehive kernel: [164705.987746] sd 10:0:1:0: [sde] tag#2758 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=6s Sep 6 11:41:24 beehive kernel: [164705.987749] sd 10:0:1:0: [sde] tag#2758 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 98 00 00 00 68 00 00 Sep 6 11:41:24 beehive kernel: [164705.987751] blk_update_request: I/O error, dev sde, sector 5779360408 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.159887] sd 10:0:1:0: [sde] tag#2759 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=6s Sep 6 11:41:24 beehive kernel: [164706.159897] sd 10:0:1:0: [sde] tag#2759 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 00 00 00 00 98 00 00 Sep 6 11:41:24 beehive kernel: [164706.159903] blk_update_request: I/O error, dev sde, sector 5779360256 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.160073] sd 10:0:1:0: [sde] tag#2761 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.333860] sd 10:0:1:0: [sde] tag#2761 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 98 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.333862] blk_update_request: I/O error, dev sde, sector 5779360408 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.333864] sd 10:0:2:0: [sdf] tag#2760 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=6s Sep 6 11:41:24 beehive kernel: [164706.334010] sd 10:0:1:0: [sde] tag#2774 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334012] sd 10:0:1:0: [sde] tag#2774 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 00 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334014] blk_update_request: I/O error, dev sde, sector 5779360256 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334021] sd 10:0:1:0: [sde] tag#2775 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334022] sd 10:0:1:0: [sde] tag#2775 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 08 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334024] blk_update_request: I/O error, dev sde, sector 5779360264 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334026] sd 10:0:1:0: [sde] tag#2776 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334028] sd 10:0:1:0: [sde] tag#2776 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 10 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334029] blk_update_request: I/O error, dev sde, sector 5779360272 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334031] sd 10:0:1:0: [sde] tag#2777 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334033] sd 10:0:1:0: [sde] tag#2777 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 18 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334034] blk_update_request: I/O error, dev sde, sector 5779360280 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334036] sd 10:0:1:0: [sde] tag#2778 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334037] sd 10:0:1:0: [sde] tag#2778 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 20 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334038] blk_update_request: I/O error, dev sde, sector 5779360288 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334039] sd 10:0:1:0: [sde] tag#2779 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Sep 6 11:41:24 beehive kernel: [164706.334041] sd 10:0:1:0: [sde] tag#2779 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 28 00 00 00 08 00 00 Sep 6 11:41:24 beehive kernel: [164706.334041] blk_update_request: I/O error, dev sde, sector 5779360296 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.334043] blk_update_request: I/O error, dev sde, sector 5779360304 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 Sep 6 11:41:24 beehive kernel: [164706.346000] md/raid:md123: Disk failure on sde1, disabling device. Sep 6 11:41:24 beehive kernel: [164706.346002] md/raid:md123: Operation continuing on 11 devices. Sep 6 11:41:24 beehive kernel: [164706.346008] md/raid:md123: Disk failure on sdd1, disabling device. Sep 6 11:41:24 beehive kernel: [164706.346009] md/raid:md123: Cannot continue operation (2/12 failed). Sep 6 11:41:24 beehive kernel: [164706.346011] md/raid:md123: Disk failure on sdg1, disabling device. Sep 6 11:41:24 beehive kernel: [164706.346012] md/raid:md123: Cannot continue operation (3/12 failed). Sep 6 11:41:24 beehive kernel: [164706.346013] md/raid:md123: Disk failure on sdf1, disabling device. Sep 6 11:41:24 beehive kernel: [164706.346014] md/raid:md123: Cannot continue operation (4/12 failed). ---- Note that port 0 of the 10: controller just... lost it. That controller only uses port 0, so we don't know if it was the whole controller or just the port, but that is what happened. Of course, it now works. sdd , sde, sdf and sdg decided they were going on holiday, sdc had already been removed at this point as mentioned, controller 11: with the other eight drives was just fine, apparently. The *really odd* thing is that it failed... gracefully. I cannot understand what damaged the filesystem. On Fri, Sep 9, 2022 at 7:04 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > A further question, in THIS boot's log I found: > [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12 > devices, algorithm 2 > [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing > full recovery > [ 9874.714178] md123: bitmap file is out of date, doing full recovery > [ 9874.881106] md123: detected capacity change from 0 to 42945088192512 > From, I think, the second --create of /dev/123, before I added the > bitmap=none. This should, however, not have written anything with -o > and --assume-clean, correct? > > On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > > > By different kernels, maybe - but the kernel has been the same for > > quite a while (months). > > > > I did paste the whole of the command lines in the (very long) email, > > as David mentions (thanks!) - the first ones, the mistaken ones, did > > NOT have --assume-clean but they did have -o, so no parity activity > > should have started according to the docs? > > A new thought came to mind: one of the HBAs lost a channel, right? > > What if on the subsequent reboot the devices that were on that channel > > got 'rediscovered' and shunted to the end of the letter order? That > > would, I believe, be ordinary operating procedure. > > That would give us an almost-correct array, which would explain how > > fsck can get ... some pieces. > > > > Also, I am not quite brave enough (...) to use shortcuts when handling > > mdadm commands. > > > > I am reconstructing the port order (scsi targets, if you prefer) from > > the 20220904 boot log. I should at that point be able to have an exact > > order of the drives. > > > > Here it is: > > > > --- > > [ 1.853329] sd 2:0:0:0: [sda] Write Protect is off > > [ 1.853331] sd 7:0:0:0: [sdc] Write Protect is off > > [ 1.853382] sd 3:0:0:0: [sdb] Write Protect is off > > [ 12.531607] sd 10:0:3:0: [sdg] Write Protect is off > > [ 12.533303] sd 10:0:2:0: [sdf] Write Protect is off > > [ 12.534606] sd 10:0:0:0: [sdd] Write Protect is off > > [ 12.570768] sd 10:0:1:0: [sde] Write Protect is off > > [ 12.959925] sd 11:0:0:0: [sdh] Write Protect is off > > [ 12.965230] sd 11:0:1:0: [sdi] Write Protect is off > > [ 12.966145] sd 11:0:4:0: [sdl] Write Protect is off > > [ 12.966800] sd 11:0:3:0: [sdk] Write Protect is off > > [ 12.997253] sd 11:0:2:0: [sdj] Write Protect is off > > [ 13.002395] sd 11:0:7:0: [sdo] Write Protect is off > > [ 13.012693] sd 11:0:5:0: [sdm] Write Protect is off > > [ 13.017630] sd 11:0:6:0: [sdn] Write Protect is off > > --- > > If we combine this with the previous: > > --- > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > > devices, algorithm 2 > > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > --- > > We have a SCSI target -> raid disk number correspondence. > > As of this boot, the letter -> scsi target correspondences match, > > shifted by one because as discussed 7:0:0:0 is no longer there (the > > old, 'faulty' sdc). > > Thus, having univocally determined the prior scsi target -> raid > > position we can transpose it to the present drive letters, which are > > shifted by one. > > Therefore, we can generate, rectius have generated, a --create with > > the same software versions, the same settings and the same drive > > order. Is there any reason why, minus the 1.2 metadata overwriting > > which should have only affected 12 blocks, the fs should 'not' be as > > before? > > Genuine question, mind. > > > > On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote: > > > > > > Reasonably likely, but not certain. > > > > > > Devices can be re-ordered by different kernels. That's why lsdrv prints > > > serial numbers in its tree. > > > > > > You haven't mentioned whether your --create operations specified > > > --assume-clean. > > > > > > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] > > > is sorted to /dev/sd[abcdefgh]. Use curly brace expansion with commas > > > if you are taking shortcuts. > > > > > > On 9/9/22 17:01, Luigi Fabio wrote: > > > > Another helpful datapoint, this is the boot *before* sdc got > > > > --replaced with sdo: > > > > > > > > [ 13.528395] md/raid:md123: device sdd1 operational as raid disk 5 > > > > [ 13.528396] md/raid:md123: device sde1 operational as raid disk 9 > > > > [ 13.528397] md/raid:md123: device sdg1 operational as raid disk 2 > > > > [ 13.528398] md/raid:md123: device sdf1 operational as raid disk 1 > > > > [ 13.528398] md/raid:md123: device sdh1 operational as raid disk 4 > > > > [ 13.528399] md/raid:md123: device sdk1 operational as raid disk 3 > > > > [ 13.528400] md/raid:md123: device sdj1 operational as raid disk 7 > > > > [ 13.528401] md/raid:md123: device sdn1 operational as raid disk 10 > > > > [ 13.528402] md/raid:md123: device sdi1 operational as raid disk 8 > > > > [ 13.528402] md/raid:md123: device sdl1 operational as raid disk 6 > > > > [ 13.528403] md/raid:md123: device sdm1 operational as raid disk 11 > > > > [ 13.528403] md/raid:md123: device sdc1 operational as raid disk 0 > > > > [ 13.531613] md/raid:md123: raid level 5 active with 12 out of 12 > > > > devices, algorithm 2 > > > > [ 13.531644] md123: detected capacity change from 0 to 42945088192512 > > > > > > > > This gives us, correct me if I am wrong of course, an exact > > > > representation of what the array 'used to look like', with sdc1 then > > > > replaced by sdo1 (8/225). > > > > > > > > Just some confirmation that the order should (?) be the one above. > > > > > > > > LF > > > > > > > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > > >> > > > >> Thanks for reaching out, first of all. Apologies for the late reply, > > > >> the brilliant (...) spam filter strikes again... > > > >> > > > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote: > > > >>> No, the moment of stupid was that you re-created the array. > > > >>> Simultaneous multi-drive failures that stop an array are easily fixed > > > >>> with --assemble --force. Too late for that now. > > > >> Noted for the future, thanks. > > > >> > > > >>> It is absurdly easy to screw up device order when re-creating, and if > > > >>> you didn't specify every allocation and layout detail, the changes in > > > >>> defaults over the years would also screw up your data. And finally, > > > >>> omitting --assume-clean would cause all of your parity to be > > > >>> recalculated immediately, with catastrophic results if any order or > > > >>> allocation attributes are wrong. > > > >> Of course. Which is why I specified everything and why I checked the > > > >> details with --examine and --detail and they match exactly, minus the > > > >> metadata version because, well, I wasn't actually the one typing (it's > > > >> a slightly complicated story.. I was reassembling by proxy on the > > > >> phone) and I made an incorrect assumption about the person typing. > > > >> There aren't, in the end, THAT many things to specify: RAID level, > > > >> number of drives, order thereof, chunk size, 'layout' and metadata > > > >> version. 0.90 doesn't allow before/after gaps so that should be it, I > > > >> believe. > > > >> Am I missing anything? > > > >> > > > >>> No, you just got lucky in the past. Probably by using mdadm versions > > > >>> that hadn't been updated. > > > >> That's not quite it: I keep records of how arrays are built and match > > > >> them, though it is true that I tend to update things as little as > > > >> possible on production machines. > > > >> One of the differences, this time, is that this was NOT a production > > > >> machine. The other was that I was driving, dictating on the phone and > > > >> was under a lot of pressure to get the thing back up ASAP. > > > >> Nonetheless, I have an --examine of at least two drives from the > > > >> previous setup so there should be enough information there to rebuild > > > >> a matching array, I think? > > > >> > > > >>> You'll need to show us every command you tried from your history, and > > > >>> full details of all drives/partitions involved. > > > >>> > > > >>> But I'll be brutally honest: your data is likely toast. > > > >> Well, let's hope it isn't. All mdadm commands were -o and > > > >> --assume-clean, so in theory the only thing which HAS been written are > > > >> the md blocks, unless I am mistaken and/or I read the docs > > > >> incorrectly? > > > >> > > > >> That does, of course, leave the problem of the blocks overwritten by > > > >> the 1.2 metadata, but as I read the docs that should be a very small > > > >> number - let's say one 4096byte block (a portion thereof, to be > > > >> pedantic, but ext4 doesn't really care?) per drive, correct? > > > >> > > > >> Background: > > > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro > > > >> X10 series)'s chipset SATA ports. > > > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs > > > >> have journals on another SSD RAID1 (one per FS, obviously). > > > >> Data drives: > > > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers, > > > >> each with two four-drive ports (and one of these went DELIGHTFULLY > > > >> missing) > > > >> > > > >> This is the layout of each drive: > > > >> --- > > > >> GPT fdisk (gdisk) version 1.0.6 > > > >> ... > > > >> Found valid GPT with protective MBR; using GPT. > > > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB > > > >> Model: ST4000NC001-1FS1 > > > >> Sector size (logical/physical): 512/4096 bytes > > > >> ... > > > >> Total free space is 99949 sectors (48.8 MiB) > > > >> > > > >> Number Start (sector) End (sector) Size Code Name > > > >> 1 2048 7625195519 3.5 TiB 8300 Linux RAID volume > > > >> 2 7625195520 7813939199 90.0 GiB 8300 Linux RAID backup > > > >> --- > > > >> > > > >> So there were two RAID arrays. Both RAID5 - a main RAID called > > > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array > > > >> called backup which had 12 x 90 GB. > > > >> > > > >> A little further backstory: right before the event, one drive had been > > > >> pulled because it had started failing. What I did was shut down the > > > >> machine, put the failing drive on a MB port and put a new drive on the > > > >> LSI controllers. I then brought the machine back online, did the > > > >> --replace --with thing and this worked fine. > > > >> At that point the faulty drive (/dev/sdc, MB drives come before the > > > >> LSI drives in the count) got deleted via /sys/block.... and physically > > > >> disconnected from the system, which was then happily running with > > > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as > > > >> the 'archive' drives. > > > >> It went 96 hours or so like that under moderate load. Then the failure > > > >> happened, the machine was rebooted thus the previous sdd -> sdo drives > > > >> became sdc -> sdn drives. > > > >> However, the relative order was, to the best of my knowledge, > > > >> conserved - AND I still have the 'faulty' drive, so I could very > > > >> easily put it back in to have everything match. > > > >> Most importantly, this drive has on it, without a doubt, the details > > > >> of the array BEFORE everything happened - by definition untouched > > > >> because the drive was stopped and pulled before the event. > > > >> I also have a cat of the --examine of two of the faulty drives BEFORE > > > >> anything was written to them - thus, unless I am mistaken, these > > > >> contained the md block details from 'before the event'. > > > >> > > > >> Here is one of them, taken after the reboot and therefore when the MB > > > >> /dev/sdc was no longer there: > > > >> --- > > > >> /dev/sdc1: > > > >> Magic : a92b4efc > > > >> Version : 0.90.00 > > > >> UUID : 2457b506:85728e9d:c44c77eb:7ee19756 > > > >> Creation Time : Sat Mar 30 18:18:00 2019 > > > >> Raid Level : raid5 > > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > > >> Raid Devices : 12 > > > >> Total Devices : 12 > > > >> Preferred Minor : 123 > > > >> > > > >> Update Time : Tue Sep 6 11:37:53 2022 > > > >> State : clean > > > >> Active Devices : 12 > > > >> Working Devices : 12 > > > >> Failed Devices : 0 > > > >> Spare Devices : 0 > > > >> Checksum : 391e325d - correct > > > >> Events : 52177 > > > >> > > > >> Layout : left-symmetric > > > >> Chunk Size : 128K > > > >> > > > >> Number Major Minor RaidDevice State > > > >> this 5 8 49 5 active sync /dev/sdd1 > > > >> > > > >> 0 0 8 225 0 active sync > > > >> 1 1 8 81 1 active sync /dev/sdf1 > > > >> 2 2 8 97 2 active sync /dev/sdg1 > > > >> 3 3 8 161 3 active sync /dev/sdk1 > > > >> 4 4 8 113 4 active sync /dev/sdh1 > > > >> 5 5 8 49 5 active sync /dev/sdd1 > > > >> 6 6 8 177 6 active sync /dev/sdl1 > > > >> 7 7 8 145 7 active sync /dev/sdj1 > > > >> 8 8 8 129 8 active sync /dev/sdi1 > > > >> 9 9 8 65 9 active sync /dev/sde1 > > > >> 10 10 8 209 10 active sync /dev/sdn1 > > > >> 11 11 8 193 11 active sync /dev/sdm1 > > > >> --- > > > >> Note that the drives are 'moved' because the old /dev/sdc isn't there > > > >> any more but the relative position should be the same, correct me if I > > > >> am wrong. If you prefer, what you need to do to get the 'new' drive > > > >> letter is to take 16 out of the minor of each of the drives. > > > >> > > > >> This is the 'new' --create > > > >> --- > > > >> /dev/sdc1: > > > >> Magic : a92b4efc > > > >> Version : 0.90.00 > > > >> UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive) > > > >> Creation Time : Tue Sep 6 15:15:03 2022 > > > >> Raid Level : raid5 > > > >> Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB) > > > >> Array Size : 41938562688 (39995.73 GiB 42945.09 GB) > > > >> Raid Devices : 12 > > > >> Total Devices : 12 > > > >> Preferred Minor : 123 > > > >> > > > >> Update Time : Tue Sep 6 15:15:03 2022 > > > >> State : clean > > > >> Active Devices : 12 > > > >> Working Devices : 12 > > > >> Failed Devices : 0 > > > >> Spare Devices : 0 > > > >> Checksum : ed12b96a - correct > > > >> Events : 1 > > > >> > > > >> Layout : left-symmetric > > > >> Chunk Size : 128K > > > >> > > > >> Number Major Minor RaidDevice State > > > >> this 5 8 33 5 active sync /dev/sdc1 > > > >> > > > >> 0 0 8 209 0 active sync /dev/sdn1 > > > >> 1 1 8 65 1 active sync /dev/sde1 > > > >> 2 2 8 81 2 active sync /dev/sdf1 > > > >> 3 3 8 145 3 active sync /dev/sdj1 > > > >> 4 4 8 97 4 active sync /dev/sdg1 > > > >> 5 5 8 33 5 active sync /dev/sdc1 > > > >> 6 6 8 161 6 active sync /dev/sdk1 > > > >> 7 7 8 129 7 active sync /dev/sdi1 > > > >> 8 8 8 113 8 active sync /dev/sdh1 > > > >> 9 9 8 49 9 active sync /dev/sdd1 > > > >> 10 10 8 193 10 active sync /dev/sdm1 > > > >> 11 11 8 177 11 active sync /dev/sdl1 > > > >> --- > > > >> > > > >> If you put the layout lines side by side, it would seem to me that > > > >> they match, modulo the '16' difference. > > > >> > > > >> This is the list of --create and --assemble commands from the 6th > > > >> which involve the sdx1 partitions, those we care about right now - > > > >> there were others involving /dev/md124 and the /dev/sdx2 which however > > > >> are not relevant - the data there : > > > >> -- > > > >> 9813 mdadm --assemble /dev/md123 missing > > > >> 9814 mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1 > > > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 > > > >> /dev/sdn1 /dev/sdm1 > > > >> 9815 mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1 > > > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1 > > > >> /dev/sdm1 > > > >> 9823 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> 9824 mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1 > > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 > > > >> /dev/sdd1 /dev/sdm1 /dev/sdl1 > > > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate > > > >> miscommunication which caused potential damage. > > > >> 9852 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1 > > > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1 > > > >> 9863 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1 > > > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 > > > >> /dev/sdl1 > > > >> 9879 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 > > > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 > > > >> /dev/sdk1 /dev/sdl1 > > > >> 9889 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> 9892 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> 9895 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> 9901 mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> 9903 mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90 > > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1 > > > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1 > > > >> /dev/sdm1 /dev/sdl1 > > > >> --- > > > >> > > > >> Note that they all were -o, therefore if I am not mistaken no parity > > > >> data was written anywhere. Note further the fact that the first two > > > >> were the 'mistake' ones, which did NOT have --assume-clean (but with > > > >> -o this shouldn't make a difference AFAIK) and most importantly the > > > >> metadata was the 1.2 default AND they were the wrong array in the > > > >> first place. > > > >> Note also that the 'final' --create commands also had --bitmap=none to > > > >> match the original array, though according to the docs the bitmap > > > >> space in 0.90 (and 1.2?) is in a space which does not affect the data > > > >> in the first place. > > > >> > > > >> Now, first of all a question: if I get the 'old' sdc, the one that was > > > >> taken out prior to this whole mess, onto a different system in order > > > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite > > > >> the md data, correct? Thus I should be able to double-check the drive > > > >> order on that as well? > > > >> > > > >> Any other pointers, insults etc are of course welcome. > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 1:29 ` Luigi Fabio @ 2022-09-10 15:18 ` Phil Turmel 2022-09-10 19:30 ` Luigi Fabio 2022-09-12 19:06 ` Phillip Susi 0 siblings, 2 replies; 24+ messages in thread From: Phil Turmel @ 2022-09-10 15:18 UTC (permalink / raw) To: Luigi Fabio; +Cc: linux-raid Hi Luigi, Mixed in responses (and trimmed): On 9/9/22 18:50, Luigi Fabio wrote: > By different kernels, maybe - but the kernel has been the same for > quite a while (months). Yes. Same kernels are pretty repeatable for device order on bootup as long as all are present. Anything missing will shift the letter assignments. > I did paste the whole of the command lines in the (very long) email, > as David mentions (thanks!) - the first ones, the mistaken ones, did > NOT have --assume-clean but they did have -o, so no parity activity > should have started according to the docs? Okay, that should have saved you. Except, I think it still writes all the meta-data. With v1.2, that would sparsely trash up to 1/4 gig at tbe beginning of each device. > A new thought came to mind: one of the HBAs lost a channel, right? > What if on the subsequent reboot the devices that were on that channel > got 'rediscovered' and shunted to the end of the letter order? That > would, I believe, be ordinary operating procedure. Well, yes. But doesn't matter for assembly attempts, with always go by the meta-data. Device order only ever matters for --create when recreating. > That would give us an almost-correct array, which would explain how > fsck can get ... some pieces. If you consistently used -o or --assume-clean, then everything beyond ~3G should be untouched, if you can get the order right. Have fsck try backup superblocks way out. > Also, I am not quite brave enough (...) to use shortcuts when handling > mdadm commands. That's good. But curly braces are safe. > I am reconstructing the port order (scsi targets, if you prefer) from > the 20220904 boot log. I should at that point be able to have an exact > order of the drives. Please use lsdrv to capture names versus serial numbers. Re-run it before any --create operation to ensure the current names really do match the expected serial numbers. Keep track of ordering information by serial number. Note that lsdrv will reliably line up PHYs on SAS controllers, so that can be trusted, too. > Here it is: [trim /] > We have a SCSI target -> raid disk number correspondence. > As of this boot, the letter -> scsi target correspondences match, > shifted by one because as discussed 7:0:0:0 is no longer there (the > old, 'faulty' sdc). OK. > Thus, having univocally determined the prior scsi target -> raid > position we can transpose it to the present drive letters, which are > shifted by one. > Therefore, we can generate, rectius have generated, a --create with > the same software versions, the same settings and the same drive > order. Is there any reason why, minus the 1.2 metadata overwriting > which should have only affected 12 blocks, the fs should 'not' be as > before? > Genuine question, mind. Superblocks other than 0.9x and 1.0 place a bad block log and a written block bitmap between the superblock and the data area. I'm not sure if any of the remain space is wiped. These would be written regardless of -o or --assume-clean. Those flags "protect" the *data area* of the array, not the array's own metadata. On 9/9/22 19:04, Luigi Fabio wrote: > A further question, in THIS boot's log I found: > [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12 > devices, algorithm 2 > [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing > full recovery > [ 9874.714178] md123: bitmap file is out of date, doing full recovery > [ 9874.881106] md123: detected capacity change from 0 to 42945088192512 > From, I think, the second --create of /dev/123, before I added the > bitmap=none. This should, however, not have written anything with -o > and --assume-clean, correct? False assumption. As described above. On 9/9/22 21:29, Luigi Fabio wrote: > For completeness' sake, though it should not be relevant, here is the > error that caused the mishap: [trim /] Noted, and helpful for correlating device names to PHYs. Okay. To date, you've only done create with -o or --assume-clean? If so, it is likely your 0.90 superblocks are still present at the ends of the disks. You will need to zero the v1.2 superblocks that have been placed on your partitions. Then attempt an --assemble and see if mdadm will deliver the same message as before, identifying all of the members, but refusing to proceed due to event counts. If so, repeat with --force. This procedure is safe to do without overlays, and will likely yield a running array. Then you will have to fsck to fix up the borked beginning of your filesystem. Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 15:18 ` Phil Turmel @ 2022-09-10 19:30 ` Luigi Fabio 2022-09-10 19:55 ` Luigi Fabio 2022-09-12 19:06 ` Phillip Susi 1 sibling, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 19:30 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Hello Phil, thank you BTW for your continued assistance. Here goes: On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote: > Yes. Same kernels are pretty repeatable for device order on bootup as > long as all are present. Anything missing will shift the letter > assignments. We need to keep this in mind, though the described boot log scsi target -> letter assignment seem to indicate that we're clear as discussed. This is relevant since I have re--created the array. > Okay, that should have saved you. Except, I think it still writes all > the meta-data. With v1.2, that would sparsely trash up to 1/4 gig at > tbe beginning of each device. I dug into the docs and the wiki and ran some experiments on another machine. Apparently, what 1.2 does with my kernel and my mdadm is use sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors -> 36 kB -> 9 ext3 blocks per device, sparsely as you say. This is 'fine' even with a 128kB chunk, the first one doesn't really matter because yes, fsck detects that it nuked the block group descriptors but the superblock before them is fine (indeed, tune2fs and dumpe2fs work 'as expected') and then goes to a backup and is happy, even declaring the fs clean. Therefore out of the 12 'affected' areas, one doesn't matter for practical purposes and we have to wonder about the others. Arguably, one of those should also be managed by parity but I have no idea how that will work out - it may be very important actually at the time of any future resync. Now, these are all in the first block of each device, which would form the first 1408 kB of the filesystem (128kB chunk, remember the original creation is *old*), since I believe mdraid preserves sequence, therefore the chunks are in order. We know the following from dumpe2fs: --- Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED] Primary superblock at 0, Group descriptors at 1-2096 Block bitmap at 2260 (+2260), csum 0x824f8d47 Inode bitmap at 2261 (+2261), csum 0xdadef5ad Inode table at 2262-2773 (+2262) 0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes --- So the first 2097 blocks are backed up group descriptors - this is *way* more than the 1408 kB therefore with restored BGDs (fsck -s 32768, say) we should be... fine? Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have to wonder whether that's because the BGDs are not happy. I am tempted to run an overlay *for the fsck*, what do you think? > Well, yes. But doesn't matter for assembly attempts, with always go by > the meta-data. Device order only ever matters for --create when recreating. Sure, but keep in mind, my --create commands nuked the original 0.90 metadata as well, so we need to be sure that the order is correct or we'll have a real jumble, Now, the cables have not been moved and the boot logs confirm that the scsi targets correspond, so we should have the order correct and the parameters are correct from the previous logs. Therefore, we 'should' have the same dataspa > If you consistently used -o or --assume-clean, then everything beyond > ~3G should be untouched, if you can get the order right. Have fsck try > backup superblocks way out. fsck grabs a backup 'magically' and seems to be happy, unless I -nf it then ... all sorts of bad stuff happens. > Please use lsdrv to capture names versus serial numbers. Re-run it > before any --create operation to ensure the current names really do > match the expected serial numbers. Keep track of ordering information > by serial number. Note that lsdrv will reliably line up PHYs on SAS > controllers, so that can be trusted, too. Thing is... I can't find lsdrv. As in: there is no lsdrv binary, apparently, in Debian stable or in Debian testing. Where do I look for it? > Superblocks other than 0.9x and 1.0 place a bad block log and a written > block bitmap between the superblock and the data area. I'm not sure if > any of the remain space is wiped. These would be written regardless of > -o or --assume-clean. Those flags "protect" the *data area* of the > array, not the array's own metadata. Yes - this is the damage I'm talking about above. From the logs, the 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80 sectors, with the first 8 not being touched (and the proof is that the superblock is 'happy', though interestingly this should not be the case because the gr0 superblock is offset by 1024 bytes -> the last 1024 bytes of the superblock should be borked too. From this, my math above. > > From, I think, the second --create of /dev/123, before I added the > > bitmap=none. This should, however, not have written anything with -o > > and --assume-clean, correct? > False assumption. As described above. Two different things: what I meant was that even with that bitmap message, the only thing that would have been written is the metadata. linux raid documentation states repeatedly that with -o no resyncing or parity reconstruction would be performed. Yes, agreed, the 1.2 metadata got written, but it's the only thing that got written from when the array was stopped by the error, if I am reading the docs correctly? > Okay. To date, you've only done create with -o or --assume-clean? > > If so, it is likely your 0.90 superblocks are still present at the ends > of the disks. Problem is, if you look at my previous email, as I mentioned above I have ALSO done --create with --metadata=0.90, which overwrote the original blocks. HOWEVER, I do have the logs of the original parameters and I have at least one drive - the old sdc - which was spit out before this whole thing, which becomes relevant to confirm that the parameter log is correct (multiple things seem to coincide, so I think we're OK there). Given all the above, however, if we get the parameters to match we should get a filesystem that corresponds to before the event after the first 1408kB - and those don't matter insofar as we have redundant backups in ext4 for at least the first 2060 blocks >> 1408 kB. The thing that I do NOT understand is that if this is the case, fsck with -s <high> should render a FS without any errors.. therefore why am I getting inode metadata checksum errors? This is why I had originarily posted in linux-ext4 ... Thanks, L ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 19:30 ` Luigi Fabio @ 2022-09-10 19:55 ` Luigi Fabio 2022-09-10 20:12 ` Luigi Fabio ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 19:55 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Well, I found SOMETHING of decided interest: when I run dumpe2fs with any backup superblock, this happens: --- Filesystem created: Tue Nov 4 08:56:08 2008 Last mount time: Thu Aug 18 21:04:22 2022 Last write time: Thu Aug 18 21:04:22 2022 --- So the backups have not been updated since boot-before-last? That would explain why, when fsck tries to use those backups, it comes up with funny results. Is this ...as intended, I wonder? Does it also imply that any file that was written to > aug 18th will be in an indeterminate state? That would seem to be the implication. On Sat, Sep 10, 2022 at 3:30 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > Hello Phil, > thank you BTW for your continued assistance. Here goes: > > On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote: > > Yes. Same kernels are pretty repeatable for device order on bootup as > > long as all are present. Anything missing will shift the letter > > assignments. > We need to keep this in mind, though the described boot log scsi > target -> letter assignment seem to indicate that we're clear as > discussed. This is relevant since I have re--created the array. > > > Okay, that should have saved you. Except, I think it still writes all > > the meta-data. With v1.2, that would sparsely trash up to 1/4 gig at > > tbe beginning of each device. > I dug into the docs and the wiki and ran some experiments on another > machine. Apparently, what 1.2 does with my kernel and my mdadm is use > sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors -> > 36 kB -> 9 ext3 blocks per device, sparsely as you say. > This is 'fine' even with a 128kB chunk, the first one doesn't really > matter because yes, fsck detects that it nuked the block group > descriptors but the superblock before them is fine (indeed, tune2fs > and dumpe2fs work 'as expected') and then goes to a backup and is > happy, even declaring the fs clean. > Therefore out of the 12 'affected' areas, one doesn't matter for > practical purposes and we have to wonder about the others. Arguably, > one of those should also be managed by parity but I have no idea how > that will work out - it may be very important actually at the time of > any future resync. > Now, these are all in the first block of each device, which would form > the first 1408 kB of the filesystem (128kB chunk, remember the > original creation is *old*), since I believe mdraid preserves > sequence, therefore the chunks are in order. > We know the following from dumpe2fs: > --- > Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED] > Primary superblock at 0, Group descriptors at 1-2096 > Block bitmap at 2260 (+2260), csum 0x824f8d47 > Inode bitmap at 2261 (+2261), csum 0xdadef5ad > Inode table at 2262-2773 (+2262) > 0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes > --- > So the first 2097 blocks are backed up group descriptors - this is > *way* more than the 1408 kB therefore with restored BGDs (fsck -s > 32768, say) we should be... fine? > > Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have > to wonder whether that's because the BGDs are not happy. I am tempted > to run an overlay *for the fsck*, what do you think? > > > Well, yes. But doesn't matter for assembly attempts, with always go by > > the meta-data. Device order only ever matters for --create when recreating. > Sure, but keep in mind, my --create commands nuked the original 0.90 > metadata as well, so we need to be sure that the order is correct or > we'll have a real jumble, > Now, the cables have not been moved and the boot logs confirm that the > scsi targets correspond, so we should have the order correct and the > parameters are correct from the previous logs. Therefore, we 'should' > have the same dataspa > > > If you consistently used -o or --assume-clean, then everything beyond > > ~3G should be untouched, if you can get the order right. Have fsck try > > backup superblocks way out. > fsck grabs a backup 'magically' and seems to be happy, unless I -nf it > then ... all sorts of bad stuff happens. > > > Please use lsdrv to capture names versus serial numbers. Re-run it > > before any --create operation to ensure the current names really do > > match the expected serial numbers. Keep track of ordering information > > by serial number. Note that lsdrv will reliably line up PHYs on SAS > > controllers, so that can be trusted, too. > Thing is... I can't find lsdrv. As in: there is no lsdrv binary, > apparently, in Debian stable or in Debian testing. Where do I look for > it? > > > Superblocks other than 0.9x and 1.0 place a bad block log and a written > > block bitmap between the superblock and the data area. I'm not sure if > > any of the remain space is wiped. These would be written regardless of > > -o or --assume-clean. Those flags "protect" the *data area* of the > > array, not the array's own metadata. > Yes - this is the damage I'm talking about above. From the logs, the > 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80 > sectors, with the first 8 not being touched (and the proof is that the > superblock is 'happy', though interestingly this should not be the > case because the gr0 superblock is offset by 1024 bytes -> the last > 1024 bytes of the superblock should be borked too. > From this, my math above. > > > > > From, I think, the second --create of /dev/123, before I added the > > > bitmap=none. This should, however, not have written anything with -o > > > and --assume-clean, correct? > > False assumption. As described above. > Two different things: what I meant was that even with that bitmap > message, the only thing that would have been written is the metadata. > linux raid documentation states repeatedly that with -o no resyncing > or parity reconstruction would be performed. Yes, agreed, the 1.2 > metadata got written, but it's the only thing that got written from > when the array was stopped by the error, if I am reading the docs > correctly? > > > Okay. To date, you've only done create with -o or --assume-clean? > > > > If so, it is likely your 0.90 superblocks are still present at the ends > > of the disks. > Problem is, if you look at my previous email, as I mentioned above I > have ALSO done --create with --metadata=0.90, which overwrote the > original blocks. > HOWEVER, I do have the logs of the original parameters and I have at > least one drive - the old sdc - which was spit out before this whole > thing, which becomes relevant to confirm that the parameter log is > correct (multiple things seem to coincide, so I think we're OK there). > > Given all the above, however, if we get the parameters to match we > should get a filesystem that corresponds to before the event after the > first 1408kB - and those don't matter insofar as we have redundant > backups in ext4 for at least the first 2060 blocks >> 1408 kB. > > The thing that I do NOT understand is that if this is the case, fsck > with -s <high> should render a FS without any errors.. therefore why > am I getting inode metadata checksum errors? This is why I had > originarily posted in linux-ext4 ... > > Thanks, > L ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 19:55 ` Luigi Fabio @ 2022-09-10 20:12 ` Luigi Fabio 2022-09-10 20:15 ` Phil Turmel 2022-09-10 20:14 ` Phil Turmel 2022-09-12 19:09 ` Phillip Susi 2 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 20:12 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Following up, I found: >> The backup ext4 superblocks are never updated by the kernel, only after >> a successful e2fsck, tune2fs, resize2fs, or other userspace operation. >> >> This avoids clobbering the backups with bad data if the kernel has a bug >> or device error (e.g. bad cable, HBA, etc). So therefore, if we restore a backup superblock (and its attendant data) what happens to any FS structure that was written to *after* that time? That is to say, in this case, after Aug 18th? Is the system 'smart enough' to do something or will I have a big fat mess? I mean, reversion to 08/18 would be great, but I can't imagine that the FS can do that, it would have to have copies of every inode. This does explain how I get so many errors when fsck grabs the backup superblock... the RAID part we solved just fine, it's the rest we have to deal with. Ideas are welcome. On Sat, Sep 10, 2022 at 3:55 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > Well, I found SOMETHING of decided interest: when I run dumpe2fs with > any backup superblock, this happens: > > --- > Filesystem created: Tue Nov 4 08:56:08 2008 > Last mount time: Thu Aug 18 21:04:22 2022 > Last write time: Thu Aug 18 21:04:22 2022 > --- > > So the backups have not been updated since boot-before-last? That > would explain why, when fsck tries to use those backups, it comes up > with funny results. > > Is this ...as intended, I wonder? Does it also imply that any file > that was written to > aug 18th will be in an indeterminate state? That > would seem to be the implication. > > On Sat, Sep 10, 2022 at 3:30 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > > > Hello Phil, > > thank you BTW for your continued assistance. Here goes: > > > > On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote: > > > Yes. Same kernels are pretty repeatable for device order on bootup as > > > long as all are present. Anything missing will shift the letter > > > assignments. > > We need to keep this in mind, though the described boot log scsi > > target -> letter assignment seem to indicate that we're clear as > > discussed. This is relevant since I have re--created the array. > > > > > Okay, that should have saved you. Except, I think it still writes all > > > the meta-data. With v1.2, that would sparsely trash up to 1/4 gig at > > > tbe beginning of each device. > > I dug into the docs and the wiki and ran some experiments on another > > machine. Apparently, what 1.2 does with my kernel and my mdadm is use > > sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors -> > > 36 kB -> 9 ext3 blocks per device, sparsely as you say. > > This is 'fine' even with a 128kB chunk, the first one doesn't really > > matter because yes, fsck detects that it nuked the block group > > descriptors but the superblock before them is fine (indeed, tune2fs > > and dumpe2fs work 'as expected') and then goes to a backup and is > > happy, even declaring the fs clean. > > Therefore out of the 12 'affected' areas, one doesn't matter for > > practical purposes and we have to wonder about the others. Arguably, > > one of those should also be managed by parity but I have no idea how > > that will work out - it may be very important actually at the time of > > any future resync. > > Now, these are all in the first block of each device, which would form > > the first 1408 kB of the filesystem (128kB chunk, remember the > > original creation is *old*), since I believe mdraid preserves > > sequence, therefore the chunks are in order. > > We know the following from dumpe2fs: > > --- > > Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED] > > Primary superblock at 0, Group descriptors at 1-2096 > > Block bitmap at 2260 (+2260), csum 0x824f8d47 > > Inode bitmap at 2261 (+2261), csum 0xdadef5ad > > Inode table at 2262-2773 (+2262) > > 0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes > > --- > > So the first 2097 blocks are backed up group descriptors - this is > > *way* more than the 1408 kB therefore with restored BGDs (fsck -s > > 32768, say) we should be... fine? > > > > Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have > > to wonder whether that's because the BGDs are not happy. I am tempted > > to run an overlay *for the fsck*, what do you think? > > > > > Well, yes. But doesn't matter for assembly attempts, with always go by > > > the meta-data. Device order only ever matters for --create when recreating. > > Sure, but keep in mind, my --create commands nuked the original 0.90 > > metadata as well, so we need to be sure that the order is correct or > > we'll have a real jumble, > > Now, the cables have not been moved and the boot logs confirm that the > > scsi targets correspond, so we should have the order correct and the > > parameters are correct from the previous logs. Therefore, we 'should' > > have the same dataspa > > > > > If you consistently used -o or --assume-clean, then everything beyond > > > ~3G should be untouched, if you can get the order right. Have fsck try > > > backup superblocks way out. > > fsck grabs a backup 'magically' and seems to be happy, unless I -nf it > > then ... all sorts of bad stuff happens. > > > > > Please use lsdrv to capture names versus serial numbers. Re-run it > > > before any --create operation to ensure the current names really do > > > match the expected serial numbers. Keep track of ordering information > > > by serial number. Note that lsdrv will reliably line up PHYs on SAS > > > controllers, so that can be trusted, too. > > Thing is... I can't find lsdrv. As in: there is no lsdrv binary, > > apparently, in Debian stable or in Debian testing. Where do I look for > > it? > > > > > Superblocks other than 0.9x and 1.0 place a bad block log and a written > > > block bitmap between the superblock and the data area. I'm not sure if > > > any of the remain space is wiped. These would be written regardless of > > > -o or --assume-clean. Those flags "protect" the *data area* of the > > > array, not the array's own metadata. > > Yes - this is the damage I'm talking about above. From the logs, the > > 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80 > > sectors, with the first 8 not being touched (and the proof is that the > > superblock is 'happy', though interestingly this should not be the > > case because the gr0 superblock is offset by 1024 bytes -> the last > > 1024 bytes of the superblock should be borked too. > > From this, my math above. > > > > > > > > From, I think, the second --create of /dev/123, before I added the > > > > bitmap=none. This should, however, not have written anything with -o > > > > and --assume-clean, correct? > > > False assumption. As described above. > > Two different things: what I meant was that even with that bitmap > > message, the only thing that would have been written is the metadata. > > linux raid documentation states repeatedly that with -o no resyncing > > or parity reconstruction would be performed. Yes, agreed, the 1.2 > > metadata got written, but it's the only thing that got written from > > when the array was stopped by the error, if I am reading the docs > > correctly? > > > > > Okay. To date, you've only done create with -o or --assume-clean? > > > > > > If so, it is likely your 0.90 superblocks are still present at the ends > > > of the disks. > > Problem is, if you look at my previous email, as I mentioned above I > > have ALSO done --create with --metadata=0.90, which overwrote the > > original blocks. > > HOWEVER, I do have the logs of the original parameters and I have at > > least one drive - the old sdc - which was spit out before this whole > > thing, which becomes relevant to confirm that the parameter log is > > correct (multiple things seem to coincide, so I think we're OK there). > > > > Given all the above, however, if we get the parameters to match we > > should get a filesystem that corresponds to before the event after the > > first 1408kB - and those don't matter insofar as we have redundant > > backups in ext4 for at least the first 2060 blocks >> 1408 kB. > > > > The thing that I do NOT understand is that if this is the case, fsck > > with -s <high> should render a FS without any errors.. therefore why > > am I getting inode metadata checksum errors? This is why I had > > originarily posted in linux-ext4 ... > > > > Thanks, > > L ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 20:12 ` Luigi Fabio @ 2022-09-10 20:15 ` Phil Turmel 0 siblings, 0 replies; 24+ messages in thread From: Phil Turmel @ 2022-09-10 20:15 UTC (permalink / raw) To: Luigi Fabio; +Cc: linux-raid Do the fsck with an overlay in place. I suspect the data in the inodes will provide corroboration for newer data in the various structures. I think your odds are good, now. On 9/10/22 16:12, Luigi Fabio wrote: > Following up, I found: > >>> The backup ext4 superblocks are never updated by the kernel, only after >>> a successful e2fsck, tune2fs, resize2fs, or other userspace operation. >>> >>> This avoids clobbering the backups with bad data if the kernel has a bug >>> or device error (e.g. bad cable, HBA, etc). > So therefore, if we restore a backup superblock (and its attendant > data) what happens to any FS structure that was written to *after* > that time? That is to say, in this case, after Aug 18th? > Is the system 'smart enough' to do something or will I have a big fat > mess? I mean, reversion to 08/18 would be great, but I can't imagine > that the FS can do that, it would have to have copies of every inode. > > This does explain how I get so many errors when fsck grabs the backup > superblock... the RAID part we solved just fine, it's the rest we have > to deal with. > > Ideas are welcome. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 19:55 ` Luigi Fabio 2022-09-10 20:12 ` Luigi Fabio @ 2022-09-10 20:14 ` Phil Turmel 2022-09-10 20:17 ` Phil Turmel 2022-09-12 19:09 ` Phillip Susi 2 siblings, 1 reply; 24+ messages in thread From: Phil Turmel @ 2022-09-10 20:14 UTC (permalink / raw) To: Luigi Fabio; +Cc: linux-raid Hi Luigi, On 9/10/22 15:55, Luigi Fabio wrote: > Well, I found SOMETHING of decided interest: when I run dumpe2fs with > any backup superblock, this happens: > > --- > Filesystem created: Tue Nov 4 08:56:08 2008 > Last mount time: Thu Aug 18 21:04:22 2022 > Last write time: Thu Aug 18 21:04:22 2022 > --- > > So the backups have not been updated since boot-before-last? That > would explain why, when fsck tries to use those backups, it comes up > with funny results. Interesting. > Is this ...as intended, I wonder? Does it also imply that any file > that was written to > aug 18th will be in an indeterminate state? That > would seem to be the implication. Hmm. I wouldn't have thought so, but maybe the backup blocks don't get updated as often? { I think you are about as far as I would have gotten myself, if I allowed myself to get there. } Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 20:14 ` Phil Turmel @ 2022-09-10 20:17 ` Phil Turmel 2022-09-10 20:24 ` Luigi Fabio 0 siblings, 1 reply; 24+ messages in thread From: Phil Turmel @ 2022-09-10 20:17 UTC (permalink / raw) To: Luigi Fabio; +Cc: linux-raid Oh, one more thing: If you had followed any of the advice on the linux-raid wiki, you'd have been pointed to my lsdrv project on github: https://github.com/pturmel/lsdrv (Still just python2, sorry.) On 9/10/22 16:14, Phil Turmel wrote: > Hi Luigi, > > > On 9/10/22 15:55, Luigi Fabio wrote: >> Well, I found SOMETHING of decided interest: when I run dumpe2fs with >> any backup superblock, this happens: >> >> --- >> Filesystem created: Tue Nov 4 08:56:08 2008 >> Last mount time: Thu Aug 18 21:04:22 2022 >> Last write time: Thu Aug 18 21:04:22 2022 >> --- >> >> So the backups have not been updated since boot-before-last? That >> would explain why, when fsck tries to use those backups, it comes up >> with funny results. > > Interesting. > >> Is this ...as intended, I wonder? Does it also imply that any file >> that was written to > aug 18th will be in an indeterminate state? That >> would seem to be the implication. > > Hmm. I wouldn't have thought so, but maybe the backup blocks don't get > updated as often? > > { I think you are about as far as I would have gotten myself, if I > allowed myself to get there. } > > Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 20:17 ` Phil Turmel @ 2022-09-10 20:24 ` Luigi Fabio 2022-09-10 20:54 ` Luigi Fabio 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 20:24 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Phil, I did indeed go there, but, stupidly, after the fact and I had missed the reference to your tool. Not an excuse, but the initial part of the process was, as I mentioned, complicated and done while driving.... I'll download lsdrv and snapshot the situation in any case, generate the overlay, run the fsck and see what happens. I'll report back when it's done, which is probably going to be tomorrow (fsck times for this filesystem historically have been in the 9+ hr range - and the overlay will probably do us no favours performancewise). Back as soon as I have further data. Thank you again for the help. L On Sat, Sep 10, 2022 at 4:17 PM Phil Turmel <philip@turmel.org> wrote: > > Oh, one more thing: > > If you had followed any of the advice on the linux-raid wiki, you'd have > been pointed to my lsdrv project on github: > > https://github.com/pturmel/lsdrv > > (Still just python2, sorry.) > > > On 9/10/22 16:14, Phil Turmel wrote: > > Hi Luigi, > > > > > > On 9/10/22 15:55, Luigi Fabio wrote: > >> Well, I found SOMETHING of decided interest: when I run dumpe2fs with > >> any backup superblock, this happens: > >> > >> --- > >> Filesystem created: Tue Nov 4 08:56:08 2008 > >> Last mount time: Thu Aug 18 21:04:22 2022 > >> Last write time: Thu Aug 18 21:04:22 2022 > >> --- > >> > >> So the backups have not been updated since boot-before-last? That > >> would explain why, when fsck tries to use those backups, it comes up > >> with funny results. > > > > Interesting. > > > >> Is this ...as intended, I wonder? Does it also imply that any file > >> that was written to > aug 18th will be in an indeterminate state? That > >> would seem to be the implication. > > > > Hmm. I wouldn't have thought so, but maybe the backup blocks don't get > > updated as often? > > > > { I think you are about as far as I would have gotten myself, if I > > allowed myself to get there. } > > > > Phil > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 20:24 ` Luigi Fabio @ 2022-09-10 20:54 ` Luigi Fabio 0 siblings, 0 replies; 24+ messages in thread From: Luigi Fabio @ 2022-09-10 20:54 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid I will be very, very brief: it works. I put on the overlay, did the first fsck which said it would try the backup blocks and then complained about not benign able to set superblock flags and stopped. At that point, since it said that the FS was modified I assumed that it had overwritten the block descriptors that were damaged. I tried mounting -o ro the filesystem without touching it further and... it works. The files are there, including the newest ones, directory connectivity is correct as far as several tests can tell.... Of course, I will treat it as a 'damaged' fs, get the files off of there into a new array, then try the further fsck and see what happens for curiosity's sake but as far as I am concerned this arrya is no longer going to be live. Which is just fine, I got, I believe, what I wanted. Thank you very much for all your help - I plan to provide a final update once the copy is done etc. Let me know where to send scotch. Much deserved. L On Sat, Sep 10, 2022 at 4:24 PM Luigi Fabio <luigi.fabio@gmail.com> wrote: > > Phil, > I did indeed go there, but, stupidly, after the fact and I had missed > the reference to your tool. Not an excuse, but the initial part of the > process was, as I mentioned, complicated and done while driving.... > > I'll download lsdrv and snapshot the situation in any case, generate > the overlay, run the fsck and see what happens. > > I'll report back when it's done, which is probably going to be > tomorrow (fsck times for this filesystem historically have been in the > 9+ hr range - and the overlay will probably do us no favours > performancewise). > > Back as soon as I have further data. Thank you again for the help. > > L > > On Sat, Sep 10, 2022 at 4:17 PM Phil Turmel <philip@turmel.org> wrote: > > > > Oh, one more thing: > > > > If you had followed any of the advice on the linux-raid wiki, you'd have > > been pointed to my lsdrv project on github: > > > > https://github.com/pturmel/lsdrv > > > > (Still just python2, sorry.) > > > > > > On 9/10/22 16:14, Phil Turmel wrote: > > > Hi Luigi, > > > > > > > > > On 9/10/22 15:55, Luigi Fabio wrote: > > >> Well, I found SOMETHING of decided interest: when I run dumpe2fs with > > >> any backup superblock, this happens: > > >> > > >> --- > > >> Filesystem created: Tue Nov 4 08:56:08 2008 > > >> Last mount time: Thu Aug 18 21:04:22 2022 > > >> Last write time: Thu Aug 18 21:04:22 2022 > > >> --- > > >> > > >> So the backups have not been updated since boot-before-last? That > > >> would explain why, when fsck tries to use those backups, it comes up > > >> with funny results. > > > > > > Interesting. > > > > > >> Is this ...as intended, I wonder? Does it also imply that any file > > >> that was written to > aug 18th will be in an indeterminate state? That > > >> would seem to be the implication. > > > > > > Hmm. I wouldn't have thought so, but maybe the backup blocks don't get > > > updated as often? > > > > > > { I think you are about as far as I would have gotten myself, if I > > > allowed myself to get there. } > > > > > > Phil > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 19:55 ` Luigi Fabio 2022-09-10 20:12 ` Luigi Fabio 2022-09-10 20:14 ` Phil Turmel @ 2022-09-12 19:09 ` Phillip Susi 2022-09-13 3:58 ` Luigi Fabio 2 siblings, 1 reply; 24+ messages in thread From: Phillip Susi @ 2022-09-12 19:09 UTC (permalink / raw) To: Luigi Fabio; +Cc: Phil Turmel, linux-raid Luigi Fabio <luigi.fabio@gmail.com> writes: > Well, I found SOMETHING of decided interest: when I run dumpe2fs with > any backup superblock, this happens: > > --- > Filesystem created: Tue Nov 4 08:56:08 2008 > Last mount time: Thu Aug 18 21:04:22 2022 > Last write time: Thu Aug 18 21:04:22 2022 > --- > > So the backups have not been updated since boot-before-last? That > would explain why, when fsck tries to use those backups, it comes up > with funny results. That's funny. IIRC, the backups virtually never get updated. The only thing e2fsck needs to get from them is the location of the inode tables and block groups, and that does not change during the life of the filesystem. I might have something tickling the back of my memory that when e2fsck is run, it updates the first backup superblock, but the others never got updated. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-12 19:09 ` Phillip Susi @ 2022-09-13 3:58 ` Luigi Fabio 2022-09-13 12:47 ` Phillip Susi 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-13 3:58 UTC (permalink / raw) To: Phillip Susi; +Cc: Phil Turmel, linux-raid On Mon, Sep 12, 2022 at 3:12 PM Phillip Susi <phill@thesusis.net> wrote: > That's funny. IIRC, the backups virtually never get updated. The only > thing e2fsck needs to get from them is the location of the inode tables > and block groups, and that does not change during the life of the > filesystem. > > I might have something tickling the back of my memory that when e2fsck > is run, it updates the first backup superblock, but the others never got > updated. The way I have found it explained in multiple places is that the backups only get updated as a consequence of an actual userspace interaction. So you have to run fsck or at least change settings in tune2fs, for instance, or resize2fs ... then all the backups get updated. The jury is still out on whether automated fscks - for those lunatics who haven't disabled them - update or not. There is conflicting information. LF ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-13 3:58 ` Luigi Fabio @ 2022-09-13 12:47 ` Phillip Susi 0 siblings, 0 replies; 24+ messages in thread From: Phillip Susi @ 2022-09-13 12:47 UTC (permalink / raw) To: Luigi Fabio; +Cc: Phil Turmel, linux-raid Luigi Fabio <luigi.fabio@gmail.com> writes: > The way I have found it explained in multiple places is that the > backups only get updated as a consequence of an actual userspace > interaction. So you have to run fsck or at least change settings in > tune2fs, for instance, or resize2fs ... then all the backups get > updated. Exactly. Changing the filesystem with tune2fs or resize2fs requires that all of the backups be updated. > The jury is still out on whether automated fscks - for those lunatics > who haven't disabled them - update or not. There is conflicting > information. IIRC, a preen ( the automatic fsck at boot ) normally just sees that the dirty flag is not set ( since the filesystem was cleanly unmounted, right? ), and doesn't do anything else. If there was an unclean shutdown though, and a real fsck is run, then it updates the first backup. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-10 15:18 ` Phil Turmel 2022-09-10 19:30 ` Luigi Fabio @ 2022-09-12 19:06 ` Phillip Susi 2022-09-13 4:02 ` Luigi Fabio 1 sibling, 1 reply; 24+ messages in thread From: Phillip Susi @ 2022-09-12 19:06 UTC (permalink / raw) To: Phil Turmel; +Cc: Luigi Fabio, linux-raid Phil Turmel <philip@turmel.org> writes: > Yes. Same kernels are pretty repeatable for device order on bootup as > long as all are present. Anything missing will shift the letter > assignments. Every time I think about this I find myself amayzed that it does seem to be so stable, and wonder how that can be. The drives are all enumerated in paralell these days so the order they get assigned in should be a total crap shoot, shouldn't it? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-12 19:06 ` Phillip Susi @ 2022-09-13 4:02 ` Luigi Fabio 2022-09-13 12:51 ` Phillip Susi 0 siblings, 1 reply; 24+ messages in thread From: Luigi Fabio @ 2022-09-13 4:02 UTC (permalink / raw) To: Phillip Susi; +Cc: Phil Turmel, linux-raid On Mon, Sep 12, 2022 at 3:09 PM Phillip Susi <phill@thesusis.net> wrote: > Every time I think about this I find myself amayzed that it does seem to > be so stable, and wonder how that can be. The drives are all enumerated > in paralell these days so the order they get assigned in should be a > total crap shoot, shouldn't it? Well, there are several possible explanations, but persistence is desireable - so evidently enumeration occurs according to controller order in a repeatable way until something changes in the configuration - or until you change kernel, someone does something funny with a driver and the order changes. In 28 years of using Linux, however, this has happened.. rarely, save for before things were sensible WAY back when. LF ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: RAID5 failure and consequent ext4 problems 2022-09-13 4:02 ` Luigi Fabio @ 2022-09-13 12:51 ` Phillip Susi 0 siblings, 0 replies; 24+ messages in thread From: Phillip Susi @ 2022-09-13 12:51 UTC (permalink / raw) To: Luigi Fabio; +Cc: Phil Turmel, linux-raid Luigi Fabio <luigi.fabio@gmail.com> writes: > Well, there are several possible explanations, but persistence is > desireable - so evidently enumeration occurs according to controller > order in a repeatable way until something changes in the configuration > - or until you change kernel, someone does something funny with a > driver and the order changes. In 28 years of using Linux, however, > this has happened.. rarely, save for before things were sensible WAY > back when. I *think* it is only because the probes are all *started* in the natural order, so as long as the drives all respond in the same, short amount of time, you get no surprises. If one drive decides to take a little longer to answer today though, it can throw things off. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-09-13 12:57 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-08 14:51 RAID5 failure and consequent ext4 problems Luigi Fabio 2022-09-08 17:23 ` Phil Turmel 2022-09-09 20:32 ` Luigi Fabio 2022-09-09 21:01 ` Luigi Fabio 2022-09-09 21:48 ` Phil Turmel 2022-09-09 22:11 ` David T-G 2022-09-09 22:50 ` Luigi Fabio 2022-09-09 23:04 ` Luigi Fabio 2022-09-10 1:29 ` Luigi Fabio 2022-09-10 15:18 ` Phil Turmel 2022-09-10 19:30 ` Luigi Fabio 2022-09-10 19:55 ` Luigi Fabio 2022-09-10 20:12 ` Luigi Fabio 2022-09-10 20:15 ` Phil Turmel 2022-09-10 20:14 ` Phil Turmel 2022-09-10 20:17 ` Phil Turmel 2022-09-10 20:24 ` Luigi Fabio 2022-09-10 20:54 ` Luigi Fabio 2022-09-12 19:09 ` Phillip Susi 2022-09-13 3:58 ` Luigi Fabio 2022-09-13 12:47 ` Phillip Susi 2022-09-12 19:06 ` Phillip Susi 2022-09-13 4:02 ` Luigi Fabio 2022-09-13 12:51 ` Phillip Susi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.