All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID5 failure and consequent ext4 problems
@ 2022-09-08 14:51 Luigi Fabio
  2022-09-08 17:23 ` Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-08 14:51 UTC (permalink / raw)
  To: linux-raid

[also on linux-ext4]

I am encountering an unusual problem after an mdraid failure, I'll
summarise briefly and can provide further details as required.

First of all, the context. This is happening on a Debian 11 system,
amd64 arch, with current updates (kernel 5.10.136-1, util-linux
2.36.1).

The system has a 12 drive mdraid RAID5 for data, recently migrated to
LSI 2308 HBAs.
This is relevant because earlier this week, at around 13,00 local (EST), four
drives, an entire HBA channel, decided to drop from the RAID.

Of course, mdraid didn't like that and stopped the arrays. I reverted
to best practice and shut down the system first of all.

Further context: the filesystem in the array is ancient - I am vaguely
proud of that - from 2001.
It started as ext2, grew to ext3, then to ext4 and finally to ext4 with 64 bits.
Because I am paranoid, I always mount ext4 with nodelalloc and data=journal.
The journal is external on a RAID1 of SSDs.
I recently (within the last ~3 months) enabled metadata_csum, which is
relevant to the following - the filesystem had never had metadata_csum
enabled before.

Upon reboot, the arrays would not reassemble - this is expected,
because 4/12 drives were marked faulty. So I re--created the array
using the same parameters as were used back when the array was built.
Unfortunately, I had a moment of stupid and didn't specify metadata
0.90 in the re--create, so it was recreated with metadata 1.2... which
writes its data block at the beginning of the components, not at the
end. I noticed it, restopped the array and recreated with the correct
0.90, but the damage was done: the 256 byte + 12 * 20 header was
written at the beginning of each of the 12 components.
Still, unless I am mistaken, this just means that at worst 12x (second
block of each component) were damaged, which shouldn't be too bad. The
only further possibility is that mdraid also zeroed out the 'blank
space' that it puts AFTER the header block and BEFORE the data, but
according to documentation it shouldn't do that.
In any case, I subsequently reassembled the array 'correctly' to match
the previous order and settings and I believe I got it right. I kept
the array RO and tried fsck -n, which gave me this:

ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
fsck.ext4: Group descriptors look bad... trying backup blocks...

It then warns that it won't attempt journal recovery because it's in
RO mode and declares the fs clean - with a reasonable looking number
of files and blocks.

If I try to mount -t ext4 -o ro, I get :

mount: /mnt: mount(2) system call failed: Structure needs cleaning.

so before anything else, I tried fsck -nf to make sure that the REST
of the filesystem is in one logical piece.
THAT painted a very different picture:
On pass 1, I get approximately 980k (almost 10^6) of
Inode nnnnn passes checks, but checksum does not match inode
and ~ 2000
Inode nnnnn contains garbage
Plus some 'tree not optimised' which are technically not errors, from
what I understand.
After ~11 hours, it switches to 1b, tells me that inode 12 has a long
list of duplicate blocks

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 12: 2928004133 [....]

And ends after the list of multiply claimed blocks with:

e2fsck: aborted
Error while scanning inodes (8193): Inode checksum does not match inode

/dev/md123: ********** WARNING: Filesystem still has errors **********


/dev/md123: ********** WARNING: Filesystem still has errors **********

So, what is my next step? I realise I should NOT have touched the
original drives and dd-ed images to a separate array to work on those,
but I believe the only writing that occurred were the mdraid
superblocks. I am, in any case, grabbing more drives to image the
'faulty' array and work on the images, leaving the original data
alone.

Where do I go from here? I have had similar issues in the past, all
the way back to the early 00s, and I had a near-100% success rate by
re--creating the arrays. What is different this time?
Or, is nothing different and is the problem just in the checksumming?

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-08 14:51 RAID5 failure and consequent ext4 problems Luigi Fabio
@ 2022-09-08 17:23 ` Phil Turmel
  2022-09-09 20:32   ` Luigi Fabio
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2022-09-08 17:23 UTC (permalink / raw)
  To: Luigi Fabio, linux-raid

Hi Luigi,

On 9/8/22 10:51, Luigi Fabio wrote:
> Upon reboot, the arrays would not reassemble - this is expected,
> because 4/12 drives were marked faulty. So I re--created the array
> using the same parameters as were used back when the array was built.

Oh, No!

> Unfortunately, I had a moment of stupid and didn't specify metadata
> 0.90 in the re--create, so it was recreated with metadata 1.2... which
> writes its data block at the beginning of the components, not at the
> end. I noticed it, restopped the array and recreated with the correct
> 0.90, but the damage was done: the 256 byte + 12 * 20 header was
> written at the beginning of each of the 12 components.

No, the moment of stupid was that you re-created the array. 
Simultaneous multi-drive failures that stop an array are easily fixed 
with --assemble --force.  Too late for that now.

It is absurdly easy to screw up device order when re-creating, and if 
you didn't specify every allocation and layout detail, the changes in 
defaults over the years would also screw up your data.  And finally, 
omitting --assume-clean would cause all of your parity to be 
recalculated immediately, with catastrophic results if any order or 
allocation attributes are wrong.

):

> Where do I go from here? I have had similar issues in the past, all
> the way back to the early 00s, and I had a near-100% success rate by
> re--creating the arrays. What is different this time?
> Or, is nothing different and is the problem just in the checksumming?

No, you just got lucky in the past.  Probably by using mdadm versions 
that hadn't been updated.

You'll need to show us every command you tried from your history, and 
full details of all drives/partitions involved.

But I'll be brutally honest:  your data is likely toast.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-08 17:23 ` Phil Turmel
@ 2022-09-09 20:32   ` Luigi Fabio
  2022-09-09 21:01     ` Luigi Fabio
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-09 20:32 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Thanks for reaching out, first of all. Apologies for the late reply,
the brilliant (...) spam filter strikes again...

On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
> No, the moment of stupid was that you re-created the array.
> Simultaneous multi-drive failures that stop an array are easily fixed
> with --assemble --force.  Too late for that now.
Noted for the future, thanks.

> It is absurdly easy to screw up device order when re-creating, and if
> you didn't specify every allocation and layout detail, the changes in
> defaults over the years would also screw up your data.  And finally,
> omitting --assume-clean would cause all of your parity to be
> recalculated immediately, with catastrophic results if any order or
> allocation attributes are wrong.
Of course. Which is why I specified everything and why I checked the
details with --examine and --detail and they match exactly, minus the
metadata version because, well, I wasn't actually the one typing (it's
a slightly complicated story.. I was reassembling by proxy on the
phone) and I made an incorrect assumption about the person typing.
There aren't, in the end, THAT many things to specify: RAID level,
number of drives, order thereof, chunk size, 'layout' and metadata
version. 0.90 doesn't allow before/after gaps so that should be it, I
believe.
Am I missing anything?

> No, you just got lucky in the past.  Probably by using mdadm versions
> that hadn't been updated.
That's not quite it: I keep records of how arrays are built and match
them, though it is true that I tend to update things as little as
possible on production machines.
One of the differences, this time, is that this was NOT a production
machine. The other was that I was driving, dictating on the phone and
was under a lot of pressure to get the thing back up ASAP.
Nonetheless, I have an --examine of at least two drives from the
previous setup so there should be enough information there to rebuild
a matching array, I think?

> You'll need to show us every command you tried from your history, and
> full details of all drives/partitions involved.
>
> But I'll be brutally honest:  your data is likely toast.
Well, let's hope it isn't. All mdadm commands were -o and
--assume-clean, so in theory the only thing which HAS been written are
the md blocks, unless I am mistaken and/or I read the docs
incorrectly?

That does, of course, leave the problem of the blocks overwritten by
the 1.2 metadata, but as I read the docs that should be a very small
number - let's say one 4096byte block (a portion thereof, to be
pedantic, but ext4 doesn't really care?) per drive, correct?

Background:
Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
X10 series)'s chipset SATA ports.
All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
have journals on another SSD RAID1 (one per FS, obviously).
Data drives:
12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
each with two four-drive ports (and one of these went DELIGHTFULLY
missing)

This is the layout of each drive:
---
GPT fdisk (gdisk) version 1.0.6
...
Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
Model: ST4000NC001-1FS1
Sector size (logical/physical): 512/4096 bytes
...
Total free space is 99949 sectors (48.8 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
   2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
---

So there were two RAID arrays. Both RAID5 - a main RAID called
'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
called backup which had 12 x 90 GB.

A little further backstory: right before the event, one drive had been
pulled because it had started failing. What I did was shut down the
machine, put the failing drive on a MB port and put a new drive on the
LSI controllers. I then brought the machine back online, did the
--replace --with thing and this worked fine.
At that point the faulty drive (/dev/sdc, MB drives come before the
LSI drives in the count) got deleted via /sys/block.... and physically
disconnected from the system, which was then happily running with
/dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
the 'archive' drives.
It went 96 hours or so like that under moderate load. Then the failure
happened, the machine was rebooted thus the previous sdd -> sdo drives
became sdc -> sdn drives.
However, the relative order was, to the best of my knowledge,
conserved - AND I still have the 'faulty' drive, so I could very
easily put it back in to have everything match.
Most importantly, this drive has on it, without a doubt, the details
of the array BEFORE everything happened - by definition untouched
because the drive was stopped and pulled before the event.
I also have a cat of the --examine of two of the faulty drives BEFORE
anything was written to them - thus, unless I am mistaken, these
contained the md block details from 'before the event'.

Here is one of them, taken after the reboot and therefore when the MB
/dev/sdc was no longer there:
---
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 2457b506:85728e9d:c44c77eb:7ee19756
  Creation Time : Sat Mar 30 18:18:00 2019
     Raid Level : raid5
  Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
     Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
   Raid Devices : 12
  Total Devices : 12
Preferred Minor : 123

    Update Time : Tue Sep  6 11:37:53 2022
          State : clean
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 391e325d - correct
         Events : 52177

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     5       8       49        5      active sync   /dev/sdd1

   0     0       8      225        0      active sync
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       97        2      active sync   /dev/sdg1
   3     3       8      161        3      active sync   /dev/sdk1
   4     4       8      113        4      active sync   /dev/sdh1
   5     5       8       49        5      active sync   /dev/sdd1
   6     6       8      177        6      active sync   /dev/sdl1
   7     7       8      145        7      active sync   /dev/sdj1
   8     8       8      129        8      active sync   /dev/sdi1
   9     9       8       65        9      active sync   /dev/sde1
  10    10       8      209       10      active sync   /dev/sdn1
  11    11       8      193       11      active sync   /dev/sdm1
---
Note that the drives are 'moved' because the old /dev/sdc isn't there
any more but the relative position should be the same, correct me if I
am wrong. If you prefer, what you need to do to get the 'new' drive
letter is to take 16 out of the minor of each of the drives.

This is the 'new' --create
---
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
  Creation Time : Tue Sep  6 15:15:03 2022
     Raid Level : raid5
  Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
     Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
   Raid Devices : 12
  Total Devices : 12
Preferred Minor : 123

    Update Time : Tue Sep  6 15:15:03 2022
          State : clean
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ed12b96a - correct
         Events : 1

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     5       8       33        5      active sync   /dev/sdc1

   0     0       8      209        0      active sync   /dev/sdn1
   1     1       8       65        1      active sync   /dev/sde1
   2     2       8       81        2      active sync   /dev/sdf1
   3     3       8      145        3      active sync   /dev/sdj1
   4     4       8       97        4      active sync   /dev/sdg1
   5     5       8       33        5      active sync   /dev/sdc1
   6     6       8      161        6      active sync   /dev/sdk1
   7     7       8      129        7      active sync   /dev/sdi1
   8     8       8      113        8      active sync   /dev/sdh1
   9     9       8       49        9      active sync   /dev/sdd1
  10    10       8      193       10      active sync   /dev/sdm1
  11    11       8      177       11      active sync   /dev/sdl1
---

If you put the layout lines side by side, it would seem to me that
they match, modulo the '16' difference.

This is the list of --create and --assemble commands from the 6th
which involve the sdx1 partitions, those we care about right now -
there were others involving /dev/md124 and the /dev/sdx2 which however
are not relevant - the data there :
--
 9813  mdadm --assemble /dev/md123 missing
 9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
/dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
/dev/sdn1 /dev/sdm1
 9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
/dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
/dev/sdm1
 9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
/dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
/dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
/dev/sdd1 /dev/sdm1 /dev/sdl1
^^^^ note that these were the WRONG ARRAY - this was an unfortunate
miscommunication which caused potential damage.
 9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
/dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
 9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
/dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
/dev/sdl1
 9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
/dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
/dev/sdk1 /dev/sdl1
 9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
 9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
--chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
/dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
/dev/sdm1 /dev/sdl1
---

Note that they all were -o, therefore if I am not mistaken no parity
data was written anywhere. Note further the fact that the first two
were the 'mistake' ones, which did NOT have --assume-clean (but with
-o this shouldn't make a difference AFAIK) and most importantly the
metadata was the 1.2 default AND they were the wrong array in the
first place.
Note also that the 'final' --create commands also had --bitmap=none to
match the original array, though according to the docs the bitmap
space in 0.90 (and 1.2?) is in a space which does not affect the data
in the first place.

Now, first of all a question: if I get the 'old' sdc, the one that was
taken out prior to this whole mess, onto a different system in order
to examine it, the modern mdraid auto discovery shoud NOT overwrite
the md data, correct? Thus I should be able to double-check the drive
order on that as well?

Any other pointers, insults etc are of course welcome.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 20:32   ` Luigi Fabio
@ 2022-09-09 21:01     ` Luigi Fabio
  2022-09-09 21:48       ` Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-09 21:01 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Another helpful datapoint, this is the boot *before* sdc got
--replaced with sdo:

[   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
[   13.528396] md/raid:md123: device sde1 operational as raid disk 9
[   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
[   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
[   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
[   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
[   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
[   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
[   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
[   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
[   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
[   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
[   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
devices, algorithm 2
[   13.531644] md123: detected capacity change from 0 to 42945088192512

This gives us, correct me if I am wrong of course, an exact
representation of what the array 'used to look like', with sdc1 then
replaced by sdo1 (8/225).

Just some confirmation that the order should (?) be the one above.

LF

On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> Thanks for reaching out, first of all. Apologies for the late reply,
> the brilliant (...) spam filter strikes again...
>
> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
> > No, the moment of stupid was that you re-created the array.
> > Simultaneous multi-drive failures that stop an array are easily fixed
> > with --assemble --force.  Too late for that now.
> Noted for the future, thanks.
>
> > It is absurdly easy to screw up device order when re-creating, and if
> > you didn't specify every allocation and layout detail, the changes in
> > defaults over the years would also screw up your data.  And finally,
> > omitting --assume-clean would cause all of your parity to be
> > recalculated immediately, with catastrophic results if any order or
> > allocation attributes are wrong.
> Of course. Which is why I specified everything and why I checked the
> details with --examine and --detail and they match exactly, minus the
> metadata version because, well, I wasn't actually the one typing (it's
> a slightly complicated story.. I was reassembling by proxy on the
> phone) and I made an incorrect assumption about the person typing.
> There aren't, in the end, THAT many things to specify: RAID level,
> number of drives, order thereof, chunk size, 'layout' and metadata
> version. 0.90 doesn't allow before/after gaps so that should be it, I
> believe.
> Am I missing anything?
>
> > No, you just got lucky in the past.  Probably by using mdadm versions
> > that hadn't been updated.
> That's not quite it: I keep records of how arrays are built and match
> them, though it is true that I tend to update things as little as
> possible on production machines.
> One of the differences, this time, is that this was NOT a production
> machine. The other was that I was driving, dictating on the phone and
> was under a lot of pressure to get the thing back up ASAP.
> Nonetheless, I have an --examine of at least two drives from the
> previous setup so there should be enough information there to rebuild
> a matching array, I think?
>
> > You'll need to show us every command you tried from your history, and
> > full details of all drives/partitions involved.
> >
> > But I'll be brutally honest:  your data is likely toast.
> Well, let's hope it isn't. All mdadm commands were -o and
> --assume-clean, so in theory the only thing which HAS been written are
> the md blocks, unless I am mistaken and/or I read the docs
> incorrectly?
>
> That does, of course, leave the problem of the blocks overwritten by
> the 1.2 metadata, but as I read the docs that should be a very small
> number - let's say one 4096byte block (a portion thereof, to be
> pedantic, but ext4 doesn't really care?) per drive, correct?
>
> Background:
> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
> X10 series)'s chipset SATA ports.
> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
> have journals on another SSD RAID1 (one per FS, obviously).
> Data drives:
> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
> each with two four-drive ports (and one of these went DELIGHTFULLY
> missing)
>
> This is the layout of each drive:
> ---
> GPT fdisk (gdisk) version 1.0.6
> ...
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> Model: ST4000NC001-1FS1
> Sector size (logical/physical): 512/4096 bytes
> ...
> Total free space is 99949 sectors (48.8 MiB)
>
> Number  Start (sector)    End (sector)  Size       Code  Name
>    1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
>    2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
> ---
>
> So there were two RAID arrays. Both RAID5 - a main RAID called
> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
> called backup which had 12 x 90 GB.
>
> A little further backstory: right before the event, one drive had been
> pulled because it had started failing. What I did was shut down the
> machine, put the failing drive on a MB port and put a new drive on the
> LSI controllers. I then brought the machine back online, did the
> --replace --with thing and this worked fine.
> At that point the faulty drive (/dev/sdc, MB drives come before the
> LSI drives in the count) got deleted via /sys/block.... and physically
> disconnected from the system, which was then happily running with
> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
> the 'archive' drives.
> It went 96 hours or so like that under moderate load. Then the failure
> happened, the machine was rebooted thus the previous sdd -> sdo drives
> became sdc -> sdn drives.
> However, the relative order was, to the best of my knowledge,
> conserved - AND I still have the 'faulty' drive, so I could very
> easily put it back in to have everything match.
> Most importantly, this drive has on it, without a doubt, the details
> of the array BEFORE everything happened - by definition untouched
> because the drive was stopped and pulled before the event.
> I also have a cat of the --examine of two of the faulty drives BEFORE
> anything was written to them - thus, unless I am mistaken, these
> contained the md block details from 'before the event'.
>
> Here is one of them, taken after the reboot and therefore when the MB
> /dev/sdc was no longer there:
> ---
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 2457b506:85728e9d:c44c77eb:7ee19756
>   Creation Time : Sat Mar 30 18:18:00 2019
>      Raid Level : raid5
>   Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
>      Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
>    Raid Devices : 12
>   Total Devices : 12
> Preferred Minor : 123
>
>     Update Time : Tue Sep  6 11:37:53 2022
>           State : clean
>  Active Devices : 12
> Working Devices : 12
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : 391e325d - correct
>          Events : 52177
>
>          Layout : left-symmetric
>      Chunk Size : 128K
>
>       Number   Major   Minor   RaidDevice State
> this     5       8       49        5      active sync   /dev/sdd1
>
>    0     0       8      225        0      active sync
>    1     1       8       81        1      active sync   /dev/sdf1
>    2     2       8       97        2      active sync   /dev/sdg1
>    3     3       8      161        3      active sync   /dev/sdk1
>    4     4       8      113        4      active sync   /dev/sdh1
>    5     5       8       49        5      active sync   /dev/sdd1
>    6     6       8      177        6      active sync   /dev/sdl1
>    7     7       8      145        7      active sync   /dev/sdj1
>    8     8       8      129        8      active sync   /dev/sdi1
>    9     9       8       65        9      active sync   /dev/sde1
>   10    10       8      209       10      active sync   /dev/sdn1
>   11    11       8      193       11      active sync   /dev/sdm1
> ---
> Note that the drives are 'moved' because the old /dev/sdc isn't there
> any more but the relative position should be the same, correct me if I
> am wrong. If you prefer, what you need to do to get the 'new' drive
> letter is to take 16 out of the minor of each of the drives.
>
> This is the 'new' --create
> ---
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 0.90.00
>            UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
>   Creation Time : Tue Sep  6 15:15:03 2022
>      Raid Level : raid5
>   Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
>      Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
>    Raid Devices : 12
>   Total Devices : 12
> Preferred Minor : 123
>
>     Update Time : Tue Sep  6 15:15:03 2022
>           State : clean
>  Active Devices : 12
> Working Devices : 12
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : ed12b96a - correct
>          Events : 1
>
>          Layout : left-symmetric
>      Chunk Size : 128K
>
>       Number   Major   Minor   RaidDevice State
> this     5       8       33        5      active sync   /dev/sdc1
>
>    0     0       8      209        0      active sync   /dev/sdn1
>    1     1       8       65        1      active sync   /dev/sde1
>    2     2       8       81        2      active sync   /dev/sdf1
>    3     3       8      145        3      active sync   /dev/sdj1
>    4     4       8       97        4      active sync   /dev/sdg1
>    5     5       8       33        5      active sync   /dev/sdc1
>    6     6       8      161        6      active sync   /dev/sdk1
>    7     7       8      129        7      active sync   /dev/sdi1
>    8     8       8      113        8      active sync   /dev/sdh1
>    9     9       8       49        9      active sync   /dev/sdd1
>   10    10       8      193       10      active sync   /dev/sdm1
>   11    11       8      177       11      active sync   /dev/sdl1
> ---
>
> If you put the layout lines side by side, it would seem to me that
> they match, modulo the '16' difference.
>
> This is the list of --create and --assemble commands from the 6th
> which involve the sdx1 partitions, those we care about right now -
> there were others involving /dev/md124 and the /dev/sdx2 which however
> are not relevant - the data there :
> --
>  9813  mdadm --assemble /dev/md123 missing
>  9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
> /dev/sdn1 /dev/sdm1
>  9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
> /dev/sdm1
>  9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
>  9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
> /dev/sdd1 /dev/sdm1 /dev/sdl1
> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
> miscommunication which caused potential damage.
>  9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
>  9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
> /dev/sdl1
>  9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
> /dev/sdk1 /dev/sdl1
>  9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
>  9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
>  9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
>  9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
>  9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> /dev/sdm1 /dev/sdl1
> ---
>
> Note that they all were -o, therefore if I am not mistaken no parity
> data was written anywhere. Note further the fact that the first two
> were the 'mistake' ones, which did NOT have --assume-clean (but with
> -o this shouldn't make a difference AFAIK) and most importantly the
> metadata was the 1.2 default AND they were the wrong array in the
> first place.
> Note also that the 'final' --create commands also had --bitmap=none to
> match the original array, though according to the docs the bitmap
> space in 0.90 (and 1.2?) is in a space which does not affect the data
> in the first place.
>
> Now, first of all a question: if I get the 'old' sdc, the one that was
> taken out prior to this whole mess, onto a different system in order
> to examine it, the modern mdraid auto discovery shoud NOT overwrite
> the md data, correct? Thus I should be able to double-check the drive
> order on that as well?
>
> Any other pointers, insults etc are of course welcome.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 21:01     ` Luigi Fabio
@ 2022-09-09 21:48       ` Phil Turmel
  2022-09-09 22:11         ` David T-G
  2022-09-09 22:50         ` Luigi Fabio
  0 siblings, 2 replies; 24+ messages in thread
From: Phil Turmel @ 2022-09-09 21:48 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: linux-raid

Reasonably likely, but not certain.

Devices can be re-ordered by different kernels.  That's why lsdrv prints 
serial numbers in its tree.

You haven't mentioned whether your --create operations specified 
--assume-clean.

Also, be aware that shell expansion of something like /dev/sd[dcbaefgh] 
is sorted to /dev/sd[abcdefgh].  Use curly brace expansion with commas 
if you are taking shortcuts.

On 9/9/22 17:01, Luigi Fabio wrote:
> Another helpful datapoint, this is the boot *before* sdc got
> --replaced with sdo:
> 
> [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> devices, algorithm 2
> [   13.531644] md123: detected capacity change from 0 to 42945088192512
> 
> This gives us, correct me if I am wrong of course, an exact
> representation of what the array 'used to look like', with sdc1 then
> replaced by sdo1 (8/225).
> 
> Just some confirmation that the order should (?) be the one above.
> 
> LF
> 
> On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>>
>> Thanks for reaching out, first of all. Apologies for the late reply,
>> the brilliant (...) spam filter strikes again...
>>
>> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
>>> No, the moment of stupid was that you re-created the array.
>>> Simultaneous multi-drive failures that stop an array are easily fixed
>>> with --assemble --force.  Too late for that now.
>> Noted for the future, thanks.
>>
>>> It is absurdly easy to screw up device order when re-creating, and if
>>> you didn't specify every allocation and layout detail, the changes in
>>> defaults over the years would also screw up your data.  And finally,
>>> omitting --assume-clean would cause all of your parity to be
>>> recalculated immediately, with catastrophic results if any order or
>>> allocation attributes are wrong.
>> Of course. Which is why I specified everything and why I checked the
>> details with --examine and --detail and they match exactly, minus the
>> metadata version because, well, I wasn't actually the one typing (it's
>> a slightly complicated story.. I was reassembling by proxy on the
>> phone) and I made an incorrect assumption about the person typing.
>> There aren't, in the end, THAT many things to specify: RAID level,
>> number of drives, order thereof, chunk size, 'layout' and metadata
>> version. 0.90 doesn't allow before/after gaps so that should be it, I
>> believe.
>> Am I missing anything?
>>
>>> No, you just got lucky in the past.  Probably by using mdadm versions
>>> that hadn't been updated.
>> That's not quite it: I keep records of how arrays are built and match
>> them, though it is true that I tend to update things as little as
>> possible on production machines.
>> One of the differences, this time, is that this was NOT a production
>> machine. The other was that I was driving, dictating on the phone and
>> was under a lot of pressure to get the thing back up ASAP.
>> Nonetheless, I have an --examine of at least two drives from the
>> previous setup so there should be enough information there to rebuild
>> a matching array, I think?
>>
>>> You'll need to show us every command you tried from your history, and
>>> full details of all drives/partitions involved.
>>>
>>> But I'll be brutally honest:  your data is likely toast.
>> Well, let's hope it isn't. All mdadm commands were -o and
>> --assume-clean, so in theory the only thing which HAS been written are
>> the md blocks, unless I am mistaken and/or I read the docs
>> incorrectly?
>>
>> That does, of course, leave the problem of the blocks overwritten by
>> the 1.2 metadata, but as I read the docs that should be a very small
>> number - let's say one 4096byte block (a portion thereof, to be
>> pedantic, but ext4 doesn't really care?) per drive, correct?
>>
>> Background:
>> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
>> X10 series)'s chipset SATA ports.
>> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
>> have journals on another SSD RAID1 (one per FS, obviously).
>> Data drives:
>> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
>> each with two four-drive ports (and one of these went DELIGHTFULLY
>> missing)
>>
>> This is the layout of each drive:
>> ---
>> GPT fdisk (gdisk) version 1.0.6
>> ...
>> Found valid GPT with protective MBR; using GPT.
>> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
>> Model: ST4000NC001-1FS1
>> Sector size (logical/physical): 512/4096 bytes
>> ...
>> Total free space is 99949 sectors (48.8 MiB)
>>
>> Number  Start (sector)    End (sector)  Size       Code  Name
>>     1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
>>     2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
>> ---
>>
>> So there were two RAID arrays. Both RAID5 - a main RAID called
>> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
>> called backup which had 12 x 90 GB.
>>
>> A little further backstory: right before the event, one drive had been
>> pulled because it had started failing. What I did was shut down the
>> machine, put the failing drive on a MB port and put a new drive on the
>> LSI controllers. I then brought the machine back online, did the
>> --replace --with thing and this worked fine.
>> At that point the faulty drive (/dev/sdc, MB drives come before the
>> LSI drives in the count) got deleted via /sys/block.... and physically
>> disconnected from the system, which was then happily running with
>> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
>> the 'archive' drives.
>> It went 96 hours or so like that under moderate load. Then the failure
>> happened, the machine was rebooted thus the previous sdd -> sdo drives
>> became sdc -> sdn drives.
>> However, the relative order was, to the best of my knowledge,
>> conserved - AND I still have the 'faulty' drive, so I could very
>> easily put it back in to have everything match.
>> Most importantly, this drive has on it, without a doubt, the details
>> of the array BEFORE everything happened - by definition untouched
>> because the drive was stopped and pulled before the event.
>> I also have a cat of the --examine of two of the faulty drives BEFORE
>> anything was written to them - thus, unless I am mistaken, these
>> contained the md block details from 'before the event'.
>>
>> Here is one of them, taken after the reboot and therefore when the MB
>> /dev/sdc was no longer there:
>> ---
>> /dev/sdc1:
>>            Magic : a92b4efc
>>          Version : 0.90.00
>>             UUID : 2457b506:85728e9d:c44c77eb:7ee19756
>>    Creation Time : Sat Mar 30 18:18:00 2019
>>       Raid Level : raid5
>>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
>>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
>>     Raid Devices : 12
>>    Total Devices : 12
>> Preferred Minor : 123
>>
>>      Update Time : Tue Sep  6 11:37:53 2022
>>            State : clean
>>   Active Devices : 12
>> Working Devices : 12
>>   Failed Devices : 0
>>    Spare Devices : 0
>>         Checksum : 391e325d - correct
>>           Events : 52177
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     5       8       49        5      active sync   /dev/sdd1
>>
>>     0     0       8      225        0      active sync
>>     1     1       8       81        1      active sync   /dev/sdf1
>>     2     2       8       97        2      active sync   /dev/sdg1
>>     3     3       8      161        3      active sync   /dev/sdk1
>>     4     4       8      113        4      active sync   /dev/sdh1
>>     5     5       8       49        5      active sync   /dev/sdd1
>>     6     6       8      177        6      active sync   /dev/sdl1
>>     7     7       8      145        7      active sync   /dev/sdj1
>>     8     8       8      129        8      active sync   /dev/sdi1
>>     9     9       8       65        9      active sync   /dev/sde1
>>    10    10       8      209       10      active sync   /dev/sdn1
>>    11    11       8      193       11      active sync   /dev/sdm1
>> ---
>> Note that the drives are 'moved' because the old /dev/sdc isn't there
>> any more but the relative position should be the same, correct me if I
>> am wrong. If you prefer, what you need to do to get the 'new' drive
>> letter is to take 16 out of the minor of each of the drives.
>>
>> This is the 'new' --create
>> ---
>> /dev/sdc1:
>>            Magic : a92b4efc
>>          Version : 0.90.00
>>             UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
>>    Creation Time : Tue Sep  6 15:15:03 2022
>>       Raid Level : raid5
>>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
>>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
>>     Raid Devices : 12
>>    Total Devices : 12
>> Preferred Minor : 123
>>
>>      Update Time : Tue Sep  6 15:15:03 2022
>>            State : clean
>>   Active Devices : 12
>> Working Devices : 12
>>   Failed Devices : 0
>>    Spare Devices : 0
>>         Checksum : ed12b96a - correct
>>           Events : 1
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     5       8       33        5      active sync   /dev/sdc1
>>
>>     0     0       8      209        0      active sync   /dev/sdn1
>>     1     1       8       65        1      active sync   /dev/sde1
>>     2     2       8       81        2      active sync   /dev/sdf1
>>     3     3       8      145        3      active sync   /dev/sdj1
>>     4     4       8       97        4      active sync   /dev/sdg1
>>     5     5       8       33        5      active sync   /dev/sdc1
>>     6     6       8      161        6      active sync   /dev/sdk1
>>     7     7       8      129        7      active sync   /dev/sdi1
>>     8     8       8      113        8      active sync   /dev/sdh1
>>     9     9       8       49        9      active sync   /dev/sdd1
>>    10    10       8      193       10      active sync   /dev/sdm1
>>    11    11       8      177       11      active sync   /dev/sdl1
>> ---
>>
>> If you put the layout lines side by side, it would seem to me that
>> they match, modulo the '16' difference.
>>
>> This is the list of --create and --assemble commands from the 6th
>> which involve the sdx1 partitions, those we care about right now -
>> there were others involving /dev/md124 and the /dev/sdx2 which however
>> are not relevant - the data there :
>> --
>>   9813  mdadm --assemble /dev/md123 missing
>>   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
>> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
>> /dev/sdn1 /dev/sdm1
>>   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
>> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
>> /dev/sdm1
>>   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
>> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>>   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
>> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
>> /dev/sdd1 /dev/sdm1 /dev/sdl1
>> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
>> miscommunication which caused potential damage.
>>   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
>> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
>>   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
>> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
>> /dev/sdl1
>>   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
>> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
>> /dev/sdk1 /dev/sdl1
>>   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
>> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>>   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
>> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>>   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
>> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>>   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
>> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>>   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
>> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
>> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
>> /dev/sdm1 /dev/sdl1
>> ---
>>
>> Note that they all were -o, therefore if I am not mistaken no parity
>> data was written anywhere. Note further the fact that the first two
>> were the 'mistake' ones, which did NOT have --assume-clean (but with
>> -o this shouldn't make a difference AFAIK) and most importantly the
>> metadata was the 1.2 default AND they were the wrong array in the
>> first place.
>> Note also that the 'final' --create commands also had --bitmap=none to
>> match the original array, though according to the docs the bitmap
>> space in 0.90 (and 1.2?) is in a space which does not affect the data
>> in the first place.
>>
>> Now, first of all a question: if I get the 'old' sdc, the one that was
>> taken out prior to this whole mess, onto a different system in order
>> to examine it, the modern mdraid auto discovery shoud NOT overwrite
>> the md data, correct? Thus I should be able to double-check the drive
>> order on that as well?
>>
>> Any other pointers, insults etc are of course welcome.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 21:48       ` Phil Turmel
@ 2022-09-09 22:11         ` David T-G
  2022-09-09 22:50         ` Luigi Fabio
  1 sibling, 0 replies; 24+ messages in thread
From: David T-G @ 2022-09-09 22:11 UTC (permalink / raw)
  To: linux-raid

Phil & Luigi, et al --

...and then Phil Turmel said...
% 
...

% 
% You haven't mentioned whether your --create operations specified
% --assume-clean.

He hasn't?


% 
% On 9/9/22 17:01, Luigi Fabio wrote:
...
% > 
% > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
% > > 
...
% > > > But I'll be brutally honest:  your data is likely toast.
% > > Well, let's hope it isn't. All mdadm commands were -o and
% > > --assume-clean, so in theory the only thing which HAS been written are
% > > the md blocks, unless I am mistaken and/or I read the docs
% > > incorrectly?
...
% > > This is the list of --create and --assemble commands from the 6th
...
% > >   9813  mdadm --assemble /dev/md123 missing
% > >   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
...
% > >   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
...
% > > /dev/sdm1
% > >   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
...
% > >   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
...
% > >   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > >   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
...
% > >   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
...
% > > 
% > > Note that they all were -o, therefore if I am not mistaken no parity
% > > data was written anywhere. Note further the fact that the first two
% > > were the 'mistake' ones, which did NOT have --assume-clean (but with
% > > -o this shouldn't make a difference AFAIK) and most importantly the
% > > metadata was the 1.2 default AND they were the wrong array in the
% > > first place.
[snip]

I certainly don't know what I'm talking about, so this is all I'll say,
but it looked reasonably complete to me ...


HTH & HANW

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 21:48       ` Phil Turmel
  2022-09-09 22:11         ` David T-G
@ 2022-09-09 22:50         ` Luigi Fabio
  2022-09-09 23:04           ` Luigi Fabio
  1 sibling, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-09 22:50 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

By different kernels, maybe - but the kernel has been the same for
quite a while (months).

I did paste the whole of the command lines in the (very long) email,
as David mentions (thanks!) - the first ones, the mistaken ones, did
NOT have --assume-clean but they did have -o, so no parity activity
should have started according to the docs?
A new thought came to mind: one of the HBAs lost a channel, right?
What if on the subsequent reboot the devices that were on that channel
got 'rediscovered' and shunted to the end of the letter order? That
would, I believe, be ordinary operating procedure.
That would give us an almost-correct array, which would explain how
fsck can get ... some pieces.

Also, I am not quite brave enough (...) to use shortcuts when handling
mdadm commands.

I am reconstructing the port order (scsi targets, if you prefer) from
the 20220904 boot log. I should at that point be able to have an exact
order of the drives.

Here it is:

---
[    1.853329] sd 2:0:0:0: [sda] Write Protect is off
[    1.853331] sd 7:0:0:0: [sdc] Write Protect is off
[    1.853382] sd 3:0:0:0: [sdb] Write Protect is off
[   12.531607] sd 10:0:3:0: [sdg] Write Protect is off
[   12.533303] sd 10:0:2:0: [sdf] Write Protect is off
[   12.534606] sd 10:0:0:0: [sdd] Write Protect is off
[   12.570768] sd 10:0:1:0: [sde] Write Protect is off
[   12.959925] sd 11:0:0:0: [sdh] Write Protect is off
[   12.965230] sd 11:0:1:0: [sdi] Write Protect is off
[   12.966145] sd 11:0:4:0: [sdl] Write Protect is off
[   12.966800] sd 11:0:3:0: [sdk] Write Protect is off
[   12.997253] sd 11:0:2:0: [sdj] Write Protect is off
[   13.002395] sd 11:0:7:0: [sdo] Write Protect is off
[   13.012693] sd 11:0:5:0: [sdm] Write Protect is off
[   13.017630] sd 11:0:6:0: [sdn] Write Protect is off
---
If we combine this with the previous:
---
[   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
[   13.528396] md/raid:md123: device sde1 operational as raid disk 9
[   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
[   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
[   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
[   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
[   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
[   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
[   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
[   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
[   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
[   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
[   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
devices, algorithm 2
[   13.531644] md123: detected capacity change from 0 to 42945088192512
---
We have a SCSI target -> raid disk number correspondence.
As of this boot, the letter -> scsi target correspondences match,
shifted by one because as discussed 7:0:0:0 is no longer there (the
old, 'faulty' sdc).
Thus, having univocally determined the prior scsi target -> raid
position we can transpose it to the present drive letters, which are
shifted by one.
Therefore, we can generate, rectius have generated, a --create with
the same software versions, the same settings and the same drive
order. Is there any reason why, minus the 1.2 metadata overwriting
which should have only affected 12 blocks, the fs should 'not' be as
before?
Genuine question, mind.

On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote:
>
> Reasonably likely, but not certain.
>
> Devices can be re-ordered by different kernels.  That's why lsdrv prints
> serial numbers in its tree.
>
> You haven't mentioned whether your --create operations specified
> --assume-clean.
>
> Also, be aware that shell expansion of something like /dev/sd[dcbaefgh]
> is sorted to /dev/sd[abcdefgh].  Use curly brace expansion with commas
> if you are taking shortcuts.
>
> On 9/9/22 17:01, Luigi Fabio wrote:
> > Another helpful datapoint, this is the boot *before* sdc got
> > --replaced with sdo:
> >
> > [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> > [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> > [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> > [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> > [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> > [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> > [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> > [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> > [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> > [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> > [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> > [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> > [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> > devices, algorithm 2
> > [   13.531644] md123: detected capacity change from 0 to 42945088192512
> >
> > This gives us, correct me if I am wrong of course, an exact
> > representation of what the array 'used to look like', with sdc1 then
> > replaced by sdo1 (8/225).
> >
> > Just some confirmation that the order should (?) be the one above.
> >
> > LF
> >
> > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
> >>
> >> Thanks for reaching out, first of all. Apologies for the late reply,
> >> the brilliant (...) spam filter strikes again...
> >>
> >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
> >>> No, the moment of stupid was that you re-created the array.
> >>> Simultaneous multi-drive failures that stop an array are easily fixed
> >>> with --assemble --force.  Too late for that now.
> >> Noted for the future, thanks.
> >>
> >>> It is absurdly easy to screw up device order when re-creating, and if
> >>> you didn't specify every allocation and layout detail, the changes in
> >>> defaults over the years would also screw up your data.  And finally,
> >>> omitting --assume-clean would cause all of your parity to be
> >>> recalculated immediately, with catastrophic results if any order or
> >>> allocation attributes are wrong.
> >> Of course. Which is why I specified everything and why I checked the
> >> details with --examine and --detail and they match exactly, minus the
> >> metadata version because, well, I wasn't actually the one typing (it's
> >> a slightly complicated story.. I was reassembling by proxy on the
> >> phone) and I made an incorrect assumption about the person typing.
> >> There aren't, in the end, THAT many things to specify: RAID level,
> >> number of drives, order thereof, chunk size, 'layout' and metadata
> >> version. 0.90 doesn't allow before/after gaps so that should be it, I
> >> believe.
> >> Am I missing anything?
> >>
> >>> No, you just got lucky in the past.  Probably by using mdadm versions
> >>> that hadn't been updated.
> >> That's not quite it: I keep records of how arrays are built and match
> >> them, though it is true that I tend to update things as little as
> >> possible on production machines.
> >> One of the differences, this time, is that this was NOT a production
> >> machine. The other was that I was driving, dictating on the phone and
> >> was under a lot of pressure to get the thing back up ASAP.
> >> Nonetheless, I have an --examine of at least two drives from the
> >> previous setup so there should be enough information there to rebuild
> >> a matching array, I think?
> >>
> >>> You'll need to show us every command you tried from your history, and
> >>> full details of all drives/partitions involved.
> >>>
> >>> But I'll be brutally honest:  your data is likely toast.
> >> Well, let's hope it isn't. All mdadm commands were -o and
> >> --assume-clean, so in theory the only thing which HAS been written are
> >> the md blocks, unless I am mistaken and/or I read the docs
> >> incorrectly?
> >>
> >> That does, of course, leave the problem of the blocks overwritten by
> >> the 1.2 metadata, but as I read the docs that should be a very small
> >> number - let's say one 4096byte block (a portion thereof, to be
> >> pedantic, but ext4 doesn't really care?) per drive, correct?
> >>
> >> Background:
> >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
> >> X10 series)'s chipset SATA ports.
> >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
> >> have journals on another SSD RAID1 (one per FS, obviously).
> >> Data drives:
> >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
> >> each with two four-drive ports (and one of these went DELIGHTFULLY
> >> missing)
> >>
> >> This is the layout of each drive:
> >> ---
> >> GPT fdisk (gdisk) version 1.0.6
> >> ...
> >> Found valid GPT with protective MBR; using GPT.
> >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> >> Model: ST4000NC001-1FS1
> >> Sector size (logical/physical): 512/4096 bytes
> >> ...
> >> Total free space is 99949 sectors (48.8 MiB)
> >>
> >> Number  Start (sector)    End (sector)  Size       Code  Name
> >>     1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
> >>     2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
> >> ---
> >>
> >> So there were two RAID arrays. Both RAID5 - a main RAID called
> >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
> >> called backup which had 12 x 90 GB.
> >>
> >> A little further backstory: right before the event, one drive had been
> >> pulled because it had started failing. What I did was shut down the
> >> machine, put the failing drive on a MB port and put a new drive on the
> >> LSI controllers. I then brought the machine back online, did the
> >> --replace --with thing and this worked fine.
> >> At that point the faulty drive (/dev/sdc, MB drives come before the
> >> LSI drives in the count) got deleted via /sys/block.... and physically
> >> disconnected from the system, which was then happily running with
> >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
> >> the 'archive' drives.
> >> It went 96 hours or so like that under moderate load. Then the failure
> >> happened, the machine was rebooted thus the previous sdd -> sdo drives
> >> became sdc -> sdn drives.
> >> However, the relative order was, to the best of my knowledge,
> >> conserved - AND I still have the 'faulty' drive, so I could very
> >> easily put it back in to have everything match.
> >> Most importantly, this drive has on it, without a doubt, the details
> >> of the array BEFORE everything happened - by definition untouched
> >> because the drive was stopped and pulled before the event.
> >> I also have a cat of the --examine of two of the faulty drives BEFORE
> >> anything was written to them - thus, unless I am mistaken, these
> >> contained the md block details from 'before the event'.
> >>
> >> Here is one of them, taken after the reboot and therefore when the MB
> >> /dev/sdc was no longer there:
> >> ---
> >> /dev/sdc1:
> >>            Magic : a92b4efc
> >>          Version : 0.90.00
> >>             UUID : 2457b506:85728e9d:c44c77eb:7ee19756
> >>    Creation Time : Sat Mar 30 18:18:00 2019
> >>       Raid Level : raid5
> >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> >>     Raid Devices : 12
> >>    Total Devices : 12
> >> Preferred Minor : 123
> >>
> >>      Update Time : Tue Sep  6 11:37:53 2022
> >>            State : clean
> >>   Active Devices : 12
> >> Working Devices : 12
> >>   Failed Devices : 0
> >>    Spare Devices : 0
> >>         Checksum : 391e325d - correct
> >>           Events : 52177
> >>
> >>           Layout : left-symmetric
> >>       Chunk Size : 128K
> >>
> >>        Number   Major   Minor   RaidDevice State
> >> this     5       8       49        5      active sync   /dev/sdd1
> >>
> >>     0     0       8      225        0      active sync
> >>     1     1       8       81        1      active sync   /dev/sdf1
> >>     2     2       8       97        2      active sync   /dev/sdg1
> >>     3     3       8      161        3      active sync   /dev/sdk1
> >>     4     4       8      113        4      active sync   /dev/sdh1
> >>     5     5       8       49        5      active sync   /dev/sdd1
> >>     6     6       8      177        6      active sync   /dev/sdl1
> >>     7     7       8      145        7      active sync   /dev/sdj1
> >>     8     8       8      129        8      active sync   /dev/sdi1
> >>     9     9       8       65        9      active sync   /dev/sde1
> >>    10    10       8      209       10      active sync   /dev/sdn1
> >>    11    11       8      193       11      active sync   /dev/sdm1
> >> ---
> >> Note that the drives are 'moved' because the old /dev/sdc isn't there
> >> any more but the relative position should be the same, correct me if I
> >> am wrong. If you prefer, what you need to do to get the 'new' drive
> >> letter is to take 16 out of the minor of each of the drives.
> >>
> >> This is the 'new' --create
> >> ---
> >> /dev/sdc1:
> >>            Magic : a92b4efc
> >>          Version : 0.90.00
> >>             UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
> >>    Creation Time : Tue Sep  6 15:15:03 2022
> >>       Raid Level : raid5
> >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> >>     Raid Devices : 12
> >>    Total Devices : 12
> >> Preferred Minor : 123
> >>
> >>      Update Time : Tue Sep  6 15:15:03 2022
> >>            State : clean
> >>   Active Devices : 12
> >> Working Devices : 12
> >>   Failed Devices : 0
> >>    Spare Devices : 0
> >>         Checksum : ed12b96a - correct
> >>           Events : 1
> >>
> >>           Layout : left-symmetric
> >>       Chunk Size : 128K
> >>
> >>        Number   Major   Minor   RaidDevice State
> >> this     5       8       33        5      active sync   /dev/sdc1
> >>
> >>     0     0       8      209        0      active sync   /dev/sdn1
> >>     1     1       8       65        1      active sync   /dev/sde1
> >>     2     2       8       81        2      active sync   /dev/sdf1
> >>     3     3       8      145        3      active sync   /dev/sdj1
> >>     4     4       8       97        4      active sync   /dev/sdg1
> >>     5     5       8       33        5      active sync   /dev/sdc1
> >>     6     6       8      161        6      active sync   /dev/sdk1
> >>     7     7       8      129        7      active sync   /dev/sdi1
> >>     8     8       8      113        8      active sync   /dev/sdh1
> >>     9     9       8       49        9      active sync   /dev/sdd1
> >>    10    10       8      193       10      active sync   /dev/sdm1
> >>    11    11       8      177       11      active sync   /dev/sdl1
> >> ---
> >>
> >> If you put the layout lines side by side, it would seem to me that
> >> they match, modulo the '16' difference.
> >>
> >> This is the list of --create and --assemble commands from the 6th
> >> which involve the sdx1 partitions, those we care about right now -
> >> there were others involving /dev/md124 and the /dev/sdx2 which however
> >> are not relevant - the data there :
> >> --
> >>   9813  mdadm --assemble /dev/md123 missing
> >>   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
> >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
> >> /dev/sdn1 /dev/sdm1
> >>   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
> >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
> >> /dev/sdm1
> >>   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >>   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
> >> /dev/sdd1 /dev/sdm1 /dev/sdl1
> >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
> >> miscommunication which caused potential damage.
> >>   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
> >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
> >>   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
> >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
> >> /dev/sdl1
> >>   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
> >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
> >> /dev/sdk1 /dev/sdl1
> >>   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >>   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >>   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >>   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >>   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> >> /dev/sdm1 /dev/sdl1
> >> ---
> >>
> >> Note that they all were -o, therefore if I am not mistaken no parity
> >> data was written anywhere. Note further the fact that the first two
> >> were the 'mistake' ones, which did NOT have --assume-clean (but with
> >> -o this shouldn't make a difference AFAIK) and most importantly the
> >> metadata was the 1.2 default AND they were the wrong array in the
> >> first place.
> >> Note also that the 'final' --create commands also had --bitmap=none to
> >> match the original array, though according to the docs the bitmap
> >> space in 0.90 (and 1.2?) is in a space which does not affect the data
> >> in the first place.
> >>
> >> Now, first of all a question: if I get the 'old' sdc, the one that was
> >> taken out prior to this whole mess, onto a different system in order
> >> to examine it, the modern mdraid auto discovery shoud NOT overwrite
> >> the md data, correct? Thus I should be able to double-check the drive
> >> order on that as well?
> >>
> >> Any other pointers, insults etc are of course welcome.
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 22:50         ` Luigi Fabio
@ 2022-09-09 23:04           ` Luigi Fabio
  2022-09-10  1:29             ` Luigi Fabio
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-09 23:04 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

A further question, in THIS boot's log I found:
[ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12
devices, algorithm 2
[ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing
full recovery
[ 9874.714178] md123: bitmap file is out of date, doing full recovery
[ 9874.881106] md123: detected capacity change from 0 to 42945088192512
From, I think, the second --create of /dev/123, before I added the
bitmap=none. This should, however, not have written anything with -o
and --assume-clean, correct?

On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> By different kernels, maybe - but the kernel has been the same for
> quite a while (months).
>
> I did paste the whole of the command lines in the (very long) email,
> as David mentions (thanks!) - the first ones, the mistaken ones, did
> NOT have --assume-clean but they did have -o, so no parity activity
> should have started according to the docs?
> A new thought came to mind: one of the HBAs lost a channel, right?
> What if on the subsequent reboot the devices that were on that channel
> got 'rediscovered' and shunted to the end of the letter order? That
> would, I believe, be ordinary operating procedure.
> That would give us an almost-correct array, which would explain how
> fsck can get ... some pieces.
>
> Also, I am not quite brave enough (...) to use shortcuts when handling
> mdadm commands.
>
> I am reconstructing the port order (scsi targets, if you prefer) from
> the 20220904 boot log. I should at that point be able to have an exact
> order of the drives.
>
> Here it is:
>
> ---
> [    1.853329] sd 2:0:0:0: [sda] Write Protect is off
> [    1.853331] sd 7:0:0:0: [sdc] Write Protect is off
> [    1.853382] sd 3:0:0:0: [sdb] Write Protect is off
> [   12.531607] sd 10:0:3:0: [sdg] Write Protect is off
> [   12.533303] sd 10:0:2:0: [sdf] Write Protect is off
> [   12.534606] sd 10:0:0:0: [sdd] Write Protect is off
> [   12.570768] sd 10:0:1:0: [sde] Write Protect is off
> [   12.959925] sd 11:0:0:0: [sdh] Write Protect is off
> [   12.965230] sd 11:0:1:0: [sdi] Write Protect is off
> [   12.966145] sd 11:0:4:0: [sdl] Write Protect is off
> [   12.966800] sd 11:0:3:0: [sdk] Write Protect is off
> [   12.997253] sd 11:0:2:0: [sdj] Write Protect is off
> [   13.002395] sd 11:0:7:0: [sdo] Write Protect is off
> [   13.012693] sd 11:0:5:0: [sdm] Write Protect is off
> [   13.017630] sd 11:0:6:0: [sdn] Write Protect is off
> ---
> If we combine this with the previous:
> ---
> [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> devices, algorithm 2
> [   13.531644] md123: detected capacity change from 0 to 42945088192512
> ---
> We have a SCSI target -> raid disk number correspondence.
> As of this boot, the letter -> scsi target correspondences match,
> shifted by one because as discussed 7:0:0:0 is no longer there (the
> old, 'faulty' sdc).
> Thus, having univocally determined the prior scsi target -> raid
> position we can transpose it to the present drive letters, which are
> shifted by one.
> Therefore, we can generate, rectius have generated, a --create with
> the same software versions, the same settings and the same drive
> order. Is there any reason why, minus the 1.2 metadata overwriting
> which should have only affected 12 blocks, the fs should 'not' be as
> before?
> Genuine question, mind.
>
> On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote:
> >
> > Reasonably likely, but not certain.
> >
> > Devices can be re-ordered by different kernels.  That's why lsdrv prints
> > serial numbers in its tree.
> >
> > You haven't mentioned whether your --create operations specified
> > --assume-clean.
> >
> > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh]
> > is sorted to /dev/sd[abcdefgh].  Use curly brace expansion with commas
> > if you are taking shortcuts.
> >
> > On 9/9/22 17:01, Luigi Fabio wrote:
> > > Another helpful datapoint, this is the boot *before* sdc got
> > > --replaced with sdo:
> > >
> > > [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> > > [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> > > [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> > > [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> > > [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> > > [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> > > [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> > > [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> > > [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> > > [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> > > [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> > > [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> > > [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> > > devices, algorithm 2
> > > [   13.531644] md123: detected capacity change from 0 to 42945088192512
> > >
> > > This gives us, correct me if I am wrong of course, an exact
> > > representation of what the array 'used to look like', with sdc1 then
> > > replaced by sdo1 (8/225).
> > >
> > > Just some confirmation that the order should (?) be the one above.
> > >
> > > LF
> > >
> > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
> > >>
> > >> Thanks for reaching out, first of all. Apologies for the late reply,
> > >> the brilliant (...) spam filter strikes again...
> > >>
> > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
> > >>> No, the moment of stupid was that you re-created the array.
> > >>> Simultaneous multi-drive failures that stop an array are easily fixed
> > >>> with --assemble --force.  Too late for that now.
> > >> Noted for the future, thanks.
> > >>
> > >>> It is absurdly easy to screw up device order when re-creating, and if
> > >>> you didn't specify every allocation and layout detail, the changes in
> > >>> defaults over the years would also screw up your data.  And finally,
> > >>> omitting --assume-clean would cause all of your parity to be
> > >>> recalculated immediately, with catastrophic results if any order or
> > >>> allocation attributes are wrong.
> > >> Of course. Which is why I specified everything and why I checked the
> > >> details with --examine and --detail and they match exactly, minus the
> > >> metadata version because, well, I wasn't actually the one typing (it's
> > >> a slightly complicated story.. I was reassembling by proxy on the
> > >> phone) and I made an incorrect assumption about the person typing.
> > >> There aren't, in the end, THAT many things to specify: RAID level,
> > >> number of drives, order thereof, chunk size, 'layout' and metadata
> > >> version. 0.90 doesn't allow before/after gaps so that should be it, I
> > >> believe.
> > >> Am I missing anything?
> > >>
> > >>> No, you just got lucky in the past.  Probably by using mdadm versions
> > >>> that hadn't been updated.
> > >> That's not quite it: I keep records of how arrays are built and match
> > >> them, though it is true that I tend to update things as little as
> > >> possible on production machines.
> > >> One of the differences, this time, is that this was NOT a production
> > >> machine. The other was that I was driving, dictating on the phone and
> > >> was under a lot of pressure to get the thing back up ASAP.
> > >> Nonetheless, I have an --examine of at least two drives from the
> > >> previous setup so there should be enough information there to rebuild
> > >> a matching array, I think?
> > >>
> > >>> You'll need to show us every command you tried from your history, and
> > >>> full details of all drives/partitions involved.
> > >>>
> > >>> But I'll be brutally honest:  your data is likely toast.
> > >> Well, let's hope it isn't. All mdadm commands were -o and
> > >> --assume-clean, so in theory the only thing which HAS been written are
> > >> the md blocks, unless I am mistaken and/or I read the docs
> > >> incorrectly?
> > >>
> > >> That does, of course, leave the problem of the blocks overwritten by
> > >> the 1.2 metadata, but as I read the docs that should be a very small
> > >> number - let's say one 4096byte block (a portion thereof, to be
> > >> pedantic, but ext4 doesn't really care?) per drive, correct?
> > >>
> > >> Background:
> > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
> > >> X10 series)'s chipset SATA ports.
> > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
> > >> have journals on another SSD RAID1 (one per FS, obviously).
> > >> Data drives:
> > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
> > >> each with two four-drive ports (and one of these went DELIGHTFULLY
> > >> missing)
> > >>
> > >> This is the layout of each drive:
> > >> ---
> > >> GPT fdisk (gdisk) version 1.0.6
> > >> ...
> > >> Found valid GPT with protective MBR; using GPT.
> > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> > >> Model: ST4000NC001-1FS1
> > >> Sector size (logical/physical): 512/4096 bytes
> > >> ...
> > >> Total free space is 99949 sectors (48.8 MiB)
> > >>
> > >> Number  Start (sector)    End (sector)  Size       Code  Name
> > >>     1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
> > >>     2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
> > >> ---
> > >>
> > >> So there were two RAID arrays. Both RAID5 - a main RAID called
> > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
> > >> called backup which had 12 x 90 GB.
> > >>
> > >> A little further backstory: right before the event, one drive had been
> > >> pulled because it had started failing. What I did was shut down the
> > >> machine, put the failing drive on a MB port and put a new drive on the
> > >> LSI controllers. I then brought the machine back online, did the
> > >> --replace --with thing and this worked fine.
> > >> At that point the faulty drive (/dev/sdc, MB drives come before the
> > >> LSI drives in the count) got deleted via /sys/block.... and physically
> > >> disconnected from the system, which was then happily running with
> > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
> > >> the 'archive' drives.
> > >> It went 96 hours or so like that under moderate load. Then the failure
> > >> happened, the machine was rebooted thus the previous sdd -> sdo drives
> > >> became sdc -> sdn drives.
> > >> However, the relative order was, to the best of my knowledge,
> > >> conserved - AND I still have the 'faulty' drive, so I could very
> > >> easily put it back in to have everything match.
> > >> Most importantly, this drive has on it, without a doubt, the details
> > >> of the array BEFORE everything happened - by definition untouched
> > >> because the drive was stopped and pulled before the event.
> > >> I also have a cat of the --examine of two of the faulty drives BEFORE
> > >> anything was written to them - thus, unless I am mistaken, these
> > >> contained the md block details from 'before the event'.
> > >>
> > >> Here is one of them, taken after the reboot and therefore when the MB
> > >> /dev/sdc was no longer there:
> > >> ---
> > >> /dev/sdc1:
> > >>            Magic : a92b4efc
> > >>          Version : 0.90.00
> > >>             UUID : 2457b506:85728e9d:c44c77eb:7ee19756
> > >>    Creation Time : Sat Mar 30 18:18:00 2019
> > >>       Raid Level : raid5
> > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > >>     Raid Devices : 12
> > >>    Total Devices : 12
> > >> Preferred Minor : 123
> > >>
> > >>      Update Time : Tue Sep  6 11:37:53 2022
> > >>            State : clean
> > >>   Active Devices : 12
> > >> Working Devices : 12
> > >>   Failed Devices : 0
> > >>    Spare Devices : 0
> > >>         Checksum : 391e325d - correct
> > >>           Events : 52177
> > >>
> > >>           Layout : left-symmetric
> > >>       Chunk Size : 128K
> > >>
> > >>        Number   Major   Minor   RaidDevice State
> > >> this     5       8       49        5      active sync   /dev/sdd1
> > >>
> > >>     0     0       8      225        0      active sync
> > >>     1     1       8       81        1      active sync   /dev/sdf1
> > >>     2     2       8       97        2      active sync   /dev/sdg1
> > >>     3     3       8      161        3      active sync   /dev/sdk1
> > >>     4     4       8      113        4      active sync   /dev/sdh1
> > >>     5     5       8       49        5      active sync   /dev/sdd1
> > >>     6     6       8      177        6      active sync   /dev/sdl1
> > >>     7     7       8      145        7      active sync   /dev/sdj1
> > >>     8     8       8      129        8      active sync   /dev/sdi1
> > >>     9     9       8       65        9      active sync   /dev/sde1
> > >>    10    10       8      209       10      active sync   /dev/sdn1
> > >>    11    11       8      193       11      active sync   /dev/sdm1
> > >> ---
> > >> Note that the drives are 'moved' because the old /dev/sdc isn't there
> > >> any more but the relative position should be the same, correct me if I
> > >> am wrong. If you prefer, what you need to do to get the 'new' drive
> > >> letter is to take 16 out of the minor of each of the drives.
> > >>
> > >> This is the 'new' --create
> > >> ---
> > >> /dev/sdc1:
> > >>            Magic : a92b4efc
> > >>          Version : 0.90.00
> > >>             UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
> > >>    Creation Time : Tue Sep  6 15:15:03 2022
> > >>       Raid Level : raid5
> > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > >>     Raid Devices : 12
> > >>    Total Devices : 12
> > >> Preferred Minor : 123
> > >>
> > >>      Update Time : Tue Sep  6 15:15:03 2022
> > >>            State : clean
> > >>   Active Devices : 12
> > >> Working Devices : 12
> > >>   Failed Devices : 0
> > >>    Spare Devices : 0
> > >>         Checksum : ed12b96a - correct
> > >>           Events : 1
> > >>
> > >>           Layout : left-symmetric
> > >>       Chunk Size : 128K
> > >>
> > >>        Number   Major   Minor   RaidDevice State
> > >> this     5       8       33        5      active sync   /dev/sdc1
> > >>
> > >>     0     0       8      209        0      active sync   /dev/sdn1
> > >>     1     1       8       65        1      active sync   /dev/sde1
> > >>     2     2       8       81        2      active sync   /dev/sdf1
> > >>     3     3       8      145        3      active sync   /dev/sdj1
> > >>     4     4       8       97        4      active sync   /dev/sdg1
> > >>     5     5       8       33        5      active sync   /dev/sdc1
> > >>     6     6       8      161        6      active sync   /dev/sdk1
> > >>     7     7       8      129        7      active sync   /dev/sdi1
> > >>     8     8       8      113        8      active sync   /dev/sdh1
> > >>     9     9       8       49        9      active sync   /dev/sdd1
> > >>    10    10       8      193       10      active sync   /dev/sdm1
> > >>    11    11       8      177       11      active sync   /dev/sdl1
> > >> ---
> > >>
> > >> If you put the layout lines side by side, it would seem to me that
> > >> they match, modulo the '16' difference.
> > >>
> > >> This is the list of --create and --assemble commands from the 6th
> > >> which involve the sdx1 partitions, those we care about right now -
> > >> there were others involving /dev/md124 and the /dev/sdx2 which however
> > >> are not relevant - the data there :
> > >> --
> > >>   9813  mdadm --assemble /dev/md123 missing
> > >>   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
> > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
> > >> /dev/sdn1 /dev/sdm1
> > >>   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
> > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
> > >> /dev/sdm1
> > >>   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
> > >> /dev/sdd1 /dev/sdm1 /dev/sdl1
> > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
> > >> miscommunication which caused potential damage.
> > >>   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
> > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
> > >>   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
> > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
> > >> /dev/sdl1
> > >>   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
> > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
> > >> /dev/sdk1 /dev/sdl1
> > >>   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >>   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > >> /dev/sdm1 /dev/sdl1
> > >> ---
> > >>
> > >> Note that they all were -o, therefore if I am not mistaken no parity
> > >> data was written anywhere. Note further the fact that the first two
> > >> were the 'mistake' ones, which did NOT have --assume-clean (but with
> > >> -o this shouldn't make a difference AFAIK) and most importantly the
> > >> metadata was the 1.2 default AND they were the wrong array in the
> > >> first place.
> > >> Note also that the 'final' --create commands also had --bitmap=none to
> > >> match the original array, though according to the docs the bitmap
> > >> space in 0.90 (and 1.2?) is in a space which does not affect the data
> > >> in the first place.
> > >>
> > >> Now, first of all a question: if I get the 'old' sdc, the one that was
> > >> taken out prior to this whole mess, onto a different system in order
> > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite
> > >> the md data, correct? Thus I should be able to double-check the drive
> > >> order on that as well?
> > >>
> > >> Any other pointers, insults etc are of course welcome.
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-09 23:04           ` Luigi Fabio
@ 2022-09-10  1:29             ` Luigi Fabio
  2022-09-10 15:18               ` Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10  1:29 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

For completeness' sake, though it should not be relevant, here is the
error that caused the mishap:
---
Sep  6 11:41:18 beehive kernel: [164700.275878] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:19 beehive kernel: [164701.395828] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:21 beehive kernel: [164702.515813] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:22 beehive kernel: [164703.635801] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:23 beehive kernel: [164704.723793] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:24 beehive kernel: [164705.811778] mpt2sas_cm0: SAS host
is non-operational !!!!
Sep  6 11:41:24 beehive kernel: [164705.894616] mpt2sas_cm0:
_base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Sep  6 11:41:24 beehive kernel: [164705.981926] sd 10:0:0:0: [sdd]
Synchronizing SCSI cache
Sep  6 11:41:24 beehive kernel: [164705.981967] sd 10:0:0:0: [sdd]
Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK
Sep  6 11:41:24 beehive kernel: [164705.987746] sd 10:0:1:0: [sde]
tag#2758 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=6s
Sep  6 11:41:24 beehive kernel: [164705.987749] sd 10:0:1:0: [sde]
tag#2758 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 98 00 00 00 68 00 00
Sep  6 11:41:24 beehive kernel: [164705.987751] blk_update_request:
I/O error, dev sde, sector 5779360408 op 0x0:(READ) flags 0x80700
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.159887] sd 10:0:1:0: [sde]
tag#2759 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=6s
Sep  6 11:41:24 beehive kernel: [164706.159897] sd 10:0:1:0: [sde]
tag#2759 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 00 00 00 00 98 00 00
Sep  6 11:41:24 beehive kernel: [164706.159903] blk_update_request:
I/O error, dev sde, sector 5779360256 op 0x0:(READ) flags 0x80700
phys_seg 3 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.160073] sd 10:0:1:0: [sde]
tag#2761 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.333860] sd 10:0:1:0: [sde]
tag#2761 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 98 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.333862] blk_update_request:
I/O error, dev sde, sector 5779360408 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.333864] sd 10:0:2:0: [sdf]
tag#2760 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=6s
Sep  6 11:41:24 beehive kernel: [164706.334010] sd 10:0:1:0: [sde]
tag#2774 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334012] sd 10:0:1:0: [sde]
tag#2774 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 00 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334014] blk_update_request:
I/O error, dev sde, sector 5779360256 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334021] sd 10:0:1:0: [sde]
tag#2775 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334022] sd 10:0:1:0: [sde]
tag#2775 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 08 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334024] blk_update_request:
I/O error, dev sde, sector 5779360264 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334026] sd 10:0:1:0: [sde]
tag#2776 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334028] sd 10:0:1:0: [sde]
tag#2776 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 10 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334029] blk_update_request:
I/O error, dev sde, sector 5779360272 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334031] sd 10:0:1:0: [sde]
tag#2777 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334033] sd 10:0:1:0: [sde]
tag#2777 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 18 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334034] blk_update_request:
I/O error, dev sde, sector 5779360280 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334036] sd 10:0:1:0: [sde]
tag#2778 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334037] sd 10:0:1:0: [sde]
tag#2778 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 20 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334038] blk_update_request:
I/O error, dev sde, sector 5779360288 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334039] sd 10:0:1:0: [sde]
tag#2779 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
cmd_age=0s
Sep  6 11:41:24 beehive kernel: [164706.334041] sd 10:0:1:0: [sde]
tag#2779 CDB: Read(16) 88 00 00 00 00 01 58 7a 0a 28 00 00 00 08 00 00
Sep  6 11:41:24 beehive kernel: [164706.334041] blk_update_request:
I/O error, dev sde, sector 5779360296 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.334043] blk_update_request:
I/O error, dev sde, sector 5779360304 op 0x0:(READ) flags 0x4000
phys_seg 1 prio class 0
Sep  6 11:41:24 beehive kernel: [164706.346000] md/raid:md123: Disk
failure on sde1, disabling device.
Sep  6 11:41:24 beehive kernel: [164706.346002] md/raid:md123:
Operation continuing on 11 devices.
Sep  6 11:41:24 beehive kernel: [164706.346008] md/raid:md123: Disk
failure on sdd1, disabling device.
Sep  6 11:41:24 beehive kernel: [164706.346009] md/raid:md123: Cannot
continue operation (2/12 failed).
Sep  6 11:41:24 beehive kernel: [164706.346011] md/raid:md123: Disk
failure on sdg1, disabling device.
Sep  6 11:41:24 beehive kernel: [164706.346012] md/raid:md123: Cannot
continue operation (3/12 failed).
Sep  6 11:41:24 beehive kernel: [164706.346013] md/raid:md123: Disk
failure on sdf1, disabling device.
Sep  6 11:41:24 beehive kernel: [164706.346014] md/raid:md123: Cannot
continue operation (4/12 failed).
----
Note that port 0 of the 10: controller just... lost it. That
controller only uses port 0, so we don't know if it was the whole
controller or just the port, but that is what happened. Of course, it
now works.
sdd , sde, sdf and sdg decided they were going on holiday, sdc had
already been removed at this point as mentioned, controller 11: with
the other eight drives was just fine, apparently.

The *really odd* thing is that it failed... gracefully. I cannot
understand what damaged the filesystem.

On Fri, Sep 9, 2022 at 7:04 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> A further question, in THIS boot's log I found:
> [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12
> devices, algorithm 2
> [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing
> full recovery
> [ 9874.714178] md123: bitmap file is out of date, doing full recovery
> [ 9874.881106] md123: detected capacity change from 0 to 42945088192512
> From, I think, the second --create of /dev/123, before I added the
> bitmap=none. This should, however, not have written anything with -o
> and --assume-clean, correct?
>
> On Fri, Sep 9, 2022 at 6:50 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
> >
> > By different kernels, maybe - but the kernel has been the same for
> > quite a while (months).
> >
> > I did paste the whole of the command lines in the (very long) email,
> > as David mentions (thanks!) - the first ones, the mistaken ones, did
> > NOT have --assume-clean but they did have -o, so no parity activity
> > should have started according to the docs?
> > A new thought came to mind: one of the HBAs lost a channel, right?
> > What if on the subsequent reboot the devices that were on that channel
> > got 'rediscovered' and shunted to the end of the letter order? That
> > would, I believe, be ordinary operating procedure.
> > That would give us an almost-correct array, which would explain how
> > fsck can get ... some pieces.
> >
> > Also, I am not quite brave enough (...) to use shortcuts when handling
> > mdadm commands.
> >
> > I am reconstructing the port order (scsi targets, if you prefer) from
> > the 20220904 boot log. I should at that point be able to have an exact
> > order of the drives.
> >
> > Here it is:
> >
> > ---
> > [    1.853329] sd 2:0:0:0: [sda] Write Protect is off
> > [    1.853331] sd 7:0:0:0: [sdc] Write Protect is off
> > [    1.853382] sd 3:0:0:0: [sdb] Write Protect is off
> > [   12.531607] sd 10:0:3:0: [sdg] Write Protect is off
> > [   12.533303] sd 10:0:2:0: [sdf] Write Protect is off
> > [   12.534606] sd 10:0:0:0: [sdd] Write Protect is off
> > [   12.570768] sd 10:0:1:0: [sde] Write Protect is off
> > [   12.959925] sd 11:0:0:0: [sdh] Write Protect is off
> > [   12.965230] sd 11:0:1:0: [sdi] Write Protect is off
> > [   12.966145] sd 11:0:4:0: [sdl] Write Protect is off
> > [   12.966800] sd 11:0:3:0: [sdk] Write Protect is off
> > [   12.997253] sd 11:0:2:0: [sdj] Write Protect is off
> > [   13.002395] sd 11:0:7:0: [sdo] Write Protect is off
> > [   13.012693] sd 11:0:5:0: [sdm] Write Protect is off
> > [   13.017630] sd 11:0:6:0: [sdn] Write Protect is off
> > ---
> > If we combine this with the previous:
> > ---
> > [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> > [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> > [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> > [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> > [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> > [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> > [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> > [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> > [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> > [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> > [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> > [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> > [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> > devices, algorithm 2
> > [   13.531644] md123: detected capacity change from 0 to 42945088192512
> > ---
> > We have a SCSI target -> raid disk number correspondence.
> > As of this boot, the letter -> scsi target correspondences match,
> > shifted by one because as discussed 7:0:0:0 is no longer there (the
> > old, 'faulty' sdc).
> > Thus, having univocally determined the prior scsi target -> raid
> > position we can transpose it to the present drive letters, which are
> > shifted by one.
> > Therefore, we can generate, rectius have generated, a --create with
> > the same software versions, the same settings and the same drive
> > order. Is there any reason why, minus the 1.2 metadata overwriting
> > which should have only affected 12 blocks, the fs should 'not' be as
> > before?
> > Genuine question, mind.
> >
> > On Fri, Sep 9, 2022 at 5:48 PM Phil Turmel <philip@turmel.org> wrote:
> > >
> > > Reasonably likely, but not certain.
> > >
> > > Devices can be re-ordered by different kernels.  That's why lsdrv prints
> > > serial numbers in its tree.
> > >
> > > You haven't mentioned whether your --create operations specified
> > > --assume-clean.
> > >
> > > Also, be aware that shell expansion of something like /dev/sd[dcbaefgh]
> > > is sorted to /dev/sd[abcdefgh].  Use curly brace expansion with commas
> > > if you are taking shortcuts.
> > >
> > > On 9/9/22 17:01, Luigi Fabio wrote:
> > > > Another helpful datapoint, this is the boot *before* sdc got
> > > > --replaced with sdo:
> > > >
> > > > [   13.528395] md/raid:md123: device sdd1 operational as raid disk 5
> > > > [   13.528396] md/raid:md123: device sde1 operational as raid disk 9
> > > > [   13.528397] md/raid:md123: device sdg1 operational as raid disk 2
> > > > [   13.528398] md/raid:md123: device sdf1 operational as raid disk 1
> > > > [   13.528398] md/raid:md123: device sdh1 operational as raid disk 4
> > > > [   13.528399] md/raid:md123: device sdk1 operational as raid disk 3
> > > > [   13.528400] md/raid:md123: device sdj1 operational as raid disk 7
> > > > [   13.528401] md/raid:md123: device sdn1 operational as raid disk 10
> > > > [   13.528402] md/raid:md123: device sdi1 operational as raid disk 8
> > > > [   13.528402] md/raid:md123: device sdl1 operational as raid disk 6
> > > > [   13.528403] md/raid:md123: device sdm1 operational as raid disk 11
> > > > [   13.528403] md/raid:md123: device sdc1 operational as raid disk 0
> > > > [   13.531613] md/raid:md123: raid level 5 active with 12 out of 12
> > > > devices, algorithm 2
> > > > [   13.531644] md123: detected capacity change from 0 to 42945088192512
> > > >
> > > > This gives us, correct me if I am wrong of course, an exact
> > > > representation of what the array 'used to look like', with sdc1 then
> > > > replaced by sdo1 (8/225).
> > > >
> > > > Just some confirmation that the order should (?) be the one above.
> > > >
> > > > LF
> > > >
> > > > On Fri, Sep 9, 2022 at 4:32 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
> > > >>
> > > >> Thanks for reaching out, first of all. Apologies for the late reply,
> > > >> the brilliant (...) spam filter strikes again...
> > > >>
> > > >> On Thu, Sep 8, 2022 at 1:23 PM Phil Turmel <philip@turmel.org> wrote:
> > > >>> No, the moment of stupid was that you re-created the array.
> > > >>> Simultaneous multi-drive failures that stop an array are easily fixed
> > > >>> with --assemble --force.  Too late for that now.
> > > >> Noted for the future, thanks.
> > > >>
> > > >>> It is absurdly easy to screw up device order when re-creating, and if
> > > >>> you didn't specify every allocation and layout detail, the changes in
> > > >>> defaults over the years would also screw up your data.  And finally,
> > > >>> omitting --assume-clean would cause all of your parity to be
> > > >>> recalculated immediately, with catastrophic results if any order or
> > > >>> allocation attributes are wrong.
> > > >> Of course. Which is why I specified everything and why I checked the
> > > >> details with --examine and --detail and they match exactly, minus the
> > > >> metadata version because, well, I wasn't actually the one typing (it's
> > > >> a slightly complicated story.. I was reassembling by proxy on the
> > > >> phone) and I made an incorrect assumption about the person typing.
> > > >> There aren't, in the end, THAT many things to specify: RAID level,
> > > >> number of drives, order thereof, chunk size, 'layout' and metadata
> > > >> version. 0.90 doesn't allow before/after gaps so that should be it, I
> > > >> believe.
> > > >> Am I missing anything?
> > > >>
> > > >>> No, you just got lucky in the past.  Probably by using mdadm versions
> > > >>> that hadn't been updated.
> > > >> That's not quite it: I keep records of how arrays are built and match
> > > >> them, though it is true that I tend to update things as little as
> > > >> possible on production machines.
> > > >> One of the differences, this time, is that this was NOT a production
> > > >> machine. The other was that I was driving, dictating on the phone and
> > > >> was under a lot of pressure to get the thing back up ASAP.
> > > >> Nonetheless, I have an --examine of at least two drives from the
> > > >> previous setup so there should be enough information there to rebuild
> > > >> a matching array, I think?
> > > >>
> > > >>> You'll need to show us every command you tried from your history, and
> > > >>> full details of all drives/partitions involved.
> > > >>>
> > > >>> But I'll be brutally honest:  your data is likely toast.
> > > >> Well, let's hope it isn't. All mdadm commands were -o and
> > > >> --assume-clean, so in theory the only thing which HAS been written are
> > > >> the md blocks, unless I am mistaken and/or I read the docs
> > > >> incorrectly?
> > > >>
> > > >> That does, of course, leave the problem of the blocks overwritten by
> > > >> the 1.2 metadata, but as I read the docs that should be a very small
> > > >> number - let's say one 4096byte block (a portion thereof, to be
> > > >> pedantic, but ext4 doesn't really care?) per drive, correct?
> > > >>
> > > >> Background:
> > > >> Separate 2x SSD RAID 1 root (/dev/sda. /dev/sdb) on the MB (Supemicro
> > > >> X10 series)'s chipset SATA ports.
> > > >> All filesystems are ext4, data=journal, nodelalloc, the 'data' RAIDs
> > > >> have journals on another SSD RAID1 (one per FS, obviously).
> > > >> Data drives:
> > > >> 12 x 4'TB' Seagate drives, NC000n variety, on 2x LSI 2308 controllers,
> > > >> each with two four-drive ports (and one of these went DELIGHTFULLY
> > > >> missing)
> > > >>
> > > >> This is the layout of each drive:
> > > >> ---
> > > >> GPT fdisk (gdisk) version 1.0.6
> > > >> ...
> > > >> Found valid GPT with protective MBR; using GPT.
> > > >> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> > > >> Model: ST4000NC001-1FS1
> > > >> Sector size (logical/physical): 512/4096 bytes
> > > >> ...
> > > >> Total free space is 99949 sectors (48.8 MiB)
> > > >>
> > > >> Number  Start (sector)    End (sector)  Size       Code  Name
> > > >>     1            2048      7625195519   3.5 TiB     8300  Linux RAID volume
> > > >>     2      7625195520      7813939199   90.0 GiB    8300  Linux RAID backup
> > > >> ---
> > > >>
> > > >> So there were two RAID arrays. Both RAID5 - a main RAID called
> > > >> 'archive' which had the 12 x 3.5ish partitions sdx1 and a second array
> > > >> called backup which had 12 x 90 GB.
> > > >>
> > > >> A little further backstory: right before the event, one drive had been
> > > >> pulled because it had started failing. What I did was shut down the
> > > >> machine, put the failing drive on a MB port and put a new drive on the
> > > >> LSI controllers. I then brought the machine back online, did the
> > > >> --replace --with thing and this worked fine.
> > > >> At that point the faulty drive (/dev/sdc, MB drives come before the
> > > >> LSI drives in the count) got deleted via /sys/block.... and physically
> > > >> disconnected from the system, which was then happily running with
> > > >> /dev/sda and /dev/sdb as the root RAID SSDs and drives sdd -> sdo as
> > > >> the 'archive' drives.
> > > >> It went 96 hours or so like that under moderate load. Then the failure
> > > >> happened, the machine was rebooted thus the previous sdd -> sdo drives
> > > >> became sdc -> sdn drives.
> > > >> However, the relative order was, to the best of my knowledge,
> > > >> conserved - AND I still have the 'faulty' drive, so I could very
> > > >> easily put it back in to have everything match.
> > > >> Most importantly, this drive has on it, without a doubt, the details
> > > >> of the array BEFORE everything happened - by definition untouched
> > > >> because the drive was stopped and pulled before the event.
> > > >> I also have a cat of the --examine of two of the faulty drives BEFORE
> > > >> anything was written to them - thus, unless I am mistaken, these
> > > >> contained the md block details from 'before the event'.
> > > >>
> > > >> Here is one of them, taken after the reboot and therefore when the MB
> > > >> /dev/sdc was no longer there:
> > > >> ---
> > > >> /dev/sdc1:
> > > >>            Magic : a92b4efc
> > > >>          Version : 0.90.00
> > > >>             UUID : 2457b506:85728e9d:c44c77eb:7ee19756
> > > >>    Creation Time : Sat Mar 30 18:18:00 2019
> > > >>       Raid Level : raid5
> > > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > > >>     Raid Devices : 12
> > > >>    Total Devices : 12
> > > >> Preferred Minor : 123
> > > >>
> > > >>      Update Time : Tue Sep  6 11:37:53 2022
> > > >>            State : clean
> > > >>   Active Devices : 12
> > > >> Working Devices : 12
> > > >>   Failed Devices : 0
> > > >>    Spare Devices : 0
> > > >>         Checksum : 391e325d - correct
> > > >>           Events : 52177
> > > >>
> > > >>           Layout : left-symmetric
> > > >>       Chunk Size : 128K
> > > >>
> > > >>        Number   Major   Minor   RaidDevice State
> > > >> this     5       8       49        5      active sync   /dev/sdd1
> > > >>
> > > >>     0     0       8      225        0      active sync
> > > >>     1     1       8       81        1      active sync   /dev/sdf1
> > > >>     2     2       8       97        2      active sync   /dev/sdg1
> > > >>     3     3       8      161        3      active sync   /dev/sdk1
> > > >>     4     4       8      113        4      active sync   /dev/sdh1
> > > >>     5     5       8       49        5      active sync   /dev/sdd1
> > > >>     6     6       8      177        6      active sync   /dev/sdl1
> > > >>     7     7       8      145        7      active sync   /dev/sdj1
> > > >>     8     8       8      129        8      active sync   /dev/sdi1
> > > >>     9     9       8       65        9      active sync   /dev/sde1
> > > >>    10    10       8      209       10      active sync   /dev/sdn1
> > > >>    11    11       8      193       11      active sync   /dev/sdm1
> > > >> ---
> > > >> Note that the drives are 'moved' because the old /dev/sdc isn't there
> > > >> any more but the relative position should be the same, correct me if I
> > > >> am wrong. If you prefer, what you need to do to get the 'new' drive
> > > >> letter is to take 16 out of the minor of each of the drives.
> > > >>
> > > >> This is the 'new' --create
> > > >> ---
> > > >> /dev/sdc1:
> > > >>            Magic : a92b4efc
> > > >>          Version : 0.90.00
> > > >>             UUID : 79990944:0bb9420b:97d5a417:7d4e9ef8 (local to host beehive)
> > > >>    Creation Time : Tue Sep  6 15:15:03 2022
> > > >>       Raid Level : raid5
> > > >>    Used Dev Size : -482370688 (3635.98 GiB 3904.10 GB)
> > > >>       Array Size : 41938562688 (39995.73 GiB 42945.09 GB)
> > > >>     Raid Devices : 12
> > > >>    Total Devices : 12
> > > >> Preferred Minor : 123
> > > >>
> > > >>      Update Time : Tue Sep  6 15:15:03 2022
> > > >>            State : clean
> > > >>   Active Devices : 12
> > > >> Working Devices : 12
> > > >>   Failed Devices : 0
> > > >>    Spare Devices : 0
> > > >>         Checksum : ed12b96a - correct
> > > >>           Events : 1
> > > >>
> > > >>           Layout : left-symmetric
> > > >>       Chunk Size : 128K
> > > >>
> > > >>        Number   Major   Minor   RaidDevice State
> > > >> this     5       8       33        5      active sync   /dev/sdc1
> > > >>
> > > >>     0     0       8      209        0      active sync   /dev/sdn1
> > > >>     1     1       8       65        1      active sync   /dev/sde1
> > > >>     2     2       8       81        2      active sync   /dev/sdf1
> > > >>     3     3       8      145        3      active sync   /dev/sdj1
> > > >>     4     4       8       97        4      active sync   /dev/sdg1
> > > >>     5     5       8       33        5      active sync   /dev/sdc1
> > > >>     6     6       8      161        6      active sync   /dev/sdk1
> > > >>     7     7       8      129        7      active sync   /dev/sdi1
> > > >>     8     8       8      113        8      active sync   /dev/sdh1
> > > >>     9     9       8       49        9      active sync   /dev/sdd1
> > > >>    10    10       8      193       10      active sync   /dev/sdm1
> > > >>    11    11       8      177       11      active sync   /dev/sdl1
> > > >> ---
> > > >>
> > > >> If you put the layout lines side by side, it would seem to me that
> > > >> they match, modulo the '16' difference.
> > > >>
> > > >> This is the list of --create and --assemble commands from the 6th
> > > >> which involve the sdx1 partitions, those we care about right now -
> > > >> there were others involving /dev/md124 and the /dev/sdx2 which however
> > > >> are not relevant - the data there :
> > > >> --
> > > >>   9813  mdadm --assemble /dev/md123 missing
> > > >>   9814  mdadm --assemble /dev/md123 missing /dev/sdf1 /dev/sdg1
> > > >> /dev/sdk1 /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1
> > > >> /dev/sdn1 /dev/sdm1
> > > >>   9815  mdadm --assemble /dev/md123 /dev/sdf1 /dev/sdg1 /dev/sdk1
> > > >> /dev/sdh1 /dev/sdd1 /dev/sdl1 /dev/sdj1 /dev/sdi1 /dev/sde1 /dev/sdn1
> > > >> /dev/sdm1
> > > >>   9823  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >>   9824  mdadm --create -o -n 12 -l 5 /dev/md124 missing /dev/sde1
> > > >> /dev/sdf1 /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1
> > > >> /dev/sdd1 /dev/sdm1 /dev/sdl1
> > > >> ^^^^ note that these were the WRONG ARRAY - this was an unfortunate
> > > >> miscommunication which caused potential damage.
> > > >>   9852  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdd1 /dev/sdf1 /dev/sde1
> > > >> /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1 /dev/sdl1
> > > >>   9863  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1 /dev/sdf1
> > > >> /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1 /dev/sdk1
> > > >> /dev/sdl1
> > > >>   9879  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sdc1 /dev/sdd1
> > > >> /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdj1 /dev/sdi1 /dev/sdm1 /dev/sdh1
> > > >> /dev/sdk1 /dev/sdl1
> > > >>   9889  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >>   9892  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >>   9895  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > > >> /dev/sdj1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >>   9901  mdadm --assemble /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > > >> /dev/sdl1 /dev/sdg1 /dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >>   9903  mdadm --create -o --assume-clean -n 12 -l 5 --metadata=0.90
> > > >> --chunk=128 --bitmap=none /dev/md123 /dev/sdn1 /dev/sde1 /dev/sdf1
> > > >> /dev/sdj1 /dev/sdg1 / dev/sdc1 /dev/sdk1 /dev/sdi1 /dev/sdh1 /dev/sdd1
> > > >> /dev/sdm1 /dev/sdl1
> > > >> ---
> > > >>
> > > >> Note that they all were -o, therefore if I am not mistaken no parity
> > > >> data was written anywhere. Note further the fact that the first two
> > > >> were the 'mistake' ones, which did NOT have --assume-clean (but with
> > > >> -o this shouldn't make a difference AFAIK) and most importantly the
> > > >> metadata was the 1.2 default AND they were the wrong array in the
> > > >> first place.
> > > >> Note also that the 'final' --create commands also had --bitmap=none to
> > > >> match the original array, though according to the docs the bitmap
> > > >> space in 0.90 (and 1.2?) is in a space which does not affect the data
> > > >> in the first place.
> > > >>
> > > >> Now, first of all a question: if I get the 'old' sdc, the one that was
> > > >> taken out prior to this whole mess, onto a different system in order
> > > >> to examine it, the modern mdraid auto discovery shoud NOT overwrite
> > > >> the md data, correct? Thus I should be able to double-check the drive
> > > >> order on that as well?
> > > >>
> > > >> Any other pointers, insults etc are of course welcome.
> > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10  1:29             ` Luigi Fabio
@ 2022-09-10 15:18               ` Phil Turmel
  2022-09-10 19:30                 ` Luigi Fabio
  2022-09-12 19:06                 ` Phillip Susi
  0 siblings, 2 replies; 24+ messages in thread
From: Phil Turmel @ 2022-09-10 15:18 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: linux-raid

Hi Luigi,

Mixed in responses (and trimmed):

On 9/9/22 18:50, Luigi Fabio wrote:
 > By different kernels, maybe - but the kernel has been the same for
 > quite a while (months).

Yes.  Same kernels are pretty repeatable for device order on bootup as 
long as all are present.  Anything missing will shift the letter 
assignments.

 > I did paste the whole of the command lines in the (very long) email,
 > as David mentions (thanks!) - the first ones, the mistaken ones, did
 > NOT have --assume-clean but they did have -o, so no parity activity
 > should have started according to the docs?

Okay, that should have saved you.  Except, I think it still writes all 
the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at 
tbe beginning of each device.

 > A new thought came to mind: one of the HBAs lost a channel, right?
 > What if on the subsequent reboot the devices that were on that channel
 > got 'rediscovered' and shunted to the end of the letter order? That
 > would, I believe, be ordinary operating procedure.

Well, yes.  But doesn't matter for assembly attempts, with always go by 
the meta-data.  Device order only ever matters for --create when recreating.

 > That would give us an almost-correct array, which would explain how
 > fsck can get ... some pieces.

If you consistently used -o or --assume-clean, then everything beyond 
~3G should be untouched, if you can get the order right.  Have fsck try 
backup superblocks way out.

 > Also, I am not quite brave enough (...) to use shortcuts when handling
 > mdadm commands.

That's good.  But curly braces are safe.

 > I am reconstructing the port order (scsi targets, if you prefer) from
 > the 20220904 boot log. I should at that point be able to have an exact
 > order of the drives.

Please use lsdrv to capture names versus serial numbers.  Re-run it 
before any --create operation to ensure the current names really do 
match the expected serial numbers.  Keep track of ordering information 
by serial number.  Note that lsdrv will reliably line up PHYs on SAS 
controllers, so that can be trusted, too.

 > Here it is:

[trim /]

 > We have a SCSI target -> raid disk number correspondence.
 > As of this boot, the letter -> scsi target correspondences match,
 > shifted by one because as discussed 7:0:0:0 is no longer there (the
 > old, 'faulty' sdc).

OK.

 > Thus, having univocally determined the prior scsi target -> raid
 > position we can transpose it to the present drive letters, which are
 > shifted by one.
 > Therefore, we can generate, rectius have generated, a --create with
 > the same software versions, the same settings and the same drive
 > order. Is there any reason why, minus the 1.2 metadata overwriting
 > which should have only affected 12 blocks, the fs should 'not' be as
 > before?
 > Genuine question, mind.

Superblocks other than 0.9x and 1.0 place a bad block log and a written 
block bitmap between the superblock and the data area.  I'm not sure if 
any of the remain space is wiped.  These would be written regardless of 
-o or --assume-clean.  Those flags "protect" the *data area* of the 
array, not the array's own metadata.

On 9/9/22 19:04, Luigi Fabio wrote:
 > A further question, in THIS boot's log I found:
 > [ 9874.709903] md/raid:md123: raid level 5 active with 12 out of 12
 > devices, algorithm 2
 > [ 9874.710249] md123: bitmap file is out of date (0 < 1) -- forcing
 > full recovery
 > [ 9874.714178] md123: bitmap file is out of date, doing full recovery
 > [ 9874.881106] md123: detected capacity change from 0 to 42945088192512
 > From, I think, the second --create of /dev/123, before I added the
 > bitmap=none. This should, however, not have written anything with -o
 > and --assume-clean, correct?

False assumption.  As described above.

On 9/9/22 21:29, Luigi Fabio wrote:
 > For completeness' sake, though it should not be relevant, here is the
 > error that caused the mishap:

[trim /]

Noted, and helpful for correlating device names to PHYs.

Okay.  To date, you've only done create with -o or --assume-clean?

If so, it is likely your 0.90 superblocks are still present at the ends 
of the disks.

You will need to zero the v1.2 superblocks that have been placed on your 
partitions.  Then attempt an --assemble and see if mdadm will deliver 
the same message as before, identifying all of the members, but refusing 
to proceed due to event counts.

If so, repeat with --force.

This procedure is safe to do without overlays, and will likely yield a 
running array.

Then you will have to fsck to fix up the borked beginning of your 
filesystem.

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 15:18               ` Phil Turmel
@ 2022-09-10 19:30                 ` Luigi Fabio
  2022-09-10 19:55                   ` Luigi Fabio
  2022-09-12 19:06                 ` Phillip Susi
  1 sibling, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10 19:30 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Hello Phil,
thank you BTW for your continued assistance. Here goes:

On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote:
> Yes.  Same kernels are pretty repeatable for device order on bootup as
> long as all are present.  Anything missing will shift the letter
> assignments.
We need to keep this in mind, though the described boot log scsi
target -> letter assignment seem to indicate that we're clear as
discussed. This is relevant since I have re--created the array.

> Okay, that should have saved you.  Except, I think it still writes all
> the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at
> tbe beginning of each device.
I dug into the docs and the wiki and ran some experiments on another
machine. Apparently, what 1.2 does with my kernel and my mdadm is use
sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors ->
36 kB -> 9 ext3 blocks per device, sparsely as you say.
This is 'fine' even with a 128kB chunk, the first one doesn't really
matter because yes, fsck detects that it nuked the block group
descriptors but the superblock before them is fine (indeed, tune2fs
and dumpe2fs work 'as expected') and then goes to a backup and is
happy, even declaring the fs clean.
Therefore out of the 12 'affected' areas, one doesn't matter for
practical purposes and we have to wonder about the others.  Arguably,
one of those should also be managed by parity but I have no idea how
that will work out - it may be very important actually at the time of
any future resync.
Now, these are all in the first block of each device, which would form
the first 1408 kB of the filesystem (128kB chunk, remember the
original creation is *old*), since I believe mdraid preserves
sequence, therefore the chunks are in order.
We know the following from dumpe2fs:
---
Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED]
  Primary superblock at 0, Group descriptors at 1-2096
  Block bitmap at 2260 (+2260), csum 0x824f8d47
  Inode bitmap at 2261 (+2261), csum 0xdadef5ad
  Inode table at 2262-2773 (+2262)
  0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes
---
So the first 2097 blocks are backed up group descriptors - this is
*way* more than the 1408 kB therefore with restored BGDs (fsck -s
32768, say) we should be... fine?

Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have
to wonder whether that's because the BGDs are not happy. I am tempted
to run an overlay *for the fsck*, what do you think?

> Well, yes.  But doesn't matter for assembly attempts, with always go by
> the meta-data.  Device order only ever matters for --create when recreating.
Sure, but keep in mind, my --create commands nuked the original 0.90
metadata as well, so we need to be sure that the order is correct or
we'll have a real jumble,
Now, the cables have not been moved and the boot logs confirm that the
scsi targets correspond, so we should have the order correct and the
parameters are correct from the previous logs. Therefore, we 'should'
have the same dataspa

> If you consistently used -o or --assume-clean, then everything beyond
> ~3G should be untouched, if you can get the order right.  Have fsck try
> backup superblocks way out.
fsck grabs a backup 'magically' and seems to be happy, unless I -nf it
then ... all sorts of bad stuff happens.

> Please use lsdrv to capture names versus serial numbers.  Re-run it
> before any --create operation to ensure the current names really do
> match the expected serial numbers.  Keep track of ordering information
> by serial number.  Note that lsdrv will reliably line up PHYs on SAS
> controllers, so that can be trusted, too.
Thing is... I can't find lsdrv. As in: there is no lsdrv binary,
apparently, in Debian stable or in Debian testing. Where do I look for
it?

> Superblocks other than 0.9x and 1.0 place a bad block log and a written
> block bitmap between the superblock and the data area.  I'm not sure if
> any of the remain space is wiped.  These would be written regardless of
> -o or --assume-clean.  Those flags "protect" the *data area* of the
> array, not the array's own metadata.
Yes - this is the damage I'm talking about above. From the logs, the
'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80
sectors, with the first 8 not being touched (and the proof is that the
superblock is 'happy', though interestingly this should not be the
case because the gr0 superblock is offset by 1024 bytes -> the last
1024 bytes of the superblock should be borked too.
From this, my math above.


>  > From, I think, the second --create of /dev/123, before I added the
>  > bitmap=none. This should, however, not have written anything with -o
>  > and --assume-clean, correct?
> False assumption.  As described above.
Two different things: what I meant was that even with that bitmap
message, the only thing that would have been written is the metadata.
linux raid documentation states repeatedly that with -o no resyncing
or parity reconstruction would be performed. Yes, agreed, the 1.2
metadata got written, but it's the only thing that got written from
when the array was stopped by the error, if I am reading the docs
correctly?

> Okay.  To date, you've only done create with -o or --assume-clean?
>
> If so, it is likely your 0.90 superblocks are still present at the ends
> of the disks.
Problem is, if you look at my previous email, as I mentioned above I
have ALSO done --create with --metadata=0.90, which overwrote the
original blocks.
HOWEVER, I do have the logs of the original parameters and I have at
least one drive - the old sdc - which was spit out before this whole
thing, which becomes relevant to confirm that the parameter log is
correct (multiple things seem to coincide, so I think we're OK there).

Given all the above, however, if we get the parameters to match we
should get a filesystem that corresponds to before the event after the
first 1408kB - and those don't matter insofar as we have redundant
backups in ext4 for at least the first 2060 blocks >> 1408 kB.

The thing that I do NOT understand is that if this is the case, fsck
with -s <high> should render a FS without any errors.. therefore why
am I getting inode metadata checksum errors? This is why I had
originarily posted in linux-ext4 ...

Thanks,
L

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 19:30                 ` Luigi Fabio
@ 2022-09-10 19:55                   ` Luigi Fabio
  2022-09-10 20:12                     ` Luigi Fabio
                                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10 19:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Well, I found SOMETHING of decided interest: when I run dumpe2fs with
any backup superblock, this happens:

---
Filesystem created:       Tue Nov  4 08:56:08 2008
Last mount time:          Thu Aug 18 21:04:22 2022
Last write time:          Thu Aug 18 21:04:22 2022
---

So the backups have not been updated since boot-before-last? That
would explain why, when fsck tries to use those backups, it comes up
with funny results.

Is this ...as intended, I wonder? Does it also imply that any file
that was written to > aug 18th will be in an indeterminate state? That
would seem to be the implication.

On Sat, Sep 10, 2022 at 3:30 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> Hello Phil,
> thank you BTW for your continued assistance. Here goes:
>
> On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote:
> > Yes.  Same kernels are pretty repeatable for device order on bootup as
> > long as all are present.  Anything missing will shift the letter
> > assignments.
> We need to keep this in mind, though the described boot log scsi
> target -> letter assignment seem to indicate that we're clear as
> discussed. This is relevant since I have re--created the array.
>
> > Okay, that should have saved you.  Except, I think it still writes all
> > the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at
> > tbe beginning of each device.
> I dug into the docs and the wiki and ran some experiments on another
> machine. Apparently, what 1.2 does with my kernel and my mdadm is use
> sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors ->
> 36 kB -> 9 ext3 blocks per device, sparsely as you say.
> This is 'fine' even with a 128kB chunk, the first one doesn't really
> matter because yes, fsck detects that it nuked the block group
> descriptors but the superblock before them is fine (indeed, tune2fs
> and dumpe2fs work 'as expected') and then goes to a backup and is
> happy, even declaring the fs clean.
> Therefore out of the 12 'affected' areas, one doesn't matter for
> practical purposes and we have to wonder about the others.  Arguably,
> one of those should also be managed by parity but I have no idea how
> that will work out - it may be very important actually at the time of
> any future resync.
> Now, these are all in the first block of each device, which would form
> the first 1408 kB of the filesystem (128kB chunk, remember the
> original creation is *old*), since I believe mdraid preserves
> sequence, therefore the chunks are in order.
> We know the following from dumpe2fs:
> ---
> Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED]
>   Primary superblock at 0, Group descriptors at 1-2096
>   Block bitmap at 2260 (+2260), csum 0x824f8d47
>   Inode bitmap at 2261 (+2261), csum 0xdadef5ad
>   Inode table at 2262-2773 (+2262)
>   0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes
> ---
> So the first 2097 blocks are backed up group descriptors - this is
> *way* more than the 1408 kB therefore with restored BGDs (fsck -s
> 32768, say) we should be... fine?
>
> Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have
> to wonder whether that's because the BGDs are not happy. I am tempted
> to run an overlay *for the fsck*, what do you think?
>
> > Well, yes.  But doesn't matter for assembly attempts, with always go by
> > the meta-data.  Device order only ever matters for --create when recreating.
> Sure, but keep in mind, my --create commands nuked the original 0.90
> metadata as well, so we need to be sure that the order is correct or
> we'll have a real jumble,
> Now, the cables have not been moved and the boot logs confirm that the
> scsi targets correspond, so we should have the order correct and the
> parameters are correct from the previous logs. Therefore, we 'should'
> have the same dataspa
>
> > If you consistently used -o or --assume-clean, then everything beyond
> > ~3G should be untouched, if you can get the order right.  Have fsck try
> > backup superblocks way out.
> fsck grabs a backup 'magically' and seems to be happy, unless I -nf it
> then ... all sorts of bad stuff happens.
>
> > Please use lsdrv to capture names versus serial numbers.  Re-run it
> > before any --create operation to ensure the current names really do
> > match the expected serial numbers.  Keep track of ordering information
> > by serial number.  Note that lsdrv will reliably line up PHYs on SAS
> > controllers, so that can be trusted, too.
> Thing is... I can't find lsdrv. As in: there is no lsdrv binary,
> apparently, in Debian stable or in Debian testing. Where do I look for
> it?
>
> > Superblocks other than 0.9x and 1.0 place a bad block log and a written
> > block bitmap between the superblock and the data area.  I'm not sure if
> > any of the remain space is wiped.  These would be written regardless of
> > -o or --assume-clean.  Those flags "protect" the *data area* of the
> > array, not the array's own metadata.
> Yes - this is the damage I'm talking about above. From the logs, the
> 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80
> sectors, with the first 8 not being touched (and the proof is that the
> superblock is 'happy', though interestingly this should not be the
> case because the gr0 superblock is offset by 1024 bytes -> the last
> 1024 bytes of the superblock should be borked too.
> From this, my math above.
>
>
> >  > From, I think, the second --create of /dev/123, before I added the
> >  > bitmap=none. This should, however, not have written anything with -o
> >  > and --assume-clean, correct?
> > False assumption.  As described above.
> Two different things: what I meant was that even with that bitmap
> message, the only thing that would have been written is the metadata.
> linux raid documentation states repeatedly that with -o no resyncing
> or parity reconstruction would be performed. Yes, agreed, the 1.2
> metadata got written, but it's the only thing that got written from
> when the array was stopped by the error, if I am reading the docs
> correctly?
>
> > Okay.  To date, you've only done create with -o or --assume-clean?
> >
> > If so, it is likely your 0.90 superblocks are still present at the ends
> > of the disks.
> Problem is, if you look at my previous email, as I mentioned above I
> have ALSO done --create with --metadata=0.90, which overwrote the
> original blocks.
> HOWEVER, I do have the logs of the original parameters and I have at
> least one drive - the old sdc - which was spit out before this whole
> thing, which becomes relevant to confirm that the parameter log is
> correct (multiple things seem to coincide, so I think we're OK there).
>
> Given all the above, however, if we get the parameters to match we
> should get a filesystem that corresponds to before the event after the
> first 1408kB - and those don't matter insofar as we have redundant
> backups in ext4 for at least the first 2060 blocks >> 1408 kB.
>
> The thing that I do NOT understand is that if this is the case, fsck
> with -s <high> should render a FS without any errors.. therefore why
> am I getting inode metadata checksum errors? This is why I had
> originarily posted in linux-ext4 ...
>
> Thanks,
> L

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 19:55                   ` Luigi Fabio
@ 2022-09-10 20:12                     ` Luigi Fabio
  2022-09-10 20:15                       ` Phil Turmel
  2022-09-10 20:14                     ` Phil Turmel
  2022-09-12 19:09                     ` Phillip Susi
  2 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10 20:12 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Following up, I found:

>> The backup ext4 superblocks are never updated by the kernel, only after
>> a successful e2fsck, tune2fs, resize2fs, or other userspace operation.
>>
>> This avoids clobbering the backups with bad data if the kernel has a bug
>> or device error (e.g. bad cable, HBA, etc).
So therefore, if we restore a backup superblock (and its attendant
data) what happens to any FS structure that was written to *after*
that time? That is to say, in this case, after Aug 18th?
Is the system 'smart enough' to do something or will I have a big fat
mess? I mean, reversion to 08/18 would be great, but I can't imagine
that the FS can do that, it would have to have copies of every inode.

This does explain how I get so many errors when fsck grabs the backup
superblock... the RAID part we solved just fine, it's the rest we have
to deal with.

Ideas are welcome.

On Sat, Sep 10, 2022 at 3:55 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
> any backup superblock, this happens:
>
> ---
> Filesystem created:       Tue Nov  4 08:56:08 2008
> Last mount time:          Thu Aug 18 21:04:22 2022
> Last write time:          Thu Aug 18 21:04:22 2022
> ---
>
> So the backups have not been updated since boot-before-last? That
> would explain why, when fsck tries to use those backups, it comes up
> with funny results.
>
> Is this ...as intended, I wonder? Does it also imply that any file
> that was written to > aug 18th will be in an indeterminate state? That
> would seem to be the implication.
>
> On Sat, Sep 10, 2022 at 3:30 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
> >
> > Hello Phil,
> > thank you BTW for your continued assistance. Here goes:
> >
> > On Sat, Sep 10, 2022 at 11:18 AM Phil Turmel <philip@turmel.org> wrote:
> > > Yes.  Same kernels are pretty repeatable for device order on bootup as
> > > long as all are present.  Anything missing will shift the letter
> > > assignments.
> > We need to keep this in mind, though the described boot log scsi
> > target -> letter assignment seem to indicate that we're clear as
> > discussed. This is relevant since I have re--created the array.
> >
> > > Okay, that should have saved you.  Except, I think it still writes all
> > > the meta-data.  With v1.2, that would sparsely trash up to 1/4 gig at
> > > tbe beginning of each device.
> > I dug into the docs and the wiki and ran some experiments on another
> > machine. Apparently, what 1.2 does with my kernel and my mdadm is use
> > sectors 9 to 80 of each device. Thus, it borked 72 512-byte sectors ->
> > 36 kB -> 9 ext3 blocks per device, sparsely as you say.
> > This is 'fine' even with a 128kB chunk, the first one doesn't really
> > matter because yes, fsck detects that it nuked the block group
> > descriptors but the superblock before them is fine (indeed, tune2fs
> > and dumpe2fs work 'as expected') and then goes to a backup and is
> > happy, even declaring the fs clean.
> > Therefore out of the 12 'affected' areas, one doesn't matter for
> > practical purposes and we have to wonder about the others.  Arguably,
> > one of those should also be managed by parity but I have no idea how
> > that will work out - it may be very important actually at the time of
> > any future resync.
> > Now, these are all in the first block of each device, which would form
> > the first 1408 kB of the filesystem (128kB chunk, remember the
> > original creation is *old*), since I believe mdraid preserves
> > sequence, therefore the chunks are in order.
> > We know the following from dumpe2fs:
> > ---
> > Group 0: (Blocks 0-32767) csum 0x45ff [ITABLE_ZEROED]
> >   Primary superblock at 0, Group descriptors at 1-2096
> >   Block bitmap at 2260 (+2260), csum 0x824f8d47
> >   Inode bitmap at 2261 (+2261), csum 0xdadef5ad
> >   Inode table at 2262-2773 (+2262)
> >   0 free blocks, 8179 free inodes, 2 directories, 8179 unused inodes
> > ---
> > So the first 2097 blocks are backed up group descriptors - this is
> > *way* more than the 1408 kB therefore with restored BGDs (fsck -s
> > 32768, say) we should be... fine?
> >
> > Now, if OTOH I do an -nf, all sorts of weird stuff happens but I have
> > to wonder whether that's because the BGDs are not happy. I am tempted
> > to run an overlay *for the fsck*, what do you think?
> >
> > > Well, yes.  But doesn't matter for assembly attempts, with always go by
> > > the meta-data.  Device order only ever matters for --create when recreating.
> > Sure, but keep in mind, my --create commands nuked the original 0.90
> > metadata as well, so we need to be sure that the order is correct or
> > we'll have a real jumble,
> > Now, the cables have not been moved and the boot logs confirm that the
> > scsi targets correspond, so we should have the order correct and the
> > parameters are correct from the previous logs. Therefore, we 'should'
> > have the same dataspa
> >
> > > If you consistently used -o or --assume-clean, then everything beyond
> > > ~3G should be untouched, if you can get the order right.  Have fsck try
> > > backup superblocks way out.
> > fsck grabs a backup 'magically' and seems to be happy, unless I -nf it
> > then ... all sorts of bad stuff happens.
> >
> > > Please use lsdrv to capture names versus serial numbers.  Re-run it
> > > before any --create operation to ensure the current names really do
> > > match the expected serial numbers.  Keep track of ordering information
> > > by serial number.  Note that lsdrv will reliably line up PHYs on SAS
> > > controllers, so that can be trusted, too.
> > Thing is... I can't find lsdrv. As in: there is no lsdrv binary,
> > apparently, in Debian stable or in Debian testing. Where do I look for
> > it?
> >
> > > Superblocks other than 0.9x and 1.0 place a bad block log and a written
> > > block bitmap between the superblock and the data area.  I'm not sure if
> > > any of the remain space is wiped.  These would be written regardless of
> > > -o or --assume-clean.  Those flags "protect" the *data area* of the
> > > array, not the array's own metadata.
> > Yes - this is the damage I'm talking about above. From the logs, the
> > 'area' is 4096 sectors of which 4016 remain 'unused'. Therefore 80
> > sectors, with the first 8 not being touched (and the proof is that the
> > superblock is 'happy', though interestingly this should not be the
> > case because the gr0 superblock is offset by 1024 bytes -> the last
> > 1024 bytes of the superblock should be borked too.
> > From this, my math above.
> >
> >
> > >  > From, I think, the second --create of /dev/123, before I added the
> > >  > bitmap=none. This should, however, not have written anything with -o
> > >  > and --assume-clean, correct?
> > > False assumption.  As described above.
> > Two different things: what I meant was that even with that bitmap
> > message, the only thing that would have been written is the metadata.
> > linux raid documentation states repeatedly that with -o no resyncing
> > or parity reconstruction would be performed. Yes, agreed, the 1.2
> > metadata got written, but it's the only thing that got written from
> > when the array was stopped by the error, if I am reading the docs
> > correctly?
> >
> > > Okay.  To date, you've only done create with -o or --assume-clean?
> > >
> > > If so, it is likely your 0.90 superblocks are still present at the ends
> > > of the disks.
> > Problem is, if you look at my previous email, as I mentioned above I
> > have ALSO done --create with --metadata=0.90, which overwrote the
> > original blocks.
> > HOWEVER, I do have the logs of the original parameters and I have at
> > least one drive - the old sdc - which was spit out before this whole
> > thing, which becomes relevant to confirm that the parameter log is
> > correct (multiple things seem to coincide, so I think we're OK there).
> >
> > Given all the above, however, if we get the parameters to match we
> > should get a filesystem that corresponds to before the event after the
> > first 1408kB - and those don't matter insofar as we have redundant
> > backups in ext4 for at least the first 2060 blocks >> 1408 kB.
> >
> > The thing that I do NOT understand is that if this is the case, fsck
> > with -s <high> should render a FS without any errors.. therefore why
> > am I getting inode metadata checksum errors? This is why I had
> > originarily posted in linux-ext4 ...
> >
> > Thanks,
> > L

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 19:55                   ` Luigi Fabio
  2022-09-10 20:12                     ` Luigi Fabio
@ 2022-09-10 20:14                     ` Phil Turmel
  2022-09-10 20:17                       ` Phil Turmel
  2022-09-12 19:09                     ` Phillip Susi
  2 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2022-09-10 20:14 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: linux-raid

Hi Luigi,


On 9/10/22 15:55, Luigi Fabio wrote:
> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
> any backup superblock, this happens:
> 
> ---
> Filesystem created:       Tue Nov  4 08:56:08 2008
> Last mount time:          Thu Aug 18 21:04:22 2022
> Last write time:          Thu Aug 18 21:04:22 2022
> ---
> 
> So the backups have not been updated since boot-before-last? That
> would explain why, when fsck tries to use those backups, it comes up
> with funny results.

Interesting.

> Is this ...as intended, I wonder? Does it also imply that any file
> that was written to > aug 18th will be in an indeterminate state? That
> would seem to be the implication.

Hmm.  I wouldn't have thought so, but maybe the backup blocks don't get 
updated as often?

{ I think you are about as far as I would have gotten myself, if I 
allowed myself to get there. }

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 20:12                     ` Luigi Fabio
@ 2022-09-10 20:15                       ` Phil Turmel
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2022-09-10 20:15 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: linux-raid

Do the fsck with an overlay in place.

I suspect the data in the inodes will provide corroboration for newer 
data in the various structures.  I think your odds are good, now.

On 9/10/22 16:12, Luigi Fabio wrote:
> Following up, I found:
> 
>>> The backup ext4 superblocks are never updated by the kernel, only after
>>> a successful e2fsck, tune2fs, resize2fs, or other userspace operation.
>>>
>>> This avoids clobbering the backups with bad data if the kernel has a bug
>>> or device error (e.g. bad cable, HBA, etc).
> So therefore, if we restore a backup superblock (and its attendant
> data) what happens to any FS structure that was written to *after*
> that time? That is to say, in this case, after Aug 18th?
> Is the system 'smart enough' to do something or will I have a big fat
> mess? I mean, reversion to 08/18 would be great, but I can't imagine
> that the FS can do that, it would have to have copies of every inode.
> 
> This does explain how I get so many errors when fsck grabs the backup
> superblock... the RAID part we solved just fine, it's the rest we have
> to deal with.
> 
> Ideas are welcome.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 20:14                     ` Phil Turmel
@ 2022-09-10 20:17                       ` Phil Turmel
  2022-09-10 20:24                         ` Luigi Fabio
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2022-09-10 20:17 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: linux-raid

Oh, one more thing:

If you had followed any of the advice on the linux-raid wiki, you'd have 
been pointed to my lsdrv project on github:

https://github.com/pturmel/lsdrv

(Still just python2, sorry.)


On 9/10/22 16:14, Phil Turmel wrote:
> Hi Luigi,
> 
> 
> On 9/10/22 15:55, Luigi Fabio wrote:
>> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
>> any backup superblock, this happens:
>>
>> ---
>> Filesystem created:       Tue Nov  4 08:56:08 2008
>> Last mount time:          Thu Aug 18 21:04:22 2022
>> Last write time:          Thu Aug 18 21:04:22 2022
>> ---
>>
>> So the backups have not been updated since boot-before-last? That
>> would explain why, when fsck tries to use those backups, it comes up
>> with funny results.
> 
> Interesting.
> 
>> Is this ...as intended, I wonder? Does it also imply that any file
>> that was written to > aug 18th will be in an indeterminate state? That
>> would seem to be the implication.
> 
> Hmm.  I wouldn't have thought so, but maybe the backup blocks don't get 
> updated as often?
> 
> { I think you are about as far as I would have gotten myself, if I 
> allowed myself to get there. }
> 
> Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 20:17                       ` Phil Turmel
@ 2022-09-10 20:24                         ` Luigi Fabio
  2022-09-10 20:54                           ` Luigi Fabio
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10 20:24 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Phil,
I did indeed go there, but, stupidly, after the fact and I had missed
the reference to your tool. Not an excuse, but the initial part of the
process was, as I mentioned, complicated and done while driving....

I'll download lsdrv and snapshot the situation in any case, generate
the overlay, run the fsck and see what happens.

I'll report back when it's done, which is probably going to be
tomorrow (fsck times for this filesystem historically have been in the
9+ hr range - and the overlay will probably do us no favours
performancewise).

Back as soon as I have further data. Thank you again for the help.

L

On Sat, Sep 10, 2022 at 4:17 PM Phil Turmel <philip@turmel.org> wrote:
>
> Oh, one more thing:
>
> If you had followed any of the advice on the linux-raid wiki, you'd have
> been pointed to my lsdrv project on github:
>
> https://github.com/pturmel/lsdrv
>
> (Still just python2, sorry.)
>
>
> On 9/10/22 16:14, Phil Turmel wrote:
> > Hi Luigi,
> >
> >
> > On 9/10/22 15:55, Luigi Fabio wrote:
> >> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
> >> any backup superblock, this happens:
> >>
> >> ---
> >> Filesystem created:       Tue Nov  4 08:56:08 2008
> >> Last mount time:          Thu Aug 18 21:04:22 2022
> >> Last write time:          Thu Aug 18 21:04:22 2022
> >> ---
> >>
> >> So the backups have not been updated since boot-before-last? That
> >> would explain why, when fsck tries to use those backups, it comes up
> >> with funny results.
> >
> > Interesting.
> >
> >> Is this ...as intended, I wonder? Does it also imply that any file
> >> that was written to > aug 18th will be in an indeterminate state? That
> >> would seem to be the implication.
> >
> > Hmm.  I wouldn't have thought so, but maybe the backup blocks don't get
> > updated as often?
> >
> > { I think you are about as far as I would have gotten myself, if I
> > allowed myself to get there. }
> >
> > Phil
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 20:24                         ` Luigi Fabio
@ 2022-09-10 20:54                           ` Luigi Fabio
  0 siblings, 0 replies; 24+ messages in thread
From: Luigi Fabio @ 2022-09-10 20:54 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

I will be very, very brief: it works.

I put on the overlay, did the first fsck which said it would try the
backup blocks and then complained about not benign able to set
superblock flags and stopped.

At that point, since it said that the FS was modified I assumed that
it had overwritten the block descriptors that were damaged. I tried
mounting -o ro the filesystem without touching it further and... it
works. The files are there, including the newest ones, directory
connectivity is correct as far as several tests can tell....

Of course, I will treat it as a 'damaged' fs, get the files off of
there into a new array, then try the further fsck and see what happens
for curiosity's sake but as far as I am concerned this arrya is no
longer going to be live.

Which is just fine, I got, I believe, what I wanted.

Thank you very much for all your help - I plan to provide a final
update once the copy is done etc.

Let me know where to send scotch. Much deserved.

L

On Sat, Sep 10, 2022 at 4:24 PM Luigi Fabio <luigi.fabio@gmail.com> wrote:
>
> Phil,
> I did indeed go there, but, stupidly, after the fact and I had missed
> the reference to your tool. Not an excuse, but the initial part of the
> process was, as I mentioned, complicated and done while driving....
>
> I'll download lsdrv and snapshot the situation in any case, generate
> the overlay, run the fsck and see what happens.
>
> I'll report back when it's done, which is probably going to be
> tomorrow (fsck times for this filesystem historically have been in the
> 9+ hr range - and the overlay will probably do us no favours
> performancewise).
>
> Back as soon as I have further data. Thank you again for the help.
>
> L
>
> On Sat, Sep 10, 2022 at 4:17 PM Phil Turmel <philip@turmel.org> wrote:
> >
> > Oh, one more thing:
> >
> > If you had followed any of the advice on the linux-raid wiki, you'd have
> > been pointed to my lsdrv project on github:
> >
> > https://github.com/pturmel/lsdrv
> >
> > (Still just python2, sorry.)
> >
> >
> > On 9/10/22 16:14, Phil Turmel wrote:
> > > Hi Luigi,
> > >
> > >
> > > On 9/10/22 15:55, Luigi Fabio wrote:
> > >> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
> > >> any backup superblock, this happens:
> > >>
> > >> ---
> > >> Filesystem created:       Tue Nov  4 08:56:08 2008
> > >> Last mount time:          Thu Aug 18 21:04:22 2022
> > >> Last write time:          Thu Aug 18 21:04:22 2022
> > >> ---
> > >>
> > >> So the backups have not been updated since boot-before-last? That
> > >> would explain why, when fsck tries to use those backups, it comes up
> > >> with funny results.
> > >
> > > Interesting.
> > >
> > >> Is this ...as intended, I wonder? Does it also imply that any file
> > >> that was written to > aug 18th will be in an indeterminate state? That
> > >> would seem to be the implication.
> > >
> > > Hmm.  I wouldn't have thought so, but maybe the backup blocks don't get
> > > updated as often?
> > >
> > > { I think you are about as far as I would have gotten myself, if I
> > > allowed myself to get there. }
> > >
> > > Phil
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 15:18               ` Phil Turmel
  2022-09-10 19:30                 ` Luigi Fabio
@ 2022-09-12 19:06                 ` Phillip Susi
  2022-09-13  4:02                   ` Luigi Fabio
  1 sibling, 1 reply; 24+ messages in thread
From: Phillip Susi @ 2022-09-12 19:06 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Luigi Fabio, linux-raid


Phil Turmel <philip@turmel.org> writes:

> Yes.  Same kernels are pretty repeatable for device order on bootup as
> long as all are present.  Anything missing will shift the letter 
> assignments.

Every time I think about this I find myself amayzed that it does seem to
be so stable, and wonder how that can be.  The drives are all enumerated
in paralell these days so the order they get assigned in should be a
total crap shoot, shouldn't it?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-10 19:55                   ` Luigi Fabio
  2022-09-10 20:12                     ` Luigi Fabio
  2022-09-10 20:14                     ` Phil Turmel
@ 2022-09-12 19:09                     ` Phillip Susi
  2022-09-13  3:58                       ` Luigi Fabio
  2 siblings, 1 reply; 24+ messages in thread
From: Phillip Susi @ 2022-09-12 19:09 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: Phil Turmel, linux-raid


Luigi Fabio <luigi.fabio@gmail.com> writes:

> Well, I found SOMETHING of decided interest: when I run dumpe2fs with
> any backup superblock, this happens:
>
> ---
> Filesystem created:       Tue Nov  4 08:56:08 2008
> Last mount time:          Thu Aug 18 21:04:22 2022
> Last write time:          Thu Aug 18 21:04:22 2022
> ---
>
> So the backups have not been updated since boot-before-last? That
> would explain why, when fsck tries to use those backups, it comes up
> with funny results.


That's funny.  IIRC, the backups virtually never get updated.  The only
thing e2fsck needs to get from them is the location of the inode tables
and block groups, and that does not change during the life of the
filesystem.

I might have something tickling the back of my memory that when e2fsck
is run, it updates the first backup superblock, but the others never got
updated.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-12 19:09                     ` Phillip Susi
@ 2022-09-13  3:58                       ` Luigi Fabio
  2022-09-13 12:47                         ` Phillip Susi
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-13  3:58 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Phil Turmel, linux-raid

On Mon, Sep 12, 2022 at 3:12 PM Phillip Susi <phill@thesusis.net> wrote:
> That's funny.  IIRC, the backups virtually never get updated.  The only
> thing e2fsck needs to get from them is the location of the inode tables
> and block groups, and that does not change during the life of the
> filesystem.
>
> I might have something tickling the back of my memory that when e2fsck
> is run, it updates the first backup superblock, but the others never got
> updated.
The way I have found it explained in multiple places is that the
backups only get updated as a consequence of an actual userspace
interaction. So you have to run fsck or at least change settings in
tune2fs, for instance, or resize2fs ... then all the backups get
updated.
The jury is still out on whether automated fscks - for those lunatics
who haven't disabled them - update or not. There is conflicting
information.

LF

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-12 19:06                 ` Phillip Susi
@ 2022-09-13  4:02                   ` Luigi Fabio
  2022-09-13 12:51                     ` Phillip Susi
  0 siblings, 1 reply; 24+ messages in thread
From: Luigi Fabio @ 2022-09-13  4:02 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Phil Turmel, linux-raid

On Mon, Sep 12, 2022 at 3:09 PM Phillip Susi <phill@thesusis.net> wrote:
> Every time I think about this I find myself amayzed that it does seem to
> be so stable, and wonder how that can be.  The drives are all enumerated
> in paralell these days so the order they get assigned in should be a
> total crap shoot, shouldn't it?
Well, there are several possible explanations, but persistence is
desireable - so evidently enumeration occurs according to controller
order in a repeatable way until something changes in the configuration
- or until you change kernel, someone does something funny with a
driver and the order changes. In 28 years of using Linux, however,
this has happened.. rarely, save for before things were sensible WAY
back when.

LF

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-13  3:58                       ` Luigi Fabio
@ 2022-09-13 12:47                         ` Phillip Susi
  0 siblings, 0 replies; 24+ messages in thread
From: Phillip Susi @ 2022-09-13 12:47 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: Phil Turmel, linux-raid


Luigi Fabio <luigi.fabio@gmail.com> writes:

> The way I have found it explained in multiple places is that the
> backups only get updated as a consequence of an actual userspace
> interaction. So you have to run fsck or at least change settings in
> tune2fs, for instance, or resize2fs ... then all the backups get
> updated.

Exactly.  Changing the filesystem with tune2fs or resize2fs requires
that all of the backups be updated.

> The jury is still out on whether automated fscks - for those lunatics
> who haven't disabled them - update or not. There is conflicting
> information.

IIRC, a preen ( the automatic fsck at boot ) normally just sees that the
dirty flag is not set ( since the filesystem was cleanly unmounted,
right? ), and doesn't do anything else.  If there was an unclean
shutdown though, and a real fsck is run, then it updates the first
backup.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RAID5 failure and consequent ext4 problems
  2022-09-13  4:02                   ` Luigi Fabio
@ 2022-09-13 12:51                     ` Phillip Susi
  0 siblings, 0 replies; 24+ messages in thread
From: Phillip Susi @ 2022-09-13 12:51 UTC (permalink / raw)
  To: Luigi Fabio; +Cc: Phil Turmel, linux-raid


Luigi Fabio <luigi.fabio@gmail.com> writes:

> Well, there are several possible explanations, but persistence is
> desireable - so evidently enumeration occurs according to controller
> order in a repeatable way until something changes in the configuration
> - or until you change kernel, someone does something funny with a
> driver and the order changes. In 28 years of using Linux, however,
> this has happened.. rarely, save for before things were sensible WAY
> back when.

I *think* it is only because the probes are all *started* in the natural
order, so as long as the drives all respond in the same, short amount of
time, you get no surprises.  If one drive decides to take a little
longer to answer today though, it can throw things off.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-09-13 12:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08 14:51 RAID5 failure and consequent ext4 problems Luigi Fabio
2022-09-08 17:23 ` Phil Turmel
2022-09-09 20:32   ` Luigi Fabio
2022-09-09 21:01     ` Luigi Fabio
2022-09-09 21:48       ` Phil Turmel
2022-09-09 22:11         ` David T-G
2022-09-09 22:50         ` Luigi Fabio
2022-09-09 23:04           ` Luigi Fabio
2022-09-10  1:29             ` Luigi Fabio
2022-09-10 15:18               ` Phil Turmel
2022-09-10 19:30                 ` Luigi Fabio
2022-09-10 19:55                   ` Luigi Fabio
2022-09-10 20:12                     ` Luigi Fabio
2022-09-10 20:15                       ` Phil Turmel
2022-09-10 20:14                     ` Phil Turmel
2022-09-10 20:17                       ` Phil Turmel
2022-09-10 20:24                         ` Luigi Fabio
2022-09-10 20:54                           ` Luigi Fabio
2022-09-12 19:09                     ` Phillip Susi
2022-09-13  3:58                       ` Luigi Fabio
2022-09-13 12:47                         ` Phillip Susi
2022-09-12 19:06                 ` Phillip Susi
2022-09-13  4:02                   ` Luigi Fabio
2022-09-13 12:51                     ` Phillip Susi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.