All of lore.kernel.org
 help / color / mirror / Atom feed
* Failed adadm RAID array after aborted Grown operation
@ 2022-05-08 13:18 Bob Brand
  2022-05-08 15:32 ` Wols Lists
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-08 13:18 UTC (permalink / raw)
  To: linux-raid

Hi,

I’m somewhat new to Linux and mdadm although I’ve certainly learnt a lot
over the last 24 hours.

I have a SuperMicro server running CentOS 7  (3.10.0-1160.11.1.e17.x86_64)
with version 4.1 – 2018-10-01 of mdadm with that was happily running with
30 8TB disk in a RAID6 configuration.  (It also has boot and root on a
RAID1 array – the RAID6 array being solely for data.)  It was however
starting to run out of space and I investigated adding more drives to the
array (it can hold a total of 45 drives).

Since this device is no longer under support, obtaining the same drives as
it already contained wasn’t an option and the supplier couldn’t guarantee
that they could supply compatible drives.  We did come to an arrangement
where I would try one drive and, if it didn’t work, I could return any
unopened units.

I spent ages ensuring that the ones he’d suggested were as compatible as
possible and I based the specs of the existing drives off the invoice for
the entire system.  This turned out to be a mistake as the invoice stated
they were 512e drives but, as I discovered after the new drives had
arrived and I was doing a final check the existing  were actually 4096k
drives.  Of course the new drives were 512e.  Bother! After a lot more
reading I found out that it might be possible to reformat the new drives
from 512e to 4096k using sg_format.

I installed the test drive and proceeded to see if it was possible to
format them to 4096k using the command sg_format –size=4096 /dev/sd<x>. 
All was proceeding smoothly when my ssh session terminated due a faulty
docking station killing my Ethernet connection.

So I logged onto the console and restarted the sg_format which completed
OK, sort of – it did convert the disk to 4096k but it did throw an I/O
error or two but they didn’t seem too concerning and I figured, if there
was a problem, it would show up in the next couple of steps.  I’ve since
discovered the dmesg log and that indicated that there were significantly
more I/O errors than I thought.

Anyway, since sg_format appeared to complete OK, I moved onto the next
stage which was to partition the disk with the following commands

     parted -a optimal /dev/sd<x>
     (parted) mklabel msdos
     (parted) mkpart primary 2048s 100% (need to check that the start is
correct)
     (parted) align-check optimal 1 (verify alignment of partition 1)
     (parted) set 1 raid on (set the FLAG to RAID)
     (parted) print


Unfortunately, I don’t have the results of the print command as my laptop
unexpectedly shut down over night (it hasn’t been a good weekend) but the
partitioning appeared to complete without incident.

I then added the new disk to the array:

     mdadm --add /dev/md125 /dev/sd<x>


And it completed without any problems.

I then proceeded to grow the array:

     mdadm --grow --raid-devices=31 --backup-file=/grow_md125.bak
/dev/md125


I monitored this with cat /proc/mdstat and it showed that it was reshaping
but the speed was 0K/sec and the reshape didn’t progress from 0%.

#cat /proc/mdstat produced:

     Personalities : [raid1] [raid6] [raid5] [raid4]
     md125 : active raid6 sdab1[30] sdw1[26] sdc1[6] sdm1[16] sdi1[12]
sdz1[29] sdh1[11] sdg1[10] sds1[22] sdf1[9] sdq1[20] sdaa1[1] sdo1[18]
sdu1[24] sdb1[5] sdae1[4] sdl1[15] sdj1[13] sdn1[17] sdp1[19] sdv1[25]
sde1[8]    sdd1[7] sdr1[21] sdt1[23] sdx1[27] sdad1[3] sdac1[2] sdy1[28]
sda1[0] sdk1[14]
          218789036032 blocks super 1.2 level 6, 512k chunk, algorithm 2
[31/31] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
           [>....................]  reshape =  0.0% (1/7813894144)
finish=328606806584.3min speed=0K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

     md126 : active raid1 sdaf1[0] sdag1[1]
           100554752 blocks super 1.2 [2/2] [UU]
           bitmap: 1/1 pages [4KB], 65536KB chunk

     md127 : active raid1 sdaf3[0] sdag2[1]
           976832 blocks super 1.0 [2/2] [UU]
           bitmap: 0/1 pages [0KB], 65536KB chunk

     unused devices: <none>


# mdadm --detail /dev/md125 produced:
/dev/md125:
           Version : 1.2
     Creation Time : Wed Sep 13 15:09:40 2017
        Raid Level : raid6
        Array Size : 218789036032 (203.76 TiB 224.04 TB)
     Used Dev Size : 7813894144 (7.28 TiB 8.00 TB)
      Raid Devices : 31
     Total Devices : 31
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun May  8 00:47:35 2022
             State : clean, reshaping
    Active Devices : 31
   Working Devices : 31
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

    Reshape Status : 0% complete
     Delta Devices : 1, (30->31)

              Name : localhost.localdomain:SW-RAID6
              UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
            Events : 1053617

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1      65      161        1      active sync   /dev/sdaa1
       2      65      193        2      active sync   /dev/sdac1
       3      65      209        3      active sync   /dev/sdad1
       4      65      225        4      active sync   /dev/sdae1
       5       8       17        5      active sync   /dev/sdb1
       6       8       33        6      active sync   /dev/sdc1
       7       8       49        7      active sync   /dev/sdd1
       8       8       65        8      active sync   /dev/sde1
       9       8       81        9      active sync   /dev/sdf1
      10       8       97       10      active sync   /dev/sdg1
      11       8      113       11      active sync   /dev/sdh1
      12       8      129       12      active sync   /dev/sdi1
      13       8      145       13      active sync   /dev/sdj1
      14       8      161       14      active sync   /dev/sdk1
      15       8      177       15      active sync   /dev/sdl1
      16       8      193       16      active sync   /dev/sdm1
      17       8      209       17      active sync   /dev/sdn1
      18       8      225       18      active sync   /dev/sdo1
      19       8      241       19      active sync   /dev/sdp1
      20      65        1       20      active sync   /dev/sdq1
      21      65       17       21      active sync   /dev/sdr1
      22      65       33       22      active sync   /dev/sds1
      23      65       49       23      active sync   /dev/sdt1
      24      65       65       24      active sync   /dev/sdu1
      25      65       81       25      active sync   /dev/sdv1
      26      65       97       26      active sync   /dev/sdw1
      27      65      113       27      active sync   /dev/sdx1
      28      65      129       28      active sync   /dev/sdy1
      29      65      145       29      active sync   /dev/sdz1
      30      65      177       30      active sync   /dev/sdab1


NOTE: the new disk is /dev/sdab


About 12 hours later, as the reshape hadn’t progressed from 0%, I looked
at ways of aborting it, such as mdadm --stop /dev/md125 which didn't work
so I ended up rebooting the server and this is where things really went
pear-shaped.

The server came up in emergency mode, which I found odd given that the
boot and root should have been OK.

I was able to log on as root OK but the RAID6 array ws stuck in the
reshape state.

I then tried mdadm --assemble --update=revert-reshape
--backup-file=/grow_md125.bak --verbose --uuid=
f9b65f55:5f257add:1140ccc0:46ca6c19 /dev/md125 and this produced:

     mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got
<varying numbers>
     mdadm: No RAID super block on /dev/sde
     .
     .
     mdadm: /dev/sde1 is identified as a member of /dev/md125, slot 6
     .
     .
     mdadm: /dev/md125 has an active reshape - checking if critical
section needs to be restored
     mdadm: No backup metadata on /grow_md125.back
     mdadm: Failed to find backup of critical section
     mdadm: Failed to restore critical section for reshape, sorry.


I've tried difference variations on this including mdadm --assemble
--invalid-backup --force but I won't include all the different commands
here because I'm having to type all this since I can't copy anything off
the server while it's in Emergency Mode.

I have also removed the suspect disk but this hasn't made any difference.

But the closest I've come to fixing this is running mdadm /dev/md125
--assemble --invalid-backup --backup-file=/grow_md125.bak --verbose
/dev/sdc1 /dev/sdd1 ....... /dev/sdaf1 and this produces:
     .
     .
     .
     mdadm: /dev/sdaf1 is identified as a member of /dev/md125, slot 4.
     mdadm: /dev/md125 has an active reshape - checking if critical
section needs to be restored
     mdadm: No backup metadata on /grow_md125.back
     mdadm: Failed to find backup of critical section
     mdadm: continuing without restoring backup
     mdadm: added /dev/sdac1 to /dev/md125 as 1
     .
     .
     .
     mdadm: failed to RUN_ARRAY /dev/md125: Invalid argument


dmesg has this information:

     md: md125 stopped.
     md/raid:md125: reshape_position too early for auto-recovery -
aborting.
     md: pers->run() failed ...
     md: md125 stopped.


If you’ve stuck with me and read all this way, thank you and I hope you
can help me.

Regards,
Bob Brand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-08 13:18 Failed adadm RAID array after aborted Grown operation Bob Brand
@ 2022-05-08 15:32 ` Wols Lists
  2022-05-08 22:04   ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Wols Lists @ 2022-05-08 15:32 UTC (permalink / raw)
  To: Bob Brand, linux-raid; +Cc: Phil Turmel

On 08/05/2022 14:18, Bob Brand wrote:
> If you’ve stuck with me and read all this way, thank you and I hope you
> can help me.

https://raid.wiki.kernel.org/index.php/Linux_Raid

Especially 
https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn

What you need to do is revert the reshape. I know what may have 
happened, and what bothers me is your kernel version, 3.10.

The first thing to try is to boot from up-to-date rescue media and see 
if an mdadm --revert works from there. If it does, your Centos should 
then bring everything back no problem.

(You've currently got what I call a Frankensetup, a very old kernel, a 
pretty new mdadm, and a whole bunch of patches that does who knows what. 
You really need a matching kernel and mdadm, and your frankenkernel 
won't match anything ...)

Let us know how that goes ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-08 15:32 ` Wols Lists
@ 2022-05-08 22:04   ` Bob Brand
  2022-05-08 22:15     ` Wol
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-08 22:04 UTC (permalink / raw)
  To: Wols Lists, linux-raid; +Cc: Phil Turmel

Thank Wol.

Should I use a CentOS 7 disk or a CentOS disk?

Thanks

-----Original Message-----
From: Wols Lists <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 1:32 AM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: Re: Failed adadm RAID array after aborted Grown operation

On 08/05/2022 14:18, Bob Brand wrote:
> If you’ve stuck with me and read all this way, thank you and I hope
> you can help me.

https://raid.wiki.kernel.org/index.php/Linux_Raid

Especially
https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn

What you need to do is revert the reshape. I know what may have happened, 
and what bothers me is your kernel version, 3.10.

The first thing to try is to boot from up-to-date rescue media and see if an 
mdadm --revert works from there. If it does, your Centos should then bring 
everything back no problem.

(You've currently got what I call a Frankensetup, a very old kernel, a 
pretty new mdadm, and a whole bunch of patches that does who knows what.
You really need a matching kernel and mdadm, and your frankenkernel won't 
match anything ...)

Let us know how that goes ...

Cheers,
Wol



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-08 22:04   ` Bob Brand
@ 2022-05-08 22:15     ` Wol
  2022-05-08 22:19       ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Wol @ 2022-05-08 22:15 UTC (permalink / raw)
  To: Bob Brand, linux-raid; +Cc: Phil Turmel

How old is CentOS 7? With that kernel I guess it's quite old?

Try and get a CentOS 8.5 disk. At the end of the day, the version of 
linux doesn't matter. What you need is an up-to-date rescue disk. 
Distro/whatever is unimportant - what IS important is that you are using 
the latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An 
original CentOS disk might be okay (with matched kernel and mdadm), but 
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd 
try starting it again with the rescue disk. It'll probably run fine. Let 
it complete and then your old CentOS 7 will be fine with it.

Cheers,
Wol

On 08/05/2022 23:04, Bob Brand wrote:
> Thank Wol.
> 
> Should I use a CentOS 7 disk or a CentOS disk?
> 
> Thanks
> 
> -----Original Message-----
> From: Wols Lists <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 1:32 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
> 
> On 08/05/2022 14:18, Bob Brand wrote:
>> If you’ve stuck with me and read all this way, thank you and I hope
>> you can help me.
> 
> https://raid.wiki.kernel.org/index.php/Linux_Raid
> 
> Especially
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
> 
> What you need to do is revert the reshape. I know what may have happened,
> and what bothers me is your kernel version, 3.10.
> 
> The first thing to try is to boot from up-to-date rescue media and see if an
> mdadm --revert works from there. If it does, your Centos should then bring
> everything back no problem.
> 
> (You've currently got what I call a Frankensetup, a very old kernel, a
> pretty new mdadm, and a whole bunch of patches that does who knows what.
> You really need a matching kernel and mdadm, and your frankenkernel won't
> match anything ...)
> 
> Let us know how that goes ...
> 
> Cheers,
> Wol
> 
> 
> 
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-08 22:15     ` Wol
@ 2022-05-08 22:19       ` Bob Brand
  2022-05-08 23:02         ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-08 22:19 UTC (permalink / raw)
  To: Wol, linux-raid; +Cc: Phil Turmel

OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that seems to 
be the most recent they have.


-----Original Message-----
From: Wol <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 8:16 AM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: Re: Failed adadm RAID array after aborted Grown operation

How old is CentOS 7? With that kernel I guess it's quite old?

Try and get a CentOS 8.5 disk. At the end of the day, the version of linux 
doesn't matter. What you need is an up-to-date rescue disk.
Distro/whatever is unimportant - what IS important is that you are using the 
latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An 
original CentOS disk might be okay (with matched kernel and mdadm), but 
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd try 
starting it again with the rescue disk. It'll probably run fine. Let it 
complete and then your old CentOS 7 will be fine with it.

Cheers,
Wol

On 08/05/2022 23:04, Bob Brand wrote:
> Thank Wol.
>
> Should I use a CentOS 7 disk or a CentOS disk?
>
> Thanks
>
> -----Original Message-----
> From: Wols Lists <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 1:32 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> On 08/05/2022 14:18, Bob Brand wrote:
>> If you’ve stuck with me and read all this way, thank you and I hope
>> you can help me.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> Especially
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>
> What you need to do is revert the reshape. I know what may have
> happened, and what bothers me is your kernel version, 3.10.
>
> The first thing to try is to boot from up-to-date rescue media and see
> if an mdadm --revert works from there. If it does, your Centos should
> then bring everything back no problem.
>
> (You've currently got what I call a Frankensetup, a very old kernel, a
> pretty new mdadm, and a whole bunch of patches that does who knows what.
> You really need a matching kernel and mdadm, and your frankenkernel
> won't match anything ...)
>
> Let us know how that goes ...
>
> Cheers,
> Wol
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
> click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-08 22:19       ` Bob Brand
@ 2022-05-08 23:02         ` Bob Brand
  2022-05-08 23:32           ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-08 23:02 UTC (permalink / raw)
  To: Bob Brand, Wol, linux-raid; +Cc: Phil Turmel

Hi Wol,

I've booted to the installation media and I've run the following command:

mdadm 
/dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak 
 --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19 
/dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak 
 --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19

But I'm still getting the error:

mdadm: /dev/md125 has an active reshape - checking if critical section needs 
to be restored
mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.


Should I try the --invalid_backup switch or --force?

Thanks,
Bob


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 8:19 AM
To: Wol <antlists@youngman.org.uk>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that seems to 
be the most recent they have.


-----Original Message-----
From: Wol <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 8:16 AM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: Re: Failed adadm RAID array after aborted Grown operation

How old is CentOS 7? With that kernel I guess it's quite old?

Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
doesn't matter. What you need is an up-to-date rescue disk.
Distro/whatever is unimportant - what IS important is that you are using the
latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An
original CentOS disk might be okay (with matched kernel and mdadm), but
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd try
starting it again with the rescue disk. It'll probably run fine. Let it
complete and then your old CentOS 7 will be fine with it.

Cheers,
Wol

On 08/05/2022 23:04, Bob Brand wrote:
> Thank Wol.
>
> Should I use a CentOS 7 disk or a CentOS disk?
>
> Thanks
>
> -----Original Message-----
> From: Wols Lists <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 1:32 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> On 08/05/2022 14:18, Bob Brand wrote:
>> If you’ve stuck with me and read all this way, thank you and I hope
>> you can help me.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> Especially
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>
> What you need to do is revert the reshape. I know what may have
> happened, and what bothers me is your kernel version, 3.10.
>
> The first thing to try is to boot from up-to-date rescue media and see
> if an mdadm --revert works from there. If it does, your Centos should
> then bring everything back no problem.
>
> (You've currently got what I call a Frankensetup, a very old kernel, a
> pretty new mdadm, and a whole bunch of patches that does who knows what.
> You really need a matching kernel and mdadm, and your frankenkernel
> won't match anything ...)
>
> Let us know how that goes ...
>
> Cheers,
> Wol
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
> click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
links or open attachments unless you recognize the sender and know the
content is safe.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-08 23:02         ` Bob Brand
@ 2022-05-08 23:32           ` Bob Brand
  2022-05-09  0:09             ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-08 23:32 UTC (permalink / raw)
  To: Bob Brand, Wol, linux-raid; +Cc: Phil Turmel

I just tried it again with the --invalid_backup switch and it's now showing 
the State as "clean, degraded".and it's showing all the disks except for the 
suspect one that I removed.

I'm unable to mount it and see the contents. I get the error "mount: 
/dev/md125: can't read superblock."

Is there more that I need to do?

Thanks


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 9:02 AM
To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>; 
linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

Hi Wol,

I've booted to the installation media and I've run the following command:

mdadm
/dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
 --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
/dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak 
  --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19

But I'm still getting the error:

mdadm: /dev/md125 has an active reshape - checking if critical section needs 
to be restored
mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.


Should I try the --invalid_backup switch or --force?

Thanks,
Bob


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 8:19 AM
To: Wol <antlists@youngman.org.uk>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that seems to
be the most recent they have.


-----Original Message-----
From: Wol <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 8:16 AM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: Re: Failed adadm RAID array after aborted Grown operation

How old is CentOS 7? With that kernel I guess it's quite old?

Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
doesn't matter. What you need is an up-to-date rescue disk.
Distro/whatever is unimportant - what IS important is that you are using the
latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An
original CentOS disk might be okay (with matched kernel and mdadm), but
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd try
starting it again with the rescue disk. It'll probably run fine. Let it
complete and then your old CentOS 7 will be fine with it.

Cheers,
Wol

On 08/05/2022 23:04, Bob Brand wrote:
> Thank Wol.
>
> Should I use a CentOS 7 disk or a CentOS disk?
>
> Thanks
>
> -----Original Message-----
> From: Wols Lists <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 1:32 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> On 08/05/2022 14:18, Bob Brand wrote:
>> If you’ve stuck with me and read all this way, thank you and I hope
>> you can help me.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> Especially
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>
> What you need to do is revert the reshape. I know what may have
> happened, and what bothers me is your kernel version, 3.10.
>
> The first thing to try is to boot from up-to-date rescue media and see
> if an mdadm --revert works from there. If it does, your Centos should
> then bring everything back no problem.
>
> (You've currently got what I call a Frankensetup, a very old kernel, a
> pretty new mdadm, and a whole bunch of patches that does who knows what.
> You really need a matching kernel and mdadm, and your frankenkernel
> won't match anything ...)
>
> Let us know how that goes ...
>
> Cheers,
> Wol
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
> click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
links or open attachments unless you recognize the sender and know the
content is safe.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-08 23:32           ` Bob Brand
@ 2022-05-09  0:09             ` Bob Brand
  2022-05-09  6:52               ` Wols Lists
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-09  0:09 UTC (permalink / raw)
  To: Bob Brand, Wol, linux-raid; +Cc: Phil Turmel

Hi Wol,

My apologies for continually bothering you but I have a couple of questions:

1. How do I overcome the error message "mount: /dev/md125: can't read 
superblock."  Do it use fsck?

2. The removed disk is showing as "   -   0   0   30   removed". Is it safe 
to use "mdadm /dev/md2 -r detached" or "mdadm /dev/md2 -r failed" to 
overcome this?

Thank you!


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 9:33 AM
To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>; 
linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

I just tried it again with the --invalid_backup switch and it's now showing 
the State as "clean, degraded".and it's showing all the disks except for the 
suspect one that I removed.

I'm unable to mount it and see the contents. I get the error "mount:
/dev/md125: can't read superblock."

Is there more that I need to do?

Thanks


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 9:02 AM
To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>; 
linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

Hi Wol,

I've booted to the installation media and I've run the following command:

mdadm
/dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
 --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
/dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak
  --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19

But I'm still getting the error:

mdadm: /dev/md125 has an active reshape - checking if critical section needs 
to be restored
mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.


Should I try the --invalid_backup switch or --force?

Thanks,
Bob


-----Original Message-----
From: Bob Brand <brand@wmawater.com.au>
Sent: Monday, 9 May 2022 8:19 AM
To: Wol <antlists@youngman.org.uk>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: RE: Failed adadm RAID array after aborted Grown operation

OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that seems to
be the most recent they have.


-----Original Message-----
From: Wol <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 8:16 AM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>
Subject: Re: Failed adadm RAID array after aborted Grown operation

How old is CentOS 7? With that kernel I guess it's quite old?

Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
doesn't matter. What you need is an up-to-date rescue disk.
Distro/whatever is unimportant - what IS important is that you are using the
latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An
original CentOS disk might be okay (with matched kernel and mdadm), but
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd try
starting it again with the rescue disk. It'll probably run fine. Let it
complete and then your old CentOS 7 will be fine with it.

Cheers,
Wol

On 08/05/2022 23:04, Bob Brand wrote:
> Thank Wol.
>
> Should I use a CentOS 7 disk or a CentOS disk?
>
> Thanks
>
> -----Original Message-----
> From: Wols Lists <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 1:32 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> On 08/05/2022 14:18, Bob Brand wrote:
>> If you’ve stuck with me and read all this way, thank you and I hope
>> you can help me.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> Especially
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>
> What you need to do is revert the reshape. I know what may have
> happened, and what bothers me is your kernel version, 3.10.
>
> The first thing to try is to boot from up-to-date rescue media and see
> if an mdadm --revert works from there. If it does, your Centos should
> then bring everything back no problem.
>
> (You've currently got what I call a Frankensetup, a very old kernel, a
> pretty new mdadm, and a whole bunch of patches that does who knows what.
> You really need a matching kernel and mdadm, and your frankenkernel
> won't match anything ...)
>
> Let us know how that goes ...
>
> Cheers,
> Wol
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
> click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
links or open attachments unless you recognize the sender and know the
content is safe.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-09  0:09             ` Bob Brand
@ 2022-05-09  6:52               ` Wols Lists
  2022-05-09 13:07                 ` Bob Brand
       [not found]                 ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
  0 siblings, 2 replies; 23+ messages in thread
From: Wols Lists @ 2022-05-09  6:52 UTC (permalink / raw)
  To: Bob Brand, linux-raid; +Cc: Phil Turmel, NeilBrown

On 09/05/2022 01:09, Bob Brand wrote:
> Hi Wol,
> 
> My apologies for continually bothering you but I have a couple of questions:

Did you read the links I sent you?
> 
> 1. How do I overcome the error message "mount: /dev/md125: can't read
> superblock."  Do it use fsck?
> 
> 2. The removed disk is showing as "   -   0   0   30   removed". Is it safe
> to use "mdadm /dev/md2 -r detached" or "mdadm /dev/md2 -r failed" to
> overcome this?

I don't know :-( This is getting a bit out of my depth. But I'm 
SERIOUSLY concerned you're still futzing about with CentOS 7!!!

Why didn't you download CentOS 8.5? Why didn't you download RHEL 8.5, or 
the latest Fedora? Why didn't you download SUSE SLES 15?

Any and all CentOS 7 will come with either an out-of-date mdadm, or a 
Frankenkernel. NEITHER are a good idea.

Go back to the links I gave you, download and run lsdrv, and post the 
output here. Hopefully somebody will tell you the next steps. I will do 
my best.
> 
> Thank you!
> 
Cheers,
Wol
> 
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:33 AM
> To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>;
> linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
> 
> I just tried it again with the --invalid_backup switch and it's now showing
> the State as "clean, degraded".and it's showing all the disks except for the
> suspect one that I removed.
> 
> I'm unable to mount it and see the contents. I get the error "mount:
> /dev/md125: can't read superblock."
> 
> Is there more that I need to do?
> 
> Thanks
> 
> 
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:02 AM
> To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>;
> linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
> 
> Hi Wol,
> 
> I've booted to the installation media and I've run the following command:
> 
> mdadm
> /dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
>   --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
> /dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak
>    --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19
> 
> But I'm still getting the error:
> 
> mdadm: /dev/md125 has an active reshape - checking if critical section needs
> to be restored
> mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
> 
> 
> Should I try the --invalid_backup switch or --force?
> 
> Thanks,
> Bob
> 
> 
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 8:19 AM
> To: Wol <antlists@youngman.org.uk>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
> 
> OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that seems to
> be the most recent they have.
> 
> 
> -----Original Message-----
> From: Wol <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 8:16 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
> 
> How old is CentOS 7? With that kernel I guess it's quite old?
> 
> Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
> doesn't matter. What you need is an up-to-date rescue disk.
> Distro/whatever is unimportant - what IS important is that you are using the
> latest mdadm, and a kernel that matches.
> 
> The problem you have sounds like a long-standing but now-fixed bug. An
> original CentOS disk might be okay (with matched kernel and mdadm), but
> almost certainly has what I consider to be a "dodgy" version of mdadm.
> 
> If you can afford the downtime, after you've reverted the reshape, I'd try
> starting it again with the rescue disk. It'll probably run fine. Let it
> complete and then your old CentOS 7 will be fine with it.
> 
> Cheers,
> Wol
> 
> On 08/05/2022 23:04, Bob Brand wrote:
>> Thank Wol.
>>
>> Should I use a CentOS 7 disk or a CentOS disk?
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Wols Lists <antlists@youngman.org.uk>
>> Sent: Monday, 9 May 2022 1:32 AM
>> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
>> Cc: Phil Turmel <philip@turmel.org>
>> Subject: Re: Failed adadm RAID array after aborted Grown operation
>>
>> On 08/05/2022 14:18, Bob Brand wrote:
>>> If you’ve stuck with me and read all this way, thank you and I hope
>>> you can help me.
>>
>> https://raid.wiki.kernel.org/index.php/Linux_Raid
>>
>> Especially
>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>>
>> What you need to do is revert the reshape. I know what may have
>> happened, and what bothers me is your kernel version, 3.10.
>>
>> The first thing to try is to boot from up-to-date rescue media and see
>> if an mdadm --revert works from there. If it does, your Centos should
>> then bring everything back no problem.
>>
>> (You've currently got what I call a Frankensetup, a very old kernel, a
>> pretty new mdadm, and a whole bunch of patches that does who knows what.
>> You really need a matching kernel and mdadm, and your frankenkernel
>> won't match anything ...)
>>
>> Let us know how that goes ...
>>
>> Cheers,
>> Wol
>>
>>
>>
>> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
>> click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>>
> 
> 
> 
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-09  6:52               ` Wols Lists
@ 2022-05-09 13:07                 ` Bob Brand
       [not found]                 ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
  1 sibling, 0 replies; 23+ messages in thread
From: Bob Brand @ 2022-05-09 13:07 UTC (permalink / raw)
  To: Wols Lists, linux-raid; +Cc: Phil Turmel, NeilBrown

Hi Wol,

I did read the links you sent, actually I'd already trawled through them 
prior to subscribing to the mailing list. They're how I learned about the 
mailing list.

It seems that the conventional version of CentOS 8.5 is no longer available, 
there's just the CentOS 8 Streams version and I wasn't sure how it would go 
with the old style of CentOS. To be honest it didn't occur to me to go with 
another flavour of Linux, I just figured that I'd use CentOS to repair 
CentOS.

Anyway, I did try using "mdadm /dev/md2 -r detached" and "mdadm /dev/md2 -r 
failed" to remove the removed disk to no avail.  I ended up using 
"mdadm --grow /dev/md125 --array-size 
218789036032 --backup-file=/mnt/sysimage/grow_md125_size_grow.bak --verbose" 
followed by "mdadm --grow 
/dev/md125 --raid-devices=30 --backup-file=/mnt/sysimage/grow_md125_grow_disks.bak 
 --verbose" and it seems to be working in that it is reshaping the array 
although it is apparently going to take around 16,000 minutes (would that be 
because we've about 200TB of data?).

My concern now is whether or not I'll still have the mount issue once it 
finally completes the reshape.  If it does mount OK, does that mean I'm good 
to reboot it?

With regards to your comment about downloading lsdrv, I'll try and do that 
although I'm having trouble configuring my DNS servers in the running rescue 
disk OS. I could run lsblk but, from what I see of lsdrv, lsblk doesn't have 
the detail that lsdrv has. I'll keep working on that and let you know what I 
get - it looks like I have to edit it to use the older version of Python 
that this installation has.

Cheers,
Bob



-----Original Message-----
From: Wols Lists <antlists@youngman.org.uk>
Sent: Monday, 9 May 2022 4:52 PM
To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
Cc: Phil Turmel <philip@turmel.org>; NeilBrown <neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation

On 09/05/2022 01:09, Bob Brand wrote:
> Hi Wol,
>
> My apologies for continually bothering you but I have a couple of 
> questions:

Did you read the links I sent you?
>
> 1. How do I overcome the error message "mount: /dev/md125: can't read
> superblock."  Do it use fsck?
>
> 2. The removed disk is showing as "   -   0   0   30   removed". Is it 
> safe
> to use "mdadm /dev/md2 -r detached" or "mdadm /dev/md2 -r failed" to
> overcome this?

I don't know :-( This is getting a bit out of my depth. But I'm SERIOUSLY 
concerned you're still futzing about with CentOS 7!!!

Why didn't you download CentOS 8.5? Why didn't you download RHEL 8.5, or the 
latest Fedora? Why didn't you download SUSE SLES 15?

Any and all CentOS 7 will come with either an out-of-date mdadm, or a 
Frankenkernel. NEITHER are a good idea.

Go back to the links I gave you, download and run lsdrv, and post the output 
here. Hopefully somebody will tell you the next steps. I will do my best.
>
> Thank you!
>
Cheers,
Wol
>
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:33 AM
> To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>;
> linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> I just tried it again with the --invalid_backup switch and it's now
> showing the State as "clean, degraded".and it's showing all the disks
> except for the suspect one that I removed.
>
> I'm unable to mount it and see the contents. I get the error "mount:
> /dev/md125: can't read superblock."
>
> Is there more that I need to do?
>
> Thanks
>
>
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:02 AM
> To: Bob Brand <brand@wmawater.com.au>; Wol <antlists@youngman.org.uk>;
> linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> Hi Wol,
>
> I've booted to the installation media and I've run the following command:
>
> mdadm
> /dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
>   --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
> /dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak
>    --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19
>
> But I'm still getting the error:
>
> mdadm: /dev/md125 has an active reshape - checking if critical section
> needs to be restored
> mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
>
>
> Should I try the --invalid_backup switch or --force?
>
> Thanks,
> Bob
>
>
> -----Original Message-----
> From: Bob Brand <brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 8:19 AM
> To: Wol <antlists@youngman.org.uk>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> OK.  I've downloaded a Centos 7 - 2009 ISO from centos.org - that
> seems to be the most recent they have.
>
>
> -----Original Message-----
> From: Wol <antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 8:16 AM
> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
> Cc: Phil Turmel <philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> How old is CentOS 7? With that kernel I guess it's quite old?
>
> Try and get a CentOS 8.5 disk. At the end of the day, the version of
> linux doesn't matter. What you need is an up-to-date rescue disk.
> Distro/whatever is unimportant - what IS important is that you are
> using the latest mdadm, and a kernel that matches.
>
> The problem you have sounds like a long-standing but now-fixed bug. An
> original CentOS disk might be okay (with matched kernel and mdadm),
> but almost certainly has what I consider to be a "dodgy" version of mdadm.
>
> If you can afford the downtime, after you've reverted the reshape, I'd
> try starting it again with the rescue disk. It'll probably run fine.
> Let it complete and then your old CentOS 7 will be fine with it.
>
> Cheers,
> Wol
>
> On 08/05/2022 23:04, Bob Brand wrote:
>> Thank Wol.
>>
>> Should I use a CentOS 7 disk or a CentOS disk?
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Wols Lists <antlists@youngman.org.uk>
>> Sent: Monday, 9 May 2022 1:32 AM
>> To: Bob Brand <brand@wmawater.com.au>; linux-raid@vger.kernel.org
>> Cc: Phil Turmel <philip@turmel.org>
>> Subject: Re: Failed adadm RAID array after aborted Grown operation
>>
>> On 08/05/2022 14:18, Bob Brand wrote:
>>> If you’ve stuck with me and read all this way, thank you and I hope
>>> you can help me.
>>
>> https://raid.wiki.kernel.org/index.php/Linux_Raid
>>
>> Especially
>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrog
>> n
>>
>> What you need to do is revert the reshape. I know what may have
>> happened, and what bothers me is your kernel version, 3.10.
>>
>> The first thing to try is to boot from up-to-date rescue media and
>> see if an mdadm --revert works from there. If it does, your Centos
>> should then bring everything back no problem.
>>
>> (You've currently got what I call a Frankensetup, a very old kernel,
>> a pretty new mdadm, and a whole bunch of patches that does who knows 
>> what.
>> You really need a matching kernel and mdadm, and your frankenkernel
>> won't match anything ...)
>>
>> Let us know how that goes ...
>>
>> Cheers,
>> Wol
>>
>>
>>
>> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
>> click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>>
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
> click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>
>




CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
       [not found]                 ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
@ 2022-05-11  5:39                   ` Bob Brand
  2022-05-11 12:35                     ` Reindl Harald
  2022-05-20 15:13                   ` Bob Brand
  1 sibling, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-11  5:39 UTC (permalink / raw)
  To: Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown

Thanks Roger.

My apologies for not replying earlier.  By the time I read this I already 
had a reshape underway to reduce the size of the array back to the original 
30 disks.  So far it seems to be progressing OK although the ETA is around 
10 days which is why I didn’t respond sooner – I’ve been bury dealing with 
the fallout from this.

Do I understand that you would recommend upgrading our installation of Linux 
once the repair is complete or are advising downloading and compiling a new 
kernel as part of the repair?  Or are you suggesting that it was the fact 
that we’re on such an old version of CentOS that caused this mess?  I ask 
because once this is repaired (assuming it does complete successfully), I 
would like to extend the array to the full 45 drives of which this server is 
capable

Thanks,
Bob

From: Roger Heflin <rogerheflin@gmail.com>
Sent: Monday, 9 May 2022 9:05 PM
To: Wols Lists <antlists@youngman.org.uk>
Cc: Bob Brand <brand@wmawater.com.au>; Linux RAID 
<linux-raid@vger.kernel.org>; Phil Turmel <philip@turmel.org>; NeilBrown 
<neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation

The short term easiest way for a new kernel might be this.

Download a Fedora 35 livecd and boot from it.  It will allow you to turn on 
the raid and/or reshape the raid and/or abort the reshape using the fedora 
35 kernel and mdadm tools.    Though all of this will need to be done 
manually from either the gui and/or command line, so it will be somewhat of 
a pain.

The other choice is to download/compile/install a current http://kernel.org 
kernel.  This takes some time (you have to install compiler/header rpms), 
and follow this 
(https://docs.rockylinux.org/guides/custom-linux-kernel/)--rockylinux so a 
redhat clone list of instructions.  How long it takes will depend on the 
number of cpus your machine has and the value after the -j<cpustouse>. 
The biggest issue with this will likely be dealing with compile errors for 
missing dependencies you get for this or that tool and/or devel package 
being missing.   And then you would still need to download the newest mdadm 
and compile and install it.   These steps will take longer, but doing this 
will get your system on a new kernel and new tools, and typically once you 
know how to do this, this process of compiling/installing a kernel has for 
the most part not changed in a long time.  And I have been doing this on and 
off for 20+ years and newer kernel on older userspace is widely used by a 
lot of the kernel developers so is generally well testing and in my 
experience just works to get you on a new kernel with minimal trouble.



On Mon, May 9, 2022 at 5:24 AM Wols Lists <mailto:antlists@youngman.org.uk> 
wrote:
On 09/05/2022 01:09, Bob Brand wrote:
> Hi Wol,
>
> My apologies for continually bothering you but I have a couple of 
> questions:

Did you read the links I sent you?
>
> 1. How do I overcome the error message "mount: /dev/md125: can't read
> superblock."  Do it use fsck?
>
> 2. The removed disk is showing as "   -   0   0   30   removed". Is it 
> safe
> to use "mdadm /dev/md2 -r detached" or "mdadm /dev/md2 -r failed" to
> overcome this?

I don't know :-( This is getting a bit out of my depth. But I'm
SERIOUSLY concerned you're still futzing about with CentOS 7!!!

Why didn't you download CentOS 8.5? Why didn't you download RHEL 8.5, or
the latest Fedora? Why didn't you download SUSE SLES 15?

Any and all CentOS 7 will come with either an out-of-date mdadm, or a
Frankenkernel. NEITHER are a good idea.

Go back to the links I gave you, download and run lsdrv, and post the
output here. Hopefully somebody will tell you the next steps. I will do
my best.
>
> Thank you!
>
Cheers,
Wol
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:33 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; Wol 
> <mailto:antlists@youngman.org.uk>;
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> I just tried it again with the --invalid_backup switch and it's now 
> showing
> the State as "clean, degraded".and it's showing all the disks except for 
> the
> suspect one that I removed.
>
> I'm unable to mount it and see the contents. I get the error "mount:
> /dev/md125: can't read superblock."
>
> Is there more that I need to do?
>
> Thanks
>
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:02 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; Wol 
> <mailto:antlists@youngman.org.uk>;
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> Hi Wol,
>
> I've booted to the installation media and I've run the following command:
>
> mdadm
> /dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
>   --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
> /dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak
>    --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19
>
> But I'm still getting the error:
>
> mdadm: /dev/md125 has an active reshape - checking if critical section 
> needs
> to be restored
> mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
>
>
> Should I try the --invalid_backup switch or --force?
>
> Thanks,
> Bob
>
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 8:19 AM
> To: Wol <mailto:antlists@youngman.org.uk>; 
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> OK.  I've downloaded a Centos 7 - 2009 ISO from http://centos.org - that 
> seems to
> be the most recent they have.
>
>
> -----Original Message-----
> From: Wol <mailto:antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 8:16 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; 
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> How old is CentOS 7? With that kernel I guess it's quite old?
>
> Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
> doesn't matter. What you need is an up-to-date rescue disk.
> Distro/whatever is unimportant - what IS important is that you are using 
> the
> latest mdadm, and a kernel that matches.
>
> The problem you have sounds like a long-standing but now-fixed bug. An
> original CentOS disk might be okay (with matched kernel and mdadm), but
> almost certainly has what I consider to be a "dodgy" version of mdadm.
>
> If you can afford the downtime, after you've reverted the reshape, I'd try
> starting it again with the rescue disk. It'll probably run fine. Let it
> complete and then your old CentOS 7 will be fine with it.
>
> Cheers,
> Wol
>
> On 08/05/2022 23:04, Bob Brand wrote:
>> Thank Wol.
>>
>> Should I use a CentOS 7 disk or a CentOS disk?
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Wols Lists <mailto:antlists@youngman.org.uk>
>> Sent: Monday, 9 May 2022 1:32 AM
>> To: Bob Brand <mailto:brand@wmawater.com.au>; 
>> mailto:linux-raid@vger.kernel.org
>> Cc: Phil Turmel <mailto:philip@turmel.org>
>> Subject: Re: Failed adadm RAID array after aborted Grown operation
>>
>> On 08/05/2022 14:18, Bob Brand wrote:
>>> If you’ve stuck with me and read all this way, thank you and I hope
>>> you can help me.
>>
>> https://raid.wiki.kernel.org/index.php/Linux_Raid
>>
>> Especially
>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>>
>> What you need to do is revert the reshape. I know what may have
>> happened, and what bothers me is your kernel version, 3.10.
>>
>> The first thing to try is to boot from up-to-date rescue media and see
>> if an mdadm --revert works from there. If it does, your Centos should
>> then bring everything back no problem.
>>
>> (You've currently got what I call a Frankensetup, a very old kernel, a
>> pretty new mdadm, and a whole bunch of patches that does who knows what.
>> You really need a matching kernel and mdadm, and your frankenkernel
>> won't match anything ...)
>>
>> Let us know how that goes ...
>>
>> Cheers,
>> Wol
>>
>>
>>
>> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
>> click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>>
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-11  5:39                   ` Bob Brand
@ 2022-05-11 12:35                     ` Reindl Harald
  2022-05-11 13:22                       ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Reindl Harald @ 2022-05-11 12:35 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 11.05.22 um 07:39 schrieb Bob Brand:
> Do I understand that you would recommend upgrading our installation of Linux
> once the repair is complete or are advising downloading and compiling a new
> kernel as part of the repair?  Or are you suggesting that it was the fact
> that we’re on such an old version of CentOS that caused this mess?  I ask
> because once this is repaired (assuming it does complete successfully), I
> would like to extend the array to the full 45 drives of which this server is
> capable

you where adivised doing thatg with a live-iso of whatever distribution 
with a recent kernel and recent mdadm and leave your installed os alone

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-11 12:35                     ` Reindl Harald
@ 2022-05-11 13:22                       ` Bob Brand
  2022-05-11 14:56                         ` Reindl Harald
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-11 13:22 UTC (permalink / raw)
  To: Reindl Harald, Roger Heflin, Wols Lists
  Cc: Linux RAID, Phil Turmel, NeilBrown

Sorry Reindl.  I'm not sure I understand. Are you saying I did or didn't do 
the right thing in booting from a CentOS rescue disk? At the moment it's 
running from the rescue disk and, be it the best distro to have used (or 
not), I would imagine that I need to keep running from the rescue disk until 
the reshape is complete as rebooting in the middle of a reshape is what got 
me in this mess.

Thanks

-----Original Message-----
From: Reindl Harald <h.reindl@thelounge.net>
Sent: Wednesday, 11 May 2022 10:36 PM
To: Bob Brand <brand@wmawater.com.au>; Roger Heflin <rogerheflin@gmail.com>; 
Wols Lists <antlists@youngman.org.uk>
Cc: Linux RAID <linux-raid@vger.kernel.org>; Phil Turmel 
<philip@turmel.org>; NeilBrown <neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation



Am 11.05.22 um 07:39 schrieb Bob Brand:
> Do I understand that you would recommend upgrading our installation of
> Linux once the repair is complete or are advising downloading and
> compiling a new kernel as part of the repair?  Or are you suggesting
> that it was the fact that we’re on such an old version of CentOS that
> caused this mess?  I ask because once this is repaired (assuming it
> does complete successfully), I would like to extend the array to the
> full 45 drives of which this server is capable

you where adivised doing thatg with a live-iso of whatever distribution with 
a recent kernel and recent mdadm and leave your installed os alone



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-11 13:22                       ` Bob Brand
@ 2022-05-11 14:56                         ` Reindl Harald
  2022-05-11 14:59                           ` Reindl Harald
  0 siblings, 1 reply; 23+ messages in thread
From: Reindl Harald @ 2022-05-11 14:56 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 11.05.22 um 15:22 schrieb Bob Brand:
> Sorry Reindl.  I'm not sure I understand. Are you saying I did or didn't do
> the right thing in booting from a CentOS rescue disk? At the moment it's
> running from the rescue disk and, be it the best distro to have used (or
> not), I would imagine that I need to keep running from the rescue disk until
> the reshape is complete as rebooting in the middle of a reshape is what got
> me in this mess.

and i don't understand what you did not understand in the clear response 
below you got days ago!

due reshape you where advised use whatever rescue/live system with a 
recent kernel and mdadm, not more and not less

just to avoid probaly long fixed bugs in your old kernel

---------------------

Try and get a CentOS 8.5 disk. At the end of the day, the version of 
linux doesn't matter. What you need is an up-to-date rescue disk. 
Distro/whatever is unimportant - what IS important is that you are using 
the latest mdadm, and a kernel that matches.

The problem you have sounds like a long-standing but now-fixed bug. An 
original CentOS disk might be okay (with matched kernel and mdadm), but 
almost certainly has what I consider to be a "dodgy" version of mdadm.

If you can afford the downtime, after you've reverted the reshape, I'd 
try starting it again with the rescue disk. It'll probably run fine. Let 
it complete and then your old CentOS 7 will be fine with it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-11 14:56                         ` Reindl Harald
@ 2022-05-11 14:59                           ` Reindl Harald
  2022-05-13  5:32                             ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Reindl Harald @ 2022-05-11 14:59 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 11.05.22 um 16:56 schrieb Reindl Harald:
> 
> 
> Am 11.05.22 um 15:22 schrieb Bob Brand:
>> Sorry Reindl.  I'm not sure I understand. Are you saying I did or 
>> didn't do
>> the right thing in booting from a CentOS rescue disk? At the moment it's
>> running from the rescue disk and, be it the best distro to have used (or
>> not), I would imagine that I need to keep running from the rescue disk 
>> until
>> the reshape is complete as rebooting in the middle of a reshape is 
>> what got
>> me in this mess.

and nowhere did i say reboot now

and i only responded to your "Do I understand that you would recommend 
upgrading our installation of Linux once the repair is complete or are 
advising downloading and compiling a new kernel as part of the repair?"

nobody said that - the only point was use a as recent kernel as possible 
with all rgow/reshape operations

> and i don't understand what you did not understand in the clear response 
> below you got days ago!
> 
> due reshape you where advised use whatever rescue/live system with a 
> recent kernel and mdadm, not more and not less
> 
> just to avoid probaly long fixed bugs in your old kernel
> 
> ---------------------
> 
> Try and get a CentOS 8.5 disk. At the end of the day, the version of 
> linux doesn't matter. What you need is an up-to-date rescue disk. 
> Distro/whatever is unimportant - what IS important is that you are using 
> the latest mdadm, and a kernel that matches.
> 
> The problem you have sounds like a long-standing but now-fixed bug. An 
> original CentOS disk might be okay (with matched kernel and mdadm), but 
> almost certainly has what I consider to be a "dodgy" version of mdadm.
> 
> If you can afford the downtime, after you've reverted the reshape, I'd 
> try starting it again with the rescue disk. It'll probably run fine. Let 
> it complete and then your old CentOS 7 will be fine with it

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-11 14:59                           ` Reindl Harald
@ 2022-05-13  5:32                             ` Bob Brand
  2022-05-13  8:18                               ` Reindl Harald
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-13  5:32 UTC (permalink / raw)
  To: Reindl Harald, Roger Heflin, Wols Lists
  Cc: Linux RAID, Phil Turmel, NeilBrown

This may not be the forum to ask this but what exactly is "compiling the 
kernel". From what I've been reading, it sounds like a somewhat involved and 
complex process - is it? Is compiling a new kernel the same as upgrading the 
OS? I'm getting the impression that it sort of is but sort of isn't. Is it 
possible to compile a kernel for a rescue CD (from the comments I've read, 
it is possible)? If I were to compile a new kernel, would I expect the 
version number for the kernel and mdadm to be the same? Sorry for all the 
question but, as I said at the outset, a lot of this is all very new to me.

Thank you,
Bob

-----Original Message-----
From: Reindl Harald <h.reindl@thelounge.net>
Sent: Thursday, 12 May 2022 12:59 AM
To: Bob Brand <brand@wmawater.com.au>; Roger Heflin <rogerheflin@gmail.com>; 
Wols Lists <antlists@youngman.org.uk>
Cc: Linux RAID <linux-raid@vger.kernel.org>; Phil Turmel 
<philip@turmel.org>; NeilBrown <neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation



Am 11.05.22 um 16:56 schrieb Reindl Harald:
>
>
> Am 11.05.22 um 15:22 schrieb Bob Brand:
>> Sorry Reindl.  I'm not sure I understand. Are you saying I did or
>> didn't do the right thing in booting from a CentOS rescue disk? At
>> the moment it's running from the rescue disk and, be it the best
>> distro to have used (or not), I would imagine that I need to keep
>> running from the rescue disk until the reshape is complete as
>> rebooting in the middle of a reshape is what got me in this mess.

and nowhere did i say reboot now

and i only responded to your "Do I understand that you would recommend 
upgrading our installation of Linux once the repair is complete or are 
advising downloading and compiling a new kernel as part of the repair?"

nobody said that - the only point was use a as recent kernel as possible 
with all rgow/reshape operations

> and i don't understand what you did not understand in the clear
> response below you got days ago!
>
> due reshape you where advised use whatever rescue/live system with a
> recent kernel and mdadm, not more and not less
>
> just to avoid probaly long fixed bugs in your old kernel
>
> ---------------------
>
> Try and get a CentOS 8.5 disk. At the end of the day, the version of
> linux doesn't matter. What you need is an up-to-date rescue disk.
> Distro/whatever is unimportant - what IS important is that you are
> using the latest mdadm, and a kernel that matches.
>
> The problem you have sounds like a long-standing but now-fixed bug. An
> original CentOS disk might be okay (with matched kernel and mdadm),
> but almost certainly has what I consider to be a "dodgy" version of mdadm.
>
> If you can afford the downtime, after you've reverted the reshape, I'd
> try starting it again with the rescue disk. It'll probably run fine.
> Let it complete and then your old CentOS 7 will be fine with it



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-13  5:32                             ` Bob Brand
@ 2022-05-13  8:18                               ` Reindl Harald
  0 siblings, 0 replies; 23+ messages in thread
From: Reindl Harald @ 2022-05-13  8:18 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 13.05.22 um 07:32 schrieb Bob Brand:
> This may not be the forum to ask 

it is because you can type that sort of questions also into google and 
there isn't a good reason to build your own kernel in 2022 for most usecases

> this but what exactly is "compiling the
> kernel". From what I've been reading, it sounds like a somewhat involved and
> complex process - is it? Is compiling a new kernel the same as upgrading the
> OS? I'm getting the impression that it sort of is but sort of isn't. Is it
> possible to compile a kernel for a rescue CD (from the comments I've read,
> it is possible)? If I were to compile a new kernel, would I expect the
> version number for the kernel and mdadm to be the same? Sorry for all the
> question but, as I said at the outset, a lot of this is all very new to me.

don't get me wrong but "Is compiling a new kernel the same as upgrading 
the OS" and "what exactly is "compiling the kernel" implies just use a 
binary distribution when it sounds that you even don't know what compile 
software from source means

> -----Original Message-----
> From: Reindl Harald <h.reindl@thelounge.net>
> Sent: Thursday, 12 May 2022 12:59 AM
> To: Bob Brand <brand@wmawater.com.au>; Roger Heflin <rogerheflin@gmail.com>;
> Wols Lists <antlists@youngman.org.uk>
> Cc: Linux RAID <linux-raid@vger.kernel.org>; Phil Turmel
> <philip@turmel.org>; NeilBrown <neilb@suse.com>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
> 
> 
> 
> Am 11.05.22 um 16:56 schrieb Reindl Harald:
>>
>>
>> Am 11.05.22 um 15:22 schrieb Bob Brand:
>>> Sorry Reindl.  I'm not sure I understand. Are you saying I did or
>>> didn't do the right thing in booting from a CentOS rescue disk? At
>>> the moment it's running from the rescue disk and, be it the best
>>> distro to have used (or not), I would imagine that I need to keep
>>> running from the rescue disk until the reshape is complete as
>>> rebooting in the middle of a reshape is what got me in this mess.
> 
> and nowhere did i say reboot now
> 
> and i only responded to your "Do I understand that you would recommend
> upgrading our installation of Linux once the repair is complete or are
> advising downloading and compiling a new kernel as part of the repair?"
> 
> nobody said that - the only point was use a as recent kernel as possible
> with all rgow/reshape operations
> 
>> and i don't understand what you did not understand in the clear
>> response below you got days ago!
>>
>> due reshape you where advised use whatever rescue/live system with a
>> recent kernel and mdadm, not more and not less
>>
>> just to avoid probaly long fixed bugs in your old kernel
>>
>> ---------------------
>>
>> Try and get a CentOS 8.5 disk. At the end of the day, the version of
>> linux doesn't matter. What you need is an up-to-date rescue disk.
>> Distro/whatever is unimportant - what IS important is that you are
>> using the latest mdadm, and a kernel that matches.
>>
>> The problem you have sounds like a long-standing but now-fixed bug. An
>> original CentOS disk might be okay (with matched kernel and mdadm),
>> but almost certainly has what I consider to be a "dodgy" version of mdadm.
>>
>> If you can afford the downtime, after you've reverted the reshape, I'd
>> try starting it again with the rescue disk. It'll probably run fine.
>> Let it complete and then your old CentOS 7 will be fine with it
> 
> 
> 
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
       [not found]                 ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
  2022-05-11  5:39                   ` Bob Brand
@ 2022-05-20 15:13                   ` Bob Brand
  2022-05-20 15:41                     ` Reindl Harald
  1 sibling, 1 reply; 23+ messages in thread
From: Bob Brand @ 2022-05-20 15:13 UTC (permalink / raw)
  To: Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown

UPDATE:

The array finally finished the reshape process (after almost two weeks!) and 
I now have an array that's showing as clean with the original 30 disks. 
However, when I try to mount it, I get the message "mount: /dev/md125: can't 
read superblock".

Any suggestions as to what my next step should be? Note: it's still running 
from the rescue disk.

Thank you,
Bob

From: Roger Heflin <rogerheflin@gmail.com>
Sent: Monday, 9 May 2022 9:05 PM
To: Wols Lists <antlists@youngman.org.uk>
Cc: Bob Brand <brand@wmawater.com.au>; Linux RAID 
<linux-raid@vger.kernel.org>; Phil Turmel <philip@turmel.org>; NeilBrown 
<neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation

The short term easiest way for a new kernel might be this.

Download a Fedora 35 livecd and boot from it.  It will allow you to turn on 
the raid and/or reshape the raid and/or abort the reshape using the fedora 
35 kernel and mdadm tools.    Though all of this will need to be done 
manually from either the gui and/or command line, so it will be somewhat of 
a pain.

The other choice is to download/compile/install a current http://kernel.org 
kernel.  This takes some time (you have to install compiler/header rpms), 
and follow this 
(https://docs.rockylinux.org/guides/custom-linux-kernel/)--rockylinux so a 
redhat clone list of instructions.  How long it takes will depend on the 
number of cpus your machine has and the value after the -j<cpustouse>. 
The biggest issue with this will likely be dealing with compile errors for 
missing dependencies you get for this or that tool and/or devel package 
being missing.   And then you would still need to download the newest mdadm 
and compile and install it.   These steps will take longer, but doing this 
will get your system on a new kernel and new tools, and typically once you 
know how to do this, this process of compiling/installing a kernel has for 
the most part not changed in a long time.  And I have been doing this on and 
off for 20+ years and newer kernel on older userspace is widely used by a 
lot of the kernel developers so is generally well testing and in my 
experience just works to get you on a new kernel with minimal trouble.



On Mon, May 9, 2022 at 5:24 AM Wols Lists <mailto:antlists@youngman.org.uk> 
wrote:
On 09/05/2022 01:09, Bob Brand wrote:
> Hi Wol,
>
> My apologies for continually bothering you but I have a couple of 
> questions:

Did you read the links I sent you?
>
> 1. How do I overcome the error message "mount: /dev/md125: can't read
> superblock."  Do it use fsck?
>
> 2. The removed disk is showing as "   -   0   0   30   removed". Is it 
> safe
> to use "mdadm /dev/md2 -r detached" or "mdadm /dev/md2 -r failed" to
> overcome this?

I don't know :-( This is getting a bit out of my depth. But I'm
SERIOUSLY concerned you're still futzing about with CentOS 7!!!

Why didn't you download CentOS 8.5? Why didn't you download RHEL 8.5, or
the latest Fedora? Why didn't you download SUSE SLES 15?

Any and all CentOS 7 will come with either an out-of-date mdadm, or a
Frankenkernel. NEITHER are a good idea.

Go back to the links I gave you, download and run lsdrv, and post the
output here. Hopefully somebody will tell you the next steps. I will do
my best.
>
> Thank you!
>
Cheers,
Wol
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:33 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; Wol 
> <mailto:antlists@youngman.org.uk>;
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> I just tried it again with the --invalid_backup switch and it's now 
> showing
> the State as "clean, degraded".and it's showing all the disks except for 
> the
> suspect one that I removed.
>
> I'm unable to mount it and see the contents. I get the error "mount:
> /dev/md125: can't read superblock."
>
> Is there more that I need to do?
>
> Thanks
>
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 9:02 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; Wol 
> <mailto:antlists@youngman.org.uk>;
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> Hi Wol,
>
> I've booted to the installation media and I've run the following command:
>
> mdadm
> /dev/md125 --assemble --update=revert-reshape --backup-file=/mnt/sysimage/grow_md125.bak
>   --verbose --uuid= f9b65f55:5f257add:1140ccc0:46ca6c19
> /dev/md125mdadm --assemble --update=revert-reshape --backup-file=/grow_md125.bak
>    --verbose --uuid=f9b65f55:5f257add:1140ccc0:46ca6c19
>
> But I'm still getting the error:
>
> mdadm: /dev/md125 has an active reshape - checking if critical section 
> needs
> to be restored
> mdadm: No backup metadata on /mnt/sysimage/grow_md125.back
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
>
>
> Should I try the --invalid_backup switch or --force?
>
> Thanks,
> Bob
>
>
> -----Original Message-----
> From: Bob Brand <mailto:brand@wmawater.com.au>
> Sent: Monday, 9 May 2022 8:19 AM
> To: Wol <mailto:antlists@youngman.org.uk>; 
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: RE: Failed adadm RAID array after aborted Grown operation
>
> OK.  I've downloaded a Centos 7 - 2009 ISO from http://centos.org - that 
> seems to
> be the most recent they have.
>
>
> -----Original Message-----
> From: Wol <mailto:antlists@youngman.org.uk>
> Sent: Monday, 9 May 2022 8:16 AM
> To: Bob Brand <mailto:brand@wmawater.com.au>; 
> mailto:linux-raid@vger.kernel.org
> Cc: Phil Turmel <mailto:philip@turmel.org>
> Subject: Re: Failed adadm RAID array after aborted Grown operation
>
> How old is CentOS 7? With that kernel I guess it's quite old?
>
> Try and get a CentOS 8.5 disk. At the end of the day, the version of linux
> doesn't matter. What you need is an up-to-date rescue disk.
> Distro/whatever is unimportant - what IS important is that you are using 
> the
> latest mdadm, and a kernel that matches.
>
> The problem you have sounds like a long-standing but now-fixed bug. An
> original CentOS disk might be okay (with matched kernel and mdadm), but
> almost certainly has what I consider to be a "dodgy" version of mdadm.
>
> If you can afford the downtime, after you've reverted the reshape, I'd try
> starting it again with the rescue disk. It'll probably run fine. Let it
> complete and then your old CentOS 7 will be fine with it.
>
> Cheers,
> Wol
>
> On 08/05/2022 23:04, Bob Brand wrote:
>> Thank Wol.
>>
>> Should I use a CentOS 7 disk or a CentOS disk?
>>
>> Thanks
>>
>> -----Original Message-----
>> From: Wols Lists <mailto:antlists@youngman.org.uk>
>> Sent: Monday, 9 May 2022 1:32 AM
>> To: Bob Brand <mailto:brand@wmawater.com.au>; 
>> mailto:linux-raid@vger.kernel.org
>> Cc: Phil Turmel <mailto:philip@turmel.org>
>> Subject: Re: Failed adadm RAID array after aborted Grown operation
>>
>> On 08/05/2022 14:18, Bob Brand wrote:
>>> If you’ve stuck with me and read all this way, thank you and I hope
>>> you can help me.
>>
>> https://raid.wiki.kernel.org/index.php/Linux_Raid
>>
>> Especially
>> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>>
>> What you need to do is revert the reshape. I know what may have
>> happened, and what bothers me is your kernel version, 3.10.
>>
>> The first thing to try is to boot from up-to-date rescue media and see
>> if an mdadm --revert works from there. If it does, your Centos should
>> then bring everything back no problem.
>>
>> (You've currently got what I call a Frankensetup, a very old kernel, a
>> pretty new mdadm, and a whole bunch of patches that does who knows what.
>> You really need a matching kernel and mdadm, and your frankenkernel
>> won't match anything ...)
>>
>> Let us know how that goes ...
>>
>> Cheers,
>> Wol
>>
>>
>>
>> CAUTION!!! This E-mail originated from outside of WMA Water. Do not
>> click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>>
>
>
>
> CAUTION!!! This E-mail originated from outside of WMA Water. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-20 15:13                   ` Bob Brand
@ 2022-05-20 15:41                     ` Reindl Harald
  2022-05-22  4:13                       ` Bob Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Reindl Harald @ 2022-05-20 15:41 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 20.05.22 um 17:13 schrieb Bob Brand:
> UPDATE:
> 
> The array finally finished the reshape process (after almost two weeks!) and
> I now have an array that's showing as clean with the original 30 disks.
> However, when I try to mount it, I get the message "mount: /dev/md125: can't
> read superblock".
> 
> Any suggestions as to what my next step should be? Note: it's still running
> from the rescue disk

restore from a backup - the array is one thing, the filesystem is a 
different layer and it seems to be heavily damag after all the things 
which happened

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-20 15:41                     ` Reindl Harald
@ 2022-05-22  4:13                       ` Bob Brand
  2022-05-22 11:25                         ` Reindl Harald
  2022-05-22 13:31                         ` Wols Lists
  0 siblings, 2 replies; 23+ messages in thread
From: Bob Brand @ 2022-05-22  4:13 UTC (permalink / raw)
  To: Reindl Harald, Roger Heflin, Wols Lists
  Cc: Linux RAID, Phil Turmel, NeilBrown

Thanks Reindl.

Is xfs_repair an option? And, if it is, do I run it on md125 or the 
individual sd devices?

Unfortunately, restore from back up isn't an option - after all to where do 
you back up 200TB of data? This storage was originally set up with the 
understanding that it wasn't backed up and so no valuable data was supposed 
to have been stored on it. Unfortunately, people being what they are, 
valuable data has been stored there and I'm the mug now trying to get it 
back - it's a system that I've inherited.

So, any help or constructive advice would be appreciated.

Thanks,
Bob

-----Original Message-----
From: Reindl Harald <h.reindl@thelounge.net>
Sent: Saturday, 21 May 2022 1:41 AM
To: Bob Brand <brand@wmawater.com.au>; Roger Heflin <rogerheflin@gmail.com>; 
Wols Lists <antlists@youngman.org.uk>
Cc: Linux RAID <linux-raid@vger.kernel.org>; Phil Turmel 
<philip@turmel.org>; NeilBrown <neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation



Am 20.05.22 um 17:13 schrieb Bob Brand:
> UPDATE:
>
> The array finally finished the reshape process (after almost two
> weeks!) and I now have an array that's showing as clean with the original 
> 30 disks.
> However, when I try to mount it, I get the message "mount: /dev/md125:
> can't read superblock".
>
> Any suggestions as to what my next step should be? Note: it's still
> running from the rescue disk

restore from a backup - the array is one thing, the filesystem is a 
different layer and it seems to be heavily damag after all the things which 
happened



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-22  4:13                       ` Bob Brand
@ 2022-05-22 11:25                         ` Reindl Harald
  2022-05-22 13:31                         ` Wols Lists
  1 sibling, 0 replies; 23+ messages in thread
From: Reindl Harald @ 2022-05-22 11:25 UTC (permalink / raw)
  To: Bob Brand, Roger Heflin, Wols Lists; +Cc: Linux RAID, Phil Turmel, NeilBrown



Am 22.05.22 um 06:13 schrieb Bob Brand:
> Is xfs_repair an option? 

unlikely in case the underlying device has all sort of damages - think 
of the RAID like a single disk and how the filesystem reacts if you 
shoot holes in it

> And, if it is, do I run it on md125 or the
> individual sd devices?

when you think two seconds it's obvious - is the filesystem on top of 
single disks or on top of the whole RAID - the filesystem don't even 
know about individual devices

> Unfortunately, restore from back up isn't an option - after all to where do
> you back up 200TB of data? 

on a second machine in a different building - the inital sync is done 
local and then no matter how large the data rsync is enough - the daily 
delta don't vary just because the whole dataset is huge

and don't get me wrong but who starts a reshape on a 200 TB storage when 
he knows that there is no backup?

> This storage was originally set up with the
> understanding that it wasn't backed up and so no valuable data was supposed
> to have been stored on it. 

well, then i won't store it at all

> Unfortunately, people being what they are,
> valuable data has been stored there and I'm the mug now trying to get it
> back - it's a system that I've inherited.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed adadm RAID array after aborted Grown operation
  2022-05-22  4:13                       ` Bob Brand
  2022-05-22 11:25                         ` Reindl Harald
@ 2022-05-22 13:31                         ` Wols Lists
  2022-05-22 22:54                           ` Bob Brand
  1 sibling, 1 reply; 23+ messages in thread
From: Wols Lists @ 2022-05-22 13:31 UTC (permalink / raw)
  To: Bob Brand, Reindl Harald, Roger Heflin; +Cc: Linux RAID, Phil Turmel, NeilBrown

On 22/05/2022 05:13, Bob Brand wrote:
> Unfortunately, restore from back up isn't an option - after all to where do
> you back up 200TB of data? This storage was originally set up with the
> understanding that it wasn't backed up and so no valuable data was supposed
> to have been stored on it. Unfortunately, people being what they are,
> valuable data has been stored there and I'm the mug now trying to get it
> back - it's a system that I've inherited.
> 
> So, any help or constructive advice would be appreciated.

Unfortunately, about the only constructive advice I can give you is 
"live and learn". I made a similar massive cock-up at the start of my 
career, and I've always been excessively cautious about disks and data 
ever since.

What your employer needs to take away from this - and no disrespect to 
yourself - is that if they run a system that was probably supported for 
about five years, then has been running on duck tape and baling wire for 
a further ten years, DON'T give it to someone with pretty much NO 
sysadmin or computer ops experience to carry out a potentially 
disastrous operation like messing about with a raid array!

This is NOT a simple setup, and it seems clear to me that you have 
little familiarity with the basic concepts. Unfortunately, your employer 
was playing Russian Roulette, and the gun went off.

On a *personal* level, and especially if your employer wants you to 
continue looking after their systems, they need to give you an (old?) 
box with a bunch of disk drives. Go back to the raid website and look at 
the article about building a new system. Take that system they've given 
you, and use that article as a guide to build it from scratch. It's 
actually about the computer being used right now to type this message.

I use(d) gentoo as my distro. It's a great distro, but for a newbie I 
think it takes "throw them in at the deep end" to extremes. Go find 
Slackware and start with that. It's not a "hold their hands and do 
everything for them" distro, but nor is it a "here's the instructions, 
if they don't work for you then you're on your own" distro. Once you've 
got to grips with Slack, have a go at gentoo. And once you've managed to 
get gentoo working, you should have a pretty decent grasp of what's 
going "under the bonnet". CentOS/RedHat/SLES should be a breeze after that.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Failed adadm RAID array after aborted Grown operation
  2022-05-22 13:31                         ` Wols Lists
@ 2022-05-22 22:54                           ` Bob Brand
  0 siblings, 0 replies; 23+ messages in thread
From: Bob Brand @ 2022-05-22 22:54 UTC (permalink / raw)
  To: Wols Lists, Reindl Harald, Roger Heflin
  Cc: Linux RAID, Phil Turmel, NeilBrown

Thanks Wol.

I can't really disagree with anything you've said except to mention that I 
do have a fair bit of experience (20+ years) but it's all been pretty much 
Microsoft/Windows and hardware RAID.

Like I said this device was never meant to be used for critical data - if 
nothing else this has been something of a wake-up call for us.



-----Original Message-----
From: Wols Lists <antlists@youngman.org.uk>
Sent: Sunday, 22 May 2022 11:31 PM
To: Bob Brand <brand@wmawater.com.au>; Reindl Harald 
<h.reindl@thelounge.net>; Roger Heflin <rogerheflin@gmail.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>; Phil Turmel 
<philip@turmel.org>; NeilBrown <neilb@suse.com>
Subject: Re: Failed adadm RAID array after aborted Grown operation

On 22/05/2022 05:13, Bob Brand wrote:
> Unfortunately, restore from back up isn't an option - after all to
> where do you back up 200TB of data? This storage was originally set up
> with the understanding that it wasn't backed up and so no valuable
> data was supposed to have been stored on it. Unfortunately, people
> being what they are, valuable data has been stored there and I'm the
> mug now trying to get it back - it's a system that I've inherited.
>
> So, any help or constructive advice would be appreciated.

Unfortunately, about the only constructive advice I can give you is "live 
and learn". I made a similar massive cock-up at the start of my career, and 
I've always been excessively cautious about disks and data ever since.

What your employer needs to take away from this - and no disrespect to 
yourself - is that if they run a system that was probably supported for 
about five years, then has been running on duck tape and baling wire for a 
further ten years, DON'T give it to someone with pretty much NO sysadmin or 
computer ops experience to carry out a potentially disastrous operation like 
messing about with a raid array!

This is NOT a simple setup, and it seems clear to me that you have little 
familiarity with the basic concepts. Unfortunately, your employer was 
playing Russian Roulette, and the gun went off.

On a *personal* level, and especially if your employer wants you to continue 
looking after their systems, they need to give you an (old?) box with a 
bunch of disk drives. Go back to the raid website and look at the article 
about building a new system. Take that system they've given you, and use 
that article as a guide to build it from scratch. It's actually about the 
computer being used right now to type this message.

I use(d) gentoo as my distro. It's a great distro, but for a newbie I think 
it takes "throw them in at the deep end" to extremes. Go find Slackware and 
start with that. It's not a "hold their hands and do everything for them" 
distro, but nor is it a "here's the instructions, if they don't work for you 
then you're on your own" distro. Once you've got to grips with Slack, have a 
go at gentoo. And once you've managed to get gentoo working, you should have 
a pretty decent grasp of what's going "under the bonnet". CentOS/RedHat/SLES 
should be a breeze after that.

Cheers,
Wol



CAUTION!!! This E-mail originated from outside of WMA Water. Do not click 
links or open attachments unless you recognize the sender and know the 
content is safe.



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-05-22 22:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-08 13:18 Failed adadm RAID array after aborted Grown operation Bob Brand
2022-05-08 15:32 ` Wols Lists
2022-05-08 22:04   ` Bob Brand
2022-05-08 22:15     ` Wol
2022-05-08 22:19       ` Bob Brand
2022-05-08 23:02         ` Bob Brand
2022-05-08 23:32           ` Bob Brand
2022-05-09  0:09             ` Bob Brand
2022-05-09  6:52               ` Wols Lists
2022-05-09 13:07                 ` Bob Brand
     [not found]                 ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
2022-05-11  5:39                   ` Bob Brand
2022-05-11 12:35                     ` Reindl Harald
2022-05-11 13:22                       ` Bob Brand
2022-05-11 14:56                         ` Reindl Harald
2022-05-11 14:59                           ` Reindl Harald
2022-05-13  5:32                             ` Bob Brand
2022-05-13  8:18                               ` Reindl Harald
2022-05-20 15:13                   ` Bob Brand
2022-05-20 15:41                     ` Reindl Harald
2022-05-22  4:13                       ` Bob Brand
2022-05-22 11:25                         ` Reindl Harald
2022-05-22 13:31                         ` Wols Lists
2022-05-22 22:54                           ` Bob Brand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.