Re: RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)

From: David Madore <david+ml@madore.org>
To: Wols Lists <antlists@youngman.org.uk>
Cc: Linux RAID mailing-list <linux-raid@vger.kernel.org>
Subject: Re: RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)
Date: Wed, 30 Sep 2020 21:45:10 +0200	[thread overview]
Message-ID: <20200930194510.vki7zixjca6sxvin@achernar.gro-tsen.net> (raw)
In-Reply-To: <5F74D684.8020005@youngman.org.uk>

On Wed, Sep 30, 2020 at 08:03:32PM +0100, Wols Lists wrote:
> On 30/09/20 19:58, David Madore wrote:
> > mdadm - v4.1 - 2018-10-01
> > 
> > - which I think is roughly contemporaneous to the kernel version I'm
> > using.  But the problem still persists (with the exact same symptoms
> > and details).
> 
> Except that mdadm is NOT the problem. The problem is that the kernel and
> mdadm are not matched date-wise, and because the kernel is a
> franken-kernel you need to use a different kernel.

I don't understand what you mean by "matched date-wise".  The kernel
I'm using is a longterm support branch (4.9) which was frozen at the
same approximate date as the mdadm I just installed.  And it was also
the same longterm support branch that was used in the Debian oldstable
(9 aka stretch).  Do you mean that there is no mdadm version which is
compatible with the 4.9 kernels?  How often does the mdadm-kernel
interface break compatibility?

> Use a rescue disk!!! That way, you get a kernel and an mdadm that are
> the same approximate date. As it stands, your frankenkernel is too new
> for mdadm 3.4, but too ancient for a modern kernel.

Using a rescue disk would mean taking the system down for longer than
I can afford (I can afford to have this particular partition down for
a long time, but not the whole system...  which unfortunately resides
on the same disks).  So I'd like to keep this as a very last resort,
or at least, not consider it until I've fully understood what's going
on.  (It's especially problematic that I have absolutely no idea of
the speed at which I can expect the reshape to take place, compared to
an ordinary resync.  If you could give me a ballpark figure, it would
help me decide.  My disks resync at ~120MB/sec, and the RAID array I
wish to reshape is ~900GB in per partition, so it takes a few hours to
do an "ordinary" resync: I assume a reshape will take much longer, but
how much longer are we talking?)

But I made another discovery in the mean time: when I run the --grow
command, something starts a systemd service called
mdadm-grow-continue@<device>.service (so in my case
mdadm-grow-continue@md112.service; I wasn't able to understand exactly
who the caller is), a unit which contains

ExecStart=/sbin/mdadm --grow --continue /dev/%I

so it ran

/sbin/mdadm --grow --continue /dev/md112

which failed with

mdadm: array: Cannot grow - need backup-file
mdadm:  Please provide one with "--backup=..."

Now if I override this service to read

ExecStart=/sbin/mdadm --grow --continue /dev/%I --backup=/run/mdadm/backup_file-%I

then it seems to work correctly, at least on my toy example with
loopback devices (but then I suppose it will break the reshape cases
where no backup file is needed?).

I'm very confused as to what's going on here: was this file supposed
to work in the first place?  Why is it needed?  Whence does it come
from?  Am I permitted to run mdadm --continue myself?  Supposed to?
How did all of this work before systemd came in?

PS: Oh, there's already a Debian bug for this: #884719
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884719
- but it's not marked as fixed.  Is array reshaping broken on Debian?

Cheers,

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )