All of lore.kernel.org
 help / color / mirror / Atom feed
* Recovery after failed chunk size change
@ 2016-03-31 19:33 Benjamin Meier
  2016-04-01  5:25 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Meier @ 2016-03-31 19:33 UTC (permalink / raw)
  To: linux-raid

Hi there,

I tried to do a chunk size change from 4096k to 64k on a 7-disk RAID6 
array. I am using Debian Jessie with kernel 3.16 and mdadm 3.3.2. After 
I initiated the change the process staled immediately. I could watch it 
in /proc/mdadm that there has not been any progress at all. The 
backup-file hasn't been touched for days now.

So I decided to backup all data from that device in case it isn't 
starting at next reboot. Accidentally the system was restarted before 
the backup was finished. And now the array is not assembling any more, 
even with the correct --backup-file. I get "mdadm: Failed to restore 
critical section for reshape, sorry.".

So the first question is: How can I access the data again? I think there 
is no damage at this time- I appended an output from --examine at the 
end of this message. All seven drives giving me the same output in all 
relevant topics. Especially "Chunk Size", "New Chunksize" and "Reshape 
pos'n" is all the same.
What is the best way now that I do not damage any data?

Second question: Is the problem with the level change a known bug?

Thanks for reading!

--
/dev/disk/by-partlabel/hyper_TA_1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x5
            Name : hyper:TA  (local to host hyper)
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 3434725376 (1637.80 GiB 1758.58 GB)
      Array Size : 8586813440 (8189.02 GiB 8792.90 GB)
     Data Offset : 147456 sectors
    Super Offset : 8 sectors
    Unused Space : before=147368 sectors, after=0 sectors
           State : clean

Internal Bitmap : 8 sectors from superblock
   Reshape pos'n : 0
   New Chunksize : 64K

     Update Time : Thu Mar 31 17:57:01 2016
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : e7172c1f - correct
          Events : 527046

          Layout : left-symmetric
      Chunk Size : 4096K

    Device Role : Active device 5
    Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Recovery after failed chunk size change
  2016-03-31 19:33 Recovery after failed chunk size change Benjamin Meier
@ 2016-04-01  5:25 ` NeilBrown
  2016-04-01 20:03   ` Benjamin Meier
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2016-04-01  5:25 UTC (permalink / raw)
  To: Benjamin Meier, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

On Fri, Apr 01 2016, Benjamin Meier wrote:

> Hi there,
>
> I tried to do a chunk size change from 4096k to 64k on a 7-disk RAID6 
> array. I am using Debian Jessie with kernel 3.16 and mdadm 3.3.2. After 
> I initiated the change the process staled immediately. I could watch it 
> in /proc/mdadm that there has not been any progress at all. The 
> backup-file hasn't been touched for days now.
>
> So I decided to backup all data from that device in case it isn't 
> starting at next reboot. Accidentally the system was restarted before 
> the backup was finished. And now the array is not assembling any more, 
> even with the correct --backup-file. I get "mdadm: Failed to restore 
> critical section for reshape, sorry.".
>
> So the first question is: How can I access the data again? I think there 
> is no damage at this time- I appended an output from --examine at the 
> end of this message. All seven drives giving me the same output in all 
> relevant topics. Especially "Chunk Size", "New Chunksize" and "Reshape 
> pos'n" is all the same.
> What is the best way now that I do not damage any data?

 mdadm --assemble --force --update=revert-reshape --invalid-backup
 --backup-file=/whatever /dev/md/TA /dev/list-of-devices

using mdadm 3.4.

>
> Second question: Is the problem with the level change a known bug?

Yes, this has been happening to a few people.  The shape doesn't really
start properly.
If someone can provide a recipe for how to reproduce the problem
(e.g. using loop-back devices) I'll happily look into fixing it, or
identifying which kernel it is already fixed in.

Neilbrown


>
> Thanks for reading!
>
> --
> /dev/disk/by-partlabel/hyper_TA_1:
>            Magic : a92b4efc
>          Version : 1.2
>      Feature Map : 0x5
>             Name : hyper:TA  (local to host hyper)
>       Raid Level : raid6
>     Raid Devices : 7
>
>   Avail Dev Size : 3434725376 (1637.80 GiB 1758.58 GB)
>       Array Size : 8586813440 (8189.02 GiB 8792.90 GB)
>      Data Offset : 147456 sectors
>     Super Offset : 8 sectors
>     Unused Space : before=147368 sectors, after=0 sectors
>            State : clean
>
> Internal Bitmap : 8 sectors from superblock
>    Reshape pos'n : 0
>    New Chunksize : 64K
>
>      Update Time : Thu Mar 31 17:57:01 2016
>    Bad Block Log : 512 entries available at offset 72 sectors
>         Checksum : e7172c1f - correct
>           Events : 527046
>
>           Layout : left-symmetric
>       Chunk Size : 4096K
>
>     Device Role : Active device 5
>     Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Recovery after failed chunk size change
  2016-04-01  5:25 ` NeilBrown
@ 2016-04-01 20:03   ` Benjamin Meier
  2016-04-11 15:29     ` Benjamin Meier
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Meier @ 2016-04-01 20:03 UTC (permalink / raw)
  To: linux-raid

Hi,

Am 01.04.2016 um 07:25 schrieb NeilBrown:
> mdadm --assemble --force --update=revert-reshape --invalid-backup 
> --backup-file=/whatever /dev/md/TA /dev/list-of-devices using mdadm 3.4. 
Thanks. Now my array is online again and working.
> If someone can provide a recipe for how to reproduce the problem (e.g. 
> using loop-back devices) I'll happily look into fixing it, or 
> identifying which kernel it is already fixed in. Neilbrown 
I could reproduce the issue with my current kernel. I also tried the 
debian unstable kernel 4.4 and mdadm 3.4 - it looks like the bug is 
still there. Most of the time it is not reshaping any block. As I 
executed the commands by hand I could see that sometimes the process 
stops after only one block and sometimes the process stops in the 
middle. You can use the script that I wrote to make the reproduction 
easier. Maybe it is also working with smaller files than 1GiB.

Good luck for the bug hunting!
--
#!/bin/bash

# Create a seven disk RAID6 array with sparse files (1GiB each)
declare -a LO_DEVICES
for x in 0 1 2 3 4 5 6; do
   dd if=/dev/zero of=sparse$x bs=1G count=0 seek=1
   losetup -f sparse${x}
   LO_DEVICES[$x]=$(losetup -a|grep sparse${x}|cut -f1 -d" "|sed "s/://")
done
mdadm --create /dev/md/TestMD --chunk=4096 --bitmap=internal --level=6 
--raid-devices=7 \
   ${LO_DEVICES[*]}
mdadm --wait /dev/md/TestMD

# Provocate BUG.
# Tested with Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 
(2016-02-29) x86_64 GNU/Linux
#             mdadm - v3.3.2 - 21st August 2014
mdadm --grow /dev/md/TestMD --chunk=64 --backup-file=backup.file


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Recovery after failed chunk size change
  2016-04-01 20:03   ` Benjamin Meier
@ 2016-04-11 15:29     ` Benjamin Meier
  0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Meier @ 2016-04-11 15:29 UTC (permalink / raw)
  To: linux-raid

Hi,

>
> # Create a seven disk RAID6 array with sparse files (1GiB each)
> declare -a LO_DEVICES
> for x in 0 1 2 3 4 5 6; do
>   dd if=/dev/zero of=sparse$x bs=1G count=0 seek=1
>   losetup -f sparse${x}
>   LO_DEVICES[$x]=$(losetup -a|grep sparse${x}|cut -f1 -d" "|sed "s/://")
> done
> mdadm --create /dev/md/TestMD --chunk=4096 --bitmap=internal --level=6 
> --raid-devices=7 \
>   ${LO_DEVICES[*]}
> mdadm --wait /dev/md/TestMD
>
> # Provocate BUG.
> # Tested with Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 
> (2016-02-29) x86_64 GNU/Linux
> #             mdadm - v3.3.2 - 21st August 2014
> mdadm --grow /dev/md/TestMD --chunk=64 --backup-file=backup.file
>

Was this issue reproducible? Or do you need any additional information?

I have discovered that the problem is not limited to changing the chunk 
size. It also happens with a RAID level change from say 6 to 5.

BR!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-04-11 15:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-31 19:33 Recovery after failed chunk size change Benjamin Meier
2016-04-01  5:25 ` NeilBrown
2016-04-01 20:03   ` Benjamin Meier
2016-04-11 15:29     ` Benjamin Meier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.