* Raid10 reshape bug
@ 2021-02-19 20:13 Phillip Susi
2021-03-08 16:39 ` Phillip Susi
0 siblings, 1 reply; 3+ messages in thread
From: Phillip Susi @ 2021-02-19 20:13 UTC (permalink / raw)
To: linux-raid
In the process of upgrading a xen server I broke the previous raid1 and
used the removed disk to create a new raid10 to prepare the new install.
I think initially I created it in the default near configuration, so I
reshaped it to offset with 1M chunk size. I got the domUs up and
running again and was pretty happy with the result, so I blew away the
old system disk and added that disk to the new array and allowed it to
sync. Then I thought that the 1M chunk size was hurting performance, so
I requested a reshape to a 256k chunk size with mdadm -G /dev/md0 -c
256. It looked like it was proceeding fine so I went home for the
night.
When I came in this morming, mdadm -D showed that the reshape was
complete, but I started getting ELF errors and such running various
programs and I started to get a feeling that something had gone horribly
wrong. At one point I was trying to run blockdev --getsz and isntead
the system somehow ran findmnt. mdadm -E showed that there was a very
large unused section of the disk both before and after. This is
probably because I had used -s to restrict the used size of the device
to be only 256g instead of the full 2tb so it wouldn't take so long to
resync, and since there was plenty of unused space, md decided to just
write back the new layout stripes in unused space further down the disk.
At this point I rebooted and grub could not recognize the filesystem. I
booted other media and tried an e2fsck but it had so many complaints,
one of which being that the root directory was not, in fact, a directory
so it deleted it that I just gave up and started reinstalling and
restoring the domU from backup.
Clearly somehow the reshape process did NOT write the data back to the
disk in the correct place. This was using debian testing with linux
5.10.0 and mdadm v4.1.
I will try to reproduce it in a vm at some point.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid10 reshape bug
2021-02-19 20:13 Raid10 reshape bug Phillip Susi
@ 2021-03-08 16:39 ` Phillip Susi
2021-03-08 17:01 ` Phillip Susi
0 siblings, 1 reply; 3+ messages in thread
From: Phillip Susi @ 2021-03-08 16:39 UTC (permalink / raw)
To: linux-raid
So it turns out all you have to do to trigger a bug is:
mdadm --create -l raid10 -n 2 /dev/md1 /dev/loop0 missing
mdadm -G /dev/md1 -p o2
After changing the layout to offset, attemting to mkfs.ext4 on the raid
device results in errors and this in dmesg:
[1467312.410811] md: md1: reshape done.
[1467386.790079] handle_bad_sector: 446 callbacks suppressed
[1467386.790083] attempt to access beyond end of device
dm-3: rw=0, want=2127992, limit=2097152
[1467386.790096] attempt to access beyond end of device
dm-3: rw=0, want=2127992, limit=2097152
[1467386.790099] buffer_io_error: 2062 callbacks suppressed
[1467386.790101] Buffer I/O error on dev md1, logical block 4238, async page read
[1467386.793270] attempt to access beyond end of device
dm-3: rw=0, want=2127992, limit=2097152
[1467386.793277] Buffer I/O error on dev md1, logical block 4238, async page read
[1467394.422528] attempt to access beyond end of device
dm-3: rw=0, want=4187016, limit=2097152
[1467394.422541] attempt to access beyond end of device
dm-3: rw=0, want=4187016, limit=2097152
[1467394.422545] Buffer I/O error on dev md1, logical block 261616, async page read
/dev/md1:
Version : 1.2
Creation Time : Mon Mar 8 11:21:23 2021
Raid Level : raid10
Array Size : 1046528 (1022.00 MiB 1071.64 MB)
Used Dev Size : 1046528 (1022.00 MiB 1071.64 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Mar 8 11:24:10 2021
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Layout : offset=2
Chunk Size : 512K
Consistency Policy : resync
Name : hyper1:1 (local to host hyper1)
UUID : 69618fc3:c6abd8de:8458d647:1c242e1a
Events : 3409
Number Major Minor RaidDevice State
0 253 3 0 active sync /dev/dm-3
- 0 0 1 removed
/dev/hyper1/leg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 69618fc3:c6abd8de:8458d647:1c242e1a
Name : hyper1:1 (local to host hyper1)
Creation Time : Mon Mar 8 11:21:23 2021
Raid Level : raid10
Raid Devices : 2
Avail Dev Size : 2095104 (1023.00 MiB 1072.69 MB)
Array Size : 1046528 (1022.00 MiB 1071.64 MB)
Used Dev Size : 2093056 (1022.00 MiB 1071.64 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
Unused Space : before=1968 sectors, after=2048 sectors
State : clean
Device UUID : 476f8e72:76084630:c33c16e4:7c987659
Update Time : Mon Mar 8 11:24:10 2021
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : e5995957 - correct
Events : 3409
Layout : offset=2
Chunk Size : 512K
Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
Phillip Susi writes:
> In the process of upgrading a xen server I broke the previous raid1 and
> used the removed disk to create a new raid10 to prepare the new install.
> I think initially I created it in the default near configuration, so I
> reshaped it to offset with 1M chunk size. I got the domUs up and
> running again and was pretty happy with the result, so I blew away the
> old system disk and added that disk to the new array and allowed it to
> sync. Then I thought that the 1M chunk size was hurting performance, so
> I requested a reshape to a 256k chunk size with mdadm -G /dev/md0 -c
> 256. It looked like it was proceeding fine so I went home for the
> night.
>
> When I came in this morming, mdadm -D showed that the reshape was
> complete, but I started getting ELF errors and such running various
> programs and I started to get a feeling that something had gone horribly
> wrong. At one point I was trying to run blockdev --getsz and isntead
> the system somehow ran findmnt. mdadm -E showed that there was a very
> large unused section of the disk both before and after. This is
> probably because I had used -s to restrict the used size of the device
> to be only 256g instead of the full 2tb so it wouldn't take so long to
> resync, and since there was plenty of unused space, md decided to just
> write back the new layout stripes in unused space further down the disk.
> At this point I rebooted and grub could not recognize the filesystem. I
> booted other media and tried an e2fsck but it had so many complaints,
> one of which being that the root directory was not, in fact, a directory
> so it deleted it that I just gave up and started reinstalling and
> restoring the domU from backup.
>
> Clearly somehow the reshape process did NOT write the data back to the
> disk in the correct place. This was using debian testing with linux
> 5.10.0 and mdadm v4.1.
>
> I will try to reproduce it in a vm at some point.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid10 reshape bug
2021-03-08 16:39 ` Phillip Susi
@ 2021-03-08 17:01 ` Phillip Susi
0 siblings, 0 replies; 3+ messages in thread
From: Phillip Susi @ 2021-03-08 17:01 UTC (permalink / raw)
To: linux-raid
Phillip Susi writes:
> So it turns out all you have to do to trigger a bug is:
>
> mdadm --create -l raid10 -n 2 /dev/md1 /dev/loop0 missing
> mdadm -G /dev/md1 -p o2
I tried it again using a second disk instead of starting with a degraded
array, and the reshape says it worked, but left the array degraded with
one disk faulty and the data trashed.
After recreating the array with offset layout initially and and
formatting it, the filesystem also was trashed when I did a reshape to
convert the chunk size down to 64k with:
mdadm -G /dev/md1 -c 64
I also tried this with raid5 and raid4 instead of raid10 and they work,
so it seems to be specific to raid10.
I tried to change the chunk size on raid0 and for some reason, mdadm
wants to convert it to raid4 and can't since that would reduce the size.
Hrm... I went back and tried reshaping the chunk size on raid10 again
but in the default near layout rather than offset, and this works fine,
so it appears to be a problem only with the offset layout. I tried the
far layout, but mdadm says it can not reshape to far.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-03-08 18:05 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-19 20:13 Raid10 reshape bug Phillip Susi
2021-03-08 16:39 ` Phillip Susi
2021-03-08 17:01 ` Phillip Susi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).