Linux-Raid Archives on lore.kernel.org
 help / color / Atom feed
* PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
@ 2020-12-09  3:46 Matthew Ruffell
  2020-12-09  4:17 ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Ruffell @ 2020-12-09  3:46 UTC (permalink / raw)
  To: xni, linux-raid
  Cc: song, linux-kernel, colyli, guoqing.jiang, songliubraving,
	khalid.elmously, jay.vosburgh

Hello,

I recently backported the following patches into the Ubuntu stable kernels:

md: add md_submit_discard_bio() for submitting discard bio
md/raid10: extend r10bio devs to raid disks
md/raid10: pull codes that wait for blocked dev into one function
md/raid10: improve raid10 discard request
md/raid10: improve discard request for far layout
dm raid: fix discard limits for raid1 and raid10
dm raid: remove unnecessary discard limits for raid10

and this morning, a user reported the following downstream bug:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/

Their weekly cronjob that runs fstrim had run, and their raid10 array has
extensive data corruption. 

The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
below.

I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
block discard. You will want to select small disks to lower the time needed
to reproduce.

$ uname -rv
5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020

Create a raid10 array, with LVM:

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0     8G  0 disk 
└─nvme0n1p1 259:1    0     8G  0 part /
nvme1n1     259:2    0 279.4G  0 disk 
nvme2n1     259:3    0 279.4G  0 disk

$ sudo -s
# mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 292836352K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
# pvcreate -ff -y /dev/md0
  Physical volume "/dev/md0" successfully created.
# vgcreate -f -y VolGroup /dev/md0
  Volume group "VolGroup" successfully created
# lvcreate -n root -L 100G -ay -y VolGroup
  Logical volume "root" created.
# mkfs.ext4 /dev/VolGroup/root
mke2fs 1.44.1 (24-Mar-2018)
Discarding device blocks: done                            
Creating filesystem with 26214400 4k blocks and 6553600 inodes
Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done
# mount /dev/VolGroup/root /mnt

Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme2n1[1] nvme1n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
      bitmap: 3/3 pages [12KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
76918016

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme2n1[1] nvme1n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
582330240

Now that the check is complete, create a file, sync and delete it:

# dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
# sync
# rm /mnt/data.raw

Perform a check:

# echo check > /sys/block/md0/md/sync_action

Again, wait 25 minutes for it to complete:

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme1n1[1] nvme2n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
      bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
1469696

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme1n1[1] nvme2n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
1469696

Now, perform the fstrim:

# fstrim /mnt --verbose
/mnt: 97.9 GiB (105089236992 bytes) trimmed

Go for another check:

# echo check >/sys/block/md0/md/sync_action
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme1n1[1] nvme2n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
      bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
205324928

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid10 nvme1n1[1] nvme2n1[0]
      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
      bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: <none>
# cat /sys/block/md0/md/mismatch_cnt
205324928

Now, we need to take the raid10 array down, and perform a fsck on one disk at
a time:

# umount /mnt
# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

Let's do first disk;

# mdadm --assemble /dev/md127 /dev/nvme1n1 
mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
# mdadm --run /dev/md127
mdadm: started array /dev/md/lv-raid
# vgchange -a y /dev/VolGroup
  1 logical volume(s) in volume group "VolGroup" now active
# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md127
mdadm: stopped /dev/md127

The second disk:

# mdadm --assemble /dev/md127 /dev/nvme2n1
mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
# mdadm --run /dev/md127
mdadm: started array /dev/md/lv-raid
# vgchange -a y /dev/VolGroup
  1 logical volume(s) in volume group "VolGroup" now active
# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Resize inode not valid.  Recreate? no

Pass 1: Checking inodes, blocks, and sizes
Inode 7 has illegal block(s).  Clear? no

Illegal indirect block (1714656753) in inode 7.  IGNORED.
Error while iterating over blocks in inode 7: Illegal indirect block found

/dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********

e2fsck: aborted

/dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********

# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md127
mdadm: stopped /dev/md127

There are no panics or anything in dmesg. The directory structure of the first
disk is intact, but the second disk only has Lost+Found present.

I can confirm it is the patches listed at the top of the email, but I have not
had an opportunity to bisect to find the exact root cause. I will do that once
we confirm what Ubuntu stable kernels are affected and begin reverting the
patches.

Let me know if you need any more details.

Thanks,
Matthew Ruffell

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-09  3:46 PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim Matthew Ruffell
@ 2020-12-09  4:17 ` Song Liu
  2020-12-09 22:04   ` Song Liu
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Song Liu @ 2020-12-09  4:17 UTC (permalink / raw)
  To: Matthew Ruffell, Xiao Ni
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh

Hi Matthew, 

> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
> 
> Hello,
> 
> I recently backported the following patches into the Ubuntu stable kernels:
> 
> md: add md_submit_discard_bio() for submitting discard bio
> md/raid10: extend r10bio devs to raid disks
> md/raid10: pull codes that wait for blocked dev into one function
> md/raid10: improve raid10 discard request
> md/raid10: improve discard request for far layout
> dm raid: fix discard limits for raid1 and raid10
> dm raid: remove unnecessary discard limits for raid10

Thanks for the report!

Hi Xiao, 

Could you please take a look at this and let me know soon? We need to fix 
this before 5.10 official release. 

Thanks,
Song

> 
> and this morning, a user reported the following downstream bug:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ 
> 
> Their weekly cronjob that runs fstrim had run, and their raid10 array has
> extensive data corruption. 
> 
> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
> below.
> 
> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
> block discard. You will want to select small disks to lower the time needed
> to reproduce.
> 
> $ uname -rv
> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020
> 
> Create a raid10 array, with LVM:
> 
> $ lsblk
> NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> nvme0n1     259:0    0     8G  0 disk 
> └─nvme0n1p1 259:1    0     8G  0 part /
> nvme1n1     259:2    0 279.4G  0 disk 
> nvme2n1     259:3    0 279.4G  0 disk
> 
> $ sudo -s
> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
> mdadm: layout defaults to n2
> mdadm: layout defaults to n2
> mdadm: chunk size defaults to 512K
> mdadm: size set to 292836352K
> mdadm: automatically enabling write-intent bitmap on large array
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> # pvcreate -ff -y /dev/md0
>  Physical volume "/dev/md0" successfully created.
> # vgcreate -f -y VolGroup /dev/md0
>  Volume group "VolGroup" successfully created
> # lvcreate -n root -L 100G -ay -y VolGroup
>  Logical volume "root" created.
> # mkfs.ext4 /dev/VolGroup/root
> mke2fs 1.44.1 (24-Mar-2018)
> Discarding device blocks: done                            
> Creating filesystem with 26214400 4k blocks and 6553600 inodes
> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
> Superblock backups stored on blocks: 
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
> 	4096000, 7962624, 11239424, 20480000, 23887872
> 
> Allocating group tables: done                            
> Writing inode tables: done                            
> Creating journal (131072 blocks): done
> Writing superblocks and filesystem accounting information: done
> # mount /dev/VolGroup/root /mnt
> 
> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
>      bitmap: 3/3 pages [12KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 76918016
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 582330240
> 
> Now that the check is complete, create a file, sync and delete it:
> 
> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
> # sync
> # rm /mnt/data.raw
> 
> Perform a check:
> 
> # echo check > /sys/block/md0/md/sync_action
> 
> Again, wait 25 minutes for it to complete:
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> Now, perform the fstrim:
> 
> # fstrim /mnt --verbose
> /mnt: 97.9 GiB (105089236992 bytes) trimmed
> 
> Go for another check:
> 
> # echo check >/sys/block/md0/md/sync_action
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> Now, we need to take the raid10 array down, and perform a fsck on one disk at
> a time:
> 
> # umount /mnt
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md0
> mdadm: stopped /dev/md0
> 
> Let's do first disk;
> 
> # mdadm --assemble /dev/md127 /dev/nvme1n1 
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> The second disk:
> 
> # mdadm --assemble /dev/md127 /dev/nvme2n1
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Resize inode not valid.  Recreate? no
> 
> Pass 1: Checking inodes, blocks, and sizes
> Inode 7 has illegal block(s).  Clear? no
> 
> Illegal indirect block (1714656753) in inode 7.  IGNORED.
> Error while iterating over blocks in inode 7: Illegal indirect block found
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> e2fsck: aborted
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> There are no panics or anything in dmesg. The directory structure of the first
> disk is intact, but the second disk only has Lost+Found present.
> 
> I can confirm it is the patches listed at the top of the email, but I have not
> had an opportunity to bisect to find the exact root cause. I will do that once
> we confirm what Ubuntu stable kernels are affected and begin reverting the
> patches.
> 
> Let me know if you need any more details.
> 
> Thanks,
> Matthew Ruffell


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-09  4:17 ` Song Liu
@ 2020-12-09 22:04   ` Song Liu
  2020-12-10  1:35   ` Xiao Ni
  2020-12-24 10:18   ` Xiao Ni
  2 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2020-12-09 22:04 UTC (permalink / raw)
  To: Matthew Ruffell, Xiao Ni
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh



> On Dec 8, 2020, at 8:17 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Matthew, 
> 
>> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
>> 
>> Hello,
>> 
>> I recently backported the following patches into the Ubuntu stable kernels:
>> 
>> md: add md_submit_discard_bio() for submitting discard bio
>> md/raid10: extend r10bio devs to raid disks
>> md/raid10: pull codes that wait for blocked dev into one function
>> md/raid10: improve raid10 discard request
>> md/raid10: improve discard request for far layout

I reproduced the issue with 5.10-rc7. With md/raid10, the issue is fixed
when I revert the md/raid10 patches. 

>> dm raid: fix discard limits for raid1 and raid10
>> dm raid: remove unnecessary discard limits for raid10


Since 5.10 official will be released this weekend, I am afraid we have to 
revert these changes for 5.10. 

I just sent a patch to revert 

f0e90b6c663a ("dm raid: remove unnecessary discard limits for raid10")

I will send pull request to revert the md/raid10 patches. 

Thanks,
Song

> 
> Thanks for the report!
> 
> Hi Xiao, 
> 
> Could you please take a look at this and let me know soon? We need to fix 
> this before 5.10 official release. 
> 
> Thanks,
> Song
> 
>> 
>> and this morning, a user reported the following downstream bug:
>> 
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ 
>> 
>> Their weekly cronjob that runs fstrim had run, and their raid10 array has
>> extensive data corruption. 
>> 
>> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
>> below.
>> 
>> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
>> block discard. You will want to select small disks to lower the time needed
>> to reproduce.
>> 
>> $ uname -rv
>> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020
>> 
>> Create a raid10 array, with LVM:
>> 
>> $ lsblk
>> NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> nvme0n1     259:0    0     8G  0 disk 
>> └─nvme0n1p1 259:1    0     8G  0 part /
>> nvme1n1     259:2    0 279.4G  0 disk 
>> nvme2n1     259:3    0 279.4G  0 disk
>> 
>> $ sudo -s
>> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
>> mdadm: layout defaults to n2
>> mdadm: layout defaults to n2
>> mdadm: chunk size defaults to 512K
>> mdadm: size set to 292836352K
>> mdadm: automatically enabling write-intent bitmap on large array
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md0 started.
>> # pvcreate -ff -y /dev/md0
>> Physical volume "/dev/md0" successfully created.
>> # vgcreate -f -y VolGroup /dev/md0
>> Volume group "VolGroup" successfully created
>> # lvcreate -n root -L 100G -ay -y VolGroup
>> Logical volume "root" created.
>> # mkfs.ext4 /dev/VolGroup/root
>> mke2fs 1.44.1 (24-Mar-2018)
>> Discarding device blocks: done                            
>> Creating filesystem with 26214400 4k blocks and 6553600 inodes
>> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
>> Superblock backups stored on blocks: 
>> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
>> 	4096000, 7962624, 11239424, 20480000, 23887872
>> 
>> Allocating group tables: done                            
>> Writing inode tables: done                            
>> Creating journal (131072 blocks): done
>> Writing superblocks and filesystem accounting information: done
>> # mount /dev/VolGroup/root /mnt
>> 
>> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.
>> 
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
>>     bitmap: 3/3 pages [12KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 76918016
>> 
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     bitmap: 0/3 pages [0KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 582330240
>> 
>> Now that the check is complete, create a file, sync and delete it:
>> 
>> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
>> # sync
>> # rm /mnt/data.raw
>> 
>> Perform a check:
>> 
>> # echo check > /sys/block/md0/md/sync_action
>> 
>> Again, wait 25 minutes for it to complete:
>> 
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
>>     bitmap: 0/3 pages [0KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 1469696
>> 
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     bitmap: 0/3 pages [0KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 1469696
>> 
>> Now, perform the fstrim:
>> 
>> # fstrim /mnt --verbose
>> /mnt: 97.9 GiB (105089236992 bytes) trimmed
>> 
>> Go for another check:
>> 
>> # echo check >/sys/block/md0/md/sync_action
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
>>     bitmap: 0/3 pages [0KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 205324928
>> 
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>     292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>     bitmap: 0/3 pages [0KB], 65536KB chunk
>> 
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 205324928
>> 
>> Now, we need to take the raid10 array down, and perform a fsck on one disk at
>> a time:
>> 
>> # umount /mnt
>> # vgchange -a n /dev/VolGroup
>> 0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md0
>> mdadm: stopped /dev/md0
>> 
>> Let's do first disk;
>> 
>> # mdadm --assemble /dev/md127 /dev/nvme1n1 
>> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
>> # mdadm --run /dev/md127
>> mdadm: started array /dev/md/lv-raid
>> # vgchange -a y /dev/VolGroup
>> 1 logical volume(s) in volume group "VolGroup" now active
>> # fsck.ext4 -n -f /dev/VolGroup/root
>> e2fsck 1.44.1 (24-Mar-2018)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
>> # vgchange -a n /dev/VolGroup
>> 0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md127
>> mdadm: stopped /dev/md127
>> 
>> The second disk:
>> 
>> # mdadm --assemble /dev/md127 /dev/nvme2n1
>> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
>> # mdadm --run /dev/md127
>> mdadm: started array /dev/md/lv-raid
>> # vgchange -a y /dev/VolGroup
>> 1 logical volume(s) in volume group "VolGroup" now active
>> # fsck.ext4 -n -f /dev/VolGroup/root
>> e2fsck 1.44.1 (24-Mar-2018)
>> Resize inode not valid.  Recreate? no
>> 
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 7 has illegal block(s).  Clear? no
>> 
>> Illegal indirect block (1714656753) in inode 7.  IGNORED.
>> Error while iterating over blocks in inode 7: Illegal indirect block found
>> 
>> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
>> 
>> e2fsck: aborted
>> 
>> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
>> 
>> # vgchange -a n /dev/VolGroup
>> 0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md127
>> mdadm: stopped /dev/md127
>> 
>> There are no panics or anything in dmesg. The directory structure of the first
>> disk is intact, but the second disk only has Lost+Found present.
>> 
>> I can confirm it is the patches listed at the top of the email, but I have not
>> had an opportunity to bisect to find the exact root cause. I will do that once
>> we confirm what Ubuntu stable kernels are affected and begin reverting the
>> patches.
>> 
>> Let me know if you need any more details.
>> 
>> Thanks,
>> Matthew Ruffell


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-09  4:17 ` Song Liu
  2020-12-09 22:04   ` Song Liu
@ 2020-12-10  1:35   ` Xiao Ni
  2020-12-24 10:18   ` Xiao Ni
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2020-12-10  1:35 UTC (permalink / raw)
  To: Song Liu, Matthew Ruffell
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh



On 12/09/2020 12:17 PM, Song Liu wrote:
> Hi Matthew,
>
>> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
>>
>> Hello,
>>
>> I recently backported the following patches into the Ubuntu stable kernels:
>>
>> md: add md_submit_discard_bio() for submitting discard bio
>> md/raid10: extend r10bio devs to raid disks
>> md/raid10: pull codes that wait for blocked dev into one function
>> md/raid10: improve raid10 discard request
>> md/raid10: improve discard request for far layout
>> dm raid: fix discard limits for raid1 and raid10
>> dm raid: remove unnecessary discard limits for raid10
> Thanks for the report!
>
> Hi Xiao,
>
> Could you please take a look at this and let me know soon? We need to fix
> this before 5.10 official release.
>
> Thanks,
> Song

Hi all

Sorry for the trouble. But I'm in pto with no test machines. I'll have a 
look at this problem
next week.
>
>> and this morning, a user reported the following downstream bug:
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
>>
>> Their weekly cronjob that runs fstrim had run, and their raid10 array has
>> extensive data corruption.
>>
>> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
>> below.
>>
>> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
>> block discard. You will want to select small disks to lower the time needed
>> to reproduce.
>>
>> $ uname -rv
>> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020
>>
>> Create a raid10 array, with LVM:
>>
>> $ lsblk
>> NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>> nvme0n1     259:0    0     8G  0 disk
>> └─nvme0n1p1 259:1    0     8G  0 part /
>> nvme1n1     259:2    0 279.4G  0 disk
>> nvme2n1     259:3    0 279.4G  0 disk
>>
>> $ sudo -s
>> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
>> mdadm: layout defaults to n2
>> mdadm: layout defaults to n2
>> mdadm: chunk size defaults to 512K
>> mdadm: size set to 292836352K
>> mdadm: automatically enabling write-intent bitmap on large array
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md0 started.
>> # pvcreate -ff -y /dev/md0
>>   Physical volume "/dev/md0" successfully created.
>> # vgcreate -f -y VolGroup /dev/md0
>>   Volume group "VolGroup" successfully created
>> # lvcreate -n root -L 100G -ay -y VolGroup
>>   Logical volume "root" created.
>> # mkfs.ext4 /dev/VolGroup/root
>> mke2fs 1.44.1 (24-Mar-2018)
>> Discarding device blocks: done
>> Creating filesystem with 26214400 4k blocks and 6553600 inodes
>> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
>> Superblock backups stored on blocks:
>> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>> 	4096000, 7962624, 11239424, 20480000, 23887872
>>
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (131072 blocks): done
>> Writing superblocks and filesystem accounting information: done
>> # mount /dev/VolGroup/root /mnt
>>
>> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
>>       bitmap: 3/3 pages [12KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 76918016
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 582330240
>>
>> Now that the check is complete, create a file, sync and delete it:
>>
>> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
>> # sync
>> # rm /mnt/data.raw
>>
>> Perform a check:
>>
>> # echo check > /sys/block/md0/md/sync_action
>>
>> Again, wait 25 minutes for it to complete:
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 1469696
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 1469696
>>
>> Now, perform the fstrim:
>>
>> # fstrim /mnt --verbose
>> /mnt: 97.9 GiB (105089236992 bytes) trimmed
>>
>> Go for another check:
>>
>> # echo check >/sys/block/md0/md/sync_action
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 205324928
>>
>> # cat /proc/mdstat
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>>       292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>>       bitmap: 0/3 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> # cat /sys/block/md0/md/mismatch_cnt
>> 205324928
>>
>> Now, we need to take the raid10 array down, and perform a fsck on one disk at
>> a time:
>>
>> # umount /mnt
>> # vgchange -a n /dev/VolGroup
>>   0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md0
>> mdadm: stopped /dev/md0
>>
>> Let's do first disk;
>>
>> # mdadm --assemble /dev/md127 /dev/nvme1n1
>> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
>> # mdadm --run /dev/md127
>> mdadm: started array /dev/md/lv-raid
>> # vgchange -a y /dev/VolGroup
>>   1 logical volume(s) in volume group "VolGroup" now active
>> # fsck.ext4 -n -f /dev/VolGroup/root
>> e2fsck 1.44.1 (24-Mar-2018)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
>> # vgchange -a n /dev/VolGroup
>>   0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md127
>> mdadm: stopped /dev/md127
>>
>> The second disk:
>>
>> # mdadm --assemble /dev/md127 /dev/nvme2n1
>> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
>> # mdadm --run /dev/md127
>> mdadm: started array /dev/md/lv-raid
>> # vgchange -a y /dev/VolGroup
>>   1 logical volume(s) in volume group "VolGroup" now active
>> # fsck.ext4 -n -f /dev/VolGroup/root
>> e2fsck 1.44.1 (24-Mar-2018)
>> Resize inode not valid.  Recreate? no
>>
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 7 has illegal block(s).  Clear? no
>>
>> Illegal indirect block (1714656753) in inode 7.  IGNORED.
>> Error while iterating over blocks in inode 7: Illegal indirect block found
>>
>> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
>>
>> e2fsck: aborted
>>
>> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
>>
>> # vgchange -a n /dev/VolGroup
>>   0 logical volume(s) in volume group "VolGroup" now active
>> # mdadm --stop /dev/md127
>> mdadm: stopped /dev/md127
>>
>> There are no panics or anything in dmesg. The directory structure of the first
>> disk is intact, but the second disk only has Lost+Found present.
>>
>> I can confirm it is the patches listed at the top of the email, but I have not
>> had an opportunity to bisect to find the exact root cause. I will do that once
>> we confirm what Ubuntu stable kernels are affected and begin reverting the
>> patches.
>>
>> Let me know if you need any more details.
>>
>> Thanks,
>> Matthew Ruffell


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-09  4:17 ` Song Liu
  2020-12-09 22:04   ` Song Liu
  2020-12-10  1:35   ` Xiao Ni
@ 2020-12-24 10:18   ` Xiao Ni
  2020-12-27 21:57     ` Song Liu
  2021-02-02  3:42     ` Matthew Ruffell
  2 siblings, 2 replies; 8+ messages in thread
From: Xiao Ni @ 2020-12-24 10:18 UTC (permalink / raw)
  To: Song Liu, Matthew Ruffell
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh


[-- Attachment #1: Type: text/plain, Size: 2209 bytes --]



On 12/09/2020 12:17 PM, Song Liu wrote:
> Hi Matthew,
>
>> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
>>
>> Hello,
>>
>> I recently backported the following patches into the Ubuntu stable kernels:
>>
>> md: add md_submit_discard_bio() for submitting discard bio
>> md/raid10: extend r10bio devs to raid disks
>> md/raid10: pull codes that wait for blocked dev into one function
>> md/raid10: improve raid10 discard request
>> md/raid10: improve discard request for far layout
>> dm raid: fix discard limits for raid1 and raid10
>> dm raid: remove unnecessary discard limits for raid10
> Thanks for the report!
>
> Hi Xiao,
>
> Could you please take a look at this and let me know soon? We need to fix
> this before 5.10 official release.
>
> Thanks,
> Song
>
Hi all

The root cause is found. Now we use a similar way with raid0 to handle 
discard request
for raid10. Because the discard region is very big, we can calculate the 
start/end address
for each disk. Then we can submit the discard request to each disk. But 
for raid10, it has
copies. For near layout, if the discard request doesn't align with chunk 
size, we calculate
a start_disk_offset. Now we only use start_disk_offset for the first 
disk, but it should be
used for the near copies disks too.

[  789.709501] discard bio start : 70968, size : 191176
[  789.709507] first stripe index 69, start disk index 0, start disk 
offset 70968
[  789.709509] last stripe index 256, end disk index 0, end disk offset 
262144
[  789.709511] disk 0, dev start : 70968, dev end : 262144
[  789.709515] disk 1, dev start : 70656, dev end : 262144

For example, in this test case, it has 2 near copies. The 
start_disk_offset for the first disk is 70968.
It should use the same offset address for second disk. But it uses the 
start address of this chunk.
It discard more region. The patch in the attachment can fix this 
problem. It split the region that
doesn't align with chunk size.

There is another problem. The stripe size should be calculated 
differently for near layout and far layout.

@Song, do you want me to use a separate patch for this fix, or fix this 
in the original patch?

Merry Christmas
Xiao


[-- Attachment #2: fix-raid10-discard-patch --]
[-- Type: text/plain, Size: 2665 bytes --]

commit 0d74ac66ed0ec5af70296545e26044723a14657c
Author: Xiao Ni <xni@redhat.com>
Date:   Thu Dec 24 17:58:43 2020 +0800

    fix

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3153183b7772..92182cf40d22 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1604,6 +1604,7 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	sector_t chunk;
 	unsigned int stripe_size;
 	sector_t split_size;
+	sector_t chunk_size = 1 << geo->chunk_shift;
 
 	sector_t bio_start, bio_end;
 	sector_t first_stripe_index, last_stripe_index;
@@ -1624,7 +1625,8 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		goto out;
 
-	stripe_size = geo->raid_disks << geo->chunk_shift;
+	stripe_size = geo->near_copies ? geo->near_copies << geo->chunk_shift:
+				geo->raid_disks << geo->chunk_shift;
 	bio_start = bio->bi_iter.bi_sector;
 	bio_end = bio_end_sector(bio);
 
@@ -1637,6 +1639,18 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	if (bio_sectors(bio) < stripe_size*2)
 		goto out;
 
+	/* Keep the discard start/end address aligned with chunk size */
+	if (bio_start & geo->chunk_mask) {
+		split_size = (chunk_size - (bio_start & geo->chunk_mask));
+		bio = raid10_split_bio(conf, bio, split_size, false);
+	}
+	if (bio_end & geo->chunk_mask) {
+		split_size = bio_end & geo->chunk_mask;
+		bio = raid10_split_bio(conf, bio, split_size, true);
+	}
+	bio_start = bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(bio);
+
 	/* For far and far offset layout, if bio is not aligned with stripe size,
 	 * it splits the part that is not aligned with strip size.
 	 */
@@ -1664,8 +1678,8 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	start_disk_index = sector_div(first_stripe_index, geo->raid_disks);
 	if (geo->far_offset)
 		first_stripe_index *= geo->far_copies;
-	start_disk_offset = (bio_start & geo->chunk_mask) +
-				(first_stripe_index << geo->chunk_shift);
+	/* Now the bio is aligned with chunk size */
+	start_disk_offset = first_stripe_index << geo->chunk_shift;
 
 	chunk = bio_end >> geo->chunk_shift;
 	chunk *= geo->near_copies;
@@ -1673,8 +1687,7 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	end_disk_index = sector_div(last_stripe_index, geo->raid_disks);
 	if (geo->far_offset)
 		last_stripe_index *= geo->far_copies;
-	end_disk_offset = (bio_end & geo->chunk_mask) +
-				(last_stripe_index << geo->chunk_shift);
+	end_disk_offset = last_stripe_index << geo->chunk_shift;
 
 retry_discard:
 	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-24 10:18   ` Xiao Ni
@ 2020-12-27 21:57     ` Song Liu
  2021-02-02  3:42     ` Matthew Ruffell
  1 sibling, 0 replies; 8+ messages in thread
From: Song Liu @ 2020-12-27 21:57 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Song Liu, Matthew Ruffell, linux-raid, lkml, Coly Li,
	Guoqing Jiang, khalid.elmously, Jay Vosburgh

Hi Xiao,

On Thu, Dec 24, 2020 at 2:18 AM Xiao Ni <xni@redhat.com> wrote:
>
>
>
[...]
>
> [  789.709501] discard bio start : 70968, size : 191176
> [  789.709507] first stripe index 69, start disk index 0, start disk
> offset 70968
> [  789.709509] last stripe index 256, end disk index 0, end disk offset
> 262144
> [  789.709511] disk 0, dev start : 70968, dev end : 262144
> [  789.709515] disk 1, dev start : 70656, dev end : 262144
>
> For example, in this test case, it has 2 near copies. The
> start_disk_offset for the first disk is 70968.
> It should use the same offset address for second disk. But it uses the
> start address of this chunk.
> It discard more region. The patch in the attachment can fix this
> problem. It split the region that
> doesn't align with chunk size.
>
> There is another problem. The stripe size should be calculated
> differently for near layout and far layout.
>
> @Song, do you want me to use a separate patch for this fix, or fix this
> in the original patch?

Please fold in the changes in the original patches and resend the whole
set.

Thanks,
Song

>
> Merry Christmas
> Xiao
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2020-12-24 10:18   ` Xiao Ni
  2020-12-27 21:57     ` Song Liu
@ 2021-02-02  3:42     ` Matthew Ruffell
  2021-02-03  1:43       ` Xiao Ni
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Ruffell @ 2021-02-02  3:42 UTC (permalink / raw)
  To: Xiao Ni, Song Liu
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh

Hi Xiao,

On 24/12/20 11:18 pm, Xiao Ni wrote:> The root cause is found. Now we use a similar way with raid0 to handle discard request
> for raid10. Because the discard region is very big, we can calculate the start/end address
> for each disk. Then we can submit the discard request to each disk. But for raid10, it has
> copies. For near layout, if the discard request doesn't align with chunk size, we calculate
> a start_disk_offset. Now we only use start_disk_offset for the first disk, but it should be
> used for the near copies disks too.

Thanks for finding the root cause and making a patch that corrects the offset
addresses for multiple disks!

> 
> [  789.709501] discard bio start : 70968, size : 191176
> [  789.709507] first stripe index 69, start disk index 0, start disk offset 70968
> [  789.709509] last stripe index 256, end disk index 0, end disk offset 262144
> [  789.709511] disk 0, dev start : 70968, dev end : 262144
> [  789.709515] disk 1, dev start : 70656, dev end : 262144
> 
> For example, in this test case, it has 2 near copies. The start_disk_offset for the first disk is 70968.
> It should use the same offset address for second disk. But it uses the start address of this chunk.
> It discard more region. The patch in the attachment can fix this problem. It split the region that
> doesn't align with chunk size.

Just wondering, what is the current status of the patchset? Is there anything
that I can do to help? 

> 
> There is another problem. The stripe size should be calculated differently for near layout and far layout.
> 

I can help review the patch and help test the patches anytime. Do you need help
with making a patch to calculate the stripe size for near and far layouts?

Let me know how you are going with this patchset, and if there is anything I
can do for you.

Thanks,
Matthew

> @Song, do you want me to use a separate patch for this fix, or fix this in the original patch?
> 
> Merry Christmas
> Xiao
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
  2021-02-02  3:42     ` Matthew Ruffell
@ 2021-02-03  1:43       ` Xiao Ni
  0 siblings, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2021-02-03  1:43 UTC (permalink / raw)
  To: Matthew Ruffell, Song Liu
  Cc: linux-raid, Song Liu, lkml, Coly Li, Guoqing Jiang,
	khalid.elmously, Jay Vosburgh



On 02/02/2021 11:42 AM, Matthew Ruffell wrote:
> Hi Xiao,
>
> On 24/12/20 11:18 pm, Xiao Ni wrote:> The root cause is found. Now we use a similar way with raid0 to handle discard request
>> for raid10. Because the discard region is very big, we can calculate the start/end address
>> for each disk. Then we can submit the discard request to each disk. But for raid10, it has
>> copies. For near layout, if the discard request doesn't align with chunk size, we calculate
>> a start_disk_offset. Now we only use start_disk_offset for the first disk, but it should be
>> used for the near copies disks too.
> Thanks for finding the root cause and making a patch that corrects the offset
> addresses for multiple disks!
>
>> [  789.709501] discard bio start : 70968, size : 191176
>> [  789.709507] first stripe index 69, start disk index 0, start disk offset 70968
>> [  789.709509] last stripe index 256, end disk index 0, end disk offset 262144
>> [  789.709511] disk 0, dev start : 70968, dev end : 262144
>> [  789.709515] disk 1, dev start : 70656, dev end : 262144
>>
>> For example, in this test case, it has 2 near copies. The start_disk_offset for the first disk is 70968.
>> It should use the same offset address for second disk. But it uses the start address of this chunk.
>> It discard more region. The patch in the attachment can fix this problem. It split the region that
>> doesn't align with chunk size.
> Just wondering, what is the current status of the patchset? Is there anything
> that I can do to help?
>
>> There is another problem. The stripe size should be calculated differently for near layout and far layout.
>>
> I can help review the patch and help test the patches anytime. Do you need help
> with making a patch to calculate the stripe size for near and far layouts?
>
> Let me know how you are going with this patchset, and if there is anything I
> can do for you.
>
> Thanks,
> Matthew
>
Hi Matthew

I'm doing the test for the new patch set. I'll send the patch soon 
again. Thanks for the help.

Regards
Xiao


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09  3:46 PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim Matthew Ruffell
2020-12-09  4:17 ` Song Liu
2020-12-09 22:04   ` Song Liu
2020-12-10  1:35   ` Xiao Ni
2020-12-24 10:18   ` Xiao Ni
2020-12-27 21:57     ` Song Liu
2021-02-02  3:42     ` Matthew Ruffell
2021-02-03  1:43       ` Xiao Ni

Linux-Raid Archives on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-raid/0 linux-raid/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-raid linux-raid/ https://lore.kernel.org/linux-raid \
		linux-raid@vger.kernel.org
	public-inbox-index linux-raid

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-raid


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git