Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim

From: Song Liu <songliubraving@fb.com>
To: Matthew Ruffell <matthew.ruffell@canonical.com>,
	Xiao Ni <xni@redhat.com>
Cc: linux-raid <linux-raid@vger.kernel.org>,
	Song Liu <song@kernel.org>, lkml <linux-kernel@vger.kernel.org>,
	Coly Li <colyli@suse.de>,
	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	"khalid.elmously@canonical.com" <khalid.elmously@canonical.com>,
	Jay Vosburgh <jay.vosburgh@canonical.com>
Subject: Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
Date: Wed, 9 Dec 2020 04:17:42 +0000	[thread overview]
Message-ID: <EA47EF7A-06D8-4B37-BED7-F04753D70DF5@fb.com> (raw)
In-Reply-To: <dbd2761e-cd7d-d60a-f769-ecc8c6335814@canonical.com>

Hi Matthew, 

> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
> 
> Hello,
> 
> I recently backported the following patches into the Ubuntu stable kernels:
> 
> md: add md_submit_discard_bio() for submitting discard bio
> md/raid10: extend r10bio devs to raid disks
> md/raid10: pull codes that wait for blocked dev into one function
> md/raid10: improve raid10 discard request
> md/raid10: improve discard request for far layout
> dm raid: fix discard limits for raid1 and raid10
> dm raid: remove unnecessary discard limits for raid10

Thanks for the report!

Hi Xiao, 

Could you please take a look at this and let me know soon? We need to fix 
this before 5.10 official release. 

Thanks,
Song

> 
> and this morning, a user reported the following downstream bug:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ 
> 
> Their weekly cronjob that runs fstrim had run, and their raid10 array has
> extensive data corruption. 
> 
> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
> below.
> 
> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
> block discard. You will want to select small disks to lower the time needed
> to reproduce.
> 
> $ uname -rv
> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020
> 
> Create a raid10 array, with LVM:
> 
> $ lsblk
> NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> nvme0n1     259:0    0     8G  0 disk 
> └─nvme0n1p1 259:1    0     8G  0 part /
> nvme1n1     259:2    0 279.4G  0 disk 
> nvme2n1     259:3    0 279.4G  0 disk
> 
> $ sudo -s
> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
> mdadm: layout defaults to n2
> mdadm: layout defaults to n2
> mdadm: chunk size defaults to 512K
> mdadm: size set to 292836352K
> mdadm: automatically enabling write-intent bitmap on large array
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> # pvcreate -ff -y /dev/md0
>  Physical volume "/dev/md0" successfully created.
> # vgcreate -f -y VolGroup /dev/md0
>  Volume group "VolGroup" successfully created
> # lvcreate -n root -L 100G -ay -y VolGroup
>  Logical volume "root" created.
> # mkfs.ext4 /dev/VolGroup/root
> mke2fs 1.44.1 (24-Mar-2018)
> Discarding device blocks: done                            
> Creating filesystem with 26214400 4k blocks and 6553600 inodes
> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
> Superblock backups stored on blocks: 
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
> 	4096000, 7962624, 11239424, 20480000, 23887872
> 
> Allocating group tables: done                            
> Writing inode tables: done                            
> Creating journal (131072 blocks): done
> Writing superblocks and filesystem accounting information: done
> # mount /dev/VolGroup/root /mnt
> 
> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
>      bitmap: 3/3 pages [12KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 76918016
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 582330240
> 
> Now that the check is complete, create a file, sync and delete it:
> 
> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
> # sync
> # rm /mnt/data.raw
> 
> Perform a check:
> 
> # echo check > /sys/block/md0/md/sync_action
> 
> Again, wait 25 minutes for it to complete:
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> Now, perform the fstrim:
> 
> # fstrim /mnt --verbose
> /mnt: 97.9 GiB (105089236992 bytes) trimmed
> 
> Go for another check:
> 
> # echo check >/sys/block/md0/md/sync_action
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> Now, we need to take the raid10 array down, and perform a fsck on one disk at
> a time:
> 
> # umount /mnt
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md0
> mdadm: stopped /dev/md0
> 
> Let's do first disk;
> 
> # mdadm --assemble /dev/md127 /dev/nvme1n1 
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> The second disk:
> 
> # mdadm --assemble /dev/md127 /dev/nvme2n1
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Resize inode not valid.  Recreate? no
> 
> Pass 1: Checking inodes, blocks, and sizes
> Inode 7 has illegal block(s).  Clear? no
> 
> Illegal indirect block (1714656753) in inode 7.  IGNORED.
> Error while iterating over blocks in inode 7: Illegal indirect block found
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> e2fsck: aborted
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> There are no panics or anything in dmesg. The directory structure of the first
> disk is intact, but the second disk only has Lost+Found present.
> 
> I can confirm it is the patches listed at the top of the email, but I have not
> had an opportunity to bisect to find the exact root cause. I will do that once
> we confirm what Ubuntu stable kernels are affected and begin reverting the
> patches.
> 
> Let me know if you need any more details.
> 
> Thanks,
> Matthew Ruffell