All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Liu <songliubraving@fb.com>
To: Matthew Ruffell <matthew.ruffell@canonical.com>,
	Xiao Ni <xni@redhat.com>
Cc: linux-raid <linux-raid@vger.kernel.org>,
	Song Liu <song@kernel.org>, lkml <linux-kernel@vger.kernel.org>,
	Coly Li <colyli@suse.de>,
	Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	"khalid.elmously@canonical.com" <khalid.elmously@canonical.com>,
	Jay Vosburgh <jay.vosburgh@canonical.com>
Subject: Re: PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
Date: Wed, 9 Dec 2020 04:17:42 +0000	[thread overview]
Message-ID: <EA47EF7A-06D8-4B37-BED7-F04753D70DF5@fb.com> (raw)
In-Reply-To: <dbd2761e-cd7d-d60a-f769-ecc8c6335814@canonical.com>

Hi Matthew, 

> On Dec 8, 2020, at 7:46 PM, Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
> 
> Hello,
> 
> I recently backported the following patches into the Ubuntu stable kernels:
> 
> md: add md_submit_discard_bio() for submitting discard bio
> md/raid10: extend r10bio devs to raid disks
> md/raid10: pull codes that wait for blocked dev into one function
> md/raid10: improve raid10 discard request
> md/raid10: improve discard request for far layout
> dm raid: fix discard limits for raid1 and raid10
> dm raid: remove unnecessary discard limits for raid10

Thanks for the report!

Hi Xiao, 

Could you please take a look at this and let me know soon? We need to fix 
this before 5.10 official release. 

Thanks,
Song

> 
> and this morning, a user reported the following downstream bug:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ 
> 
> Their weekly cronjob that runs fstrim had run, and their raid10 array has
> extensive data corruption. 
> 
> The issue is reproducible on the latest 5.10-rc7 mainline kernel, steps are
> below.
> 
> I used a m5d.4xlarge instance on AWS to ultilise 2x 300GB SSDs that support
> block discard. You will want to select small disks to lower the time needed
> to reproduce.
> 
> $ uname -rv
> 5.10.0-rc7+ #1 SMP Wed Dec 9 01:15:27 UTC 2020
> 
> Create a raid10 array, with LVM:
> 
> $ lsblk
> NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> nvme0n1     259:0    0     8G  0 disk 
> └─nvme0n1p1 259:1    0     8G  0 part /
> nvme1n1     259:2    0 279.4G  0 disk 
> nvme2n1     259:3    0 279.4G  0 disk
> 
> $ sudo -s
> # mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme1n1 /dev/nvme2n1
> mdadm: layout defaults to n2
> mdadm: layout defaults to n2
> mdadm: chunk size defaults to 512K
> mdadm: size set to 292836352K
> mdadm: automatically enabling write-intent bitmap on large array
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> # pvcreate -ff -y /dev/md0
>  Physical volume "/dev/md0" successfully created.
> # vgcreate -f -y VolGroup /dev/md0
>  Volume group "VolGroup" successfully created
> # lvcreate -n root -L 100G -ay -y VolGroup
>  Logical volume "root" created.
> # mkfs.ext4 /dev/VolGroup/root
> mke2fs 1.44.1 (24-Mar-2018)
> Discarding device blocks: done                            
> Creating filesystem with 26214400 4k blocks and 6553600 inodes
> Filesystem UUID: d7be2e14-fa4d-4489-884b-3bef63b1e1db
> Superblock backups stored on blocks: 
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
> 	4096000, 7962624, 11239424, 20480000, 23887872
> 
> Allocating group tables: done                            
> Writing inode tables: done                            
> Creating journal (131072 blocks): done
> Writing superblocks and filesystem accounting information: done
> # mount /dev/VolGroup/root /mnt
> 
> Next, wait for the disk check to complete, 25 minutes on m5d.4xlarge instance.
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  resync = 12.0% (35211392/292836352) finish=21.4min speed=200340K/sec
>      bitmap: 3/3 pages [12KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 76918016
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme2n1[1] nvme1n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 582330240
> 
> Now that the check is complete, create a file, sync and delete it:
> 
> # dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.95974 s, 1.1 GB/s
> # sync
> # rm /mnt/data.raw
> 
> Perform a check:
> 
> # echo check > /sys/block/md0/md/sync_action
> 
> Again, wait 25 minutes for it to complete:
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [==>..................]  check = 13.7% (40356224/292836352) finish=20.8min speed=201707K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 1469696
> 
> Now, perform the fstrim:
> 
> # fstrim /mnt --verbose
> /mnt: 97.9 GiB (105089236992 bytes) trimmed
> 
> Go for another check:
> 
> # echo check >/sys/block/md0/md/sync_action
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      [========>............]  check = 40.3% (118270848/292836352) finish=14.4min speed=200963K/sec
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> # cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md0 : active raid10 nvme1n1[1] nvme2n1[0]
>      292836352 blocks super 1.2 2 near-copies [2/2] [UU]
>      bitmap: 0/3 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> # cat /sys/block/md0/md/mismatch_cnt
> 205324928
> 
> Now, we need to take the raid10 array down, and perform a fsck on one disk at
> a time:
> 
> # umount /mnt
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md0
> mdadm: stopped /dev/md0
> 
> Let's do first disk;
> 
> # mdadm --assemble /dev/md127 /dev/nvme1n1 
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> The second disk:
> 
> # mdadm --assemble /dev/md127 /dev/nvme2n1
> mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist).
> # mdadm --run /dev/md127
> mdadm: started array /dev/md/lv-raid
> # vgchange -a y /dev/VolGroup
>  1 logical volume(s) in volume group "VolGroup" now active
> # fsck.ext4 -n -f /dev/VolGroup/root
> e2fsck 1.44.1 (24-Mar-2018)
> Resize inode not valid.  Recreate? no
> 
> Pass 1: Checking inodes, blocks, and sizes
> Inode 7 has illegal block(s).  Clear? no
> 
> Illegal indirect block (1714656753) in inode 7.  IGNORED.
> Error while iterating over blocks in inode 7: Illegal indirect block found
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> e2fsck: aborted
> 
> /dev/VolGroup/root: ********** WARNING: Filesystem still has errors **********
> 
> # vgchange -a n /dev/VolGroup
>  0 logical volume(s) in volume group "VolGroup" now active
> # mdadm --stop /dev/md127
> mdadm: stopped /dev/md127
> 
> There are no panics or anything in dmesg. The directory structure of the first
> disk is intact, but the second disk only has Lost+Found present.
> 
> I can confirm it is the patches listed at the top of the email, but I have not
> had an opportunity to bisect to find the exact root cause. I will do that once
> we confirm what Ubuntu stable kernels are affected and begin reverting the
> patches.
> 
> Let me know if you need any more details.
> 
> Thanks,
> Matthew Ruffell


  reply	other threads:[~2020-12-09  4:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-09  3:46 PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim Matthew Ruffell
2020-12-09  4:17 ` Song Liu [this message]
2020-12-09 22:04   ` Song Liu
2020-12-10  1:35   ` Xiao Ni
2020-12-24 10:18   ` Xiao Ni
2020-12-27 21:57     ` Song Liu
2021-02-02  3:42     ` Matthew Ruffell
2021-02-03  1:43       ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EA47EF7A-06D8-4B37-BED7-F04753D70DF5@fb.com \
    --to=songliubraving@fb.com \
    --cc=colyli@suse.de \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=jay.vosburgh@canonical.com \
    --cc=khalid.elmously@canonical.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=matthew.ruffell@canonical.com \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.