All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Anand Jain <anand.jain@oracle.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com, fstests@vger.kernel.org
Subject: Re: [PATCH v3] btrfs: new test for devt change between mounts
Date: Mon, 11 Mar 2024 12:06:17 -0700	[thread overview]
Message-ID: <20240311190617.GA2738765@zen.localdomain> (raw)
In-Reply-To: <5997ae6b-1214-4476-a75a-183e485b50b7@oracle.com>

On Mon, Mar 11, 2024 at 10:14:15PM +0530, Anand Jain wrote:
> On 3/8/24 23:13, Boris Burkov wrote:
> > It is possible to confuse the btrfs device cache (fs_devices) by
> > starting with a multi-device filesystem, then removing and re-adding a
> > device in a way which changes its dev_t while the filesystem is
> > unmounted. After this procedure, if we remount, then we are in a funny
> > state where struct btrfs_device's "devt" field does not match the bd_dev
> > of the "bdev" field. I would say this is bad enough, as we have violated
> > a pretty clear invariant.
> > 
> > But for style points, we can then remove the extra device from the fs,
> > making it a single device fs, which enables the "temp_fsid" feature,
> > which permits multiple separate mounts of different devices with the
> > same fsid. Since btrfs is confused and *thinks* there are different
> > devices (based on device->devt), it allows a second redundant mount of
> > the same device (not a bind mount!). This then allows us to corrupt the
> > original mount by doing stuff to the one that should be a bind mount.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Changelog:
> > v3:
> > - fstests convention improvements (helpers, output, comments, etc...)
> > v2:
> > - fix numerous fundamental issues, v1 wasn't really ready
> > 
> >   common/config       |   1 +
> >   tests/btrfs/311     | 105 ++++++++++++++++++++++++++++++++++++++++++++
> >   tests/btrfs/311.out |   2 +
> >   3 files changed, 108 insertions(+)
> >   create mode 100755 tests/btrfs/311
> >   create mode 100644 tests/btrfs/311.out
> > 
> > diff --git a/common/config b/common/config
> > index a3b15b96f..43b517fda 100644
> > --- a/common/config
> > +++ b/common/config
> > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)"
> >   export GZIP_PROG="$(type -P gzip)"
> >   export BTRFS_IMAGE_PROG="$(type -P btrfs-image)"
> >   export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
> > +export PARTED_PROG="$(type -P parted)"
> >   # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled.
> >   # newer systems have udevadm command but older systems like RHEL5 don't.
> > diff --git a/tests/btrfs/311 b/tests/btrfs/311
> > new file mode 100755
> > index 000000000..a7fa541c4
> > --- /dev/null
> > +++ b/tests/btrfs/311
> > @@ -0,0 +1,105 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved.
> > +#
> > +# FS QA Test 311
> > +#
> > +# Test an edge case of multi device volume management in btrfs.
> > +# If a device changes devt between mounts of a multi device fs, we can trick
> > +# btrfs into mounting the same device twice fully (not as a bind mount). From
> > +# there, it is trivial to induce corruption.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick volume scrub
> 
> Please add tmpfsid as well, because this test is about when not to activate
> the tmpfsid.
> 
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_require_test
> > +_require_command "$PARTED_PROG" parted
> > +_require_batched_discard "$TEST_DIR"
> > +
> 
> _fixed_by_kernel_commit XXXXXXXXXXXX \
> 		btrfs: validate device maj:min during open
> 
> > +_cleanup() {
> > +	cd /
> > +	$UMOUNT_PROG $MNT
> > +	$UMOUNT_PROG $BIND
> > +	losetup -d $DEV0
> > +	losetup -d $DEV1
> > +	losetup -d $DEV2
> > +	rm $IMG0
> > +	rm $IMG1
> > +	rm $IMG2
> > +}
> > +
> > +IMG0=$TEST_DIR/$$.img0
> > +IMG1=$TEST_DIR/$$.img1
> > +IMG2=$TEST_DIR/$$.img2
> > +truncate -s 1G $IMG0
> > +truncate -s 1G $IMG1
> > +truncate -s 1G $IMG2
> > +DEV0=$(losetup -f $IMG0 --show)
> > +DEV1=$(losetup -f $IMG1 --show)
> > +DEV2=$(losetup -f $IMG2 --show)
> > +D0P1=$DEV0"p1"
> > +D1P1=$DEV1"p1"
> > +MNT=$TEST_DIR/mnt
> > +BIND=$TEST_DIR/bind
> > +
> > +# Setup partition table with one partition on each device.
> > +$PARTED_PROG $DEV0 'mktable gpt' --script
> > +$PARTED_PROG $DEV1 'mktable gpt' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +
> > +# mkfs with two devices to avoid clearing devices on close
> > +# single raid to allow removing DEV2.
> > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 >>$seqres.full 2>&1 || _fail "failed to mkfs.btrfs"
> 
> Error out is already sufficient here.
> 

I don't understand, but I think Filipe asked for this one.

> > +
> > +# Cycle mount the two device fs to populate both devices into the
> > +# stale device cache.
> > +mkdir -p $MNT
> > +_mount $D0P1 $MNT
> > +$UMOUNT_PROG $MNT
> > +
> > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date.
> > +$PARTED_PROG $DEV0 'rm 1' --script
> > +$PARTED_PROG $DEV1 'rm 1' --script
> > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script
> > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script
> > +
> > +# Mount with mismatched dev_t!
> > +_mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point"
> > +
> 
> On a system where the kernel bug is fixed, the mount is expected to pass.
> 
> On a system without the kernel bug fix, the mount is still expected to pass.
> 
> However, the failure message indicates that it is advisable to fail the
> mount in this scenario.
> 
> I believe this will be the case once the btrfs-progs patch below is
> integrated:
> 
> [PATCH 0/2] btrfs-progs: forget removed devices
> 
> So, we need to update the test case logic based on whether the above
> btrfs-progs patch is integrated.
> 
> 

The point of this was just to do our best to avoid doing a bunch of
writes/scrubbing/trimming on the fs under all the loopdevs and stuff in
an earlier version outside fstests. It would stink to run the script
without set -e and blow up your root drive or whatever.

With that said, this is running in TEST_DIR, so I think we can forget
about the precaution now.

> > +# Remove the extra device to bring temp-fsid back in the fray.
> > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT
> > +
> > +# Create the would be bind mount.
> > +mkdir -p $BIND
> > +_mount $D0P1 $BIND
> > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT)
> > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND)
> > +# If they're different, we are in trouble.
> > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show"
> > +
> 
> 
> > +# Now really prove it by corrupting the first mount with the second.
> > +for i in $(seq 20); do
> > +	$XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >>$seqres.full 2>&1
> > +done
> > +for i in $(seq 20); do
> > +	$XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >>$seqres.full 2>&1
> > +done
> > +
> 
> > +# sync so that we really write the large file data out to the shared device
> > +sync
> > +
> > +# now delete from one view of the shared device
> > +find $BIND -type f -delete
> > +# sync so that fstrim definitely has deleted data to trim
> > +sync
> > +# This should blow up both mounts, if the writes somehow didn't overlap at all.
> > +$FSTRIM_PROG $BIND
> > +# drop caches to improve the odds we read from the corrupted device while scrubbing.
> > +echo 3 > /proc/sys/vm/drop_caches
> > +$BTRFS_UTIL_PROG scrub start -B $MNT | grep "Error summary:"
> > +
> 
> 
> The rest appears to be fine.
> 
> Question: Why didn't you choose a cp --reflink=always across $MNT and $BIND
> to prove how kernel think about $MNT and $BIND.

Didn't occur to me. Who knows, maybe one day cross-fs reflink will work ;)

Demonstrating a corruption felt "elemental" to me.

> 
> Thanks, Anand
> 
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out
> > new file mode 100644
> > index 000000000..70a6db809
> > --- /dev/null
> > +++ b/tests/btrfs/311.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 311
> > +Error summary:    no errors found
> 

      reply	other threads:[~2024-03-11 19:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 17:43 [PATCH v3] btrfs: new test for devt change between mounts Boris Burkov
2024-03-10 17:16 ` Filipe Manana
2024-03-11 16:44 ` Anand Jain
2024-03-11 19:06   ` Boris Burkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240311190617.GA2738765@zen.localdomain \
    --to=boris@bur.io \
    --cc=anand.jain@oracle.com \
    --cc=fstests@vger.kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.