linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix filesystem corruption after a device replace
@ 2020-09-23 14:30 fdmanana
  2020-09-23 14:43 ` Nikolay Borisov
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: fdmanana @ 2020-09-23 14:30 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We use a device's allocation state tree to track ranges in a device used
for allocated chunks, and we set ranges in this tree when allocating a new
chunk. However after a device replace operation, we were not setting the
allocated ranges in the new device's allocation state tree, so that tree
is empty after a device replace.

This means that a fitrim operation after a device replace will trim the
device ranges that have allocated chunks and extents, as we trim every
range for which there is not a range marked in the device's allocation
state tree. It is also important during chunk allocation, since the
device's allocation state is used to determine if a range is already
allocated when allocating a new chunk.

This is trivial to reproduce and the following script triggers the bug:

  $ cat reproducer.sh
  #!/bin/bash

  DEV1="/dev/sdg"
  DEV2="/dev/sdh"
  DEV3="/dev/sdi"

  wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null

  # Create a raid1 test fs on 2 devices.
  mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
  mount $DEV1 /mnt/btrfs

  xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo

  echo "Starting to replace $DEV1 with $DEV3"
  btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
  echo

  echo "Running fstrim"
  fstrim /mnt/btrfs
  echo

  echo "Unmounting filesystem"
  umount /mnt/btrfs

  echo "Mounting filesystem in degraded mode using $DEV3 only"
  wipefs -a $DEV1 $DEV2 &> /dev/null
  mount -o degraded $DEV3 /mnt/btrfs
  if [ $? -ne 0 ]; then
          dmesg | tail
          echo
          echo "Failed to mount in degraded mode"
          exit 1
  fi

  echo
  echo "File foo data (expected all bytes = 0xab):"
  od -A d -t x1 /mnt/btrfs/foo

  umount /mnt/btrfs

When running the reproducer:

  $ ./replace-test.sh
  wrote 10485760/10485760 bytes at offset 0
  10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
  Starting to replace /dev/sdg with /dev/sdi

  Running fstrim

  Unmounting filesystem
  Mounting filesystem in degraded mode using /dev/sdi only
  mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
  [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
  [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
  [19582.208293] BTRFS info (device sdi): allowing degraded mounts
  [19582.208298] BTRFS info (device sdi): disk space caching is enabled
  [19582.208301] BTRFS info (device sdi): has skinny extents
  [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
  [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
  [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
  [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
  [19582.231576] BTRFS error (device sdi): open_ctree failed

  Failed to mount in degraded mode

So fix by setting all allocated ranges in the replace target device when
the replace operation is finishing, when we are holding the chunk mutex
and we can not race with new chunk allocations.

A test case for fstests follows soon.

Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/dev-replace.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 119721eeecf6..20ce1970015f 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -619,6 +619,37 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 	write_unlock(&em_tree->lock);
 }
 
+/*
+ * When finishing the device replace, before swapping the source device with the
+ * target device we must update the chunk allocation state in the target device,
+ * as it is empty because replace works by directly copying the chunks and not
+ * through the normal chunk allocation path.
+ */
+static int btrfs_set_target_alloc_state(struct btrfs_device *srcdev,
+					struct btrfs_device *tgtdev)
+{
+	struct extent_state *cached_state = NULL;
+	u64 start = 0;
+	u64 found_start;
+	u64 found_end;
+	int ret = 0;
+
+	lockdep_assert_held(&srcdev->fs_info->chunk_mutex);
+
+	while (!find_first_extent_bit(&srcdev->alloc_state, start,
+				      &found_start, &found_end,
+				      CHUNK_ALLOCATED, &cached_state)) {
+		ret = set_extent_bits(&tgtdev->alloc_state, found_start,
+				      found_end, CHUNK_ALLOCATED);
+		if (ret)
+			break;
+		start = found_end + 1;
+	}
+
+	free_extent_state(cached_state);
+	return ret;
+}
+
 static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 				       int scrub_ret)
 {
@@ -693,8 +724,14 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	dev_replace->time_stopped = ktime_get_real_seconds();
 	dev_replace->item_needs_writeback = 1;
 
-	/* replace old device with new one in mapping tree */
+	/*
+	 * Update allocation state in the new device and replace the old device
+	 * with the new one in the mapping tree.
+	 */
 	if (!scrub_ret) {
+		scrub_ret = btrfs_set_target_alloc_state(src_device, tgt_device);
+		if (scrub_ret)
+			goto error;
 		btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
 								src_device,
 								tgt_device);
@@ -705,6 +742,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 				 btrfs_dev_name(src_device),
 				 src_device->devid,
 				 rcu_str_deref(tgt_device->name), scrub_ret);
+error:
 		up_write(&dev_replace->rwsem);
 		mutex_unlock(&fs_info->chunk_mutex);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: fix filesystem corruption after a device replace
  2020-09-23 14:30 [PATCH] btrfs: fix filesystem corruption after a device replace fdmanana
@ 2020-09-23 14:43 ` Nikolay Borisov
  2020-09-23 17:56 ` Johannes Thumshirn
  2020-09-30 16:31 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Nikolay Borisov @ 2020-09-23 14:43 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 23.09.20 г. 17:30 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We use a device's allocation state tree to track ranges in a device used
> for allocated chunks, and we set ranges in this tree when allocating a new
> chunk. However after a device replace operation, we were not setting the
> allocated ranges in the new device's allocation state tree, so that tree
> is empty after a device replace.
> 
> This means that a fitrim operation after a device replace will trim the
> device ranges that have allocated chunks and extents, as we trim every
> range for which there is not a range marked in the device's allocation
> state tree. It is also important during chunk allocation, since the
> device's allocation state is used to determine if a range is already
> allocated when allocating a new chunk.
> 
> This is trivial to reproduce and the following script triggers the bug:
> 
>   $ cat reproducer.sh
>   #!/bin/bash
> 
>   DEV1="/dev/sdg"
>   DEV2="/dev/sdh"
>   DEV3="/dev/sdi"
> 
>   wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
> 
>   # Create a raid1 test fs on 2 devices.
>   mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
>   mount $DEV1 /mnt/btrfs
> 
>   xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
> 
>   echo "Starting to replace $DEV1 with $DEV3"
>   btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
>   echo
> 
>   echo "Running fstrim"
>   fstrim /mnt/btrfs
>   echo
> 
>   echo "Unmounting filesystem"
>   umount /mnt/btrfs
> 
>   echo "Mounting filesystem in degraded mode using $DEV3 only"
>   wipefs -a $DEV1 $DEV2 &> /dev/null
>   mount -o degraded $DEV3 /mnt/btrfs
>   if [ $? -ne 0 ]; then
>           dmesg | tail
>           echo
>           echo "Failed to mount in degraded mode"
>           exit 1
>   fi
> 
>   echo
>   echo "File foo data (expected all bytes = 0xab):"
>   od -A d -t x1 /mnt/btrfs/foo
> 
>   umount /mnt/btrfs
> 
> When running the reproducer:
> 
>   $ ./replace-test.sh
>   wrote 10485760/10485760 bytes at offset 0
>   10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
>   Starting to replace /dev/sdg with /dev/sdi
> 
>   Running fstrim
> 
>   Unmounting filesystem
>   Mounting filesystem in degraded mode using /dev/sdi only
>   mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
>   [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
>   [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
>   [19582.208293] BTRFS info (device sdi): allowing degraded mounts
>   [19582.208298] BTRFS info (device sdi): disk space caching is enabled
>   [19582.208301] BTRFS info (device sdi): has skinny extents
>   [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
>   [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
>   [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
>   [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
>   [19582.231576] BTRFS error (device sdi): open_ctree failed
> 
>   Failed to mount in degraded mode
> 
> So fix by setting all allocated ranges in the replace target device when
> the replace operation is finishing, when we are holding the chunk mutex
> and we can not race with new chunk allocations.
> 
> A test case for fstests follows soon.
> 
> Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree")
> CC: stable@vger.kernel.org # 5.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---


Good catch!

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: fix filesystem corruption after a device replace
  2020-09-23 14:30 [PATCH] btrfs: fix filesystem corruption after a device replace fdmanana
  2020-09-23 14:43 ` Nikolay Borisov
@ 2020-09-23 17:56 ` Johannes Thumshirn
  2020-09-30 16:31 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Johannes Thumshirn @ 2020-09-23 17:56 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: fix filesystem corruption after a device replace
  2020-09-23 14:30 [PATCH] btrfs: fix filesystem corruption after a device replace fdmanana
  2020-09-23 14:43 ` Nikolay Borisov
  2020-09-23 17:56 ` Johannes Thumshirn
@ 2020-09-30 16:31 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2020-09-30 16:31 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Sep 23, 2020 at 03:30:16PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We use a device's allocation state tree to track ranges in a device used
> for allocated chunks, and we set ranges in this tree when allocating a new
> chunk. However after a device replace operation, we were not setting the
> allocated ranges in the new device's allocation state tree, so that tree
> is empty after a device replace.
> 
> This means that a fitrim operation after a device replace will trim the
> device ranges that have allocated chunks and extents, as we trim every
> range for which there is not a range marked in the device's allocation
> state tree. It is also important during chunk allocation, since the
> device's allocation state is used to determine if a range is already
> allocated when allocating a new chunk.
> 
> This is trivial to reproduce and the following script triggers the bug:
> 
>   $ cat reproducer.sh
>   #!/bin/bash
> 
>   DEV1="/dev/sdg"
>   DEV2="/dev/sdh"
>   DEV3="/dev/sdi"
> 
>   wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
> 
>   # Create a raid1 test fs on 2 devices.
>   mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
>   mount $DEV1 /mnt/btrfs
> 
>   xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
> 
>   echo "Starting to replace $DEV1 with $DEV3"
>   btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
>   echo
> 
>   echo "Running fstrim"
>   fstrim /mnt/btrfs
>   echo
> 
>   echo "Unmounting filesystem"
>   umount /mnt/btrfs
> 
>   echo "Mounting filesystem in degraded mode using $DEV3 only"
>   wipefs -a $DEV1 $DEV2 &> /dev/null
>   mount -o degraded $DEV3 /mnt/btrfs
>   if [ $? -ne 0 ]; then
>           dmesg | tail
>           echo
>           echo "Failed to mount in degraded mode"
>           exit 1
>   fi
> 
>   echo
>   echo "File foo data (expected all bytes = 0xab):"
>   od -A d -t x1 /mnt/btrfs/foo
> 
>   umount /mnt/btrfs
> 
> When running the reproducer:
> 
>   $ ./replace-test.sh
>   wrote 10485760/10485760 bytes at offset 0
>   10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
>   Starting to replace /dev/sdg with /dev/sdi
> 
>   Running fstrim
> 
>   Unmounting filesystem
>   Mounting filesystem in degraded mode using /dev/sdi only
>   mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
>   [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
>   [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
>   [19582.208293] BTRFS info (device sdi): allowing degraded mounts
>   [19582.208298] BTRFS info (device sdi): disk space caching is enabled
>   [19582.208301] BTRFS info (device sdi): has skinny extents
>   [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
>   [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
>   [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
>   [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
>   [19582.231576] BTRFS error (device sdi): open_ctree failed
> 
>   Failed to mount in degraded mode
> 
> So fix by setting all allocated ranges in the replace target device when
> the replace operation is finishing, when we are holding the chunk mutex
> and we can not race with new chunk allocations.
> 
> A test case for fstests follows soon.
> 
> Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree")
> CC: stable@vger.kernel.org # 5.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next (a few days ago actually), thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-30 16:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 14:30 [PATCH] btrfs: fix filesystem corruption after a device replace fdmanana
2020-09-23 14:43 ` Nikolay Borisov
2020-09-23 17:56 ` Johannes Thumshirn
2020-09-30 16:31 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).