linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@gmail.com>
To: Kyle Ambroff-Kao <kyle@ambroffkao.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] btrfs: Allow replacing device with a smaller one if possible
Date: Mon, 9 Dec 2019 10:26:52 +0000	[thread overview]
Message-ID: <CAL3q7H60gNBC_zzU8gjZ_s=7MnN23yFzQqYxanhvzMO50qtXJg@mail.gmail.com> (raw)
In-Reply-To: <20191208093045.43433-2-kyle@ambroffkao.com>

On Sun, Dec 8, 2019 at 9:32 AM Kyle Ambroff-Kao <kyle@ambroffkao.com> wrote:
>
> As long as the target device has enough capacity for the total bytes
> of the source device, allow the replacement to occur.
>
> Just changing the size validation isn't enough though, since the
> rest of the replacement code just assumes that the source device is
> identical to the target device. The current code just blindly
> copies the total disk size from the source to the target.
>
> A btrfs resize <devid>:max could be performed, but we might as well
> just set the disk size for the new target device correctly in the
> first place before initiating a scrub, which is what this patch does.
>
> When initializing the target device, the size in bytes is calculated
> in the same way that btrfs_init_new_device does it.
>
> When the replace operation completes, btrfs_dev_replace_finishing no
> longer clobbers the target device size with the source device size.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112741
> Signed-off-by: Kyle Ambroff-Kao <kyle@ambroffkao.com>
> ---
>  fs/btrfs/dev-replace.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index f639dde2a679..6a7a83ccab56 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -216,7 +216,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>
>
>         if (i_size_read(bdev->bd_inode) <
> -           btrfs_device_get_total_bytes(srcdev)) {
> +           btrfs_device_get_bytes_used(srcdev)) {

Hi,

Unfortunately things are not that simple...

The new device may have enough space, but that's not enough to guarantee
that the replace operation will succeed.
Consider the following example:

#!/bin/bash -x

losetup -d /dev/loop1 &> /dev/null
losetup -d /dev/loop2 &> /dev/null
losetup -d /dev/loop3 &> /dev/null

rm -f /mnt/disk1 /mnt/disk2 /mnt/disk3
fallocate -l 8G /mnt/disk1
fallocate -l 8G /mnt/disk2
fallocate -l 4G /mnt/disk3
losetup /dev/loop1 /mnt/disk1
losetup /dev/loop2 /mnt/disk2
losetup /dev/loop3 /mnt/disk3

mkdir /mnt/test &> /dev/null
umount /mnt/test &> /dev/null
mkfs.btrfs -f /dev/loop1 /dev/loop2
mount /dev/loop1 /mnt/test

fallocate -l 1G /mnt/test/foo1
fallocate -l 1G /mnt/test/foo2
fallocate -l 1G /mnt/test/foo3
fallocate -l 1G /mnt/test/foo4
fallocate -l 1G /mnt/test/foo5
fallocate -l 1G /mnt/test/foo6
fallocate -l 1G /mnt/test/foo7
fallocate -l 1G /mnt/test/foo8
sync
rm -f /mnt/test/foo{2,3,4,5,6,7}

umount /mnt/test
mount /dev/loop1 /mnt/test
sync # ensure empty block groups are removed

btrfs filesystem du /mnt/test
btrfs filesystem df /mnt/test

btrfs replace start -B /dev/loop2 /dev/loop3 /mnt/test

umount /mnt/test

# end of script


This will fail, despite the new device having a size of 4Gb which is more
than enough as there's little more than 2Gb used by the filesystem:

(...)
+ btrfs filesystem du /mnt/test
     Total   Exclusive  Set shared  Filename
   1.00GiB     1.00GiB           -  /mnt/test/foo1
   1.00GiB     1.00GiB           -  /mnt/test/foo8
   2.00GiB     2.00GiB       0.00B  /mnt/test

+ btrfs filesystem df /mnt/test
Data, RAID0: total=4.85GiB, used=2.00GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=112.00KiB
GlobalReserve, single: total=3.25MiB, used=0.00B

+ btrfs replace start -B /dev/loop2 /dev/loop3 /mnt/test
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/test": Input/output error
(...)


If you look at dmesg/syslog you'll also see a lot of error messages regarding
writes beyond the target device's size, which is the replace operation fails
with error -EIO (which is -5):

$ dmesg
(...)
[242142.682576] loop3: rw=1, want=8904704, limit=8388608
[242142.682623] attempt to access beyond end of device
[242142.682625] loop3: rw=1, want=8904832, limit=8388608
[242142.682671] attempt to access beyond end of device
[242142.682672] loop3: rw=1, want=8904960, limit=8388608
[242142.682716] attempt to access beyond end of device
[242142.682718] loop3: rw=1, want=8905088, limit=8388608
[242148.882448] BTRFS error (device loop1):
btrfs_scrub_dev(/dev/loop2, 2, /dev/loop3) failed -5
[242148.903248] ------------[ cut here ]------------
[242148.903375] WARNING: CPU: 1 PID: 22989 at
fs/btrfs/dev-replace.c:508 btrfs_dev_replace_by_ioctl+0x711/0x8b0
[btrfs]
[242148.903383] Modules linked in: btrfs xor raid6_pq libcrc32c loop
intel_rapl_msr intel_rapl_common kvm_intel kvm bochs_drm
drm_vram_helper ttm irqbypass drm_kms_helper crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel drm aesni_intel crypto_simd cryptd
joydev glue_helper evdev sg pcspkr parport_pc serio_raw button
qemu_fw_cfg ppdev lp parport ip_tables x_tables autofs4 ext4
crc32c_generic crc16 mbcache jbd2 sd_mod virtio_scsi ata_generic
crc32c_intel virtio_pci virtio_ring psmouse virtio e1000 ata_piix
libata scsi_mod i2c_piix4
[242148.903502] CPU: 1 PID: 22989 Comm: btrfs Tainted: G        W
   5.4.0-rc8-btrfs-next-65 #1
[242148.903509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[242148.903618] RIP: 0010:btrfs_dev_replace_by_ioctl+0x711/0x8b0 [btrfs]
[242148.903630] Code: 89 f7 e8 82 e7 fa ff 8b 45 b8 e9 8a fe ff ff 48
c7 c2 83 ff ff ff 48 8b 4d c0 48 89 51 08 e9 25 fe ff ff 0f 0b e9 cd
fd ff ff <0f> 0b e9 68 fe ff ff 48 c7 c6 c0 31 9c c0 48 89 df e8 43 9f
02 00
[242148.903637] RSP: 0018:ffffa9c740803cb0 EFLAGS: 00010282
[242148.903648] RAX: 00000000fffffffb RBX: ffff99d87ab7c000 RCX:
0000000000000000
[242148.903655] RDX: 0000000000000000 RSI: ffff99d87ab7ee68 RDI:
0000000000000246
[242148.903663] RBP: ffffa9c740803d10 R08: 0000000000000000 R09:
0000000000000000
[242148.903669] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff99d8b35ef000
[242148.903677] R13: 0000000000000000 R14: ffff99d87b38d800 R15:
ffff99d87ab7ee88
[242148.903686] FS:  00007f1fc75cd8c0(0000) GS:ffff99d8b6a80000(0000)
knlGS:0000000000000000
[242148.903694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[242148.903701] CR2: 00007ffe01b4bec8 CR3: 00000001fca5a005 CR4:
00000000003606e0
[242148.903717] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[242148.903724] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[242148.903730] Call Trace:
[242148.903864]  btrfs_ioctl+0x2928/0x37e0 [btrfs]
[242148.903894]  ? __lock_acquire+0x37a/0x1e10
[242148.903910]  ? __lock_acquire+0x37a/0x1e10
[242148.903961]  ? do_sigaction+0xf3/0x240
[242148.903992]  ? do_vfs_ioctl+0xa2/0x700
[242148.904003]  do_vfs_ioctl+0xa2/0x700
[242148.904017]  ? do_sigaction+0xf3/0x240
[242148.904051]  ksys_ioctl+0x70/0x80
[242148.904063]  ? trace_hardirqs_off_thunk+0x1a/0x20
[242148.904086]  __x64_sys_ioctl+0x16/0x20
[242148.904097]  do_syscall_64+0x5c/0x250
[242148.904114]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[242148.904125] RIP: 0033:0x7f1fc63d5dd7
[242148.904136] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00
00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89
01 48
[242148.904145] RSP: 002b:00007fffd8716078 EFLAGS: 00000202 ORIG_RAX:
0000000000000010
[242148.904154] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX:
00007f1fc63d5dd7
[242148.904162] RDX: 00007fffd8716ee0 RSI: 00000000ca289435 RDI:
0000000000000003
[242148.904167] RBP: 0000000000000004 R08: 0000000000000000 R09:
0000000000000000
[242148.904173] R10: 0000000000000008 R11: 0000000000000202 R12:
0000000000000000
[242148.904181] R13: 0000000000000003 R14: 00007fffd87186a6 R15:
00005555cb872050
[242148.904234] irq event stamp: 1705866
[242148.904246] hardirqs last  enabled at (1705865):
[<ffffffff896c65d9>] _raw_spin_unlock_irqrestore+0x59/0x70
[242148.904258] hardirqs last disabled at (1705866):
[<ffffffff88c048ea>] trace_hardirqs_off_thunk+0x1a/0x20
[242148.904317] softirqs last  enabled at (1705764):
[<ffffffff8949359e>] peernet2id+0x4e/0x70
[242148.904328] softirqs last disabled at (1705762):
[<ffffffff8949357f>] peernet2id+0x2f/0x70
[242148.904334] ---[ end trace f2fa14238388d729 ]---


So device replace fails with -EIO and lots of errors in dmesg/syslog,
something that will mislead a user into thinking there's some problem
with the hardware.
Each device is using 2880634880 bytes (it's what btrfs_device_get_bytes_used()
returns) right before the replace operation, a value clearly smaller
then 4Gb (the size
of the new device).

So, why does this happens?
Because device replace simply copies extents (both data and metadata) directly,
if something is at offset 7Gb in the source device, it tries to copy that extent
also to the offset 7Gb of the target device. This is very simple, reliable and
predictable, it actually decreases the chances for data/metadata loss.

For extents already on disk before the device replace starts, the process
consists of iterating all chunks and then copy them from the source to
the target device. Once a chunk is processed we don't process it anymore.

If any extent (data or metadata) is written during the replace procedure,
after we processed the corresponding chunk, the respective extent is still
copied to the target device - this is because during the initialization
of device replace we set up a mechanism that results in once a write is
performed to the source device, it is automatically forwarded to the target
device as well (using the same offset as the source device) - the key places
to see where this happens are: btrfs_dev_replace_start() and
__btrfs_map_block().

So, that's why we don't allow the target device to be smaller then the source
device - a device may have only 2Gb of space allocated to chunks for example,
but one 1Gb chunk may be at device offset 0 and the other 1Gb chunk at device
offset 6Gb, so replacing with a 4Gb device will never work.

In order to allow a smaller device for device replace, I see two approaches:

1) The more complex one:  have a mechanism to remap chunk offsets. This implies
   building some mapping (in memory), so that for the above case for example,
   if the first data chunk is at offset 1GB it can be mapped to 1GB in
the target
   device as well, but for the second data chunk, we have to remap the offset
   (something > 4Gb) to something <= 3Gb on the target device. I think
this may actually
   be harder then it may seem. It's not only that copy part (and
dealing with incoming
   writes while replace is in progress), because when replace
finishes, you also have
   to update the mappings in the chunk tree for every chunk with the
new start offset,
   which can take a while and require a lot of metadata space to be
reserved. In other
   words, it will require significant changes to how device replace
copies extents just
   for one use case.

2) A simple solution, but often less efficient: before starting the actual
   replace operation, shrink the source device to the size of the
target device -
   just use the existing btrfs_shrink_device(), which will relocate chunks
   beyond the new size, and if there's not enough space it just returns -ENOSPC.
   This means no changes to the actual way replace copies data - it
does extra IO,
   due to the relocation but keeps things simple, and it should still
be significantly
   more efficient then doing a device remove + device add operation,
maybe except if
   all or most of the allocated chunks (in the device to be replaced)
cross or start
   beyond an offset matching the new device's size.

   Also, since the shrink can take some time due to relocation of
chunks, we would need
   to teach btrfs_shrink_device() to check for device replace cancel
requests as well.
   And such request is detected, restore the device's size to the
original value.

I think option 2 may actually be acceptable for an initial version. Option 1 is
complex and increases the risk for data loss. Also, for option 2, there's the
possible downside of requiring writes to the source device - one might
be replacing
it because the device is not healthy, writes into some regions are
failing, which
can prevent the shrink/relocation process from suceeding, in that case only
a device remove followed by a device add operation would work.

Either way, we will also need to have test cases in fstests to cover
all possible
scenarios we can think of, including stress tests where we replace the
device with
a smaller one and write/delete/create to files while replace is in
progress. Then
for each scenario scrub the devices, run fsck, etc, to verify
everything is fine.
If you decide to go for option 2, also compare performance against a
device remove
+ device add operation.

Thanks.


>                 btrfs_err(fs_info,
>                           "target device is smaller than source device!");
>                 ret = -EINVAL;
> @@ -243,8 +243,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>         device->io_width = fs_info->sectorsize;
>         device->io_align = fs_info->sectorsize;
>         device->sector_size = fs_info->sectorsize;
> -       device->total_bytes = btrfs_device_get_total_bytes(srcdev);
> -       device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev);
> +       device->total_bytes = round_down(
> +               i_size_read(bdev->bd_inode),
> +               fs_info->sectorsize);
> +       device->disk_total_bytes = device->total_bytes;
>         device->bytes_used = btrfs_device_get_bytes_used(srcdev);
>         device->commit_total_bytes = srcdev->commit_total_bytes;
>         device->commit_bytes_used = device->bytes_used;
> @@ -671,9 +673,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>         memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
>         memcpy(tgt_device->uuid, src_device->uuid, sizeof(tgt_device->uuid));
>         memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid));
> -       btrfs_device_set_total_bytes(tgt_device, src_device->total_bytes);
> -       btrfs_device_set_disk_total_bytes(tgt_device,
> -                                         src_device->disk_total_bytes);
>         btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
>         tgt_device->commit_bytes_used = src_device->bytes_used;
>
> --
> 2.20.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

  reply	other threads:[~2019-12-09 10:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-08  9:30 [PATCH 0/1] Support replacing a device with smaller one Kyle Ambroff-Kao
2019-12-08  9:30 ` [PATCH 1/1] btrfs: Allow replacing device with a smaller one if possible Kyle Ambroff-Kao
2019-12-09 10:26   ` Filipe Manana [this message]
2019-12-09 19:22     ` Kyle Ambroff-Kao
2019-12-13 15:48     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H60gNBC_zzU8gjZ_s=7MnN23yFzQqYxanhvzMO50qtXJg@mail.gmail.com' \
    --to=fdmanana@gmail.com \
    --cc=kyle@ambroffkao.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).