Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Cedric.dewijs@eclipso.eu, Linux BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
Date: Sat, 9 Jan 2021 16:40:32 -0500	[thread overview]
Message-ID: <20210109214032.GC31381@hungrycats.org> (raw)
In-Reply-To: <CAK-xaQbQPSS7=cH1qmb9S51CL34VRfyE_=eNwb-GhSL1b8Yz2g@mail.gmail.com>

On Fri, Jan 08, 2021 at 08:29:45PM +0100, Andrea Gelmini wrote:
> Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu> ha scritto:
> > What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
> > dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?
> 
> <smiling>
> Well, (happens) the same thing when your laptop is stolen or you read
> "open_ctree failed"...You restore backup...
> </smiling>
> 
> I have a few idea, but it's much more quicker to try it. Let's see:
> 
> truncate -s 5G dev1
> truncate -s 5G dev2
> losetup /dev/loop31 dev1
> losetup /dev/loop32 dev2
> mdadm --create --verbose --assume-clean /dev/md0 --level=1
> --raid-devices=2 /dev/loop31 --write-mostly /dev/loop32

Note that with --write-mostly here, total filesystem loss is no longer
random: mdadm will always pick loop31 over loop32 while loop31 exists.

> mkfs.btrfs /dev/md0
> mount -o compress=lzo /dev/md0 /mnt/sg10/
> cd /mnt/sg10/
> cp -af /home/gelma/dev/kernel/ .
> root@glet:/mnt/sg10# dmesg -T
> [Fri Jan  8 19:51:33 2021] md/raid1:md0: active with 2 out of 2 mirrors
> [Fri Jan  8 19:51:33 2021] md0: detected capacity change from 0 to 5363466240
> [Fri Jan  8 19:51:53 2021] BTRFS: device fsid
> 2fe43610-20e5-48de-873d-d1a6c2db2a6a devid 1 transid 5 /dev/md0
> scanned by mkfs.btrfs (512004)
> [Fri Jan  8 19:51:53 2021] md: data-check of RAID array md0
> [Fri Jan  8 19:52:19 2021] md: md0: data-check done.
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): setting incompat
> feature flag for COMPRESS_LZO (0x8)
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): use lzo compression, level 0
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): disk space caching
> is enabled
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): has skinny extents
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): flagging fs with
> big metadata feature
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): enabling ssd optimizations
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): checking UUID tree
> 
> root@glet:/mnt/sg10# btrfs scrub start -B .
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:01:59 2021
> Status:           finished
> Duration:         0:00:04
> Total to scrub:   4.99GiB
> Rate:             1.23GiB/s
> Error summary:    no errors found
> 
> We check the array is in sync:
> 
> root@glet:/mnt/sg10# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md0 : active raid1 loop32[1](W) loop31[0]
>      5237760 blocks super 1.2 [2/2] [UU]

You have used --assume-clean and didn't tell mdadm otherwise since,
so this test didn't provide any information.

On real disks a mdadm integrity check at this point fail very hard since
the devices have never been synced (unless they are both blank devices
filled with the same formatting test pattern or zeros).

> unused devices: <none>
> 
> Now we wipe the storage;
> root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M count=100

With --write-mostly, the above deterministically works, and

	dd if=/dev/urandom of=/dev/loop31 bs=1M count=100

deterministically damages or destroys the filesystem.

With real disk failures you don't get to pick which drive is corrupted
or when.  If it's the remote drive, you have no backup and have no way
to _know_ you have no backup.  If it's the local drive, you can recover
it if you read from the backup in time; otherise, you lose your data
permanently on the next mdadm resync.

> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB, 100 MiB) copied, 0.919025 s, 114 MB/s
> 
> sync
> 
> echo 3 > /proc/sys/vm/drop_caches
> 
> I do rm to force write i/o:
> 
> root@glet:/mnt/sg10# rm kernel/v5.11/ -rf
> 
> root@glet:/mnt/sg10# btrfs scrub start -B .
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:11:21 2021
> Status:           finished
> Duration:         0:00:03
> Total to scrub:   4.77GiB
> Rate:             1.54GiB/s
> Error summary:    no errors found

This scrub will never detect corruption on the remote filesystem because
of --write-mostly, so you have no way to know whether it has bitrotted
away (or is just missing a whole lot of updates).

> Now, I stop the array and re-assembly:
> mdadm -Ss
> 
> root@glet:/# mdadm --assemble /dev/md0 /dev/loop31 /dev/loop32
> mdadm: /dev/md0 has been started with 2 drives.
> 
> root@glet:/# mount /dev/md0 /mnt/sg10/
> root@glet:/# btrfs scrub start -B  /mnt/sg10/
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:15:16 2021
> Status:           finished
> Duration:         0:00:03
> Total to scrub:   4.77GiB
> Rate:             1.54GiB/s
> Error summary:    no errors found
> 
> Ciao,
> Gelma