linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Raid1 of a slow hdd and a fast(er) SSD,  howto to prioritize the SSD?
@ 2021-01-05  6:39  
  2021-01-05  6:53 ` Qu Wenruo
  2021-01-08  8:16 ` Andrea Gelmini
  0 siblings, 2 replies; 15+ messages in thread
From:   @ 2021-01-05  6:39 UTC (permalink / raw)
  To: linux-btrfs

­I have put a SSD and a slow laptop HDD in btrfs raid. This was a bad idea, my system does not feel responsive. When i load a program, dstat shows half of the program is loaded from the SSD, and the rest from the slow hard drive.

I was expecting btrfs to do almost all reads from the fast SSD, as both the data and the metadata is on that drive, so the slow hdd is only really needed when there's a bitflip on the SSD, and the data has to be reconstructed.

Writing has to be done to both drives of course, but I don't expect slowdowns from that, as the system RAM should cache that. 

Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize the SSD?

In detail:

# mkfs.btrfs -f -d raid1 -m raid1 /dev/sda1 /dev/sdb1

# btrfs filesystem show /
Label: none  uuid: 485952f9-0cfc-499a-b5c2-xxxxxxxxx
	Total devices 2 FS bytes used 63.62GiB
	devid    2 size 223.57GiB used 65.03GiB path /dev/sda1
	devid    3 size 149.05GiB used 65.03GiB path /dev/sdb1


 $ dstat -tdD sda,sdb --nocolor
----system---- --dsk/sda-- --dsk/sdb--
     time     | read  writ: read  writ
05-01 08:19:39|   0     0 :   0     0 
05-01 08:19:40|   0  4372k:   0  4096k
05-01 08:19:41|  61M 4404k:  16k 4680k
05-01 08:19:42|  52M    0 :6904k    0 
05-01 08:19:43|4556k   76k:  31M   76k
05-01 08:19:44|2640k    0 :  38M    0 
05-01 08:19:45|4064k    0 :  30M    0 
05-01 08:19:46|1252k    0 :  30M    0 
05-01 08:19:47|2572k    0 :  37M    0 
05-01 08:19:48|5840k    0 :  27M    0 
05-01 08:19:49|4480k  492k:  22M  492k
05-01 08:19:50|1284k    0 :  44M    0 
05-01 08:19:51|1184k    0 :  33M    0 
05-01 08:19:52|3592k    0 :  31M    0 
05-01 08:19:53|  14M  156k:8268k  156k
05-01 08:19:54|  22M 1956k:   0  1956k
05-01 08:19:55|   0     0 :   0     0 
05-01 08:19:56|7636k    0 :   0     0 
05-01 08:19:57|  23M  116k:   0   116k
05-01 08:19:58|2296k  552k:   0   552k
05-01 08:19:59| 624k  132k:   0   132k
05-01 08:20:00|   0     0 :   0     0 
05-01 08:20:01|6948k  188k:   0   188k
05-01 08:20:02|   0  1340k:4364k 1340k
05-01 08:20:03|   0     0 :   0     0 
05-01 08:20:04|   0   484k:   0   484k
05-01 08:20:05|   0     0 :   0     0 
05-01 08:20:06|   0     0 :   0     0 
05-01 08:20:07|   0     0 :   0     0 
05-01 08:20:08|   0    84k:   0    84k
05-01 08:20:09|   0   132k:   0   132k
05-01 08:20:10|   0     0 :   0     0 
05-01 08:20:11|   0  7616k:  96k 7584k
05-01 08:20:12|   0  2264k:   0  2296k
05-01 08:20:13|   0     0 :   0     0 
05-01 08:20:14|   0  1956k:   0  1956k
05-01 08:20:15|   0     0 :   0     0 
05-01 08:20:16|   0     0 :   0     0 

# fdisk -l
**This is the SSD**
Disk /dev/sda: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: CT240BX200SSD1  
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x12cfb9e1

Device     Boot Start       End   Sectors   Size Id Type
/dev/sda1        2048 468862127 468860080 223.6G 83 Linux

**This is the hard drive**
Disk /dev/sdb: 149.05 GiB, 160041885696 bytes, 312581808 sectors
Disk model: Hitachi HTS54321
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x20000000

Device     Boot Start       End   Sectors  Size Id Type
/dev/sdb1        2048 312581807 312579760  149G 83 Linux



---

Take your mailboxes with you. Free, fast and secure Mail & Cloud: https://www.eclipso.eu - Time to change!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05  6:39 Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?  
@ 2021-01-05  6:53 ` Qu Wenruo
  2021-01-05 18:19   `  
                     ` (2 more replies)
  2021-01-08  8:16 ` Andrea Gelmini
  1 sibling, 3 replies; 15+ messages in thread
From: Qu Wenruo @ 2021-01-05  6:53 UTC (permalink / raw)
  To: Cedric.dewijs, linux-btrfs



On 2021/1/5 下午2:39, Cedric.dewijs@eclipso.eu wrote:
> ­I have put a SSD and a slow laptop HDD in btrfs raid. This was a bad idea, my system does not feel responsive. When i load a program, dstat shows half of the program is loaded from the SSD, and the rest from the slow hard drive.

Btrfs uses pid to load balance read IIRC, thus it sucks in such workload.

>
> I was expecting btrfs to do almost all reads from the fast SSD, as both the data and the metadata is on that drive, so the slow hdd is only really needed when there's a bitflip on the SSD, and the data has to be reconstructed.
IIRC there will be some read policy feature to do that, but not yet
merged, and even merged, you still need to manually specify the
priority, as there is no way for btrfs to know which driver is faster
(except the non-rotational bit, which is not reliable at all).

>
> Writing has to be done to both drives of course, but I don't expect slowdowns from that, as the system RAM should cache that.

Write can still slow down the system even you have tons of memory.
Operations like fsync() or sync() will still wait for the writeback,
thus in your case, it will also be slowed by the HDD no matter what.

In fact, in real world desktop, most of the writes are from sometimes
unnecessary fsync().

To get rid of such slow down, you have to go dangerous by disabling
barrier, which is never a safe idea.

>
> Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize the SSD?

Not in upstream kernel for now.

Thus I guess you need something like bcache to do this.

Thanks,
Qu

>
> In detail:
>
> # mkfs.btrfs -f -d raid1 -m raid1 /dev/sda1 /dev/sdb1
>
> # btrfs filesystem show /
> Label: none  uuid: 485952f9-0cfc-499a-b5c2-xxxxxxxxx
> 	Total devices 2 FS bytes used 63.62GiB
> 	devid    2 size 223.57GiB used 65.03GiB path /dev/sda1
> 	devid    3 size 149.05GiB used 65.03GiB path /dev/sdb1
>
>
>   $ dstat -tdD sda,sdb --nocolor
> ----system---- --dsk/sda-- --dsk/sdb--
>       time     | read  writ: read  writ
> 05-01 08:19:39|   0     0 :   0     0
> 05-01 08:19:40|   0  4372k:   0  4096k
> 05-01 08:19:41|  61M 4404k:  16k 4680k
> 05-01 08:19:42|  52M    0 :6904k    0
> 05-01 08:19:43|4556k   76k:  31M   76k
> 05-01 08:19:44|2640k    0 :  38M    0
> 05-01 08:19:45|4064k    0 :  30M    0
> 05-01 08:19:46|1252k    0 :  30M    0
> 05-01 08:19:47|2572k    0 :  37M    0
> 05-01 08:19:48|5840k    0 :  27M    0
> 05-01 08:19:49|4480k  492k:  22M  492k
> 05-01 08:19:50|1284k    0 :  44M    0
> 05-01 08:19:51|1184k    0 :  33M    0
> 05-01 08:19:52|3592k    0 :  31M    0
> 05-01 08:19:53|  14M  156k:8268k  156k
> 05-01 08:19:54|  22M 1956k:   0  1956k
> 05-01 08:19:55|   0     0 :   0     0
> 05-01 08:19:56|7636k    0 :   0     0
> 05-01 08:19:57|  23M  116k:   0   116k
> 05-01 08:19:58|2296k  552k:   0   552k
> 05-01 08:19:59| 624k  132k:   0   132k
> 05-01 08:20:00|   0     0 :   0     0
> 05-01 08:20:01|6948k  188k:   0   188k
> 05-01 08:20:02|   0  1340k:4364k 1340k
> 05-01 08:20:03|   0     0 :   0     0
> 05-01 08:20:04|   0   484k:   0   484k
> 05-01 08:20:05|   0     0 :   0     0
> 05-01 08:20:06|   0     0 :   0     0
> 05-01 08:20:07|   0     0 :   0     0
> 05-01 08:20:08|   0    84k:   0    84k
> 05-01 08:20:09|   0   132k:   0   132k
> 05-01 08:20:10|   0     0 :   0     0
> 05-01 08:20:11|   0  7616k:  96k 7584k
> 05-01 08:20:12|   0  2264k:   0  2296k
> 05-01 08:20:13|   0     0 :   0     0
> 05-01 08:20:14|   0  1956k:   0  1956k
> 05-01 08:20:15|   0     0 :   0     0
> 05-01 08:20:16|   0     0 :   0     0
>
> # fdisk -l
> **This is the SSD**
> Disk /dev/sda: 223.57 GiB, 240057409536 bytes, 468862128 sectors
> Disk model: CT240BX200SSD1
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disklabel type: dos
> Disk identifier: 0x12cfb9e1
>
> Device     Boot Start       End   Sectors   Size Id Type
> /dev/sda1        2048 468862127 468860080 223.6G 83 Linux
>
> **This is the hard drive**
> Disk /dev/sdb: 149.05 GiB, 160041885696 bytes, 312581808 sectors
> Disk model: Hitachi HTS54321
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x20000000
>
> Device     Boot Start       End   Sectors  Size Id Type
> /dev/sdb1        2048 312581807 312579760  149G 83 Linux
>
>
>
> ---
>
> Take your mailboxes with you. Free, fast and secure Mail & Cloud: https://www.eclipso.eu - Time to change!
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05  6:53 ` Qu Wenruo
@ 2021-01-05 18:19   `  
  2021-01-07 22:11     ` Zygo Blaxell
  2021-01-05 19:19   ` Stéphane Lesimple
  2021-01-06  2:55   ` Anand Jain
  2 siblings, 1 reply; 15+ messages in thread
From:   @ 2021-01-05 18:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

>> I was expecting btrfs to do almost all reads from the fast SSD, as both
the data and the metadata is on that drive, so the slow hdd is only really
needed when there's a bitflip on the SSD, and the data has to be reconstructed.

> IIRC there will be some read policy feature to do that, but not yet
> merged, and even merged, you still need to manually specify the
> priority, as there is no way for btrfs to know which driver is faster
> (except the non-rotational bit, which is not reliable at all).

Manually specifying the priority drive would be a big step in the right direction. Maybe btrfs could get a routine that benchmarks the sequential and random read and write speed of the drives at (for instance) mount time, or triggered by an administrator? This could lead to misleading results if btrfs doesn't get the whole drive to itself.


>> Writing has to be done to both drives of course, but I don't expect
slowdowns from that, as the system RAM should cache that.

>Write can still slow down the system even you have tons of memory.
>Operations like fsync() or sync() will still wait for the writeback,
>thus in your case, it will also be slowed by the HDD no matter what.

>In fact, in real world desktop, most of the writes are from sometimes
>unnecessary fsync().

>To get rid of such slow down, you have to go dangerous by disabling
>barrier, which is never a safe idea.

I suggest a middle ground, where btrfs returns from fsync when one of the copies (instead of all the copies) of the data has been written completely to disk. This poses a small data risk, as this creates  moments that there's only one copy of the data on disk, while the software above btrfs thinks all data is written on two disks. one problem I see if the server is told to shut down while there's a big backlog of data to be written to the slow drive, while the big drive is already done. Then the server could cut the power while the slow drive is still being written.

i think this setting should be given to the system administrator, it's not a good idea to just blindly enable this behavior.

>>
>> Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize
the SSD?

> Not in upstream kernel for now.

> Thus I guess you need something like bcache to do this.

Agreed. However, one of the problems of bcache, it that it can't use 2 SSD's in mirrored mode to form a writeback cache in front of many spindles, so this structure is impossible:
+-----------------------------------------------------------+--------------+--------------+
|                               btrfs raid 1 (2 copies) /mnt                              |
+--------------+--------------+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 | /dev/bcache4 | /dev/bcache5 |
+--------------+--------------+--------------+--------------+--------------+--------------+
|                          Write Cache (2xSSD in raid 1, mirrored)                        |
|                                 /dev/sda2 and /dev/sda3                                 |
+--------------+--------------+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   | /dev/sda13   | /dev/sda14   |
+--------------+--------------+--------------+--------------+--------------+--------------+

In order to get a system that has no data loss if a drive fails,  the user either has to live with only a read cache, or the user has to put a separate writeback cache in front of each spindle like this:
+-----------------------------------------------------------+
|                btrfs raid 1 (2 copies) /mnt               |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
+--------------+--------------+--------------+--------------+
| Write Cache  | Write Cache  | Write Cache  | Write Cache  |
|(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
| /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   |
+--------------+--------------+--------------+--------------+

In the mainline kernel is's impossible to put a bcache on top of a bcache, so a user does not have the option to have 4 small write caches below one fast, big read cache like this:
+-----------------------------------------------------------+
|                btrfs raid 1 (2 copies) /mnt               |
+--------------+--------------+--------------+--------------+
| /dev/bcache4 | /dev/bcache5 | /dev/bcache6 | /dev/bcache7 |
+--------------+--------------+--------------++-------------+
|                      Read Cache (SSD)                     |
|                        /dev/sda4                          |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
+--------------+--------------+--------------+--------------+
| Write Cache  | Write Cache  | Write Cache  | Write Cache  |
|(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
| /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   |
+--------------+--------------+--------------+--------------+

>Thanks,
>Qu

Thank you,
Cedric


---

Take your mailboxes with you. Free, fast and secure Mail & Cloud: https://www.eclipso.eu - Time to change!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05  6:53 ` Qu Wenruo
  2021-01-05 18:19   `  
@ 2021-01-05 19:19   ` Stéphane Lesimple
  2021-01-06  2:55   ` Anand Jain
  2 siblings, 0 replies; 15+ messages in thread
From: Stéphane Lesimple @ 2021-01-05 19:19 UTC (permalink / raw)
  To: Cedric.dewijs, Qu Wenruo; +Cc: linux-btrfs

January 5, 2021 7:20 PM, Cedric.dewijs@eclipso.eu wrote:

>>> I was expecting btrfs to do almost all reads from the fast SSD, as both
> 
> the data and the metadata is on that drive, so the slow hdd is only really
> needed when there's a bitflip on the SSD, and the data has to be reconstructed.
> 
>> IIRC there will be some read policy feature to do that, but not yet
>> merged, and even merged, you still need to manually specify the
>> priority, as there is no way for btrfs to know which driver is faster
>> (except the non-rotational bit, which is not reliable at all).
> 
> Manually specifying the priority drive would be a big step in the right direction. Maybe btrfs
> could get a routine that benchmarks the sequential and random read and write speed of the drives at
> (for instance) mount time, or triggered by an administrator? This could lead to misleading results
> if btrfs doesn't get the whole drive to itself.
> 
>>> Writing has to be done to both drives of course, but I don't expect
> 
> slowdowns from that, as the system RAM should cache that.
> 
>> Write can still slow down the system even you have tons of memory.
>> Operations like fsync() or sync() will still wait for the writeback,
>> thus in your case, it will also be slowed by the HDD no matter what.
>> 
>> In fact, in real world desktop, most of the writes are from sometimes
>> unnecessary fsync().
>> 
>> To get rid of such slow down, you have to go dangerous by disabling
>> barrier, which is never a safe idea.
> 
> I suggest a middle ground, where btrfs returns from fsync when one of the copies (instead of all
> the copies) of the data has been written completely to disk. This poses a small data risk, as this
> creates moments that there's only one copy of the data on disk, while the software above btrfs
> thinks all data is written on two disks. one problem I see if the server is told to shut down while
> there's a big backlog of data to be written to the slow drive, while the big drive is already done.
> Then the server could cut the power while the slow drive is still being written.
> 
> i think this setting should be given to the system administrator, it's not a good idea to just
> blindly enable this behavior.
> 
>>> Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize
> 
> the SSD?
> 
>> Not in upstream kernel for now.


I happen to have written a custom patch for my own use for a similar use case:
I have a bunch of slow drives constituting a raid1 FS of dozens of terabytes,
and just one SSD, reserved only for metadata.

My patch adds an entry under sysfs for each FS so that the admin can select the
"metadata_only" devid. This is optional, if it's not done, the usual btrfs behavior
applies. When set, this device is:
- never considered for new data chunks allocation
- preferred for new metadata chunk allocations
- preferred for metadata reads

This way I still have raid1, but the metadata chunks on slow drives are only
there for redundancy and never accessed for reads as long as the SSD metadata
is valid.

This *drastically* improved my snapshots rotation, and even made qgroups usable
again. I think I've been running this for 1-2 years, but obviously I'd love seeing
such option on the vanilla kernel so that I can get rid of hacky patch :)

>> 
>> Thus I guess you need something like bcache to do this.
> 
> Agreed. However, one of the problems of bcache, it that it can't use 2 SSD's in mirrored mode to
> form a writeback cache in front of many spindles, so this structure is impossible:
> +-----------------------------------------------------------+--------------+--------------+
> | btrfs raid 1 (2 copies) /mnt |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 | /dev/bcache4 | /dev/bcache5 |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> | Write Cache (2xSSD in raid 1, mirrored) |
> | /dev/sda2 and /dev/sda3 |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> | Data | Data | Data | Data | Data | Data |
> | /dev/sda9 | /dev/sda10 | /dev/sda11 | /dev/sda12 | /dev/sda13 | /dev/sda14 |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> 
> In order to get a system that has no data loss if a drive fails, the user either has to live with
> only a read cache, or the user has to put a separate writeback cache in front of each spindle like
> this:
> +-----------------------------------------------------------+
> | btrfs raid 1 (2 copies) /mnt |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Write Cache | Write Cache | Write Cache | Write Cache |
> |(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
> | /dev/sda5 | /dev/sda6 | /dev/sda7 | /dev/sda8 |
> +--------------+--------------+--------------+--------------+
> | Data | Data | Data | Data |
> | /dev/sda9 | /dev/sda10 | /dev/sda11 | /dev/sda12 |
> +--------------+--------------+--------------+--------------+
> 
> In the mainline kernel is's impossible to put a bcache on top of a bcache, so a user does not have
> the option to have 4 small write caches below one fast, big read cache like this:
> +-----------------------------------------------------------+
> | btrfs raid 1 (2 copies) /mnt |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache4 | /dev/bcache5 | /dev/bcache6 | /dev/bcache7 |
> +--------------+--------------+--------------++-------------+
> | Read Cache (SSD) |
> | /dev/sda4 |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Write Cache | Write Cache | Write Cache | Write Cache |
> |(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
> | /dev/sda5 | /dev/sda6 | /dev/sda7 | /dev/sda8 |
> +--------------+--------------+--------------+--------------+
> | Data | Data | Data | Data |
> | /dev/sda9 | /dev/sda10 | /dev/sda11 | /dev/sda12 |
> +--------------+--------------+--------------+--------------+
> 
>> Thanks,
>> Qu
> 
> Thank you,
> Cedric
> 
> ---
> 
> Take your mailboxes with you. Free, fast and secure Mail & Cloud: https://www.eclipso.eu - Time to
> change!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05  6:53 ` Qu Wenruo
  2021-01-05 18:19   `  
  2021-01-05 19:19   ` Stéphane Lesimple
@ 2021-01-06  2:55   ` Anand Jain
  2 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-06  2:55 UTC (permalink / raw)
  To: Qu Wenruo, Cedric.dewijs, linux-btrfs



>> ­I have put a SSD and a slow laptop HDD in btrfs raid. This was a bad 
>> idea, my system does not feel responsive. When i load a program, dstat 
>> shows half of the program is loaded from the SSD, and the rest from 
>> the slow hard drive.

The drive speeds are evolving. NVME introduced lower latency than the
SSD. Mixing up drives amid production is unavoidable sometimes.


> 
> Btrfs uses pid to load balance read IIRC, thus it sucks in such workload.
> 

PID is not good when drives are of mixed speeds. But it balances
with the block layer IO queuing and sorting and cache. So it provides
equally good read IO performance when all drives are of the same speed.

>> I was expecting btrfs to do almost all reads from the fast SSD, as 
>> both the data and the metadata is on that drive, so the slow hdd is 
>> only really needed when there's a bitflip on the SSD, and the data has 
>> to be reconstructed.

> IIRC there will be some read policy feature to do that, but not yet
> merged,

Yes. Patches are here [1]. These patches have dependent patches that are
merged in V5.11-rc1. Please give it a try. See below for which policy to
use.

> and even merged, you still need to manually specify the
> priority, as there is no way for btrfs to know which driver is faster
> (except the non-rotational bit, which is not reliable at all).


  Hm. This is wrong.

  There are two types of read policies as of now.

  - Latency
  - Device

  Latency - For each read IO Latency policy dynamically picks the drive
  with the lowest latency based on its historic read IO wait.

   Set the policy sysfs:
   echo "latency" > /sys/fs/btrfs/$uuid/read_policy

  Device - is a kind of manual configuration, you can tell the read IO
  which device to read from when all the strips are healthy.

   Set it in the sysfs:
   First tell which Device is preferred for reading.
     echo 1 > /sys/fs/btrfs/$uuid/devinfo/$devid/read_preferred

   Set the policy:
   echo "device" > /sys/fs/btrfs/$uuid/read_policy

  The policy type round-robin is still experimental.

  Also, note that read policies are in memory only. You have to set it
  again (using sysfs) after the reboot/remount. I am open to feedback.


[1]
V2:
[PATCH v2 0/4] btrfs: read_policy types latency, device and round-robin


Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05 18:19   `  
@ 2021-01-07 22:11     ` Zygo Blaxell
  0 siblings, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2021-01-07 22:11 UTC (permalink / raw)
  To: Cedric.dewijs; +Cc: Qu Wenruo, linux-btrfs

On Tue, Jan 05, 2021 at 07:19:14PM +0100,   wrote:
> >> I was expecting btrfs to do almost all reads from the fast SSD, as both
> the data and the metadata is on that drive, so the slow hdd is only really
> needed when there's a bitflip on the SSD, and the data has to be reconstructed.
> 
> > IIRC there will be some read policy feature to do that, but not yet
> > merged, and even merged, you still need to manually specify the
> > priority, as there is no way for btrfs to know which driver is faster
> > (except the non-rotational bit, which is not reliable at all).
> 
> Manually specifying the priority drive would be a big step in the
> right direction. Maybe btrfs could get a routine that benchmarks
> the sequential and random read and write speed of the drives at (for
> instance) mount time, or triggered by an administrator? This could lead
> to misleading results if btrfs doesn't get the whole drive to itself.
>
>
> >> Writing has to be done to both drives of course, but I don't expect
> slowdowns from that, as the system RAM should cache that.
> 
> >Write can still slow down the system even you have tons of memory.
> >Operations like fsync() or sync() will still wait for the writeback,
> >thus in your case, it will also be slowed by the HDD no matter what.
> 
> >In fact, in real world desktop, most of the writes are from sometimes
> >unnecessary fsync().
> 
> >To get rid of such slow down, you have to go dangerous by disabling
> >barrier, which is never a safe idea.
> 
> I suggest a middle ground, where btrfs returns from fsync when one of
> the copies (instead of all the copies) of the data has been written
> completely to disk. This poses a small data risk, as this creates
> moments that there's only one copy of the data on disk, while the
> software above btrfs thinks all data is written on two disks. one
> problem I see if the server is told to shut down while there's a big
> backlog of data to be written to the slow drive, while the big drive
> is already done. Then the server could cut the power while the slow
> drive is still being written.

The tricky thing here is that kernel memory management for the filesystem
is tied to transaction commits.  During a commit we have to finish the
flush to _every_ disk before any more writes can proceed, because the
transaction is holding locks that prevent memory from being modified or
freed until all the writes we are going to do are done.

If you have _short_ bursts of writes, you could put the spinning disk
behind a dedicated block-layer write-barrier-preserving RAM cache device.
btrfs would dump its writes into this cache and be able to complete
transaction commit, reducing latency as long as the write cache isn't
full (at that point writes must block).  btrfs could be modified to
implement such a cache layer itself, but the gains would be modest
compared to having an external block device do it (it saves some CPU
copies, but the worst-case memory requirement is the same) and in the
best case the data loss risks are identical.

As long as the block-layer cache strictly preserves write order barriers,
you can recover from most crashes by doing a btrfs scrub with both
disks online--the filesystem would behave as if the spinning disk was
just really unreliable and corrupting data a lot.

The crashes you can't recover from are csum collisions, where old and
new data have the same csum and you can't tell whether you have current
or stale data on the spinning disk.  You'll have silently corrupted data
at the hash collision rate.  If your crashes are infrequent enough you
can probably get away with this for a long time, or you can use blake2b
or sha256 csums which make collisions effectively impossible.

If the SSD dies and you only have the spinning disk left, the filesystem
will appear to roll back in time some number of transactions, but that
only happens when the SSD fails, so it might be an event rare enough to
be tolerable for some use cases.

If there is a crash which prevents some writes from reaching the spinning
disk, and after the crash, the SSD fails before a scrub can be completed,
then the filesystem on the spinning disk will not be usable.  The spinning
disk will have a discontiguous metadata write history and there is no
longer an intact copy on the SSD to recover from, so the filesystem will
have to be rebuilt by brute force scan of the surviving metadata leaf
pages (i.e. btrfs check --repair --init-extent-tree, with no guarantee
of success).

lvmcache(7) documents something called "dm-writecache" which would
theoretically provide the required write stream caching functionality,
but I've never seen it work.

If you have continuous writes then this idea doesn't work--latency will
degrade to at _least_ as slow as the spinning disk once the cache fills
up, since it's not allowed to elide any writes in the stream.

Alternatively, btrfs could be modified to implement transaction pipelining
that might be smart enough to optimize redundant writes away (i.e. don't
bother writing out a file to the spinner that is already deleted on the
SSD).  That's not a trivial change, though--better to think of it as
writing a new filesystem with a btrfs-compatible on-disk format, than
as an extension of the current filesystem code.

> i think this setting should be given to the system administrator,
> it's not a good idea to just blindly enable this behavior.

Definitely not a good idea to do it blindly.  There are novel failure
modes leading to full filesystem loss in this configuration.

> >> Is there a way to tell btrfs to leave the slow hdd alone, and to
> prioritize the SSD?
> 
> > Not in upstream kernel for now.
> 
> > Thus I guess you need something like bcache to do this.
> 
> Agreed. However, one of the problems of bcache, it that it can't use 2 SSD's in mirrored mode to form a writeback cache in front of many spindles, so this structure is impossible:
> +-----------------------------------------------------------+--------------+--------------+
> |                               btrfs raid 1 (2 copies) /mnt                              |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 | /dev/bcache4 | /dev/bcache5 |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> |                          Write Cache (2xSSD in raid 1, mirrored)                        |
> |                                 /dev/sda2 and /dev/sda3                                 |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   | /dev/sda13   | /dev/sda14   |
> +--------------+--------------+--------------+--------------+--------------+--------------+
> 
> In order to get a system that has no data loss if a drive fails,  the user either has to live with only a read cache, or the user has to put a separate writeback cache in front of each spindle like this:
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Write Cache  | Write Cache  | Write Cache  | Write Cache  |
> |(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   |
> +--------------+--------------+--------------+--------------+
> 
> In the mainline kernel is's impossible to put a bcache on top of a bcache, so a user does not have the option to have 4 small write caches below one fast, big read cache like this:
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache4 | /dev/bcache5 | /dev/bcache6 | /dev/bcache7 |
> +--------------+--------------+--------------++-------------+
> |                      Read Cache (SSD)                     |
> |                        /dev/sda4                          |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Write Cache  | Write Cache  | Write Cache  | Write Cache  |
> |(Flash Drive) |(Flash Drive) |(Flash Drive) |(Flash Drive) |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   | /dev/sda12   |
> +--------------+--------------+--------------+--------------+
> 
> >Thanks,
> >Qu
> 
> Thank you,
> Cedric
> 
> 
> ---
> 
> Take your mailboxes with you. Free, fast and secure Mail & Cloud: https://www.eclipso.eu - Time to change!
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-05  6:39 Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?  
  2021-01-05  6:53 ` Qu Wenruo
@ 2021-01-08  8:16 ` Andrea Gelmini
  2021-01-08  8:36   `  
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Gelmini @ 2021-01-08  8:16 UTC (permalink / raw)
  To: Cedric.dewijs; +Cc: Linux BTRFS

Il giorno mar 5 gen 2021 alle ore 07:44 <Cedric.dewijs@eclipso.eu> ha scritto:
>
> Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize the SSD?

You can use mdadm to do this (I'm using this feature since years in
setup where I have to fallback on USB disks for any reason).

From manpage:

       -W, --write-mostly
              subsequent  devices  listed in a --build, --create, or
--add command will be flagged as 'write-mostly'.  This is valid for
              RAID1 only and means that the 'md' driver will avoid
reading from these devices if at all possible.  This can be useful if
              mirroring over a slow link.

       --write-behind=
              Specify  that  write-behind  mode  should be enabled
(valid for RAID1 only).  If an argument is specified, it will set the
              maximum number of outstanding writes allowed.  The
default value is 256.  A write-intent bitmap is required  in  order
to
              use write-behind mode, and write-behind is only
attempted on drives marked as write-mostly.

So you can do this:
(be carefull, this wipe your data)

mdadm --create --verbose --assume-clean /dev/md0 --level=1
--raid-devices=2 /dev/sda1 --write-mostly /dev/sdb1

Then you use BTRFS on top of /dev/md0, after mkfs.btrfs, of course.

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-08  8:16 ` Andrea Gelmini
@ 2021-01-08  8:36   `  
  2021-01-08 14:00     ` Zygo Blaxell
  2021-01-08 19:29     ` Andrea Gelmini
  0 siblings, 2 replies; 15+ messages in thread
From:   @ 2021-01-08  8:36 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: linux-btrfs


--- Ursprüngliche Nachricht ---
Von: Andrea Gelmini <andrea.gelmini@gmail.com>
Datum: 08.01.2021 09:16:26
An: Cedric.dewijs@eclipso.eu
Betreff: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?

Il giorno mar 5 gen 2021 alle ore 07:44 <Cedric.dewijs@eclipso.eu>
ha scritto:
>
> Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize
the SSD?

You can use mdadm to do this (I'm using this feature since years in
setup where I have to fallback on USB disks for any reason).

From manpage:

       -W, --write-mostly
              subsequent  devices  listed in a --build, --create, or
--add command will be flagged as 'write-mostly'.  This is valid for
              RAID1 only and means that the 'md' driver will avoid
reading from these devices if at all possible.  This can be useful if
              mirroring over a slow link.

       --write-behind=
              Specify  that  write-behind  mode  should be enabled
(valid for RAID1 only).  If an argument is specified, it will set the
              maximum number of outstanding writes allowed.  The
default value is 256.  A write-intent bitmap is required  in  order
to
              use write-behind mode, and write-behind is only
attempted on drives marked as write-mostly.

So you can do this:
(be carefull, this wipe your data)

mdadm --create --verbose --assume-clean /dev/md0 --level=1
--raid-devices=2 /dev/sda1 --write-mostly /dev/sdb1

Then you use BTRFS on top of /dev/md0, after mkfs.btrfs, of course.

Ciao,
Gelma

Thanks Gelma.

What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?

When I do this test on a plain btrfs raid 1 with 2 drives, all the data comes out OK (while generating a lot of messages about correcting data in dmesg -w)

Cheers,
Cedric

---

Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-08  8:36   `  
@ 2021-01-08 14:00     ` Zygo Blaxell
  2021-01-08 19:29     ` Andrea Gelmini
  1 sibling, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2021-01-08 14:00 UTC (permalink / raw)
  To: Cedric.dewijs; +Cc: Andrea Gelmini, linux-btrfs

On Fri, Jan 08, 2021 at 09:36:13AM +0100,   wrote:
> 
> --- Ursprüngliche Nachricht ---
> Von: Andrea Gelmini <andrea.gelmini@gmail.com>
> Datum: 08.01.2021 09:16:26
> An: Cedric.dewijs@eclipso.eu
> Betreff: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
> 
> Il giorno mar 5 gen 2021 alle ore 07:44 <Cedric.dewijs@eclipso.eu>
> ha scritto:
> >
> > Is there a way to tell btrfs to leave the slow hdd alone, and to prioritize
> the SSD?
> 
> You can use mdadm to do this (I'm using this feature since years in
> setup where I have to fallback on USB disks for any reason).
> 
> >From manpage:
> 
>        -W, --write-mostly
>               subsequent  devices  listed in a --build, --create, or
> --add command will be flagged as 'write-mostly'.  This is valid for
>               RAID1 only and means that the 'md' driver will avoid
> reading from these devices if at all possible.  This can be useful if
>               mirroring over a slow link.
> 
>        --write-behind=
>               Specify  that  write-behind  mode  should be enabled
> (valid for RAID1 only).  If an argument is specified, it will set the
>               maximum number of outstanding writes allowed.  The
> default value is 256.  A write-intent bitmap is required  in  order
> to
>               use write-behind mode, and write-behind is only
> attempted on drives marked as write-mostly.
> 
> So you can do this:
> (be carefull, this wipe your data)
> 
> mdadm --create --verbose --assume-clean /dev/md0 --level=1
> --raid-devices=2 /dev/sda1 --write-mostly /dev/sdb1
> 
> Then you use BTRFS on top of /dev/md0, after mkfs.btrfs, of course.
> 
> Ciao,
> Gelma
> 
> Thanks Gelma.
> 
> What happens when I poison one of the drives in the mdadm array using
> this command? Will all data come out OK?
> dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?

mdadm doesn't handle data corruption, and (except for a /sys counter),
reads from mirror devices interchangeably, and silently propagates
data between devices during resync, so the array will almost certainly
be destroyed.

> When I do this test on a plain btrfs raid 1 with 2 drives, all the data
> comes out OK (while generating a lot of messages about correcting data
> in dmesg -w)
> 
> Cheers,
> Cedric
> 
> ---
> 
> Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-08  8:36   `  
  2021-01-08 14:00     ` Zygo Blaxell
@ 2021-01-08 19:29     ` Andrea Gelmini
  2021-01-09 21:40       ` Zygo Blaxell
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Gelmini @ 2021-01-08 19:29 UTC (permalink / raw)
  To: Cedric.dewijs; +Cc: Linux BTRFS

Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu> ha scritto:
> What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
> dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?

<smiling>
Well, (happens) the same thing when your laptop is stolen or you read
"open_ctree failed"...You restore backup...
</smiling>

I have a few idea, but it's much more quicker to try it. Let's see:

truncate -s 5G dev1
truncate -s 5G dev2
losetup /dev/loop31 dev1
losetup /dev/loop32 dev2
mdadm --create --verbose --assume-clean /dev/md0 --level=1
--raid-devices=2 /dev/loop31 --write-mostly /dev/loop32
mkfs.btrfs /dev/md0
mount -o compress=lzo /dev/md0 /mnt/sg10/
cd /mnt/sg10/
cp -af /home/gelma/dev/kernel/ .
root@glet:/mnt/sg10# dmesg -T
[Fri Jan  8 19:51:33 2021] md/raid1:md0: active with 2 out of 2 mirrors
[Fri Jan  8 19:51:33 2021] md0: detected capacity change from 0 to 5363466240
[Fri Jan  8 19:51:53 2021] BTRFS: device fsid
2fe43610-20e5-48de-873d-d1a6c2db2a6a devid 1 transid 5 /dev/md0
scanned by mkfs.btrfs (512004)
[Fri Jan  8 19:51:53 2021] md: data-check of RAID array md0
[Fri Jan  8 19:52:19 2021] md: md0: data-check done.
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): setting incompat
feature flag for COMPRESS_LZO (0x8)
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): use lzo compression, level 0
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): disk space caching
is enabled
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): has skinny extents
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): flagging fs with
big metadata feature
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): enabling ssd optimizations
[Fri Jan  8 19:53:13 2021] BTRFS info (device md0): checking UUID tree

root@glet:/mnt/sg10# btrfs scrub start -B .
scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
Scrub started:    Fri Jan  8 20:01:59 2021
Status:           finished
Duration:         0:00:04
Total to scrub:   4.99GiB
Rate:             1.23GiB/s
Error summary:    no errors found

We check the array is in sync:

root@glet:/mnt/sg10# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid1 loop32[1](W) loop31[0]
     5237760 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Now we wipe the storage;
root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.919025 s, 114 MB/s

sync

echo 3 > /proc/sys/vm/drop_caches

I do rm to force write i/o:

root@glet:/mnt/sg10# rm kernel/v5.11/ -rf

root@glet:/mnt/sg10# btrfs scrub start -B .
scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
Scrub started:    Fri Jan  8 20:11:21 2021
Status:           finished
Duration:         0:00:03
Total to scrub:   4.77GiB
Rate:             1.54GiB/s
Error summary:    no errors found

Now, I stop the array and re-assembly:
mdadm -Ss

root@glet:/# mdadm --assemble /dev/md0 /dev/loop31 /dev/loop32
mdadm: /dev/md0 has been started with 2 drives.

root@glet:/# mount /dev/md0 /mnt/sg10/
root@glet:/# btrfs scrub start -B  /mnt/sg10/
scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
Scrub started:    Fri Jan  8 20:15:16 2021
Status:           finished
Duration:         0:00:03
Total to scrub:   4.77GiB
Rate:             1.54GiB/s
Error summary:    no errors found

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-08 19:29     ` Andrea Gelmini
@ 2021-01-09 21:40       ` Zygo Blaxell
  2021-01-10  9:00         ` Andrea Gelmini
  0 siblings, 1 reply; 15+ messages in thread
From: Zygo Blaxell @ 2021-01-09 21:40 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Cedric.dewijs, Linux BTRFS

On Fri, Jan 08, 2021 at 08:29:45PM +0100, Andrea Gelmini wrote:
> Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu> ha scritto:
> > What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
> > dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?
> 
> <smiling>
> Well, (happens) the same thing when your laptop is stolen or you read
> "open_ctree failed"...You restore backup...
> </smiling>
> 
> I have a few idea, but it's much more quicker to try it. Let's see:
> 
> truncate -s 5G dev1
> truncate -s 5G dev2
> losetup /dev/loop31 dev1
> losetup /dev/loop32 dev2
> mdadm --create --verbose --assume-clean /dev/md0 --level=1
> --raid-devices=2 /dev/loop31 --write-mostly /dev/loop32

Note that with --write-mostly here, total filesystem loss is no longer
random: mdadm will always pick loop31 over loop32 while loop31 exists.

> mkfs.btrfs /dev/md0
> mount -o compress=lzo /dev/md0 /mnt/sg10/
> cd /mnt/sg10/
> cp -af /home/gelma/dev/kernel/ .
> root@glet:/mnt/sg10# dmesg -T
> [Fri Jan  8 19:51:33 2021] md/raid1:md0: active with 2 out of 2 mirrors
> [Fri Jan  8 19:51:33 2021] md0: detected capacity change from 0 to 5363466240
> [Fri Jan  8 19:51:53 2021] BTRFS: device fsid
> 2fe43610-20e5-48de-873d-d1a6c2db2a6a devid 1 transid 5 /dev/md0
> scanned by mkfs.btrfs (512004)
> [Fri Jan  8 19:51:53 2021] md: data-check of RAID array md0
> [Fri Jan  8 19:52:19 2021] md: md0: data-check done.
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): setting incompat
> feature flag for COMPRESS_LZO (0x8)
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): use lzo compression, level 0
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): disk space caching
> is enabled
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): has skinny extents
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): flagging fs with
> big metadata feature
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): enabling ssd optimizations
> [Fri Jan  8 19:53:13 2021] BTRFS info (device md0): checking UUID tree
> 
> root@glet:/mnt/sg10# btrfs scrub start -B .
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:01:59 2021
> Status:           finished
> Duration:         0:00:04
> Total to scrub:   4.99GiB
> Rate:             1.23GiB/s
> Error summary:    no errors found
> 
> We check the array is in sync:
> 
> root@glet:/mnt/sg10# cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md0 : active raid1 loop32[1](W) loop31[0]
>      5237760 blocks super 1.2 [2/2] [UU]

You have used --assume-clean and didn't tell mdadm otherwise since,
so this test didn't provide any information.

On real disks a mdadm integrity check at this point fail very hard since
the devices have never been synced (unless they are both blank devices
filled with the same formatting test pattern or zeros).

> unused devices: <none>
> 
> Now we wipe the storage;
> root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M count=100

With --write-mostly, the above deterministically works, and

	dd if=/dev/urandom of=/dev/loop31 bs=1M count=100

deterministically damages or destroys the filesystem.

With real disk failures you don't get to pick which drive is corrupted
or when.  If it's the remote drive, you have no backup and have no way
to _know_ you have no backup.  If it's the local drive, you can recover
it if you read from the backup in time; otherise, you lose your data
permanently on the next mdadm resync.

> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB, 100 MiB) copied, 0.919025 s, 114 MB/s
> 
> sync
> 
> echo 3 > /proc/sys/vm/drop_caches
> 
> I do rm to force write i/o:
> 
> root@glet:/mnt/sg10# rm kernel/v5.11/ -rf
> 
> root@glet:/mnt/sg10# btrfs scrub start -B .
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:11:21 2021
> Status:           finished
> Duration:         0:00:03
> Total to scrub:   4.77GiB
> Rate:             1.54GiB/s
> Error summary:    no errors found

This scrub will never detect corruption on the remote filesystem because
of --write-mostly, so you have no way to know whether it has bitrotted
away (or is just missing a whole lot of updates).

> Now, I stop the array and re-assembly:
> mdadm -Ss
> 
> root@glet:/# mdadm --assemble /dev/md0 /dev/loop31 /dev/loop32
> mdadm: /dev/md0 has been started with 2 drives.
> 
> root@glet:/# mount /dev/md0 /mnt/sg10/
> root@glet:/# btrfs scrub start -B  /mnt/sg10/
> scrub done for 2fe43610-20e5-48de-873d-d1a6c2db2a6a
> Scrub started:    Fri Jan  8 20:15:16 2021
> Status:           finished
> Duration:         0:00:03
> Total to scrub:   4.77GiB
> Rate:             1.54GiB/s
> Error summary:    no errors found
> 
> Ciao,
> Gelma

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-09 21:40       ` Zygo Blaxell
@ 2021-01-10  9:00         ` Andrea Gelmini
  2021-01-16  1:04           ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Gelmini @ 2021-01-10  9:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Cedric.dewijs, Linux BTRFS

Il giorno sab 9 gen 2021 alle ore 22:40 Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> ha scritto:
>
> On Fri, Jan 08, 2021 at 08:29:45PM +0100, Andrea Gelmini wrote:
> > Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu> ha scritto:
> > > What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
> > > dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?
> You have used --assume-clean and didn't tell mdadm otherwise since,
> so this test didn't provide any information.

I know mdadm, no need of your explanation.

"--assume-clean" is used on purpose because:
a) the two devices are already identical;
b) no need two sync something (even if they were random filled), that
are going to be formatted and data filled, so - more or less - each
block is rewritten.

> On real disks a mdadm integrity check at this point fail very hard since
> the devices have never been synced (unless they are both blank devices
> filled with the same formatting test pattern or zeros).

I disagree. My point is: who cares about blocks never touched by the filesystem?

> > root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M count=100
>
> With --write-mostly, the above deterministically works, and
>
>         dd if=/dev/urandom of=/dev/loop31 bs=1M count=100
>
> deterministically damages or destroys the filesystem.

My friend, read the question, he asked about what happens if you
poison the second device.
Of course if you poison /dev/md0 or the main device what else can
happen, in such situation?
Thanks god you told us, because we are all so much stupid!

My point of view is: you can use mdadm to defend from real case
scenario  (first hard drive die,
the second slow one goes on, and you have all your data up to date,
and if you are afraid of
bit rotten data, you have btrfs checksum).
Also, even if the second/slow hard drive is out-of-sync of seconds, it
would like if unplugged while working.
All cool feature of BTRFS (transaction, checksums, dup btree and so
on) will recover filesystem and do the rest, isn't it?

Thinking about "what if I trick my system here and there" is
absolutely fun, but no real use case, for me.

What if I expose BTRFS devices to cosmic rays and everything is wiped out?

(I know, my only hero Qu is already preparing a patch - as usual -
while others starts to write poems...)

Don't take it personally and smile,
Gelma

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-10  9:00         ` Andrea Gelmini
@ 2021-01-16  1:04           ` Zygo Blaxell
  2021-01-16 15:27             `  
  0 siblings, 1 reply; 15+ messages in thread
From: Zygo Blaxell @ 2021-01-16  1:04 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Cedric.dewijs, Linux BTRFS

On Sun, Jan 10, 2021 at 10:00:01AM +0100, Andrea Gelmini wrote:
> Il giorno sab 9 gen 2021 alle ore 22:40 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> >
> > On Fri, Jan 08, 2021 at 08:29:45PM +0100, Andrea Gelmini wrote:
> > > Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu> ha scritto:
> > > > What happens when I poison one of the drives in the mdadm array using this command? Will all data come out OK?
> > > > dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?
> > You have used --assume-clean and didn't tell mdadm otherwise since,
> > so this test didn't provide any information.
> 
> I know mdadm, no need of your explanation.
> 
> "--assume-clean" is used on purpose because:
> a) the two devices are already identical;
> b) no need two sync something (even if they were random filled), that
> are going to be formatted and data filled, so - more or less - each
> block is rewritten.
> 
> > On real disks a mdadm integrity check at this point fail very hard since
> > the devices have never been synced (unless they are both blank devices
> > filled with the same formatting test pattern or zeros).
> 
> I disagree. My point is: who cares about blocks never touched by the filesystem?

If you don't do the sync, you don't get even mdadm's weak corruption
detection--it will present an overwhelming number of false positives.

Of course mdadm can only tell you the two devices are different--it
can't tell you which one (if any) is correct.

> > > root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M count=100
> >
> > With --write-mostly, the above deterministically works, and
> >
> >         dd if=/dev/urandom of=/dev/loop31 bs=1M count=100
> >
> > deterministically damages or destroys the filesystem.
> 
> My friend, read the question, he asked about what happens if you
> poison the second device.

Strictly correct, but somewhere between unhelpful and misleading.

The motivation behind the question is clearly to compare mdadm and btrfs
capabilities wrt recovery from data poisoning.  He presented an example
of an experiment performed on btrfs demonstrating that capability.

He neglected to ask separate questions about poisoning both individual
drives, but you also neglected to point this omission out.

> Of course if you poison /dev/md0 or the main device what else can
> happen, in such situation?

What else can happen is that btrfs raid1 reconstructs the data on a
damaged device from the other non-damaged device, regardless of which
device is damaged.  btrfs can verify the contents of each device and
identify which copy (if any) is correct.  mdadm cannot.

> Thanks god you told us, because we are all so much stupid!

Well, the rule of thumb is "never use mdadm for any use case where you can
use btrfs instead," and answering yet another mdadm-vs-btrfs FAQ normally
wouldn't be interesting; however, this setup used mdadm --write-mostly
which is relatively unusual around here (and makes the posting appear
under a different set of search terms), so I wrote a short reply to
Cedric's question as it was not a question I had seen answered before.

Hours later, you presented an alternative analysis that contained the
results of an experiment that demonstrated mdadm success under specific
conditions arising from a very literal interpretation of the question.
I had to reread the mdadm kernel sources to remind myself how it could
work, because it's a corner case outcome that I've never observed or
even heard third-party reports of for mdadm in the wild.  That was a
fun puzzle to solve, so I wanted to write about it.

Your setup has significant data loss risks that you did not mention.
A user who was not very familiar with both btrfs and mdadm, but who
stumbled across this thread during a search of the archives, might not
understand the limited scope of your experimental setup and why it did
not detect or cause any of the expected data loss failures.  This user
might, upon reading your posting, incorrectly conclude that this setup
is a viable way to store data with btrfs self-repair capabilities intact.
Those were significant omissions, a second reason to write about them.

Initially I thought you accidentally omitted these details due to
inexperience or inattention, but as you say, you know mdadm.  If that's
true, your omissions were intentional.  That information is also
potentially not obvious for someone reading this thread in the archives,
so I am writing about it too.

> My point of view is: you can use mdadm to defend from real case
> scenario  (first hard drive die,
> the second slow one goes on, and you have all your data up to date,
> and if you are afraid of
> bit rotten data, you have btrfs checksum).
> Also, even if the second/slow hard drive is out-of-sync of seconds, it
> would like if unplugged while working.
> All cool feature of BTRFS (transaction, checksums, dup btree and so
> on) will recover filesystem and do the rest, isn't it?

No.  That was the point.  btrfs cannot fix mdadm if mdadm trashes both
copies of the data underneath.  The btrfs self-repair features only work
when btrfs has access to both mirror copies.  On top of mdadm, btrfs
can only help you identify the data you have already lost.  btrfs is
especially picky about its metadata--if even one bit of metadata is
unrecoverably lost, the filesystem can only be repaired by brute force
means (btrfs check --repair --init-extent-tree, or mkfs and start over).

This branch of the thread is a little over-focused on the bitrot case.
The somewhat more significant problem with this setup is that write
ordering rules are violated on sdb during resync with sda.  During resync
there may be no complete and intact copy of btrfs metadata on sdb, so
if sda fails at that time, the filesystem may be severely damaged or
unrecoverable, even if there is no bitrot at all.  Depending on write
workloads and disk speeds, the window of vulnerability may last for
minutes to hours.

That's a pretty common failure mode--a lot of drives don't die all
at once, especially at the low end of the market.  Instead, they have
reliability problems--the kind that will force a reboot and mdadm resync,
but not kick the device out of the array--just before they completely die.

If you are using mobile or desktop drives, the second most common 5-year
outcome from this setup will be filesystem loss.  The most common outcome
is that you don't have a disk failure on sda, so you don't need the second
disk at all.  Failures that mdadm can recover from are 3rd most likely,
followed by failures neither mdadm nor btrfs can recover from.

If you are using NAS or enterprise drives (and they're not relabled
desktop drives inside), then they are unlikely to throw anything at mdadm
that mdadm cannot handle; however, even the best drives break sometimes,
and this puts you in one of the other probability scenarios.

If you are using cheap SSDs then the probabilities flip because they
do not report errors that mdadm can detect, and they tend to trigger
more mdadm resyncs on average due to firmware issues and general
low reliability.  Partial data loss (that would have been recoverable
with btrfs raid1) is the most likely outcome, unrecoverable filesystem
failure is 2nd.  Keeping your data for 5 years moves down to the 3rd
most likely outcome.

Silent data corruption seems to be far more common on SSDs than HDDs, so
the risk of unrecoverable corruption is highest when mdadm --write-mostly
is used with the HDD as component disk #1 and the SSD as disk #0.

btrfs raid1 recovers from all of these cases except 2-drive same-sector
failures and host RAM failure.

Also, your stated use case is slightly different from Cedic's:

> You can use mdadm to do this (I'm using this feature since years in
> setup where I have to fallback on USB disks for any reason).

Here, you are mirroring up a relatively reliable disk onto one or more
removable USB devices.  Depending on the details, this setup can have
a different failure profile than the SSD/HDD use case.  If you rotate
multiple USB devices so that you always have one offline intact mirror of
the filesystem, then if there is a failure during resync and the online
USB disk has a destroyed filesystem, the offline disk will still be
intact and you can still use it to recover, but it will have older data.
In the SSD/HDD use case, there is only one mirror pair, so there is no
isolated copy that could survive a failure during resync.

The corruption propagation between drives is different too--if you can
run a btrfs scrub, and detect corruption on the primary drive in time,
you can retrieve the missing data from the offline mirror.  With only
two online drives, it's a challenge to access the intact mirror (you
have to ensure mdadm never resyncs, and use btrfs restore directly on
the mirror device).

> Thinking about "what if I trick my system here and there" is
> absolutely fun, but no real use case, for me.

It is a real use case for other people.  Disks do silently corrupt data
(especially low end SSDs), so it's reasonable to expect a modern raid1
implementation to be able to recover from damage to either drive, and
reasonable to test that capability by intentionally wiping one of them.
Just remember to test wiping the other one too.

> What if I expose BTRFS devices to cosmic rays and everything is wiped out?
> 
> (I know, my only hero Qu is already preparing a patch - as usual -
> while others starts to write poems...)
> 
> Don't take it personally and smile,
> Gelma

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-16  1:04           ` Zygo Blaxell
@ 2021-01-16 15:27             `  
  2021-01-18  0:45               ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From:   @ 2021-01-16 15:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: andrea.gelmini, linux-btrfs


--- Ursprüngliche Nachricht ---
Von: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Datum: 16.01.2021 02:04:38
An: Andrea Gelmini <andrea.gelmini@gmail.com>
Betreff: Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize  the SSD?

On Sun, Jan 10, 2021 at 10:00:01AM +0100, Andrea Gelmini wrote:
> Il giorno sab 9 gen 2021 alle ore 22:40 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> >
> > On Fri, Jan 08, 2021 at 08:29:45PM +0100, Andrea Gelmini wrote:

> > > Il giorno ven 8 gen 2021 alle ore 09:36 <Cedric.dewijs@eclipso.eu>
ha scritto:
> > > > What happens when I poison one of the drives in the mdadm
array using this command? Will all data come out OK?
> > > > dd if=/dev/urandom of=/dev/dev/sdb1 bs=1M count = 100?

> > You have used --assume-clean and didn't tell mdadm otherwise since,

> > so this test didn't provide any information.
> 
> I know mdadm, no need of your explanation.
> 
> "--assume-clean" is used on purpose because:
> a) the two devices are already identical;
> b) no need two sync something (even if they were random filled), that

> are going to be formatted and data filled, so - more or less - each

> block is rewritten.
> 
> > On real disks a mdadm integrity check at this point fail very hard
since
> > the devices have never been synced (unless they are both blank
devices
> > filled with the same formatting test pattern or zeros).
> 
> I disagree. My point is: who cares about blocks never touched by the
filesystem?

If you don't do the sync, you don't get even mdadm's weak corruption
detection--it will present an overwhelming number of false positives.

Of course mdadm can only tell you the two devices are different--it
can't tell you which one (if any) is correct.

This tells me everything I need to know about mdadm. It can only save my data in case both drive originally had correct data, and one of the drives completely disappears. A mdadm raid-1 (mirror) array ironically increases the likelihood of data corruption. If one drive has a 1% change of corrupting the data, 2 of those drives in mdadm raid 1 have a 2% chance of corrupting the data, and 100 of these drives in raid 1 will 100% corrupt the data. (This is a bit of an oversimplification, and statistically not entirely sound, but it gets my point across). For me this defeats the purpose of a raid system, as raid should increase the redundancy and resilience. 

Would it be possible to add a checksum to the data in mdadm in much the same way btrfs is doing that, so it can also detect and even repair corruption on the block level? 


> > > root@glet:/mnt/sg10# dd if=/dev/urandom of=/dev/loop32 bs=1M
count=100
> >
> > With --write-mostly, the above deterministically works, and
> >
> >         dd if=/dev/urandom of=/dev/loop31 bs=1M count=100
> >
> > deterministically damages or destroys the filesystem.
> 
> My friend, read the question, he asked about what happens if you
> poison the second device.

Strictly correct, but somewhere between unhelpful and misleading.

The motivation behind the question is clearly to compare mdadm and btrfs

capabilities wrt recovery from data poisoning.  He presented an example
of an experiment performed on btrfs demonstrating that capability.

He neglected to ask separate questions about poisoning both individual
drives, but you also neglected to point this omission out.

I was trying to figure out if my data could survive if one of the drives (partially)failed. I only know of two ways to simulate this: physically disconnecting the drive, or dd-ing random data to it. I didn't state it in my original question, but I wanted to first poison one drive, then scrub the data, and then poison the next drive, and scrub again, until all drives has been poisoned and scrubbed. I have tested this with 2 set's of a backing drive and a writeback  SSD cache. No matter which drive was poisoned, all the data survived although reconstructing the data was about 30x slower than reading correct data.

> Of course if you poison /dev/md0 or the main device what else can
> happen, in such situation?

What else can happen is that btrfs raid1 reconstructs the data on a
damaged device from the other non-damaged device, regardless of which
device is damaged.  btrfs can verify the contents of each device and
identify which copy (if any) is correct.  mdadm cannot.

Yes, but only if btrfs has direct access to each drive separately. mdadm is hiding the individual drives from btrfs.

> Thanks god you told us, because we are all so much stupid!
I am stupid, or at least not so deeply informed. I have only worked with multi-drive configurations for a few weeks now.

Well, the rule of thumb is "never use mdadm for any use case where
you can
use btrfs instead,"

My rule of thumb is "never run adadm". I don't see a use case where it increases the longevity of my data. 

 and answering yet another mdadm-vs-btrfs FAQ normally

wouldn't be interesting; however, this setup used mdadm --write-mostly
which is relatively unusual around here (and makes the posting appear
under a different set of search terms), so I wrote a short reply to
Cedric's question as it was not a question I had seen answered before.

Hours later, you presented an alternative analysis that contained the
results of an experiment that demonstrated mdadm success under specific
conditions arising from a very literal interpretation of the question.
I had to reread the mdadm kernel sources to remind myself how it could
work, because it's a corner case outcome that I've never observed or
even heard third-party reports of for mdadm in the wild.  That was a
fun puzzle to solve, so I wanted to write about it.

Your setup has significant data loss risks that you did not mention.
A user who was not very familiar with both btrfs and mdadm, but who
stumbled across this thread during a search of the archives, might not
understand the limited scope of your experimental setup and why it did
not detect or cause any of the expected data loss failures.  This user
might, upon reading your posting, incorrectly conclude that this setup
is a viable way to store data with btrfs self-repair capabilities intact.

I was looking for a way to give a the drives of a btrfs filesystem a write cache, in such a way that a failure of a single drive could not result in data loss. As bcache does not and will not support multiple redundant ssd's as write cache [1], my plan was to put 2 identical ssd's in mdadm raid 1, as host for a bcache writecack cache for all the drives of the btrfs filesytem. See the figure below:
+-----------------------------------------------------------+
|          btrfs raid 1 (2 copies) /mnt                     |
+--------------+--------------+--------------+--------------+
| /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
+--------------+--------------+--------------+--------------+
| Mdadm Raid 1 mirrored Writeback Cache (SSD)               |
| /dev/md/name (cointaining /dev/sda3 and /dev/sda4)        |
+--------------+--------------+--------------+--------------+
| Data         | Data         | Data         | Data         |
| /dev/sda8    | /dev/sda9    | /dev/sda10   | /dev/sda11   |
+--------------+--------------+--------------+--------------+
This will not protect my data if one of the SSD's starts to return random data, as mdadm can't see who of the two SSD's is correct. This will also defeat the redundancy of btrfs, as all copies of the data that btrfs sees are coming from the ssd pair.
[1] https://lore.kernel.org/linux-bcache/e03dd593-14cb-b4a0-d68a-bd9b4fb8bd20@suse.de/T/#t

The only way I've come up with is to give each hard drive in the btrfs array it's own writeback bcache, but that requires double the amount of drives.


Those were significant omissions, a second reason to write about them.

Initially I thought you accidentally omitted these details due to
inexperience or inattention, but as you say, you know mdadm.  If that's
true, your omissions were intentional.  That information is also
potentially not obvious for someone reading this thread in the archives,

so I am writing about it too.

> My point of view is: you can use mdadm to defend from real case
> scenario  (first hard drive die,
> the second slow one goes on, and you have all your data up to date,

> and if you are afraid of
> bit rotten data, you have btrfs checksum).
> Also, even if the second/slow hard drive is out-of-sync of seconds,
it
> would like if unplugged while working.
> All cool feature of BTRFS (transaction, checksums, dup btree and so

> on) will recover filesystem and do the rest, isn't it?

No.  That was the point.  btrfs cannot fix mdadm if mdadm trashes both
copies of the data underneath.  The btrfs self-repair features only work

when btrfs has access to both mirror copies.  On top of mdadm, btrfs
can only help you identify the data you have already lost.  btrfs is
especially picky about its metadata--if even one bit of metadata is
unrecoverably lost, the filesystem can only be repaired by brute force
means (btrfs check --repair --init-extent-tree, or mkfs and start over).


This branch of the thread is a little over-focused on the bitrot case.
The somewhat more significant problem with this setup is that write
ordering rules are violated on sdb during resync with sda.  During resync

there may be no complete and intact copy of btrfs metadata on sdb, so
if sda fails at that time, the filesystem may be severely damaged or
unrecoverable, even if there is no bitrot at all.  Depending on write
workloads and disk speeds, the window of vulnerability may last for
minutes to hours.

That's a pretty common failure mode--a lot of drives don't die all
at once, especially at the low end of the market.  Instead, they have
reliability problems--the kind that will force a reboot and mdadm resync,

but not kick the device out of the array--just before they completely die.


If you are using mobile or desktop drives, the second most common 5-year

outcome from this setup will be filesystem loss.  The most common outcome

is that you don't have a disk failure on sda, so you don't need the second

disk at all.  Failures that mdadm can recover from are 3rd most likely,
followed by failures neither mdadm nor btrfs can recover from.

If you are using NAS or enterprise drives (and they're not relabled
desktop drives inside), then they are unlikely to throw anything at mdadm

that mdadm cannot handle; however, even the best drives break sometimes,

I am misusing consumer drives in my NAS. Most of the hard drives and SSD's have been given to me, and are medium old to very old. That's why I am trying to build a system that can survive a single drive failure. That's also the reason why I build 2 NAS boxes, my primary NAS syncs to my slave nas once per day).

and this puts you in one of the other probability scenarios.

If you are using cheap SSDs then the probabilities flip because they
do not report errors that mdadm can detect, and they tend to trigger
more mdadm resyncs on average due to firmware issues and general
low reliability.  Partial data loss (that would have been recoverable
with btrfs raid1) is the most likely outcome, unrecoverable filesystem
failure is 2nd.  Keeping your data for 5 years moves down to the 3rd
most likely outcome.

Silent data corruption seems to be far more common on SSDs than HDDs, so

the risk of unrecoverable corruption is highest when mdadm --write-mostly

is used with the HDD as component disk #1 and the SSD as disk #0.

btrfs raid1 recovers from all of these cases except 2-drive same-sector
failures and host RAM failure.

Also, your stated use case is slightly different from Cedic's:

> You can use mdadm to do this (I'm using this feature since years in

> setup where I have to fallback on USB disks for any reason).

Here, you are mirroring up a relatively reliable disk onto one or more
removable USB devices.  Depending on the details, this setup can have
a different failure profile than the SSD/HDD use case.  If you rotate
multiple USB devices so that you always have one offline intact mirror of

the filesystem, then if there is a failure during resync and the online
USB disk has a destroyed filesystem, the offline disk will still be
intact and you can still use it to recover, but it will have older data.

In the SSD/HDD use case, there is only one mirror pair, so there is no
isolated copy that could survive a failure during resync.

The corruption propagation between drives is different too--if you can
run a btrfs scrub, and detect corruption on the primary drive in time,
you can retrieve the missing data from the offline mirror.  With only
two online drives, it's a challenge to access the intact mirror (you
have to ensure mdadm never resyncs, and use btrfs restore directly on
the mirror device).

> Thinking about "what if I trick my system here and there"
is
> absolutely fun, but no real use case, for me.

It is a real use case for other people.  Disks do silently corrupt data
(especially low end SSDs), so it's reasonable to expect a modern raid1
implementation to be able to recover from damage to either drive, and
reasonable to test that capability by intentionally wiping one of them.

Thanks for confirming sometimes drives do silently corrupt data. I have not yet seen this happen "in the wild".

Just remember to test wiping the other one too.

Of course, always test all failure modes for all drives, and document the recovery procedures. After testing, fill the system with real data, and do everything in your power to prevent drives failing.

> What if I expose BTRFS devices to cosmic rays and everything is wiped
out?

Then it's time to restore the backups. This may also be the moment to replace the drives, as the cosmic rays could have degraded the electronics.

> 
> (I know, my only hero Qu is already preparing a patch - as usual -
> while others starts to write poems...)
> 
> Don't take it personally and smile,
> Gelma

Thanks everybody in this thread for taking the time to tell me about the inner workings of btrfs and mdadm. This has saved me a lot of time.

---

Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?
  2021-01-16 15:27             `  
@ 2021-01-18  0:45               ` Zygo Blaxell
  0 siblings, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2021-01-18  0:45 UTC (permalink / raw)
  To: Cedric.dewijs; +Cc: andrea.gelmini, linux-btrfs

On Sat, Jan 16, 2021 at 04:27:29PM +0100,   wrote:
> 
> --- Ursprüngliche Nachricht ---
> Von: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Datum: 16.01.2021 02:04:38
> An: Andrea Gelmini <andrea.gelmini@gmail.com>
> Betreff: Re: Re: Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize  the SSD?
> 
> This tells me everything I need to know about mdadm. It can only save
> my data in case both drive originally had correct data, and one of the
> drives completely disappears. A mdadm raid-1 (mirror) array ironically
> increases the likelihood of data corruption. If one drive has a 1%
> change of corrupting the data, 2 of those drives in mdadm raid 1 have
> a 2% chance of corrupting the data, and 100 of these drives in raid 1
> will 100% corrupt the data. (This is a bit of an oversimplification,
> and statistically not entirely sound, but it gets my point across). For
> me this defeats the purpose of a raid system, as raid should increase
> the redundancy and resilience.

It influences architecture and purchasing too.  Say you want to build
two filesystems, and you have 2 instances of 2 device models A and B
with different firmware, and you are worried about firmware bugs.

The mdadm way is:  host1 is modelA, modelA.  host2 is modelB, modelB.
This way, if modelA has a firmware bug, host1's filesystem gets corrupted,
but host2 is not affected, so you can restore backups from host2 back
to host1 after a failure.  Short of detailed forensic investigation,
it's not easy to tell which model is failing, so you just keep buying
both models and never combining them into an array in the same host.

The btrfs way is:  host1 is modelA, modelB.  host2 is modelA, modelB.
This way, if modelA has a firmware bug, btrfs corrects modelA using
modelB's data, so you don't need the backups.  You also know whether
modelA or modelB has the bad firmware, because btrfs identifies the bad
drive, and you can confirm the bug if both host1 and host2 are affected,
so you can stop buying that model until the vendor fixes it.

> Would it be possible to add a checksum to the data in mdadm in much
> the same way btrfs is doing that, so it can also detect and even repair
> corruption on the block level?

dm-integrity would provide the csum error detection that mdadm needs to
be able to recover from bitrot (i.e. use dm-integrity devices as
mdadm component devices).  I wouldn't expect it to perform too well
on spinning disks.

> I was trying to figure out if my data could survive if one of the drives
> (partially)failed. I only know of two ways to simulate this: physically
> disconnecting the drive, or dd-ing random data to it. I didn't state
> it in my original question, but I wanted to first poison one drive,
> then scrub the data, and then poison the next drive, and scrub again,
> until all drives has been poisoned and scrubbed. I have tested this
> with 2 set's of a backing drive and a writeback  SSD cache. No matter
> which drive was poisoned, all the data survived although reconstructing
> the data was about 30x slower than reading correct data.

> My rule of thumb is "never run adadm". I don't see a use case where
> it increases the longevity of my data.

There are still plenty of use cases for mdadm.  btrfs can't do everything,
e.g. mirrors on more than 4 disks, or split-failure-domain raid10, or
just a really convenient way to copy one disk to another from time to
time without taking it offline.

For nodatasum files, mdadm and btrfs have equivalent data integrity,
maybe a little worse on btrfs since btrfs lacks any way for the admin to
manually indicate which device has correct data (at least on mdadm there
are various ways to force a resync from one chosen drive to the other).
On the other hand, using nodatasum implies you don't care about data
integrity issues for that specific file, and btrfs still maintains
better integrity than mdadm for the rest of the filesystem.

> I was looking for a way to give a the drives of a btrfs filesystem a
> write cache, in such a way that a failure of a single drive could not
> result in data loss. As bcache does not and will not support multiple
> redundant ssd's as write cache [1], my plan was to put 2 identical
> ssd's in mdadm raid 1, as host for a bcache writecack cache for all
> the drives of the btrfs filesytem. See the figure below:
> +-----------------------------------------------------------+
> |          btrfs raid 1 (2 copies) /mnt                     |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Mdadm Raid 1 mirrored Writeback Cache (SSD)               |
> | /dev/md/name (cointaining /dev/sda3 and /dev/sda4)        |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda8    | /dev/sda9    | /dev/sda10   | /dev/sda11   |
> +--------------+--------------+--------------+--------------+
> This will not protect my data if one of the SSD's starts to return
> random data, as mdadm can't see who of the two SSD's is correct. This
> will also defeat the redundancy of btrfs, as all copies of the data
> that btrfs sees are coming from the ssd pair.
> [1] https://lore.kernel.org/linux-bcache/e03dd593-14cb-b4a0-d68a-bd9b4fb8bd20@suse.de/T/#t
> 
> The only way I've come up with is to give each hard drive in the btrfs
> array it's own writeback bcache, but that requires double the amount
> of drives.

If it's a small enough number of disks, you could bind pairs of spinning
disks together into mdadm linear or single (see?  mdadm is still useful):

  btrfs dev 1 -> cache1 -> md1 linear -> spinner1, spinner2

  btrfs dev 2 -> cache2 -> md2 linear -> spinner3, spinner4

If one of the spinners fails, you swap it out and rebuild a new mdadm
array on the remaining old disk and the new disk, then build a new
cache device on top.  If the cache dies you build a new cache device.
Then you mount btrfs degraded without the failed device, and use replace
to repopulate it with the replacement device(s), same as a normal disk
failure.

If just the cache fails, you can maybe get away with dropping the
cache and running a btrfs scrub directly on the backing mdadm device,
as opposed to a full btrfs replace.  The missing blocks from the dead
cache will just look like data corruption and btrfs will replace them.

This changes the recovery time and failure probabilities compared to a
4xSSD 4xHDD layout, but not very much.  I wouldn't try this with 20 disks,
but with only 4 disks it's probably still OK as long as you don't have
a model that likes to fail all at the same time.  You'd want to scrub
that array regularly to detect failures as early as possible.

Both dm-cache and bcache will try to bypass the cache for big sequential
reads, so the scrub will mostly touch the backing disks.  There are some
gaps in the scrub coverage (any scrub reads that are serviced by the
cache will not reflect the backing disk state, so bitrot in the backing
disk might not be observed until the cache blocks are evicted) but btrfs
raid1 handles those like any other intermittently unreliable disk.

> I am misusing consumer drives in my NAS. Most of the hard drives and
> SSD's have been given to me, and are medium old to very old. 

OK, "very old" might be too risky to batch them up 2 disks at a time.
After a certain age, just moving a drive to a different slot in the
chassis can break it.

> That's
> why I am trying to build a system that can survive a single drive
> failure. That's also the reason why I build 2 NAS boxes, my primary
> NAS syncs to my slave nas once per day).

A sound plan.  Host RAM sometimes flips bits, and there can be kernel
bugs.  Having your data on two separate physical hosts provides isolation
for that kind of failure.

> Thanks for confirming sometimes drives do silently corrupt data. I
> have not yet seen this happen "in the wild".

Cheap SSDs are notorious for this (it's pretty much how they indicate
they are failing, you get btrfs csum errors while drive self-tests pass),
but we've seen one or two name-brand devices do it too.

Any drive will do it if it gets too hot and doesn't throttle itself.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-18  0:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05  6:39 Raid1 of a slow hdd and a fast(er) SSD, howto to prioritize the SSD?  
2021-01-05  6:53 ` Qu Wenruo
2021-01-05 18:19   `  
2021-01-07 22:11     ` Zygo Blaxell
2021-01-05 19:19   ` Stéphane Lesimple
2021-01-06  2:55   ` Anand Jain
2021-01-08  8:16 ` Andrea Gelmini
2021-01-08  8:36   `  
2021-01-08 14:00     ` Zygo Blaxell
2021-01-08 19:29     ` Andrea Gelmini
2021-01-09 21:40       ` Zygo Blaxell
2021-01-10  9:00         ` Andrea Gelmini
2021-01-16  1:04           ` Zygo Blaxell
2021-01-16 15:27             `  
2021-01-18  0:45               ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).