linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Avoiding BRTFS RAID5 write hole
@ 2019-11-13 22:29 Hubert Tonneau
  2019-11-13 22:51 ` waxhead
  2019-11-14 21:25 ` Goffredo Baroncelli
  0 siblings, 2 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-13 22:29 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> > What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.
> I am not sure to understand what are you saying. Could you elaborate ?

The safety problem with RAID5 is that between the time you start to overwrite a stripe and the time you finish, disk safety is disabled because parity is broken.
On the other hand, with RAID1, disk safety more or less remains all the time, so overwriting is no issue.

There are several possible strategies to keep RAID5 disk safety all the time:

1) Use a journal
This is the MDADM solution because it's the only resonable one if the RAID layer is separated from the filesystem (because you don't whan to add another sectors mapping layer).
The problem is that it's IO expensive.
This is the solution implemented in Liu Bo 2017 patch, as far as I can understand it.

2) Never overwrite the RAID5 stripe
This is stripe COW. The new stripe is stored at different disks positions.
The problem is that it's even more IO expensive.
This is the solution you are suggesting, as far as I can understand it.

What I'm suggesting is to use your COW solution, but also write the new (set of) stripe(s) as RAID1.
Let me call this operation stripe COW RAID5 to RAID1.
The key advantage is that if you have to overwrite it again a few seconds (or hours) later, then it can be fast, because it's already RAID1.

Morever, new stripes resulting from writing a new file, or appending, would be created as RAID1, even if the filesystem DATA is configured as RAID5, each time the stripe is not full or is likely to be modified soon.
This will reduce the number of stripe COW RAID5 to RAID1 operations.

The final objective is to have few stripe COW operations, because they are IO expensive, and many RAID1 stripe overwrite operations.
The price to pay for the reduced number of stripe COW operations is consuming more disk space, because RAID1 stripes consumes more disk space than RAID5 ones, and that is why we would have a background process that does stripe COW from RAID1 to RAID5 in order to reclaim disk space, and we could make it more aggressive when we lack disk space.

What I'm trying to provide is the idea that seeing the DATA as RAID1 or RAID5 is not a good idea when we have BTRFS flexibility. We should rather see it as RAID1 and RAID5, RAID5 beeing just a way to reclaim disk space (same for RAID1C3 and RAID6).
Having METADATA as RAID1 and DATA as RAID5 was a first step, but BTRFS flexibility probably allows to do more.

Please notice that I understand the BTRFS and RAID principles, but on the other hand, I have not read the code, so can hardly say what is easy to implement.
Sorry about that. I've written a full new operating system (see www.fullpliant.org) but the kernel :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: Avoiding BRTFS RAID5 write hole
@ 2019-11-12 22:27 Hubert Tonneau
  2019-11-13 19:34 ` Goffredo Baroncelli
  0 siblings, 1 reply; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-12 22:27 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata.

We are saying the same.
What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.

The background process would then turn it back to RAID5 at a later point.
Adjusting how aggressively this background process works enables to adjust the extra write cost versus saved disk space compromise.

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Avoiding BRTFS RAID5 write hole
@ 2019-11-12 15:13 Hubert Tonneau
  2019-11-12 18:44 ` Chris Murphy
  2019-11-12 19:49 ` Goffredo Baroncelli
  0 siblings, 2 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-12 15:13 UTC (permalink / raw)
  To: linux-btrfs

Hi,

In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour :

. When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6)

. Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6)

Expected advantages are :
. the low level features set basically remains the same
. the filesystem format remains the same
. old kernels and btrs-progs would not be disturbed

The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5.

Regards,
Hubert Tonneau

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-11-28 10:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 22:29 Avoiding BRTFS RAID5 write hole Hubert Tonneau
2019-11-13 22:51 ` waxhead
2019-11-14 21:25 ` Goffredo Baroncelli
2019-11-15 20:41   ` Hubert Tonneau
2019-11-17  8:53     ` Goffredo Baroncelli
2019-11-17 19:49       ` Hubert Tonneau
2019-11-28 11:37       ` Hubert Tonneau
  -- strict thread matches above, loose matches on Subject: below --
2019-11-12 22:27 Hubert Tonneau
2019-11-13 19:34 ` Goffredo Baroncelli
2019-11-12 15:13 Hubert Tonneau
2019-11-12 18:44 ` Chris Murphy
2019-11-12 19:49 ` Goffredo Baroncelli
2019-11-14  4:25   ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).