Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* Re: Avoiding BRTFS RAID5 write hole
@ 2019-11-13 22:29 Hubert Tonneau
  2019-11-13 22:51 ` waxhead
  2019-11-14 21:25 ` Goffredo Baroncelli
  0 siblings, 2 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-13 22:29 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> > What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.
> I am not sure to understand what are you saying. Could you elaborate ?

The safety problem with RAID5 is that between the time you start to overwrite a stripe and the time you finish, disk safety is disabled because parity is broken.
On the other hand, with RAID1, disk safety more or less remains all the time, so overwriting is no issue.

There are several possible strategies to keep RAID5 disk safety all the time:

1) Use a journal
This is the MDADM solution because it's the only resonable one if the RAID layer is separated from the filesystem (because you don't whan to add another sectors mapping layer).
The problem is that it's IO expensive.
This is the solution implemented in Liu Bo 2017 patch, as far as I can understand it.

2) Never overwrite the RAID5 stripe
This is stripe COW. The new stripe is stored at different disks positions.
The problem is that it's even more IO expensive.
This is the solution you are suggesting, as far as I can understand it.

What I'm suggesting is to use your COW solution, but also write the new (set of) stripe(s) as RAID1.
Let me call this operation stripe COW RAID5 to RAID1.
The key advantage is that if you have to overwrite it again a few seconds (or hours) later, then it can be fast, because it's already RAID1.

Morever, new stripes resulting from writing a new file, or appending, would be created as RAID1, even if the filesystem DATA is configured as RAID5, each time the stripe is not full or is likely to be modified soon.
This will reduce the number of stripe COW RAID5 to RAID1 operations.

The final objective is to have few stripe COW operations, because they are IO expensive, and many RAID1 stripe overwrite operations.
The price to pay for the reduced number of stripe COW operations is consuming more disk space, because RAID1 stripes consumes more disk space than RAID5 ones, and that is why we would have a background process that does stripe COW from RAID1 to RAID5 in order to reclaim disk space, and we could make it more aggressive when we lack disk space.

What I'm trying to provide is the idea that seeing the DATA as RAID1 or RAID5 is not a good idea when we have BTRFS flexibility. We should rather see it as RAID1 and RAID5, RAID5 beeing just a way to reclaim disk space (same for RAID1C3 and RAID6).
Having METADATA as RAID1 and DATA as RAID5 was a first step, but BTRFS flexibility probably allows to do more.

Please notice that I understand the BTRFS and RAID principles, but on the other hand, I have not read the code, so can hardly say what is easy to implement.
Sorry about that. I've written a full new operating system (see www.fullpliant.org) but the kernel :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-13 22:29 Avoiding BRTFS RAID5 write hole Hubert Tonneau
@ 2019-11-13 22:51 ` waxhead
  2019-11-14 21:25 ` Goffredo Baroncelli
  1 sibling, 0 replies; 13+ messages in thread
From: waxhead @ 2019-11-13 22:51 UTC (permalink / raw)
  To: Hubert Tonneau, Goffredo Baroncelli; +Cc: linux-btrfs

First of all, I am just a regular and a BTRFS enthusiast with no proper 
filesystem knowledge.

regarding the write hole.... I was just pondering (and I may be totally 
wrong about this , but it is worth a shot)

If raid5/6 needs to read-modify-write - would not the write hole be 
avoided if you first log a XOR cipher value in the metadata, then you 
modify an already existing stripe by XOR'ing whatever needs to be 
modified on the rewriten stripe. The journal would only need to know 
what stripes are being modified so it can be checked on a mount.

If you encounter the write hole, the parity would not match and since 
the XOR cipher is in the metadata you can roll back any failed update 
byte by byte until the checksum match and you will be good to go with 
the old data instead of the new one.

If on the other hand you can write a new stripe the problem goes away. I 
personally are willing to have increased disk IO and reduced performance 
for space. After all raid5/6 is not performance oriented , but primarily 
a space saver.

once (if ever) BTRFS supports per subvolume raid levels then the 
performance issues goes away as you can always raid1/0 some subvolume if 
you need to sacrifice space for performance.

- Waxhead


Hubert Tonneau wrote:
> Goffredo Baroncelli wrote:
>>
>>> What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.
>> I am not sure to understand what are you saying. Could you elaborate ?
> 
> The safety problem with RAID5 is that between the time you start to overwrite a stripe and the time you finish, disk safety is disabled because parity is broken.
> On the other hand, with RAID1, disk safety more or less remains all the time, so overwriting is no issue.
> 
> There are several possible strategies to keep RAID5 disk safety all the time:
> 
> 1) Use a journal
> This is the MDADM solution because it's the only resonable one if the RAID layer is separated from the filesystem (because you don't whan to add another sectors mapping layer).
> The problem is that it's IO expensive.
> This is the solution implemented in Liu Bo 2017 patch, as far as I can understand it.
> 
> 2) Never overwrite the RAID5 stripe
> This is stripe COW. The new stripe is stored at different disks positions.
> The problem is that it's even more IO expensive.
> This is the solution you are suggesting, as far as I can understand it.
> 
> What I'm suggesting is to use your COW solution, but also write the new (set of) stripe(s) as RAID1.
> Let me call this operation stripe COW RAID5 to RAID1.
> The key advantage is that if you have to overwrite it again a few seconds (or hours) later, then it can be fast, because it's already RAID1.
> 
> Morever, new stripes resulting from writing a new file, or appending, would be created as RAID1, even if the filesystem DATA is configured as RAID5, each time the stripe is not full or is likely to be modified soon.
> This will reduce the number of stripe COW RAID5 to RAID1 operations.
> 
> The final objective is to have few stripe COW operations, because they are IO expensive, and many RAID1 stripe overwrite operations.
> The price to pay for the reduced number of stripe COW operations is consuming more disk space, because RAID1 stripes consumes more disk space than RAID5 ones, and that is why we would have a background process that does stripe COW from RAID1 to RAID5 in order to reclaim disk space, and we could make it more aggressive when we lack disk space.
> 
> What I'm trying to provide is the idea that seeing the DATA as RAID1 or RAID5 is not a good idea when we have BTRFS flexibility. We should rather see it as RAID1 and RAID5, RAID5 beeing just a way to reclaim disk space (same for RAID1C3 and RAID6).
> Having METADATA as RAID1 and DATA as RAID5 was a first step, but BTRFS flexibility probably allows to do more.
> 
> Please notice that I understand the BTRFS and RAID principles, but on the other hand, I have not read the code, so can hardly say what is easy to implement.
> Sorry about that. I've written a full new operating system (see www.fullpliant.org) but the kernel :-)
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-13 22:29 Avoiding BRTFS RAID5 write hole Hubert Tonneau
  2019-11-13 22:51 ` waxhead
@ 2019-11-14 21:25 ` Goffredo Baroncelli
  2019-11-15 20:41   ` Hubert Tonneau
  1 sibling, 1 reply; 13+ messages in thread
From: Goffredo Baroncelli @ 2019-11-14 21:25 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-btrfs

On 13/11/2019 23.29, Hubert Tonneau wrote:
> Goffredo Baroncelli wrote:
>>
>>> What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.
>> I am not sure to understand what are you saying. Could you elaborate ?
> 
> The safety problem with RAID5 is that between the time you start to overwrite a stripe and the time you finish, disk safety is disabled because parity is broken.
> On the other hand, with RAID1, disk safety more or less remains all the time, so overwriting is no issue.
> 
> There are several possible strategies to keep RAID5 disk safety all the time:
> 
> 1) Use a journal
> This is the MDADM solution because it's the only resonable one if the RAID layer is separated from the filesystem (because you don't whan to add another sectors mapping layer).
> The problem is that it's IO expensive.
> This is the solution implemented in Liu Bo 2017 patch, as far as I can understand it.
> 
> 2) Never overwrite the RAID5 stripe
> This is stripe COW. The new stripe is stored at different disks positions.
> The problem is that it's even more IO expensive.

Why do you think that this approach is more IO expensive ?

Supposing to have n+1 disks, configured as raid5 Raid5

The problem is only when you have to update a portion of stripe. So I consider of an amount of data in the range 0..n -> n/2 in average (I omit the size of the disk-stripe which is constant)

1) stripe update in place (current BTRFS behavior)
The data to write is  ~ n/2 +1 (in average half of the stripe size + parity)

2) COW stripe (my idea)
The data to write is ~ n + 1 (the full stripe size + parity)

3) Your solution (cache in RAID1)
The data to write is
	a) write data to raid1: ~ n/2 * 2 = n (data written on two disks)
	b) update the RAID5 stripe: ~ n/2 + 1

Total:	~ 3 * n/2 + 1

(for the case 2 and 3 I don't consider the metadata update because it is several order of magnitude lower)




> This is the solution you are suggesting, as far as I can understand it.
> 
> What I'm suggesting is to use your COW solution, but also write the new (set of) stripe(s) as RAID1.
> Let me call this operation stripe COW RAID5 to RAID1.
> The key advantage is that if you have to overwrite it again a few seconds (or hours) later, then it can be fast, because it's already RAID1.

On the basis of my simulation, I can't agree: COW-ing a stripe requires to write the full stripe; instead if you want to write the data in a RAID1 before updating the stripe, in average first you have to write an amount of data equal to 'n', then you have to update the raid5....

You can see this in another way: raid5 is more space friendly than raid1, this means that in raid5 you have to write less data....


> 
> Morever, new stripes resulting from writing a new file, or appending, would be created as RAID1, even if the filesystem DATA is configured as RAID5, each time the stripe is not full or is likely to be modified soon.
> This will reduce the number of stripe COW RAID5 to RAID1 operations.
> 
> The final objective is to have few stripe COW operations, because they are IO expensive, and many RAID1 stripe overwrite operations.
> The price to pay for the reduced number of stripe COW operations is consuming more disk space, because RAID1 stripes consumes more disk space than RAID5 ones, and that is why we would have a background process that does stripe COW from RAID1 to RAID5 in order to reclaim disk space, and we could make it more aggressive when we lack disk space.
> 
> What I'm trying to provide is the idea that seeing the DATA as RAID1 or RAID5 is not a good idea when we have BTRFS flexibility. We should rather see it as RAID1 and RAID5, RAID5 beeing just a way to reclaim disk space (same for RAID1C3 and RAID6).
> Having METADATA as RAID1 and DATA as RAID5 was a first step, but BTRFS flexibility probably allows to do more.
> 
> Please notice that I understand the BTRFS and RAID principles, but on the other hand, I have not read the code, so can hardly say what is easy to implement.
> Sorry about that. I've written a full new operating system (see www.fullpliant.org) but the kernel :-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-14 21:25 ` Goffredo Baroncelli
@ 2019-11-15 20:41   ` Hubert Tonneau
  2019-11-17  8:53     ` Goffredo Baroncelli
  0 siblings, 1 reply; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-15 20:41 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> Why do you think that this approach is more IO expensive ?
> 
> Supposing to have n+1 disks, configured as raid5 Raid5
> 
> The problem is only when you have to update a portion of stripe. So I consider of an amount of data in the range 0..n -> n/2 in average (I omit the size of the disk-stripe which is constant)

Your approche is way too optimistic.
If the application is doing random writes on a nocow file, or doing append sync append sync append sync on a log, then it will rather be something like n/10 (10 is arbitrary number, but it will be much more than 2)

> 1) stripe update in place (current BTRFS behavior)
> The data to write is  ~ n/2 +1 (in average half of the stripe size + parity)

You have forgotten that you have to do a read of the parity before being abble to write, and this is what is going to take the server to a crawl if it does not write full RAID5 stripes.

> 2) COW stripe (my idea)
> The data to write is ~ n + 1 (the full stripe size + parity)

Same, you have to do read before write, in order to get the unchanged part of the stripe.

> 3) Your solution (cache in RAID1)
> The data to write is
> 	a) write data to raid1: ~ n/2 * 2 = n (data written on two disks)
> 	b) update the RAID5 stripe: ~ n/2 + 1
> 
> Total:	~ 3 * n/2 + 1

Your count shows that I did not express properly what I'm suggesting. Sorry about that.
What I'm suggesting, with something closer to your wording, is
a) copy the RAID5 stripe to a new RAID1 one at a different place: n * 2 (and need to read before write)
b) in place overwrite the new RAID1 stripe: n/2 * 2
Of course, you could optimize and do a) and b) in one step, so the overall cost would be n * 2 instead of n * 3
What is important is that what I'm suggesting is to NEVER update in place a RAID5 stripe. If you need to update a RAID5 stripe, first turn it to some RAID stripes at a different place of the disks. This is what a) does.
Also, since all active parts are already RAID1, a) is a fairly rare operation.

---

Let me get back to the basic:
With RAID5, if you want good performances and data safety, you have to do only full new stripes writes (no in place overwrite), because it avoids the read before write need, and the write hole.
It means that you need variable size stripes. This is ZFS way, as far as I understand it, and it brings a lot of complexity and makes it harder to deal with fragmentation, so this is what we would like to avoid it for BTRFS.

On the other hand, with RAID1, you can update in place, so everything is simple and fast.
This is why current BTRFS RAID1 is satisfying.
Only large file sequencial write on RAID5 is faster than RAID1, but we can assume that if people switch to RAID5, for all but very special use cases, it is not for faster sequencial write speed, but for more storage space available to the application.

As a summary, the only problem with current RAID1 BTRFS implementation, is that it provides a poor usable versus raw disk space ratio, just as any pure RAID1 implementation.
So, I see RAID5 as just a way to improve this ratio (same for RAID6 and RAID1C3).
That is why you need a background process that will select some large files that don't have the nocow flag set, and convert (not in place) them from RAID1 to RAID5 in order to consume less disk space.
The same process should convert (not in place) the file from RAID5 to RAID1 if the nocow flag has been set.
For any new files, use RAID1.
This is also why a) costy operation will be rare.
In the end, you get RAID1 performances and data safety, but a better usable versus raw disk space ratio, and this is just what most of us expect from the filesystem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-15 20:41   ` Hubert Tonneau
@ 2019-11-17  8:53     ` Goffredo Baroncelli
  2019-11-17 19:49       ` Hubert Tonneau
  2019-11-28 11:37       ` Hubert Tonneau
  0 siblings, 2 replies; 13+ messages in thread
From: Goffredo Baroncelli @ 2019-11-17  8:53 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-btrfs

On 15/11/2019 21.41, Hubert Tonneau wrote:
> Goffredo Baroncelli wrote:
>>
>> Why do you think that this approach is more IO expensive ?
>>
>> Supposing to have n+1 disks, configured as raid5 Raid5
>>
>> The problem is only when you have to update a portion of stripe. So I consider of an amount of data in the range 0..n -> n/2 in average (I omit the size of the disk-stripe which is constant)
> 
> Your approche is way too optimistic.
> If the application is doing random writes on a nocow file, or doing append sync append sync append sync on a log, then it will rather be something like n/10 (10 is arbitrary number, but it will be much more than 2)

We need data to prove which is the best model: n/2 or n/10.... Otherwise we have only assumption. What it is true, is that in case of small repeated in same stripe writes your approach is better. In case the writes are all spread on the disks you have more writes.
> 
>> 1) stripe update in place (current BTRFS behavior)
>> The data to write is  ~ n/2 +1 (in average half of the stripe size + parity)
> 
> You have forgotten that you have to do a read of the parity before being abble to write, and this is what is going to take the server to a crawl if it does not write full RAID5 stripes.

With exception of having a missing device, it is not needed to read the parity. However you have to read the *full* stripe (but this is true for every solution) to calculate the parity.
> 
>> 2) COW stripe (my idea)
>> The data to write is ~ n + 1 (the full stripe size + parity)
> 
> Same, you have to do read before write, in order to get the unchanged part of the stripe.
> 
>> 3) Your solution (cache in RAID1)
>> The data to write is
>> 	a) write data to raid1: ~ n/2 * 2 = n (data written on two disks)
>> 	b) update the RAID5 stripe: ~ n/2 + 1
>>
>> Total:	~ 3 * n/2 + 1
> 
> Your count shows that I did not express properly what I'm suggesting. Sorry about that.
> What I'm suggesting, with something closer to your wording, is
> a) copy the RAID5 stripe to a new RAID1 one at a different place: n * 2 (and need to read before write)
> b) in place overwrite the new RAID1 stripe: n/2 * 2
> Of course, you could optimize and do a) and b) in one step, so the overall cost would be n * 2 instead of n * 3

I wrote 3/2 *n not n*3... Pay attention that this includes also the updating of the RAID5...

Anyway I was wrong: you have to cache the full stripe, before updating the raid5 stripe. Why ? If you don't update the full stripe (so the parity is not aligned to the data) AND lose one disk (not involved int the update), you cannot rebuild the missing data.

So the corrected calculation is:
[...]
3) Your solution (cache in RAID1)
The data to write is
	a) write data to raid1: ~ n * 2  (data written on two disks)
	b) update the RAID5 stripe: ~ n/2 + 1

Total:	~ (2+1/2) * n + 1
[...]

> What is important is that what I'm suggesting is to NEVER update in place a RAID5 stripe. If you need to update a RAID5 stripe, first turn it to some RAID stripes at a different place of the disks. This is what a) does.
> Also, since all active parts are already RAID1, a) is a fairly rare operation.
> 
> ---
> 
> Let me get back to the basic:
> With RAID5, if you want good performances and data safety, you have to do only full new stripes writes (no in place overwrite), because it avoids the read before write need, and the write hole.
> It means that you need variable size stripes. This is ZFS way, as far as I understand it, and it brings a lot of complexity and makes it harder to deal with fragmentation, so this is what we would like to avoid it for BTRFS.
> 
> On the other hand, with RAID1, you can update in place, so everything is simple and fast.
> This is why current BTRFS RAID1 is satisfying.
> Only large file sequencial write on RAID5 is faster than RAID1, but we can assume that if people switch to RAID5, for all but very special use cases, it is not for faster sequencial write speed, but for more storage space available to the application.
> 
> As a summary, the only problem with current RAID1 BTRFS implementation, is that it provides a poor usable versus raw disk space ratio, just as any pure RAID1 implementation.
> So, I see RAID5 as just a way to improve this ratio (same for RAID6 and RAID1C3).
> That is why you need a background process that will select some large files that don't have the nocow flag set, and convert (not in place) them from RAID1 to RAID5 in order to consume less disk space.
> The same process should convert (not in place) the file from RAID5 to RAID1 if the nocow flag has been set.
> For any new files, use RAID1.
> This is also why a) costy operation will be rare.
> In the end, you get RAID1 performances and data safety, but a better usable versus raw disk space ratio, and this is just what most of us expect from the filesystem.

If I understood correctly, ZFS acts like you describe: it has a cache called L2ARC, where the data are written on disks before updating the filesystem. You are suggesting to keep the "hot" data in a RAID1 block group, and write/update the RADID5 block group only when the data is "cold" in the RAID1 cache.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-17  8:53     ` Goffredo Baroncelli
@ 2019-11-17 19:49       ` Hubert Tonneau
  2019-11-28 11:37       ` Hubert Tonneau
  1 sibling, 0 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-17 19:49 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> You are suggesting to keep the "hot" data in a RAID1 block group, and write/update the RAID5 block group only when the data is "cold" in the RAID1 cache.

That's it.
Just remove the last word 'cache' because in some cases, the data might also remain as RAID1 forever, as an exemple if the nocow flag is set and the partition has enough free space.

Regards,
Hubert Tonneau

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-17  8:53     ` Goffredo Baroncelli
  2019-11-17 19:49       ` Hubert Tonneau
@ 2019-11-28 11:37       ` Hubert Tonneau
  1 sibling, 0 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-28 11:37 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> You are suggesting to keep the "hot" data in a RAID1 block group, and write/update the RADID5 block group only when the data is "cold" in the RAID1 cache.

Following this view, just adding a 'source data profile' (extent filter) and a 'target data profile' (overules filesystem data profile) option to 'btrfs filesystem defragment' utility would be a good starting point.
It would allow, within a RAID1 btrfs filesystem, to manualy convert some subtree to RAID5, and increase free disk space as a result, so get the best of both worlds.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-12 19:49 ` Goffredo Baroncelli
@ 2019-11-14  4:25   ` Zygo Blaxell
  0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2019-11-14  4:25 UTC (permalink / raw)
  To: kreijack; +Cc: Hubert Tonneau, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7581 bytes --]

On Tue, Nov 12, 2019 at 08:49:33PM +0100, Goffredo Baroncelli wrote:
> On 12/11/2019 16.13, Hubert Tonneau wrote:
> > Hi,
> > 
> > In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour :
> > 
> > . When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6)
> 
> You can't overwrite  and convert a existing stripe for two kind of reason:
> 1) you still have to protect the stripe overwriting from the write hole
> 2) depending by the layout, a raid1 stripe consumes more space than a raid5 stripe with equal "capacity"
> 
> So you have to write (temporarily) the data on another place. This is something not different from what Qu proposed few years ago:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg66472.html [Btrfs: Add journal for raid5/6 writes]
> 
> where he added a device for logging the writes.
> 
> Unfortunately, this means doubling the writes; that for a COW filesystem (which already suffers this kind of issue) would be big performance penality....
> 
> Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata.
> Pros: the data is written only once
> Cons: the pressure of the metadata would increase; the fragmentation would increase

The write hole issue is caused by updating a RAID stripe that contains
committed data, and then not being able to finish that update because
of a crash or power loss.  You avoid this using two strategies:

	1.  never modify RAID stripes while they have committed data in
	them, or

	2.  use journalling so that you can never be prevented from
	completing a RAID stripe update due to a crash.

You can even do both, e.g. use strategy #1 for datacow files and strategy
#2 for nodatacow files.  IMHO we don't need to help nodatacow files
survive RAID5/6 failure events because we don't help nodatacow files
survive raid1, raid1c3, raid1c4, raid10, or dup failure events either,
but opinions differ so, fine, there's strategy #2 if you want it.

Other filesystems use strategy #1, but they have different layering
between CoW allocator and the RAID layer:  they put parity blocks in-band
in extents, so every extent is always a complete set of RAID stripes.
That would be a huge on-disk format change for btrfs (as well as rewriting
half the kernel implementation) that nobody wants to do.  The end
result would behave almost but not quite like the way btrfs currently
handles compression.  It's also not fixing the current btrfs raid5/6,
it's deprecating them entirely and doing something better instead.

Back to fixing existing btrfs profiles.  Any time we write to a stripe
that is not occupied by committed data on btrfs, we avoid the conditions
for the write hole.  The existing CoW mechanisms handle this, so nothing
needs to be changed there.  We only need to worry about writes to stripes
that contain data committed in earlier transactions, and we can know we
are doing this by looking at 'gen' fields in neighboring extents whenever
we insert an extent into a RAID5/6 block group.

We can get strategy #1 on btrfs by making two small(ish) changes:

	1.1.  allocate blocks strictly on stripe-aligned boundaries.

	1.2.  add a new balance filter that selects only partially filled
	RAID5/6 stripes for relocation.

The 'ssd' mount option already does 1.1, but it only works for RAID5
arrays with 5 disks and RAID6 arrays with 6 disks because it uses a fixed
allocation boundary, and it only works for metadata because...it's coded
to work only on metadata.  The change would be to have btrfs select an
allocation boundary for each block group based on the number of disks in
the block group (no new behavior for block groups that aren't raid5/6),
and do aligned allocations for both data and metadata.  This creates a
problem with free space fragmentation which we solve with change 1.2.

Implementing 1.2 allows balance to repack partially filled stripes into
complete stripes, which you will have to do fairly often if you are
allocating data strictly on RAID-stripe-aligned boundaries.  "Write 4K
then fsync" uses 256K of disk space, since writes to partially filled
stripes would not be allowed, we have 252K of wasted space and 4K in use.
Balance could later pack 64 such 4K extents into a single RAID5 stripe,
recovering all the wasted space.  Defrag can perform a similar function,
collecting multiple 4K extents into a single 256K or larger extent that
can be written in a single transaction without wasting space.

Strategy #2 requires some disk format changes:

	2.1.  add a new block group type for metadata that uses simple
	replication (raid1c3/raid1c4, already done)

	2.2.  record all data blocks to be written to partially filled
	RAID5/6 stripes in a journal before modifying any blocks in
	the stripe.

The journal in 2.2. could be some extension of the log tree or a separate
tree.  As long as we can guarantee that any partial RAID5/6 RMW stripe
update will complete all data block updates before we start updating the
committed stripes, we can update any blocks we want.  We don't need to
journal the parity blocks, we can just recompute them from the logged
data block if the updated device goes missing.

After a crash, the journal must be replayed so that there are no
incomplete stripe updates.  Normally there would be at most one partial
stripe update per transaction, unless the filesystem is really full and we
are forced to start filling in old incomplete stripes.  Full stripe writes
don't need any intervention, the existing btrfs CoW mechanisms are fine.

Strategy #1 requires no disk format changes.  It just changes
the allocator and balance behavior.  Userspace changes would not
be immediately required, though without running balance to clean up
partially filled RAID stripes, performance would degrade after some time.
New kernels will be able to write raid5/6 updates without write hole,
old kernels won't.

Strategy #2 requires multiple disk format changes:  raid1c3/c4 (which we
now have) and raid5/6 data block journalling extensions (which we don't).
A kernel that didn't know to replay the log would not be able to fix
write holes on mount.

Note there are similar numbers of writes between the two strategies.
Everything is written in two places--but strategy #1 allows the user to
choose when the second write happens.  This allows for batch updates,
or maybe the user deletes or overwrites the data before we even have to
bother relocating it.  Strategy #2 always writes every journalled data
block twice (not counting parity and mirroring), but we can keep the
number of journalled blocks to a minimum.

> > . Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6)
> > 
> > Expected advantages are :
> > . the low level features set basically remains the same
> > . the filesystem format remains the same
> > . old kernels and btrs-progs would not be disturbed
> > 
> > The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5.
> > 
> > Regards,
> > Hubert Tonneau
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-12 22:27 Hubert Tonneau
@ 2019-11-13 19:34 ` Goffredo Baroncelli
  0 siblings, 0 replies; 13+ messages in thread
From: Goffredo Baroncelli @ 2019-11-13 19:34 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-btrfs

On 12/11/2019 23.27, Hubert Tonneau wrote:
> Goffredo Baroncelli wrote:
>>
>> Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata.
> 
> We are saying the same.

The main difference is that my solution is permanent. The new data in new place is valid as the old one. In your idea, you talk about a secondary step of updating the stripe.

> What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.
I am not sure to understand what are you saying. Could you elaborate ?


> 
> The background process would then turn it back to RAID5 at a later point.
> Adjusting how aggressively this background process works enables to adjust the extra write cost versus saved disk space compromise.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
@ 2019-11-12 22:27 Hubert Tonneau
  2019-11-13 19:34 ` Goffredo Baroncelli
  0 siblings, 1 reply; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-12 22:27 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

Goffredo Baroncelli wrote:
>
> Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata.

We are saying the same.
What I am suggesting is to write it as RAID1 instead of RAID5, so that if it's changed a lot of times, you pay only once.

The background process would then turn it back to RAID5 at a later point.
Adjusting how aggressively this background process works enables to adjust the extra write cost versus saved disk space compromise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-12 15:13 Hubert Tonneau
  2019-11-12 18:44 ` Chris Murphy
@ 2019-11-12 19:49 ` Goffredo Baroncelli
  2019-11-14  4:25   ` Zygo Blaxell
  1 sibling, 1 reply; 13+ messages in thread
From: Goffredo Baroncelli @ 2019-11-12 19:49 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: linux-btrfs

On 12/11/2019 16.13, Hubert Tonneau wrote:
> Hi,
> 
> In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour :
> 
> . When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6)

You can't overwrite  and convert a existing stripe for two kind of reason:
1) you still have to protect the stripe overwriting from the write hole
2) depending by the layout, a raid1 stripe consumes more space than a raid5 stripe with equal "capacity"

So you have to write (temporarily) the data on another place. This is something not different from what Qu proposed few years ago:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg66472.html [Btrfs: Add journal for raid5/6 writes]

where he added a device for logging the writes.

Unfortunately, this means doubling the writes; that for a COW filesystem (which already suffers this kind of issue) would be big performance penality....

Instead I would like to investigate the idea of COW-ing the stripe: instead of updating the stripe on place, why not write the new stripe in another place and then update the data extent to point to the new data ? Of course would work only for the data and not for the metadata.
Pros: the data is written only once
Cons: the pressure of the metadata would increase; the fragmentation would increase


> 
> . Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6)
> 
> Expected advantages are :
> . the low level features set basically remains the same
> . the filesystem format remains the same
> . old kernels and btrs-progs would not be disturbed
> 
> The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5.
> 
> Regards,
> Hubert Tonneau
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Avoiding BRTFS RAID5 write hole
  2019-11-12 15:13 Hubert Tonneau
@ 2019-11-12 18:44 ` Chris Murphy
  2019-11-12 19:49 ` Goffredo Baroncelli
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2019-11-12 18:44 UTC (permalink / raw)
  To: Hubert Tonneau, Btrfs BTRFS

On Tue, Nov 12, 2019 at 2:28 PM Hubert Tonneau
<hubert.tonneau@fullpliant.org> wrote:
>
> Hi,
>
> In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour :
>
> . When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6)
>
> . Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6)
>
> Expected advantages are :
> . the low level features set basically remains the same
> . the filesystem format remains the same
> . old kernels and btrs-progs would not be disturbed
>
> The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5.
>

Interesting idea. It would be a compat_ro feature at worst, plausibly
it could be a compat feature, just without the guarantees offered by
the feature if using an older kernel.

Thing is, I'm not sure it's possible to convert just one stripe. I
think the minimum conversion unit is the block group. What is the
performance penalty if only full stripe writes were allowed? If
there's never RMW for a stripe, it avoids the write hole in the first
place. For some workloads the performance may be bad, in which case
the option to do COW as raid1 or raid1c3, later converting it to
raid56 would be better. In effect it'd be using raid1 for performance
and safety, and raid5 for efficiency.

I think most people would prefer raid1, 2 or 3 copies, for metadata,
by default, in lieu of raid56.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Avoiding BRTFS RAID5 write hole
@ 2019-11-12 15:13 Hubert Tonneau
  2019-11-12 18:44 ` Chris Murphy
  2019-11-12 19:49 ` Goffredo Baroncelli
  0 siblings, 2 replies; 13+ messages in thread
From: Hubert Tonneau @ 2019-11-12 15:13 UTC (permalink / raw)
  To: linux-btrfs

Hi,

In order to close the RAID5 write hole, I prepose the add a mount option that would change RAID5 (and RAID6) behaviour :

. When overwriting a RAID5 stripe, first convert it to RAID1 (convert it to RAID1C3 if it was RAID6)

. Have a background process that converts RAID1 stripes to RAID5 (RAID1C3 to RAID6)

Expected advantages are :
. the low level features set basically remains the same
. the filesystem format remains the same
. old kernels and btrs-progs would not be disturbed

The end result would be a mixed filesystem where active parts are RAID1 and archives one are RAID5.

Regards,
Hubert Tonneau

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 22:29 Avoiding BRTFS RAID5 write hole Hubert Tonneau
2019-11-13 22:51 ` waxhead
2019-11-14 21:25 ` Goffredo Baroncelli
2019-11-15 20:41   ` Hubert Tonneau
2019-11-17  8:53     ` Goffredo Baroncelli
2019-11-17 19:49       ` Hubert Tonneau
2019-11-28 11:37       ` Hubert Tonneau
  -- strict thread matches above, loose matches on Subject: below --
2019-11-12 22:27 Hubert Tonneau
2019-11-13 19:34 ` Goffredo Baroncelli
2019-11-12 15:13 Hubert Tonneau
2019-11-12 18:44 ` Chris Murphy
2019-11-12 19:49 ` Goffredo Baroncelli
2019-11-14  4:25   ` Zygo Blaxell

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git