All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: raid with a variable stripe size
@ 2016-11-18 18:15 Goffredo Baroncelli
  2016-11-18 20:32 ` Janos Toth F.
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-18 18:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell

Hello,

these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.

As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.

The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.

On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.

On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.

For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
BG #1,composed by two disks (1 data+ 1 parity)
BG #2 composed by three disks (2 data + 1 parity)
BG #3 composed by four disks (3 data + 1 parity).

If the data to be written has a size of 4k, it will be allocated to the BG #1.
If the data to be written has a size of 8k, it will be allocated to the BG #2
If the data to be written has a size of 12k, it will be allocated to the BG #3
If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.


To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e

DISK1 DISK2 DISK3 DISK4
S1    S1    S1    S2
S2    S2    S3    S3
S3    S4    S4    S4
[....]

Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.


Pro: 
- btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
- no more RMW are required (== higher performance)

Cons:
- the data will be more fragmented
- the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).


Thoughts ?

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
@ 2016-11-18 20:32 ` Janos Toth F.
  2016-11-18 20:51   ` Timofey Titovets
  2016-11-19  8:55   ` Goffredo Baroncelli
  2016-11-18 20:34 ` Timofey Titovets
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Janos Toth F. @ 2016-11-18 20:32 UTC (permalink / raw)
  Cc: linux-btrfs

Based on the comments of this patch, stripe size could theoretically
go as low as 512 byte:
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
If these very small (0.5k-2k) stripe sizes could really work (it's
possible to implement such changes and it does not degrade performance
too much - or at all - to keep it so low), we could use RAID-5(/6) on
<=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
sector size + 4k node size, although I am not sure if node size is
really important here) without having to worry about RMW, extra space
waste or additional fragmentation.

On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.
>
> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).
>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1    S1    S1    S2
> S2    S2    S3    S3
> S3    S4    S4    S4
> [....]
>
> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli
>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
  2016-11-18 20:32 ` Janos Toth F.
@ 2016-11-18 20:34 ` Timofey Titovets
  2016-11-19  8:59   ` Goffredo Baroncelli
  2016-11-19  8:22 ` Zygo Blaxell
  2016-11-29  0:48 ` Qu Wenruo
  3 siblings, 1 reply; 21+ messages in thread
From: Timofey Titovets @ 2016-11-18 20:34 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs, Zygo Blaxell

2016-11-18 21:15 GMT+03:00 Goffredo Baroncelli <kreijack@libero.it>:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.
>
> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).
>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1    S1    S1    S2
> S2    S2    S3    S3
> S3    S4    S4    S4
> [....]
>
> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli

AFAIK, it's difficult to do such things with btrfs, because btrfs use
chuck allocation for metadata & data,
i.e. AFAIK ZFS work with storage more directly, so zfs directly span
file to the different disks.

May be it's can be implemented by some chunk allocator rework, i don't know.

Fix me if i'm wrong, thanks.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 20:32 ` Janos Toth F.
@ 2016-11-18 20:51   ` Timofey Titovets
  2016-11-18 21:38     ` Janos Toth F.
  2016-11-19  8:55   ` Goffredo Baroncelli
  1 sibling, 1 reply; 21+ messages in thread
From: Timofey Titovets @ 2016-11-18 20:51 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: linux-btrfs

2016-11-18 23:32 GMT+03:00 Janos Toth F. <toth.f.janos@gmail.com>:
> Based on the comments of this patch, stripe size could theoretically
> go as low as 512 byte:
> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
> If these very small (0.5k-2k) stripe sizes could really work (it's
> possible to implement such changes and it does not degrade performance
> too much - or at all - to keep it so low), we could use RAID-5(/6) on
> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
> sector size + 4k node size, although I am not sure if node size is
> really important here) without having to worry about RMW, extra space
> waste or additional fragmentation.
>
> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>
>> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.
>>
>> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>>
>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>> BG #1,composed by two disks (1 data+ 1 parity)
>> BG #2 composed by three disks (2 data + 1 parity)
>> BG #3 composed by four disks (3 data + 1 parity).
>>
>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>
>>
>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>
>> DISK1 DISK2 DISK3 DISK4
>> S1    S1    S1    S2
>> S2    S2    S3    S3
>> S3    S4    S4    S4
>> [....]
>>
>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>
>>
>> Pro:
>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>> - no more RMW are required (== higher performance)
>>
>> Cons:
>> - the data will be more fragmented
>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>
>>
>> Thoughts ?
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

AFAIK all drives at now use 4k physical sector size, and use 512b only logically
So it's create another RWM Read 4k -> Modify 512b -> Write 4k, instead
of just write 512b.

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 20:51   ` Timofey Titovets
@ 2016-11-18 21:38     ` Janos Toth F.
  0 siblings, 0 replies; 21+ messages in thread
From: Janos Toth F. @ 2016-11-18 21:38 UTC (permalink / raw)
  Cc: linux-btrfs

Yes, I don't think one could find any NAND based SSDs with <4k page
size on the market right now (even =4k is hard to get) and 4k is
becoming the new norm for HDDs. However, some HDD manufacturers
continue to offer drives with 512 byte sectors (I think it's possible
to get new ones in sizable quantities if you need them).

I am aware it wouldn't solve the problem for >=4k sector devices
unless you are ready to balance frequently. But I think it would still
be a lot better to waste padding space on 4k stripes than, say, 64k
stripes until you can balance the new block groups. And, if the space
waste ratio is tolerable, then this could be an automatic background
task as soon as an individual block group or their totals get to a
high waste ratio.

I suggest this as a quick temporal workaround because it could be
cheap in terms of work if the above mentioned functionalities (stripe
size change, auto-balance) would be worked on anyway (regardless of
RAID-5/6 specific issues) until some better solution is realized
(probably through a lot more work over a lot longer development
period). RAID-5 isn't really optimal for a huge amount of disks (URE
during rebuild issue...), so the temporary space waste is probably
<=8x per unbalanced block groups (which are 1Gb or may be ~10Gb if I
am not mistaken, so usually <<8x of the whole available space). But
may be my guesstimates are wrong here.

On Fri, Nov 18, 2016 at 9:51 PM, Timofey Titovets <nefelim4ag@gmail.com> wrote:
> 2016-11-18 23:32 GMT+03:00 Janos Toth F. <toth.f.janos@gmail.com>:
>> Based on the comments of this patch, stripe size could theoretically
>> go as low as 512 byte:
>> https://mail-archive.com/linux-btrfs@vger.kernel.org/msg56011.html
>> If these very small (0.5k-2k) stripe sizes could really work (it's
>> possible to implement such changes and it does not degrade performance
>> too much - or at all - to keep it so low), we could use RAID-5(/6) on
>> <=9(/10) disks with 512 byte physical sectors (assuming 4k filesystem
>> sector size + 4k node size, although I am not sure if node size is
>> really important here) without having to worry about RMW, extra space
>> waste or additional fragmentation.
>>
>> On Fri, Nov 18, 2016 at 7:15 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
>>> Hello,
>>>
>>> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>>>
>>> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>>
>>> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
>>> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>>>
>>> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.
>>>
>>> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>>>
>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>>> BG #1,composed by two disks (1 data+ 1 parity)
>>> BG #2 composed by three disks (2 data + 1 parity)
>>> BG #3 composed by four disks (3 data + 1 parity).
>>>
>>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>>
>>>
>>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>>
>>> DISK1 DISK2 DISK3 DISK4
>>> S1    S1    S1    S2
>>> S2    S2    S3    S3
>>> S3    S4    S4    S4
>>> [....]
>>>
>>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>>
>>>
>>> Pro:
>>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>>> - no more RMW are required (== higher performance)
>>>
>>> Cons:
>>> - the data will be more fragmented
>>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>>
>>>
>>> Thoughts ?
>>>
>>> BR
>>> G.Baroncelli
>>>
>>>
>>>
>>> --
>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> AFAIK all drives at now use 4k physical sector size, and use 512b only logically
> So it's create another RWM Read 4k -> Modify 512b -> Write 4k, instead
> of just write 512b.
>
> --
> Have a nice day,
> Timofey.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
  2016-11-18 20:32 ` Janos Toth F.
  2016-11-18 20:34 ` Timofey Titovets
@ 2016-11-19  8:22 ` Zygo Blaxell
  2016-11-19  9:13   ` Goffredo Baroncelli
  2016-11-29  0:48 ` Qu Wenruo
  3 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2016-11-19  8:22 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10179 bytes --]

On Fri, Nov 18, 2016 at 07:15:12PM +0100, Goffredo Baroncelli wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share
> it hoping that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of
> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size"
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.

The key point in the problem statement is that subsequent writes are
allowed to modify stripes while they contain data.  Proper CoW would
never do that.

Stripes should never contain data from two separate transactions--that
would imply that CoW rules have been violated.

Currently there is no problem for big writes on empty disks because
the data block allocator happens to do the right thing accidentally in
such cases.  It's only when the allocator allocates new data to partially
filled stripes that the problems occur.

For metadata the allocator currently stumbles into RMW writes so badly
that the difference between the current allocator and the worst possible
allocator is only a few percent.

> On the best of my understanding (which could be very wrong) ZFS try
> to solve this issue using a variable length stripe.

ZFS ties the parity blocks to what btrfs would call extents.  It prevents
multiple writes to the same RAID stripe in different transactions by
dynamically defining the RAID stripe boundaries *around* the write
boundaries.  This is very different from btrfs's current on-disk
structure.

e.g. if we were to write:

	extent D, 7 blocks
	extent E, 3 blocks
	extent F, 9 blocks

the disk in btrfs looks something like:

	D1 D2 D3 D4 P1
	D5 D6 D7 P2 E1
	E2 E3 P3 F1 F2
	F3 P4 F4 F5 F6
	P5 F7 F8 F9 xx

	P1 = parity(D1..D4)
	P2 = parity(D5..D7, E1)
	P3 = parity(E2, E3, F1, F2)
	P4 = parity(F3..F6)
	P5 = parity(F7..F9)

If D, E, and F were written in different transactions, it could make P2
and P3 invalid.

The disk in ZFS looks something like:

	D1 D2 D3 D4 P1
	D5 D6 D7 P2 E1
	E2 E3 P3 F1 F2
	F3 F4 P4 F5 F6
	F7 F8 P5 F9 P6

where:

	P1 is parity(D1..D4)
	P2 is parity(D5..D7)
	P3 is parity(E1..E3)
	P4 is parity(F1..F4)
	P5 is parity(F5..F8)
	P6 is parity(F9)

Each parity value contains only data from one extent, which makes it
impossible for any P block to contain data from different transactions.
Every extent is striped across a potentially different number of disks,
so it's less efficient than "pure" raid5 would be with the same quantity
of data.

This would require pushing the parity allocation all the way up into
the extent layer in btrfs, which would be a massive change that could
introduce regressions into all the other RAID levels; on the other hand,
if it was pushed up to that level, it would be possible to checksum the
parity blocks...

> On BTRFS this could be achieved using several BGs (== block group or
> chunk), one for each stripe size.

Actually it's one per *possibly* failed disk (N^2 - N disks for RAID6).
Block groups are composed of *specific* disks...

> For example, if a filesystem - RAID5 is composed by 4 DISK, the
> filesystem should have three BGs: BG #1,composed by two disks (1
> data+ 1 parity) BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).

...i.e. you'd need block groups for disks ABCD, ABC, ABD, ACD, and BCD.

Btrfs doesn't allocate block groups that way anyway.  A much simpler
version of this is to make two changes:

	1.  Identify when disks go offline and mark block groups touching
	these disks as 'degraded'.  Currently this only happens at mount
	time, so the btrfs change would be to add the detection of state
	transition at the instant when a disk fails.

	2.  When a block group is degraded (i.e. some of its disks are
	missing), mark it strictly read-only and disable nodatacow.

Btrfs can already do #2 when balancing.  I've used this capability to
repair broken raid5 arrays.  Currently btrfs does *not* do this for
ordinary data writes, and that's the required change.

The trade-off for this approach is that if you didn't have any unallocated
space when a disk failed, you'll get ENOSPC for everything, because
there's no disk you could be allocating new metadata pages on.  That
makes it hard to add or replace disks.

> If the data to be written has a size of 4k, it will be allocated to
> the BG #1.  If the data to be written has a size of 8k, it will be
> allocated to the BG #2 If the data to be written has a size of 12k,
> it will be allocated to the BG #3 If the data to be written has a size
> greater than 12k, it will be allocated to the BG3, until the data fills
> a full stripes; then the remainder will be stored in BG #1 or BG #2.

OK I think I'm beginning to understand this idea better.  Short writes
degenerate to RAID1, and large writes behave more like RAID5.  No disk
format change is required because newer kernels would just allocate
block groups and distribute data differently.

That might be OK on SSD, but on spinning rust (where you're most likely
to find a RAID5 array) it'd be really seeky.  It'd also make 'df' output
even less predictive of actual data capacity.

Going back to the earlier example (but on 5 disks) we now have:

	block groups with 5 disks:
	D1 D2 D3 D4 P1
	F1 F2 F3 P2 F4
	F5 F6 P3 F7 F8

	block groups with 4 disks:
	E1 E2 E3 P4
	D5 D6 P5 D7

	block groups with 3 disks:
	(none)

	block groups with 2 disks:
	F9 P6

Now every parity block contains data from only one transaction, but 
extents D and F are separated by up to 4GB of disk space.

> To avoid unbalancing of the disk usage, each BG could use all the disks,
> even if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4 S1    S1    S1    S2 S2    S2    S3    S3 S3
> S4    S4    S4 [....]
>
> Above is show a BG which uses all the four disks, but has a stripe
> which spans only 3 disks.

This isn't necessary.  Ordinary btrfs block group allocations will take
care of this if they use the most-free-disk-space-first algorithm (like
raid1 does, spreading out block groups using pairs of disks across all
the available disks).  The only requirement is to have separate block
groups divided into groups with 2, 3, and 4 (up to N) disks in them.
The final decision of _which_ disks to allocate can be done on the fly
as required by space demands.

When the disk does get close to full, this would lead to some nasty
early-ENOSPC issues.  It's bad enough now with just two competing
allocators (metadata and data)...imagine those problems multiplied by
10 on a big RAID5 array.

> Pro: - btrfs already is capable to handle different BG in the
> filesystem, only the allocator has to change - no more RMW are required
> (== higher performance)
>
> Cons: - the data will be more fragmented - the filesystem, will have
> more BGs; this will require time-to time a re-balance. But is is an
> issue which we already know (even if may be not 100% addressed).
>
>
> Thoughts ?

I initially proposed "plug extents" as an off-the-cuff strawman to show
how absurd an idea it was, but the more I think about it, the more I
think it might be the best hope for a short-term fix.

The original "plug extents" statement was:

	we would inject "plug" extents to fill up RAID5 stripes.  This
	lets us keep the 4K block size for allocations, but at commit
	(or delalloc) time we would fill up any gaps in new RAID stripes
	to prevent them from being modified.  As the real data is deleted
	from the RAID stripes, it would be replaced by "plug" extents to
	keep any new data from being allocated in the stripe.  When the
	stripe consists entirely of "plug" extents, the plug extent would
	be deleted, allowing the stripe to be allocated again.	The "plug"
	data would be zero for the purposes of parity reconstruction,
	regardless of what's on the disk.  Balance would just throw the
	plug extents away (no need to relocate them).

With the same three extents again:

	D1 D2 D3 D4 P1
	D5 D6 D7 P2 xx
	E1 E2 P3 E3 xx
	F1 P4 F2 F3 F4
	P5 F5 F6 F7 F8
	F9 xx xx xx P6

This doesn't fragment the data or the free space, but it does waste
some space between extents.  If all three extents were committed
in the same transaction, we don't need the plug extents, so it
looks like

	D1 D2 D3 D4 P1
	D5 D6 D7 P2 E1
	E2 E3 P3 F1 F2
	F3 P4 F4 F5 F6
	P5 F7 F8 F9 xx

which is the same as what btrfs does now *except* we would *only*
allow this when D, E, and F are part of the *same* transaction, and a
new transaction wouldn't allocate anything in the block after F9.

I now realize there's no need for any "plug extent" to physically
exist--the allocator can simply infer their existence on the fly by
noticing where the RAID stripe boundaries are, and remembering which
blocks it had allocated in the current uncommitted transaction.

Plug extents require no disk format change, older and newer kernels would
both read and write the same disk format, but newer kernels would write
with properly working CoW.

The tradeoff is that more balances would be required to avoid free space
fragmentation; on the other hand, typical RAID5 use cases involve storing
a lot of huge files, so the fragmentation won't be a very large percentage
of total space.  A few percent of disk capacity is a fair price to pay for
data integrity.

>
> BR G.Baroncelli
>
>
>
> -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5 --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 20:32 ` Janos Toth F.
  2016-11-18 20:51   ` Timofey Titovets
@ 2016-11-19  8:55   ` Goffredo Baroncelli
  1 sibling, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-19  8:55 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: linux-btrfs

On 2016-11-18 21:32, Janos Toth F. wrote:
> Based on the comments of this patch, stripe size could theoretically
> go as low as 512 byte:

AFAIK the kernel uses a pagesize of 4k (or greater in some architecture). So doesn't make sense to use a so small size.

GB

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 20:34 ` Timofey Titovets
@ 2016-11-19  8:59   ` Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-19  8:59 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-btrfs, Zygo Blaxell

On 2016-11-18 21:34, Timofey Titovets wrote:
[...]
>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>> BG #1,composed by two disks (1 data+ 1 parity)
>> BG #2 composed by three disks (2 data + 1 parity)
>> BG #3 composed by four disks (3 data + 1 parity).
>>
>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>
>>
>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>
>> DISK1 DISK2 DISK3 DISK4
>> S1    S1    S1    S2
>> S2    S2    S3    S3
>> S3    S4    S4    S4
>> [....]
>>
>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>
>>
>> Pro:
>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>> - no more RMW are required (== higher performance)
>>
>> Cons:
>> - the data will be more fragmented
>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>
>>
>> Thoughts ?
>>
>> BR
>> G.Baroncelli
> 
> AFAIK, it's difficult to do such things with btrfs, because btrfs use
> chuck allocation for metadata & data,

BTRFS already is capable to use in the same filesystem different kind of chunk: i.e in case of adding a disk and a balance is not performed, a BTRFS filesystem still has the older chunks which doesn't use the last inserted disk.

Is the same thing, the only differences is that the allocator should select the chunk where to write on the basis data size to write.


> i.e. AFAIK ZFS work with storage more directly, so zfs directly span
> file to the different disks.
> 
> May be it's can be implemented by some chunk allocator rework, i don't know.
> 
> Fix me if i'm wrong, thanks.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-19  8:22 ` Zygo Blaxell
@ 2016-11-19  9:13   ` Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-19  9:13 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 2016-11-19 09:22, Zygo Blaxell wrote:
[...]
>> If the data to be written has a size of 4k, it will be allocated to
>> the BG #1.  If the data to be written has a size of 8k, it will be
>> allocated to the BG #2 If the data to be written has a size of 12k,
>> it will be allocated to the BG #3 If the data to be written has a size
>> greater than 12k, it will be allocated to the BG3, until the data fills
>> a full stripes; then the remainder will be stored in BG #1 or BG #2.
> 
> OK I think I'm beginning to understand this idea better.  Short writes
> degenerate to RAID1, and large writes behave more like RAID5.  No disk
> format change is required because newer kernels would just allocate
> block groups and distribute data differently.
> 
> That might be OK on SSD, but on spinning rust (where you're most likely
> to find a RAID5 array) it'd be really seeky.  It'd also make 'df' output
> even less predictive of actual data capacity.
> 
> Going back to the earlier example (but on 5 disks) we now have:
> 
> 	block groups with 5 disks:
> 	D1 D2 D3 D4 P1
> 	F1 F2 F3 P2 F4
> 	F5 F6 P3 F7 F8
> 
> 	block groups with 4 disks:
> 	E1 E2 E3 P4
> 	D5 D6 P5 D7
> 
> 	block groups with 3 disks:
> 	(none)
> 
> 	block groups with 2 disks:
> 	F9 P6
> 
> Now every parity block contains data from only one transaction, but 
> extents D and F are separated by up to 4GB of disk space.
> 
[....]

> 
> When the disk does get close to full, this would lead to some nasty
> early-ENOSPC issues.  It's bad enough now with just two competing
> allocators (metadata and data)...imagine those problems multiplied by
> 10 on a big RAID5 array.

I am incline to think that some problem would be reduced developing a daemon which starts a balance automatically when need (on the basis of the fragmentation). Anyway this is an issue which we should solve anyway.

[...]
> 
> I now realize there's no need for any "plug extent" to physically
> exist--the allocator can simply infer their existence on the fly by
> noticing where the RAID stripe boundaries are, and remembering which
> blocks it had allocated in the current uncommitted transaction.


Even this could be a "simple" solution: when a write starts, the system has to use only empty stripes...
> 
> 
> The tradeoff is that more balances would be required to avoid free space
> fragmentation; on the other hand, typical RAID5 use cases involve storing
> a lot of huge files, so the fragmentation won't be a very large percentage
> of total space.  A few percent of disk capacity is a fair price to pay for
> data integrity.

Both the methods would require a more aggressive balance. In this they are equal.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2016-11-19  8:22 ` Zygo Blaxell
@ 2016-11-29  0:48 ` Qu Wenruo
  2016-11-29  3:53   ` Zygo Blaxell
                     ` (2 more replies)
  3 siblings, 3 replies; 21+ messages in thread
From: Qu Wenruo @ 2016-11-29  0:48 UTC (permalink / raw)
  To: kreijack, linux-btrfs; +Cc: Zygo Blaxell



At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of raid5/6 is the write hole. Today BTRFS is not capable to address it.

I'd say, no need to address yet, since current soft RAID5/6 can't handle 
it yet.

Personally speaking, Btrfs should implementing RAID56 support just like 
Btrfs on mdadm.
See how badly the current RAID56 works?

The marginally benefit of btrfs RAID56 to scrub data better than 
tradition RAID56 is just a joke in current code base.

>
> The problem is that the stripe size is bigger than the "sector size" (ok sector is not the correct word, but I am referring to the basic unit of writing on the disk, which is 4k or 16K in btrfs).
> So when btrfs writes less data than the stripe, the stripe is not filled; when it is filled by a subsequent write, a RMW of the parity is required.
>
> On the best of my understanding (which could be very wrong) ZFS try to solve this issue using a variable length stripe.

Did you mean ZFS record size?
IIRC that's file extent minimum size, and I didn't see how that can 
handle the write hole problem.

Or did ZFS handle the problem?

Anyway, it should be a low priority thing, and personally speaking,
any large behavior modification involving  both extent allocator and bg 
allocator will be bug prone.

>
> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>
> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> BG #1,composed by two disks (1 data+ 1 parity)
> BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).

Too complicated bg layout and further extent allocator modification.

More code means more bugs, and I'm pretty sure it will be bug prone.


Although the idea of variable stripe size can somewhat reduce the 
problem under certain situation.

For example, if sectorsize is 64K, and we make stripe len to 32K, and 
use 3 disc RAID5, we can avoid such write hole problem.
Withouth modification to extent/chunk allocator.

And I'd prefer to make stripe len mkfs time parameter, not possible to 
modify after mkfs. To make things easy.

Thanks,
Qu

>
> If the data to be written has a size of 4k, it will be allocated to the BG #1.
> If the data to be written has a size of 8k, it will be allocated to the BG #2
> If the data to be written has a size of 12k, it will be allocated to the BG #3
> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>
>
> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>
> DISK1 DISK2 DISK3 DISK4
> S1    S1    S1    S2
> S2    S2    S3    S3
> S3    S4    S4    S4
> [....]
>
> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>
>
> Pro:
> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> - no more RMW are required (== higher performance)
>
> Cons:
> - the data will be more fragmented
> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>
>
> Thoughts ?
>
> BR
> G.Baroncelli
>
>
>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  0:48 ` Qu Wenruo
@ 2016-11-29  3:53   ` Zygo Blaxell
  2016-11-29  4:12     ` Qu Wenruo
  2016-11-29  5:51   ` Chris Murphy
  2016-11-29 18:10   ` Goffredo Baroncelli
  2 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2016-11-29  3:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kreijack, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5598 bytes --]

On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >Hello,
> >
> >these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
> >
> >As reported several times by Zygo (and others), one of the problem
> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> 
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
> 
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.

Even mdadm doesn't implement it the way btrfs does (assuming all bugs
are fixed) any more.

> See how badly the current RAID56 works?

> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

> >The problem is that the stripe size is bigger than the "sector size"
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.
> >
> >On the best of my understanding (which could be very wrong) ZFS try
> to solve this issue using a variable length stripe.
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
> 
> Or did ZFS handle the problem?

ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
parity blocks within extents, so it behaves more like btrfs compression
in the sense that the data in a RAID-Z extent is encoded differently
from the data in the file, and the kernel has to transform it on reads
and writes.

No ZFS stripe can contain blocks from multiple different
transactions because the RAID-Z stripes begin and end on extent
(single-transaction-write) boundaries, so there is no write hole on ZFS.

There is some space waste in ZFS because the minimum allocation unit
is two blocks (one data one parity) so any free space that is less
than two blocks long is unusable.  Also the maximum usable stripe width
(number of disks) is the size of the data in the extent plus one parity
block.  It means if you write a lot of discontiguous 4K blocks, you
effectively get 2-disk RAID1 and that may result in disappointing
storage efficiency.

(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
for additional parity blocks).

One could implement RAID-Z on btrfs, but it's by far the most invasive
proposal for fixing btrfs's write hole so far (and doesn't actually fix
anything, since the existing raid56 format would still be required to
read old data, and it would still be broken).

> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

My proposal requires only a modification to the extent allocator.
The behavior at the block group layer and scrub remains exactly the same.
We just need to adjust the allocator slightly to take the RAID5 CoW
constraints into account.

It's not as efficient as the ZFS approach, but it doesn't require an
incompatible disk format change either.

> >On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
> >
> >For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> >BG #1,composed by two disks (1 data+ 1 parity)
> >BG #2 composed by three disks (2 data + 1 parity)
> >BG #3 composed by four disks (3 data + 1 parity).
> 
> Too complicated bg layout and further extent allocator modification.
> 
> More code means more bugs, and I'm pretty sure it will be bug prone.
> 
> 
> Although the idea of variable stripe size can somewhat reduce the problem
> under certain situation.
> 
> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> disc RAID5, we can avoid such write hole problem.
> Withouth modification to extent/chunk allocator.
> 
> And I'd prefer to make stripe len mkfs time parameter, not possible to
> modify after mkfs. To make things easy.
> 
> Thanks,
> Qu
> 
> >
> >If the data to be written has a size of 4k, it will be allocated to the BG #1.
> >If the data to be written has a size of 8k, it will be allocated to the BG #2
> >If the data to be written has a size of 12k, it will be allocated to the BG #3
> >If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
> >
> >
> >To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
> >
> >DISK1 DISK2 DISK3 DISK4
> >S1    S1    S1    S2
> >S2    S2    S3    S3
> >S3    S4    S4    S4
> >[....]
> >
> >Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
> >
> >
> >Pro:
> >- btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> >- no more RMW are required (== higher performance)
> >
> >Cons:
> >- the data will be more fragmented
> >- the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
> >
> >
> >Thoughts ?
> >
> >BR
> >G.Baroncelli
> >
> >
> >
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  3:53   ` Zygo Blaxell
@ 2016-11-29  4:12     ` Qu Wenruo
  2016-11-29  4:55       ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2016-11-29  4:12 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: kreijack, linux-btrfs



At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
> On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
>> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>> Hello,
>>>
>>> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>>>
>>> As reported several times by Zygo (and others), one of the problem
>> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>
>> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
>> yet.
>>
>> Personally speaking, Btrfs should implementing RAID56 support just like
>> Btrfs on mdadm.
>
> Even mdadm doesn't implement it the way btrfs does (assuming all bugs
> are fixed) any more.
>
>> See how badly the current RAID56 works?
>
>> The marginally benefit of btrfs RAID56 to scrub data better than tradition
>> RAID56 is just a joke in current code base.
>
>>> The problem is that the stripe size is bigger than the "sector size"
>> (ok sector is not the correct word, but I am referring to the basic
>> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
>> btrfs writes less data than the stripe, the stripe is not filled; when
>> it is filled by a subsequent write, a RMW of the parity is required.
>>>
>>> On the best of my understanding (which could be very wrong) ZFS try
>> to solve this issue using a variable length stripe.
>>
>> Did you mean ZFS record size?
>> IIRC that's file extent minimum size, and I didn't see how that can handle
>> the write hole problem.
>>
>> Or did ZFS handle the problem?
>
> ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
> parity blocks within extents, so it behaves more like btrfs compression
> in the sense that the data in a RAID-Z extent is encoded differently
> from the data in the file, and the kernel has to transform it on reads
> and writes.
>
> No ZFS stripe can contain blocks from multiple different
> transactions because the RAID-Z stripes begin and end on extent
> (single-transaction-write) boundaries, so there is no write hole on ZFS.
>
> There is some space waste in ZFS because the minimum allocation unit
> is two blocks (one data one parity) so any free space that is less
> than two blocks long is unusable.  Also the maximum usable stripe width
> (number of disks) is the size of the data in the extent plus one parity
> block.  It means if you write a lot of discontiguous 4K blocks, you
> effectively get 2-disk RAID1 and that may result in disappointing
> storage efficiency.
>
> (the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
> for additional parity blocks).
>
> One could implement RAID-Z on btrfs, but it's by far the most invasive
> proposal for fixing btrfs's write hole so far (and doesn't actually fix
> anything, since the existing raid56 format would still be required to
> read old data, and it would still be broken).
>
>> Anyway, it should be a low priority thing, and personally speaking,
>> any large behavior modification involving  both extent allocator and bg
>> allocator will be bug prone.
>
> My proposal requires only a modification to the extent allocator.
> The behavior at the block group layer and scrub remains exactly the same.
> We just need to adjust the allocator slightly to take the RAID5 CoW
> constraints into account.

Then, you'd need to allow btrfs to split large buffered/direct write 
into small extents(not 128M anymore).
Not sure if we need to do extra work for DirectIO.

And in fact, you're going to support variant max file extent size.

This makes delalloc more complex (Wang enhanced dealloc support for 
variant file extent size, to fix ENOSPC problem for dedupe and compression).

This is already much more complex than you expected.


And this is the *BIGGEST* problem of current btrfs:
No good enough(if there is any) *ISOLATION* for such a complex fs.

So even "small" modification can lead to unexpected bugs.

That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
If not possible, I prefer not to do anything yet, until we are sure the 
very basic part of RAID56 is stable.

Thanks,
Qu

>
> It's not as efficient as the ZFS approach, but it doesn't require an
> incompatible disk format change either.
>
>>> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>>>
>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>>> BG #1,composed by two disks (1 data+ 1 parity)
>>> BG #2 composed by three disks (2 data + 1 parity)
>>> BG #3 composed by four disks (3 data + 1 parity).
>>
>> Too complicated bg layout and further extent allocator modification.
>>
>> More code means more bugs, and I'm pretty sure it will be bug prone.
>>
>>
>> Although the idea of variable stripe size can somewhat reduce the problem
>> under certain situation.
>>
>> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
>> disc RAID5, we can avoid such write hole problem.
>> Withouth modification to extent/chunk allocator.
>>
>> And I'd prefer to make stripe len mkfs time parameter, not possible to
>> modify after mkfs. To make things easy.
>>
>> Thanks,
>> Qu
>>
>>>
>>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>>
>>>
>>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>>
>>> DISK1 DISK2 DISK3 DISK4
>>> S1    S1    S1    S2
>>> S2    S2    S3    S3
>>> S3    S4    S4    S4
>>> [....]
>>>
>>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>>
>>>
>>> Pro:
>>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>>> - no more RMW are required (== higher performance)
>>>
>>> Cons:
>>> - the data will be more fragmented
>>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>>
>>>
>>> Thoughts ?
>>>
>>> BR
>>> G.Baroncelli
>>>
>>>
>>>
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  4:12     ` Qu Wenruo
@ 2016-11-29  4:55       ` Zygo Blaxell
  2016-11-29  5:49         ` Qu Wenruo
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2016-11-29  4:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kreijack, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8358 bytes --]

On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
> 
> 
> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
> >On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem
> >>of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >
> >Even mdadm doesn't implement it the way btrfs does (assuming all bugs
> >are fixed) any more.
> >
> >>See how badly the current RAID56 works?
> >
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >>>The problem is that the stripe size is bigger than the "sector size"
> >>(ok sector is not the correct word, but I am referring to the basic
> >>unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> >>btrfs writes less data than the stripe, the stripe is not filled; when
> >>it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try
> >>to solve this issue using a variable length stripe.
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
> >parity blocks within extents, so it behaves more like btrfs compression
> >in the sense that the data in a RAID-Z extent is encoded differently
> >from the data in the file, and the kernel has to transform it on reads
> >and writes.
> >
> >No ZFS stripe can contain blocks from multiple different
> >transactions because the RAID-Z stripes begin and end on extent
> >(single-transaction-write) boundaries, so there is no write hole on ZFS.
> >
> >There is some space waste in ZFS because the minimum allocation unit
> >is two blocks (one data one parity) so any free space that is less
> >than two blocks long is unusable.  Also the maximum usable stripe width
> >(number of disks) is the size of the data in the extent plus one parity
> >block.  It means if you write a lot of discontiguous 4K blocks, you
> >effectively get 2-disk RAID1 and that may result in disappointing
> >storage efficiency.
> >
> >(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
> >for additional parity blocks).
> >
> >One could implement RAID-Z on btrfs, but it's by far the most invasive
> >proposal for fixing btrfs's write hole so far (and doesn't actually fix
> >anything, since the existing raid56 format would still be required to
> >read old data, and it would still be broken).
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >My proposal requires only a modification to the extent allocator.
> >The behavior at the block group layer and scrub remains exactly the same.
> >We just need to adjust the allocator slightly to take the RAID5 CoW
> >constraints into account.
> 
> Then, you'd need to allow btrfs to split large buffered/direct write into
> small extents(not 128M anymore).
> Not sure if we need to do extra work for DirectIO.

Nope, that's not my proposal.  My proposal is to simply ignore free
space whenever it's inside a partially filled raid stripe (optimization:
...which was empty at the start of the current transaction).

That avoids modifying a stripe with committed data and therefore plugs the
write hole.

For nodatacow, prealloc (and maybe directio?) extents the behavior
wouldn't change (you'd have write hole, but only on data blocks not
metadata, and only on files that were already marked as explicitly not
requiring data integrity).

> And in fact, you're going to support variant max file extent size.

The existing extent sizing behavior is not changed *at all* in my proposal,
only the allocator's notion of what space is 'free'.

We can write an extent across multiple RAID5 stripes so long as we
finish writing the entire extent before pointing committed metadata to
it.  btrfs does that already otherwise checksums wouldn't work.

> This makes delalloc more complex (Wang enhanced dealloc support for variant
> file extent size, to fix ENOSPC problem for dedupe and compression).
> 
> This is already much more complex than you expected.

The complexity I anticipate is having to deal with two implementations
of the free space search, one for free space cache and one for free
space tree.

It could be as simple as calling the existing allocation functions and
just filtering out anything that isn't suitably aligned inside a raid56
block group (at least for a proof of concept).

> And this is the *BIGGEST* problem of current btrfs:
> No good enough(if there is any) *ISOLATION* for such a complex fs.
> 
> So even "small" modification can lead to unexpected bugs.
> 
> That's why I want to isolate the fix in RAID56 layer, not any layer upwards.

I don't think the write hole is fixable in the current raid56 layer, at
least not without a nasty brute force solution like stripe update journal.

Any of the fixes I'd want to use fix the problem from outside.

> If not possible, I prefer not to do anything yet, until we are sure the very
> basic part of RAID56 is stable.
> 
> Thanks,
> Qu
> 
> >
> >It's not as efficient as the ZFS approach, but it doesn't require an
> >incompatible disk format change either.
> >
> >>>On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
> >>>
> >>>For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> >>>BG #1,composed by two disks (1 data+ 1 parity)
> >>>BG #2 composed by three disks (2 data + 1 parity)
> >>>BG #3 composed by four disks (3 data + 1 parity).
> >>
> >>Too complicated bg layout and further extent allocator modification.
> >>
> >>More code means more bugs, and I'm pretty sure it will be bug prone.
> >>
> >>
> >>Although the idea of variable stripe size can somewhat reduce the problem
> >>under certain situation.
> >>
> >>For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> >>disc RAID5, we can avoid such write hole problem.
> >>Withouth modification to extent/chunk allocator.
> >>
> >>And I'd prefer to make stripe len mkfs time parameter, not possible to
> >>modify after mkfs. To make things easy.
> >>
> >>Thanks,
> >>Qu
> >>
> >>>
> >>>If the data to be written has a size of 4k, it will be allocated to the BG #1.
> >>>If the data to be written has a size of 8k, it will be allocated to the BG #2
> >>>If the data to be written has a size of 12k, it will be allocated to the BG #3
> >>>If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
> >>>
> >>>
> >>>To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
> >>>
> >>>DISK1 DISK2 DISK3 DISK4
> >>>S1    S1    S1    S2
> >>>S2    S2    S3    S3
> >>>S3    S4    S4    S4
> >>>[....]
> >>>
> >>>Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
> >>>
> >>>
> >>>Pro:
> >>>- btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> >>>- no more RMW are required (== higher performance)
> >>>
> >>>Cons:
> >>>- the data will be more fragmented
> >>>- the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
> >>>
> >>>
> >>>Thoughts ?
> >>>
> >>>BR
> >>>G.Baroncelli
> >>>
> >>>
> >>>
> >>
> >>
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  4:55       ` Zygo Blaxell
@ 2016-11-29  5:49         ` Qu Wenruo
  2016-11-29 18:47           ` Janos Toth F.
  2016-11-29 22:51           ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Qu Wenruo @ 2016-11-29  5:49 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: kreijack, linux-btrfs



At 11/29/2016 12:55 PM, Zygo Blaxell wrote:
> On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
>>
>>
>> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
>>> On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
>>>> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>>>> Hello,
>>>>>
>>>>> these are only my thoughts; no code here, but I would like to share it hoping that it could be useful.
>>>>>
>>>>> As reported several times by Zygo (and others), one of the problem
>>>> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>>>
>>>> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
>>>> yet.
>>>>
>>>> Personally speaking, Btrfs should implementing RAID56 support just like
>>>> Btrfs on mdadm.
>>>
>>> Even mdadm doesn't implement it the way btrfs does (assuming all bugs
>>> are fixed) any more.
>>>
>>>> See how badly the current RAID56 works?
>>>
>>>> The marginally benefit of btrfs RAID56 to scrub data better than tradition
>>>> RAID56 is just a joke in current code base.
>>>
>>>>> The problem is that the stripe size is bigger than the "sector size"
>>>> (ok sector is not the correct word, but I am referring to the basic
>>>> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
>>>> btrfs writes less data than the stripe, the stripe is not filled; when
>>>> it is filled by a subsequent write, a RMW of the parity is required.
>>>>>
>>>>> On the best of my understanding (which could be very wrong) ZFS try
>>>> to solve this issue using a variable length stripe.
>>>>
>>>> Did you mean ZFS record size?
>>>> IIRC that's file extent minimum size, and I didn't see how that can handle
>>>> the write hole problem.
>>>>
>>>> Or did ZFS handle the problem?
>>>
>>> ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
>>> parity blocks within extents, so it behaves more like btrfs compression
>>> in the sense that the data in a RAID-Z extent is encoded differently
>> >from the data in the file, and the kernel has to transform it on reads
>>> and writes.
>>>
>>> No ZFS stripe can contain blocks from multiple different
>>> transactions because the RAID-Z stripes begin and end on extent
>>> (single-transaction-write) boundaries, so there is no write hole on ZFS.
>>>
>>> There is some space waste in ZFS because the minimum allocation unit
>>> is two blocks (one data one parity) so any free space that is less
>>> than two blocks long is unusable.  Also the maximum usable stripe width
>>> (number of disks) is the size of the data in the extent plus one parity
>>> block.  It means if you write a lot of discontiguous 4K blocks, you
>>> effectively get 2-disk RAID1 and that may result in disappointing
>>> storage efficiency.
>>>
>>> (the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
>>> for additional parity blocks).
>>>
>>> One could implement RAID-Z on btrfs, but it's by far the most invasive
>>> proposal for fixing btrfs's write hole so far (and doesn't actually fix
>>> anything, since the existing raid56 format would still be required to
>>> read old data, and it would still be broken).
>>>
>>>> Anyway, it should be a low priority thing, and personally speaking,
>>>> any large behavior modification involving  both extent allocator and bg
>>>> allocator will be bug prone.
>>>
>>> My proposal requires only a modification to the extent allocator.
>>> The behavior at the block group layer and scrub remains exactly the same.
>>> We just need to adjust the allocator slightly to take the RAID5 CoW
>>> constraints into account.
>>
>> Then, you'd need to allow btrfs to split large buffered/direct write into
>> small extents(not 128M anymore).
>> Not sure if we need to do extra work for DirectIO.
>
> Nope, that's not my proposal.  My proposal is to simply ignore free
> space whenever it's inside a partially filled raid stripe (optimization:
> ...which was empty at the start of the current transaction).

Still have problems.

Allocator must handle fs under device remove or profile converting (from 
4 disks raid5 to 5 disk raid5/6) correctly.
Which already seems complex for me.


And further more, for fs with more devices, for example, 9 devices RAID5.
It will be a disaster to just write a 4K data and take up the whole 8 * 
64K space.
It will  definitely cause huge ENOSPC problem.

If you really think it's easy, make a RFC patch, which should be easy if 
it is, then run fstest auto group on it.

Easy words won't turn emails into real patch.

>
> That avoids modifying a stripe with committed data and therefore plugs the
> write hole.
>
> For nodatacow, prealloc (and maybe directio?) extents the behavior
> wouldn't change (you'd have write hole, but only on data blocks not
> metadata, and only on files that were already marked as explicitly not
> requiring data integrity).
>
>> And in fact, you're going to support variant max file extent size.
>
> The existing extent sizing behavior is not changed *at all* in my proposal,
> only the allocator's notion of what space is 'free'.
>
> We can write an extent across multiple RAID5 stripes so long as we
> finish writing the entire extent before pointing committed metadata to
> it.  btrfs does that already otherwise checksums wouldn't work.
>
>> This makes delalloc more complex (Wang enhanced dealloc support for variant
>> file extent size, to fix ENOSPC problem for dedupe and compression).
>>
>> This is already much more complex than you expected.
>
> The complexity I anticipate is having to deal with two implementations
> of the free space search, one for free space cache and one for free
> space tree.
>
> It could be as simple as calling the existing allocation functions and
> just filtering out anything that isn't suitably aligned inside a raid56
> block group (at least for a proof of concept).
>
>> And this is the *BIGGEST* problem of current btrfs:
>> No good enough(if there is any) *ISOLATION* for such a complex fs.
>>
>> So even "small" modification can lead to unexpected bugs.
>>
>> That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
>
> I don't think the write hole is fixable in the current raid56 layer, at
> least not without a nasty brute force solution like stripe update journal.
>
> Any of the fixes I'd want to use fix the problem from outside.
>
>> If not possible, I prefer not to do anything yet, until we are sure the very
>> basic part of RAID56 is stable.
>>
>> Thanks,
>> Qu
>>
>>>
>>> It's not as efficient as the ZFS approach, but it doesn't require an
>>> incompatible disk format change either.
>>>
>>>>> On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
>>>>>
>>>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
>>>>> BG #1,composed by two disks (1 data+ 1 parity)
>>>>> BG #2 composed by three disks (2 data + 1 parity)
>>>>> BG #3 composed by four disks (3 data + 1 parity).
>>>>
>>>> Too complicated bg layout and further extent allocator modification.
>>>>
>>>> More code means more bugs, and I'm pretty sure it will be bug prone.
>>>>
>>>>
>>>> Although the idea of variable stripe size can somewhat reduce the problem
>>>> under certain situation.
>>>>
>>>> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
>>>> disc RAID5, we can avoid such write hole problem.
>>>> Withouth modification to extent/chunk allocator.
>>>>
>>>> And I'd prefer to make stripe len mkfs time parameter, not possible to
>>>> modify after mkfs. To make things easy.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> If the data to be written has a size of 4k, it will be allocated to the BG #1.
>>>>> If the data to be written has a size of 8k, it will be allocated to the BG #2
>>>>> If the data to be written has a size of 12k, it will be allocated to the BG #3
>>>>> If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
>>>>>
>>>>>
>>>>> To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
>>>>>
>>>>> DISK1 DISK2 DISK3 DISK4
>>>>> S1    S1    S1    S2
>>>>> S2    S2    S3    S3
>>>>> S3    S4    S4    S4
>>>>> [....]
>>>>>
>>>>> Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
>>>>>
>>>>>
>>>>> Pro:
>>>>> - btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
>>>>> - no more RMW are required (== higher performance)
>>>>>
>>>>> Cons:
>>>>> - the data will be more fragmented
>>>>> - the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
>>>>>
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>> BR
>>>>> G.Baroncelli
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  0:48 ` Qu Wenruo
  2016-11-29  3:53   ` Zygo Blaxell
@ 2016-11-29  5:51   ` Chris Murphy
  2016-11-29  6:03     ` Qu Wenruo
  2016-11-29 18:10   ` Goffredo Baroncelli
  2 siblings, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2016-11-29  5:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Goffredo Baroncelli, linux-btrfs, Zygo Blaxell

On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it
>> hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of
>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
>
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
>
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.
> See how badly the current RAID56 works?
>
> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

Btrfs is subject to the write hole problem on disk, but any read or
scrub that needs to reconstruct from parity that is corrupt results in
a checksum error and EIO. So corruption is not passed up to user
space. Recent versions of md/mdadm support a write journal to avoid
the write hole problem on disk in case of a crash.

>> The problem is that the stripe size is bigger than the "sector size" (ok
>> sector is not the correct word, but I am referring to the basic unit of
>> writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled;
>> when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to
>> solve this issue using a variable length stripe.
>
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
>
> Or did ZFS handle the problem?

ZFS isn't subject to the write hole. My understanding is they get
around this because all writes are COW, there is no RMW. But the
variable stripe size means they don't have to do the usual (fixed)
full stripe write for just, for example a 4KiB change in data for a
single file. Conversely Btrfs does do RMW in such a case.


> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

I tend to agree. I think the non-scalability of Btrfs raid10, which
makes it behave more like raid 0+1, is a higher priority because right
now it's misleading to say the least; and then the longer term goal
for scaleable huge file systems is how Btrfs can shed irreparably
damaged parts of the file system (tree pruning) rather than
reconstruction.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  5:51   ` Chris Murphy
@ 2016-11-29  6:03     ` Qu Wenruo
  2016-11-29 18:19       ` Goffredo Baroncelli
  2016-11-29 22:54       ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Qu Wenruo @ 2016-11-29  6:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, linux-btrfs, Zygo Blaxell



At 11/29/2016 01:51 PM, Chris Murphy wrote:
> On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>>
>>> Hello,
>>>
>>> these are only my thoughts; no code here, but I would like to share it
>>> hoping that it could be useful.
>>>
>>> As reported several times by Zygo (and others), one of the problem of
>>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>
>>
>> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
>> yet.
>>
>> Personally speaking, Btrfs should implementing RAID56 support just like
>> Btrfs on mdadm.
>> See how badly the current RAID56 works?
>>
>> The marginally benefit of btrfs RAID56 to scrub data better than tradition
>> RAID56 is just a joke in current code base.
>
> Btrfs is subject to the write hole problem on disk, but any read or
> scrub that needs to reconstruct from parity that is corrupt results in
> a checksum error and EIO. So corruption is not passed up to user
> space. Recent versions of md/mdadm support a write journal to avoid
> the write hole problem on disk in case of a crash.

That's interesting.

So I think it's less worthy to support RAID56 in btrfs, especially 
considering the stability.

My widest dream is, btrfs calls device mapper to build a micro 
RAID1/5/6/10 device for each chunk.
Which should save us tons of codes and bugs.

And for better recovery, enhance device mapper to provide interface to 
judge which block is correct.

Although that's just dream anyway.

Thanks,
Qu
>
>>> The problem is that the stripe size is bigger than the "sector size" (ok
>>> sector is not the correct word, but I am referring to the basic unit of
>>> writing on the disk, which is 4k or 16K in btrfs).
>>> So when btrfs writes less data than the stripe, the stripe is not filled;
>>> when it is filled by a subsequent write, a RMW of the parity is required.
>>>
>>> On the best of my understanding (which could be very wrong) ZFS try to
>>> solve this issue using a variable length stripe.
>>
>>
>> Did you mean ZFS record size?
>> IIRC that's file extent minimum size, and I didn't see how that can handle
>> the write hole problem.
>>
>> Or did ZFS handle the problem?
>
> ZFS isn't subject to the write hole. My understanding is they get
> around this because all writes are COW, there is no RMW.
> But the
> variable stripe size means they don't have to do the usual (fixed)
> full stripe write for just, for example a 4KiB change in data for a
> single file. Conversely Btrfs does do RMW in such a case.
>
>
>> Anyway, it should be a low priority thing, and personally speaking,
>> any large behavior modification involving  both extent allocator and bg
>> allocator will be bug prone.
>
> I tend to agree. I think the non-scalability of Btrfs raid10, which
> makes it behave more like raid 0+1, is a higher priority because right
> now it's misleading to say the least; and then the longer term goal
> for scaleable huge file systems is how Btrfs can shed irreparably
> damaged parts of the file system (tree pruning) rather than
> reconstruction.
>
>
>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  0:48 ` Qu Wenruo
  2016-11-29  3:53   ` Zygo Blaxell
  2016-11-29  5:51   ` Chris Murphy
@ 2016-11-29 18:10   ` Goffredo Baroncelli
  2 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-29 18:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Zygo Blaxell

On 2016-11-29 01:48, Qu Wenruo wrote:
> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3 disc RAID5, we can avoid such write hole problem.
> Withouth modification to extent/chunk allocator.
> 
> And I'd prefer to make stripe len mkfs time parameter, not possible to modify after mkfs. To make things easy.

This is like the Zygo idea: make the sector_size = (ndisk-1) * strpe_len... If this could be possible to implement per BG basis you answered the Zygo question. Of course when the number of the disk increases the disk space wasting increases too. But for small RAID5/6 (4/5 disk) it could be an acceptable trade-off.

Anyway on the basis that SSD is the future of storage, I think that our thoughts about how avoid a RMW cycle don't make sense. The SSD firmware remaps sectors, so what we think as "simple write" may hide a RMW because the erase sector are bigger than the disk sector (4k ?).

> 
> Thanks,
> Qu


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  6:03     ` Qu Wenruo
@ 2016-11-29 18:19       ` Goffredo Baroncelli
  2016-11-29 22:54       ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2016-11-29 18:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, linux-btrfs, Zygo Blaxell

On 2016-11-29 07:03, Qu Wenruo wrote:
[...]
>> Btrfs is subject to the write hole problem on disk, but any read or
>> scrub that needs to reconstruct from parity that is corrupt results in
>> a checksum error and EIO. So corruption is not passed up to user
>> space. Recent versions of md/mdadm support a write journal to avoid
>> the write hole problem on disk in case of a crash.
> 
> That's interesting.
> 
> So I think it's less worthy to support RAID56 in btrfs, especially considering the stability.
> 
> My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10 device for each chunk.
> Which should save us tons of codes and bugs.
> 
> And for better recovery, enhance device mapper to provide interface to judge which block is correct.
> 
> Although that's just dream anyway.


IIRC in the past this was discussed, although I am not able to find any reference...


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  5:49         ` Qu Wenruo
@ 2016-11-29 18:47           ` Janos Toth F.
  2016-11-29 22:51           ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Janos Toth F. @ 2016-11-29 18:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, kreijack, linux-btrfs

I would love to have the stripe element size (per disk portions of
logical "full" stripes) changeable online with balance anyway
(starting from 512 byte/disk, not placing artificial arbitrary
limitations on it at the low end).
A small stripe size (for example 4k/disk or even 512byte/disk if you
happen to have HDDs with real 512 byte physical sectors) would help
minimizing this temporal space waste problem a lot (16 fold if you go
from 64k to 4k, or even completely if you go down to 512 byte on
5-disk RAID-5).

And regardless of that, I think having to keep in mind to balance
regularly, or even artificially running out of space from time to time
is much better than living with a constantly impending doom in mind
(and probably experiencing that disaster for real in your lifetime).

In case you wonder and/or care, ZFS not only allows for setting the
parameter which is closest to the "stripe element size" (smallest unit
which can be written to a disk at once) to 512 byte but that still
continues to be the default for many ZFS implementations, with 4k (or
more) being only optional (it's controlled by "ashift" and set
statically on pool creation time, although additional cache/log
devices might be added later with different ashift). And I like it
that way. I never used bigger ashift than the one matching the
physical sector size of the disks (usually 512 byte for HDDs or 4k for
SSDs). And I always used the smallest recordsize (effectively the
minimum "full" stripe stripe) I could get around with before notably
throttling the performance of sustained sequential writes. In this
regard, I never understood why people tend to crave huge units like
1MiB stripe size or so. Don't they ever store small files or read
small chunks of big files and/or care about latency (and even
minimizing the potential data loss in case multiple random sectors get
damaged on multiple disks, or upon power failure / kernel panic), not
even as long as benchmarks can show it's almost free to go lower...?

On Tue, Nov 29, 2016 at 6:49 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 11/29/2016 12:55 PM, Zygo Blaxell wrote:
>>
>> On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
>>>
>>>
>>>
>>> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
>>>>
>>>> On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
>>>>>
>>>>> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> these are only my thoughts; no code here, but I would like to share it
>>>>>> hoping that it could be useful.
>>>>>>
>>>>>> As reported several times by Zygo (and others), one of the problem
>>>>>
>>>>> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
>>>>>
>>>>> I'd say, no need to address yet, since current soft RAID5/6 can't
>>>>> handle it
>>>>> yet.
>>>>>
>>>>> Personally speaking, Btrfs should implementing RAID56 support just like
>>>>> Btrfs on mdadm.
>>>>
>>>>
>>>> Even mdadm doesn't implement it the way btrfs does (assuming all bugs
>>>> are fixed) any more.
>>>>
>>>>> See how badly the current RAID56 works?
>>>>
>>>>
>>>>> The marginally benefit of btrfs RAID56 to scrub data better than
>>>>> tradition
>>>>> RAID56 is just a joke in current code base.
>>>>
>>>>
>>>>>> The problem is that the stripe size is bigger than the "sector size"
>>>>>
>>>>> (ok sector is not the correct word, but I am referring to the basic
>>>>> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
>>>>> btrfs writes less data than the stripe, the stripe is not filled; when
>>>>> it is filled by a subsequent write, a RMW of the parity is required.
>>>>>>
>>>>>>
>>>>>> On the best of my understanding (which could be very wrong) ZFS try
>>>>>
>>>>> to solve this issue using a variable length stripe.
>>>>>
>>>>> Did you mean ZFS record size?
>>>>> IIRC that's file extent minimum size, and I didn't see how that can
>>>>> handle
>>>>> the write hole problem.
>>>>>
>>>>> Or did ZFS handle the problem?
>>>>
>>>>
>>>> ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds
>>>> the
>>>> parity blocks within extents, so it behaves more like btrfs compression
>>>> in the sense that the data in a RAID-Z extent is encoded differently
>>>
>>> >from the data in the file, and the kernel has to transform it on reads
>>>>
>>>> and writes.
>>>>
>>>> No ZFS stripe can contain blocks from multiple different
>>>> transactions because the RAID-Z stripes begin and end on extent
>>>> (single-transaction-write) boundaries, so there is no write hole on ZFS.
>>>>
>>>> There is some space waste in ZFS because the minimum allocation unit
>>>> is two blocks (one data one parity) so any free space that is less
>>>> than two blocks long is unusable.  Also the maximum usable stripe width
>>>> (number of disks) is the size of the data in the extent plus one parity
>>>> block.  It means if you write a lot of discontiguous 4K blocks, you
>>>> effectively get 2-disk RAID1 and that may result in disappointing
>>>> storage efficiency.
>>>>
>>>> (the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
>>>> for additional parity blocks).
>>>>
>>>> One could implement RAID-Z on btrfs, but it's by far the most invasive
>>>> proposal for fixing btrfs's write hole so far (and doesn't actually fix
>>>> anything, since the existing raid56 format would still be required to
>>>> read old data, and it would still be broken).
>>>>
>>>>> Anyway, it should be a low priority thing, and personally speaking,
>>>>> any large behavior modification involving  both extent allocator and bg
>>>>> allocator will be bug prone.
>>>>
>>>>
>>>> My proposal requires only a modification to the extent allocator.
>>>> The behavior at the block group layer and scrub remains exactly the
>>>> same.
>>>> We just need to adjust the allocator slightly to take the RAID5 CoW
>>>> constraints into account.
>>>
>>>
>>> Then, you'd need to allow btrfs to split large buffered/direct write into
>>> small extents(not 128M anymore).
>>> Not sure if we need to do extra work for DirectIO.
>>
>>
>> Nope, that's not my proposal.  My proposal is to simply ignore free
>> space whenever it's inside a partially filled raid stripe (optimization:
>> ...which was empty at the start of the current transaction).
>
>
> Still have problems.
>
> Allocator must handle fs under device remove or profile converting (from 4
> disks raid5 to 5 disk raid5/6) correctly.
> Which already seems complex for me.
>
>
> And further more, for fs with more devices, for example, 9 devices RAID5.
> It will be a disaster to just write a 4K data and take up the whole 8 * 64K
> space.
> It will  definitely cause huge ENOSPC problem.
>
> If you really think it's easy, make a RFC patch, which should be easy if it
> is, then run fstest auto group on it.
>
> Easy words won't turn emails into real patch.
>
>
>>
>> That avoids modifying a stripe with committed data and therefore plugs the
>> write hole.
>>
>> For nodatacow, prealloc (and maybe directio?) extents the behavior
>> wouldn't change (you'd have write hole, but only on data blocks not
>> metadata, and only on files that were already marked as explicitly not
>> requiring data integrity).
>>
>>> And in fact, you're going to support variant max file extent size.
>>
>>
>> The existing extent sizing behavior is not changed *at all* in my
>> proposal,
>> only the allocator's notion of what space is 'free'.
>>
>> We can write an extent across multiple RAID5 stripes so long as we
>> finish writing the entire extent before pointing committed metadata to
>> it.  btrfs does that already otherwise checksums wouldn't work.
>>
>>> This makes delalloc more complex (Wang enhanced dealloc support for
>>> variant
>>> file extent size, to fix ENOSPC problem for dedupe and compression).
>>>
>>> This is already much more complex than you expected.
>>
>>
>> The complexity I anticipate is having to deal with two implementations
>> of the free space search, one for free space cache and one for free
>> space tree.
>>
>> It could be as simple as calling the existing allocation functions and
>> just filtering out anything that isn't suitably aligned inside a raid56
>> block group (at least for a proof of concept).
>>
>>> And this is the *BIGGEST* problem of current btrfs:
>>> No good enough(if there is any) *ISOLATION* for such a complex fs.
>>>
>>> So even "small" modification can lead to unexpected bugs.
>>>
>>> That's why I want to isolate the fix in RAID56 layer, not any layer
>>> upwards.
>>
>>
>> I don't think the write hole is fixable in the current raid56 layer, at
>> least not without a nasty brute force solution like stripe update journal.
>>
>> Any of the fixes I'd want to use fix the problem from outside.
>>
>>> If not possible, I prefer not to do anything yet, until we are sure the
>>> very
>>> basic part of RAID56 is stable.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> It's not as efficient as the ZFS approach, but it doesn't require an
>>>> incompatible disk format change either.
>>>>
>>>>>> On BTRFS this could be achieved using several BGs (== block group or
>>>>>> chunk), one for each stripe size.
>>>>>>
>>>>>> For example, if a filesystem - RAID5 is composed by 4 DISK, the
>>>>>> filesystem should have three BGs:
>>>>>> BG #1,composed by two disks (1 data+ 1 parity)
>>>>>> BG #2 composed by three disks (2 data + 1 parity)
>>>>>> BG #3 composed by four disks (3 data + 1 parity).
>>>>>
>>>>>
>>>>> Too complicated bg layout and further extent allocator modification.
>>>>>
>>>>> More code means more bugs, and I'm pretty sure it will be bug prone.
>>>>>
>>>>>
>>>>> Although the idea of variable stripe size can somewhat reduce the
>>>>> problem
>>>>> under certain situation.
>>>>>
>>>>> For example, if sectorsize is 64K, and we make stripe len to 32K, and
>>>>> use 3
>>>>> disc RAID5, we can avoid such write hole problem.
>>>>> Withouth modification to extent/chunk allocator.
>>>>>
>>>>> And I'd prefer to make stripe len mkfs time parameter, not possible to
>>>>> modify after mkfs. To make things easy.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> If the data to be written has a size of 4k, it will be allocated to
>>>>>> the BG #1.
>>>>>> If the data to be written has a size of 8k, it will be allocated to
>>>>>> the BG #2
>>>>>> If the data to be written has a size of 12k, it will be allocated to
>>>>>> the BG #3
>>>>>> If the data to be written has a size greater than 12k, it will be
>>>>>> allocated to the BG3, until the data fills a full stripes; then the
>>>>>> remainder will be stored in BG #1 or BG #2.
>>>>>>
>>>>>>
>>>>>> To avoid unbalancing of the disk usage, each BG could use all the
>>>>>> disks, even if a stripe uses less disks: i.e
>>>>>>
>>>>>> DISK1 DISK2 DISK3 DISK4
>>>>>> S1    S1    S1    S2
>>>>>> S2    S2    S3    S3
>>>>>> S3    S4    S4    S4
>>>>>> [....]
>>>>>>
>>>>>> Above is show a BG which uses all the four disks, but has a stripe
>>>>>> which spans only 3 disks.
>>>>>>
>>>>>>
>>>>>> Pro:
>>>>>> - btrfs already is capable to handle different BG in the filesystem,
>>>>>> only the allocator has to change
>>>>>> - no more RMW are required (== higher performance)
>>>>>>
>>>>>> Cons:
>>>>>> - the data will be more fragmented
>>>>>> - the filesystem, will have more BGs; this will require time-to time a
>>>>>> re-balance. But is is an issue which we already know (even if may be not
>>>>>> 100% addressed).
>>>>>>
>>>>>>
>>>>>> Thoughts ?
>>>>>>
>>>>>> BR
>>>>>> G.Baroncelli
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  5:49         ` Qu Wenruo
  2016-11-29 18:47           ` Janos Toth F.
@ 2016-11-29 22:51           ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2016-11-29 22:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kreijack, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6895 bytes --]

On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote:
> >>>My proposal requires only a modification to the extent allocator.
> >>>The behavior at the block group layer and scrub remains exactly the same.
> >>>We just need to adjust the allocator slightly to take the RAID5 CoW
> >>>constraints into account.
> >>
> >>Then, you'd need to allow btrfs to split large buffered/direct write into
> >>small extents(not 128M anymore).
> >>Not sure if we need to do extra work for DirectIO.
> >
> >Nope, that's not my proposal.  My proposal is to simply ignore free
> >space whenever it's inside a partially filled raid stripe (optimization:
> >...which was empty at the start of the current transaction).
> 
> Still have problems.
> 
> Allocator must handle fs under device remove or profile converting (from 4
> disks raid5 to 5 disk raid5/6) correctly.
> Which already seems complex for me.

Those would be allocations in separate block groups with different stripe
widths.  Already handled in btrfs.

> And further more, for fs with more devices, for example, 9 devices RAID5.
> It will be a disaster to just write a 4K data and take up the whole 8 * 64K
> space.
> It will  definitely cause huge ENOSPC problem.

If you called fsync() after every 4K, yes; otherwise you can just batch
up small writes into full-size stripes.  The worst case isn't common
enough to be a serious problem for a lot of the common RAID5 use cases
(i.e. non-database workloads).  I wouldn't try running a database on
it--I'd use a RAID1 or RAID10 array for that instead, because the other
RAID5 performance issues would be deal-breakers.

On ZFS the same case degenerates into something like btrfs RAID1 over
the 9 disks, which burns over 50% of the space.  More efficient than 
wasting 99% of the space, but still wasteful.

> If you really think it's easy, make a RFC patch, which should be easy if it
> is, then run fstest auto group on it.

I plan to when I get time; however, that could be some months in the
future and I don't want to "claim" the task and stop anyone else from
taking a crack at it in the meantime.

> Easy words won't turn emails into real patch.
> 
> >That avoids modifying a stripe with committed data and therefore plugs the
> >write hole.
> >
> >For nodatacow, prealloc (and maybe directio?) extents the behavior
> >wouldn't change (you'd have write hole, but only on data blocks not
> >metadata, and only on files that were already marked as explicitly not
> >requiring data integrity).
> >
> >>And in fact, you're going to support variant max file extent size.
> >
> >The existing extent sizing behavior is not changed *at all* in my proposal,
> >only the allocator's notion of what space is 'free'.
> >
> >We can write an extent across multiple RAID5 stripes so long as we
> >finish writing the entire extent before pointing committed metadata to
> >it.  btrfs does that already otherwise checksums wouldn't work.
> >
> >>This makes delalloc more complex (Wang enhanced dealloc support for variant
> >>file extent size, to fix ENOSPC problem for dedupe and compression).
> >>
> >>This is already much more complex than you expected.
> >
> >The complexity I anticipate is having to deal with two implementations
> >of the free space search, one for free space cache and one for free
> >space tree.
> >
> >It could be as simple as calling the existing allocation functions and
> >just filtering out anything that isn't suitably aligned inside a raid56
> >block group (at least for a proof of concept).
> >
> >>And this is the *BIGGEST* problem of current btrfs:
> >>No good enough(if there is any) *ISOLATION* for such a complex fs.
> >>
> >>So even "small" modification can lead to unexpected bugs.
> >>
> >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
> >
> >I don't think the write hole is fixable in the current raid56 layer, at
> >least not without a nasty brute force solution like stripe update journal.
> >
> >Any of the fixes I'd want to use fix the problem from outside.
> >
> >>If not possible, I prefer not to do anything yet, until we are sure the very
> >>basic part of RAID56 is stable.
> >>
> >>Thanks,
> >>Qu
> >>
> >>>
> >>>It's not as efficient as the ZFS approach, but it doesn't require an
> >>>incompatible disk format change either.
> >>>
> >>>>>On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
> >>>>>
> >>>>>For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> >>>>>BG #1,composed by two disks (1 data+ 1 parity)
> >>>>>BG #2 composed by three disks (2 data + 1 parity)
> >>>>>BG #3 composed by four disks (3 data + 1 parity).
> >>>>
> >>>>Too complicated bg layout and further extent allocator modification.
> >>>>
> >>>>More code means more bugs, and I'm pretty sure it will be bug prone.
> >>>>
> >>>>
> >>>>Although the idea of variable stripe size can somewhat reduce the problem
> >>>>under certain situation.
> >>>>
> >>>>For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> >>>>disc RAID5, we can avoid such write hole problem.
> >>>>Withouth modification to extent/chunk allocator.
> >>>>
> >>>>And I'd prefer to make stripe len mkfs time parameter, not possible to
> >>>>modify after mkfs. To make things easy.
> >>>>
> >>>>Thanks,
> >>>>Qu
> >>>>
> >>>>>
> >>>>>If the data to be written has a size of 4k, it will be allocated to the BG #1.
> >>>>>If the data to be written has a size of 8k, it will be allocated to the BG #2
> >>>>>If the data to be written has a size of 12k, it will be allocated to the BG #3
> >>>>>If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
> >>>>>
> >>>>>
> >>>>>To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
> >>>>>
> >>>>>DISK1 DISK2 DISK3 DISK4
> >>>>>S1    S1    S1    S2
> >>>>>S2    S2    S3    S3
> >>>>>S3    S4    S4    S4
> >>>>>[....]
> >>>>>
> >>>>>Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
> >>>>>
> >>>>>
> >>>>>Pro:
> >>>>>- btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> >>>>>- no more RMW are required (== higher performance)
> >>>>>
> >>>>>Cons:
> >>>>>- the data will be more fragmented
> >>>>>- the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
> >>>>>
> >>>>>
> >>>>>Thoughts ?
> >>>>>
> >>>>>BR
> >>>>>G.Baroncelli
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: raid with a variable stripe size
  2016-11-29  6:03     ` Qu Wenruo
  2016-11-29 18:19       ` Goffredo Baroncelli
@ 2016-11-29 22:54       ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2016-11-29 22:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Goffredo Baroncelli, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3786 bytes --]

On Tue, Nov 29, 2016 at 02:03:58PM +0800, Qu Wenruo wrote:
> At 11/29/2016 01:51 PM, Chris Murphy wrote:
> >On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> >>
> >>
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem of
> >>>raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >>See how badly the current RAID56 works?
> >>
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >Btrfs is subject to the write hole problem on disk, but any read or
> >scrub that needs to reconstruct from parity that is corrupt results in
> >a checksum error and EIO. So corruption is not passed up to user
> >space. Recent versions of md/mdadm support a write journal to avoid
> >the write hole problem on disk in case of a crash.
> 
> That's interesting.
> 
> So I think it's less worthy to support RAID56 in btrfs, especially
> considering the stability.
> 
> My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10
> device for each chunk.
> Which should save us tons of codes and bugs.
> 
> And for better recovery, enhance device mapper to provide interface to judge
> which block is correct.
> 
> Although that's just dream anyway.

It would be nice to do that for balancing.  In many balance cases
(especially device delete and full balance after device add) it's not
necessary to rewrite the data in a block group, only copy it verbatim
to a different physical location (like pvmove does) and update the chunk
tree with the new address when it's done.  No need to rewrite the whole
extent tree.

> Thanks,
> Qu
> >
> >>>The problem is that the stripe size is bigger than the "sector size" (ok
> >>>sector is not the correct word, but I am referring to the basic unit of
> >>>writing on the disk, which is 4k or 16K in btrfs).
> >>>So when btrfs writes less data than the stripe, the stripe is not filled;
> >>>when it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try to
> >>>solve this issue using a variable length stripe.
> >>
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS isn't subject to the write hole. My understanding is they get
> >around this because all writes are COW, there is no RMW.
> >But the
> >variable stripe size means they don't have to do the usual (fixed)
> >full stripe write for just, for example a 4KiB change in data for a
> >single file. Conversely Btrfs does do RMW in such a case.
> >
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >I tend to agree. I think the non-scalability of Btrfs raid10, which
> >makes it behave more like raid 0+1, is a higher priority because right
> >now it's misleading to say the least; and then the longer term goal
> >for scaleable huge file systems is how Btrfs can shed irreparably
> >damaged parts of the file system (tree pruning) rather than
> >reconstruction.
> >
> >
> >
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-11-29 22:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51   ` Timofey Titovets
2016-11-18 21:38     ` Janos Toth F.
2016-11-19  8:55   ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19  8:59   ` Goffredo Baroncelli
2016-11-19  8:22 ` Zygo Blaxell
2016-11-19  9:13   ` Goffredo Baroncelli
2016-11-29  0:48 ` Qu Wenruo
2016-11-29  3:53   ` Zygo Blaxell
2016-11-29  4:12     ` Qu Wenruo
2016-11-29  4:55       ` Zygo Blaxell
2016-11-29  5:49         ` Qu Wenruo
2016-11-29 18:47           ` Janos Toth F.
2016-11-29 22:51           ` Zygo Blaxell
2016-11-29  5:51   ` Chris Murphy
2016-11-29  6:03     ` Qu Wenruo
2016-11-29 18:19       ` Goffredo Baroncelli
2016-11-29 22:54       ` Zygo Blaxell
2016-11-29 18:10   ` Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.