Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
       [not found] <CAL5DHTFA8G5fq=BQ7N7qQimesjygKTiBR2qZF4YBRAsAjB_L5Q@mail.gmail.com>
@ 2019-07-30 16:20 ` Goffredo Baroncelli
  0 siblings, 0 replies; 24+ messages in thread
From: Goffredo Baroncelli @ 2019-07-30 16:20 UTC (permalink / raw)
  To: Torstein Eide; +Cc: linux-btrfs

On 30/07/2019 16.48, Torstein Eide wrote:
> Hi
> Is there any news to implementing journal for raid5/6 writes?
> 
I think that you should ask to ML. I am (was) occasional contributor than a active btrfs developers.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-02 18:47     ` Chris Mason
@ 2018-05-03 19:16       ` Goffredo Baroncelli
  0 siblings, 0 replies; 24+ messages in thread
From: Goffredo Baroncelli @ 2018-05-03 19:16 UTC (permalink / raw)
  To: Chris Mason, Austin S. Hemmelgarn, Roman Mamedov, Liu Bo; +Cc: linux-btrfs

On 08/02/2017 08:47 PM, Chris Mason wrote:
>> I agree, MD pretty much needs a separate device simply because they can't allocate arbitrary space on the other array members.  BTRFS can do that though, and I would actually think that that would be _easier_ to implement than having a separate device.
>>
>> That said, I do think that it would need to be a separate chunk type, because things could get really complicated if the metadata is itself using a parity raid profile.
> 
> Thanks for running with this Liu, I'm reading through all the patches. I do agree that it's better to put the logging into a dedicated chunk type, that way we can have it default to either double or triple mirroring.

Sorry for reply a bit late :-), however it should be sufficient to start the writes from the stripe boundary. For a filesystem this is complicate to grant, however for a journal it would be more simple to do;

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-23 15:28             ` Chris Murphy
  2017-08-23 15:47               ` Austin S. Hemmelgarn
@ 2017-08-25 13:53               ` Goffredo Baroncelli
  1 sibling, 0 replies; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-25 13:53 UTC (permalink / raw)
  To: Chris Murphy, Liu Bo; +Cc: Btrfs BTRFS

On 08/23/2017 05:28 PM, Chris Murphy wrote:
> - dynamically sized stripes, so that writes can always be full stripe
> writes, no overwrites, and atomic
Think about also that a block could be deallocated (i.e. canceling a file of 4kb). This leads to have some holes that you cannot fill without a RMW cycle...

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-23 15:28             ` Chris Murphy
@ 2017-08-23 15:47               ` Austin S. Hemmelgarn
  2017-08-25 13:53               ` Goffredo Baroncelli
  1 sibling, 0 replies; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-23 15:47 UTC (permalink / raw)
  To: Chris Murphy, Liu Bo; +Cc: Goffredo Baroncelli, Btrfs BTRFS

On 2017-08-23 11:28, Chris Murphy wrote:
> On Wed, Aug 2, 2017 at 2:27 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>> On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:
> 
>>> What I want to understand, is if it is possible to log only the "partial stripe"  RMW cycle.
>>>
>>
>> I think your point is valid if all data is written with datacow.  In
>> case of nodatacow, btrfs does overwrite in place, so a full stripe
>> write may pollute on-disk data after unclean shutdown.  Checksum can
>> detect errors but repair thru raid5 may not recover the correct data.
> 
> What's simpler? raid56 journal for everything (cow, nocow, data,
> metadata), or to apply some limitations to available layouts?
> 
> -  if raid56, then cow only (no such thing as nodatacow)
This should obviously be something that will be contentious to certain 
individuals.
> -  permit raid56 for data bg only, system and metadata can be raid1, or raid10
> 
> I'm hard pressed thinking of a use case where metadata raid56 is
> beneficial over raid10; a metadata heavy workload is not well suited
> for any parity raid. And if it isn't metadata heavy, then chances are
> you don't even need raid10 but raid1 is sufficient.
Until BTRFS gets n-way replication, raid6 remains the only way to 
configure a BTRFS volume to survive more than one device failure.
> 
> Of the more complicated ways to solve it:
> 
> - journal
> - dynamically sized stripes, so that writes can always be full stripe
> writes, no overwrites, and atomic
> - mixed block groups where only sequential full stripe writes use
> raid56 block group; random and smaller writes go in a raid 1 or 10
> block group.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-02 20:27           ` Liu Bo
  2017-08-03  4:02             ` Duncan
@ 2017-08-23 15:28             ` Chris Murphy
  2017-08-23 15:47               ` Austin S. Hemmelgarn
  2017-08-25 13:53               ` Goffredo Baroncelli
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Murphy @ 2017-08-23 15:28 UTC (permalink / raw)
  To: Liu Bo; +Cc: Goffredo Baroncelli, Btrfs BTRFS

On Wed, Aug 2, 2017 at 2:27 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:

>> What I want to understand, is if it is possible to log only the "partial stripe"  RMW cycle.
>>
>
> I think your point is valid if all data is written with datacow.  In
> case of nodatacow, btrfs does overwrite in place, so a full stripe
> write may pollute on-disk data after unclean shutdown.  Checksum can
> detect errors but repair thru raid5 may not recover the correct data.

What's simpler? raid56 journal for everything (cow, nocow, data,
metadata), or to apply some limitations to available layouts?

-  if raid56, then cow only (no such thing as nodatacow)
-  permit raid56 for data bg only, system and metadata can be raid1, or raid10

I'm hard pressed thinking of a use case where metadata raid56 is
beneficial over raid10; a metadata heavy workload is not well suited
for any parity raid. And if it isn't metadata heavy, then chances are
you don't even need raid10 but raid1 is sufficient.

Of the more complicated ways to solve it:

- journal
- dynamically sized stripes, so that writes can always be full stripe
writes, no overwrites, and atomic
- mixed block groups where only sequential full stripe writes use
raid56 block group; random and smaller writes go in a raid 1 or 10
block group.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-03  4:02             ` Duncan
@ 2017-08-03  4:40               ` Goffredo Baroncelli
  0 siblings, 0 replies; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-03  4:40 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2017-08-03 06:02, Duncan wrote:
> Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:
> 
>>>>> It is correct reading this as: all data is written two times ?
> 
> If as is being discussed the log is mirrored by default that'd be three 
> times...

And for raid6 you need to do it 4 times... (!)

> Parity-raid is slow and of course normally has the infamous write hole 
> this patch set is trying to close.  Yes, closing the write hole is 
> possible, but for sure it's going to make the performance bite of parity-
> raid even worse. =:^(

This is the reason for looking for possible optimization from the beginning: a full stripe (only datacow) writing doesn't require logging at all. This could be a big optimization ( if you need to write a lot of data, only tail and head are NOT full stripe). However this require to know that the data is [no]cow when it is logged, and I think that it is not so simple: possible but not simple.

> 
>>>>> Or are logged only the stripes involved by a RMW cycle (i.e. if a
>>>>> stripe is fully written, the log is bypassed )?
>>>>
>>>> For data, only data in bios from high level will be logged, while for
>>>> parity, the whole parity will be logged.
>>>>
>>>> Full stripe write still logs all data and parity, as full stripe
>>>> write may not survive from unclean shutdown.
>>>
>>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>>> interrupted (by an unclean shutdown) the transaction data are all lost.
>>> Am I missing something ?
>>>
>>> What I want to understand, is if it is possible to log only the
>>> "partial stripe"  RMW cycle.
>>>
>>>
>> I think your point is valid if all data is written with datacow.  In
>> case of nodatacow, btrfs does overwrite in place, so a full stripe write
>> may pollute on-disk data after unclean shutdown.  Checksum can detect
>> errors but repair thru raid5 may not recover the correct data.
> 
> But nodatacow doesn't have checksum...

True, but Liu is correct stating that a write "nocow" is not protected by a transaction.
The funny part, is that in case of raid5 we need to duplicate the data written for the nocow case, when for the cow case it would be possible to avoid it (in the full stripe case) !


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-02 20:27           ` Liu Bo
@ 2017-08-03  4:02             ` Duncan
  2017-08-03  4:40               ` Goffredo Baroncelli
  2017-08-23 15:28             ` Chris Murphy
  1 sibling, 1 reply; 24+ messages in thread
From: Duncan @ 2017-08-03  4:02 UTC (permalink / raw)
  To: linux-btrfs

Liu Bo posted on Wed, 02 Aug 2017 14:27:21 -0600 as excerpted:

>> >> It is correct reading this as: all data is written two times ?

If as is being discussed the log is mirrored by default that'd be three 
times...

Parity-raid is slow and of course normally has the infamous write hole 
this patch set is trying to close.  Yes, closing the write hole is 
possible, but for sure it's going to make the performance bite of parity-
raid even worse. =:^(

>> >> Or are logged only the stripes involved by a RMW cycle (i.e. if a
>> >> stripe is fully written, the log is bypassed )?
>> > 
>> > For data, only data in bios from high level will be logged, while for
>> > parity, the whole parity will be logged.
>> > 
>> > Full stripe write still logs all data and parity, as full stripe
>> > write may not survive from unclean shutdown.
>> 
>> Does this matter ? Due to the COW nature of BTRFS if a transaction is
>> interrupted (by an unclean shutdown) the transaction data are all lost.
>> Am I missing something ?
>> 
>> What I want to understand, is if it is possible to log only the
>> "partial stripe"  RMW cycle.
>>
>>
> I think your point is valid if all data is written with datacow.  In
> case of nodatacow, btrfs does overwrite in place, so a full stripe write
> may pollute on-disk data after unclean shutdown.  Checksum can detect
> errors but repair thru raid5 may not recover the correct data.

But nodatacow doesn't have checksum...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-02 17:57       ` Liu Bo
@ 2017-08-02 20:41         ` Goffredo Baroncelli
  2017-08-02 20:27           ` Liu Bo
  0 siblings, 1 reply; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-02 20:41 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

Hi Liu,

thanks for your reply, below my comments
On 2017-08-02 19:57, Liu Bo wrote:
> On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
>> On 2017-08-01 19:24, Liu Bo wrote:
>>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>>>> Hi Liu,
>>>>
>>>> On 2017-08-01 18:14, Liu Bo wrote:
>>>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>>>>> separate disk as a journal (aka raid5/6 log), so that after unclean
>>>>> shutdown we can make sure data and parity are consistent on the raid
>>>>> array by replaying the journal.
>>>>>
>>>>
>>>> it would be possible to have more information ?
>>>> - what is logged ? data, parity or data + parity ?
>>>
>>> Patch 5 has more details(sorry for not making it clear that in the
>>> cover letter).
>>>
>>> So both data and parity are logged so that while replaying the journal
>>> everything is written to whichever disk it should be written to.
>>
>> It is correct reading this as: all data is written two times ? Or are logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, the log is bypassed )?
> 
> For data, only data in bios from high level will be logged, while for
> parity, the whole parity will be logged.
> 
> Full stripe write still logs all data and parity, as full stripe write
> may not survive from unclean shutdown.

Does this matter ? Due to the COW nature of BTRFS if a transaction is interrupted (by an unclean shutdown) the transaction data are all lost. Am I missing something ?

What I want to understand, is if it is possible to log only the "partial stripe"  RMW cycle.

> 
> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).
> 
>>>
>>>> - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?
>>>
>>> An unclean shutdown causes inconsistence between data and parity, so
>>> scrub won't help as it's not able to tell which one (data or parity)
>>> is valid
>> Scrub compares data against its checksum; so it knows if the data is correct. If no disk is lost, a scrub process is sufficient/needed to rebuild the parity/data.
>>
> 
> If no disk is lost, it depends on whether the number of errors caused
> by an unclean shutdown can be tolerated by the raid setup.

see below
> 
>> The problem born when after "an unclean shutdown" a disk failure happens. But these  are *two* distinct failures. These together break the BTRFS raid5 redundancy. But if you run a scrub process between these two failures, the btrfs raid5 redundancy is still effective.
>>
> 
> I wouldn't say that the redundancy is still effective after a scrub
> process, but rather those data which match their checksum can still be
> read out while the mismatched data are lost forever after unclean
> shutdown.


I think that this is the point where we are in disagreement: until now I understood that in BTRFS
a) a transaction is fully completed or fully not-completed. 
b) a transaction is completed after both the data *and* the parity are written.

With these assumption, due to the COW nature of BTRFS an unclean shutdown might invalidate only data of the current transaction. Of course the unclean shutdown prevent the transaction to be completed, and this means that all the data of this transaction is lost in any case.

For the parity this is different, because it is possible a misalignment between the parity and the data (which might be of different transactions).

Let me to explain with the help of your example:

> Taking a raid5 setup with 3 disks as an example, doing an overwrite
> of 4k will log 4K(data) + 64K(parity).

If the transaction is aborted, 128k-4k = 124k are untouched, and these still be valid. The last 4k might be wrong, but in any case this data is not referenced because the transaction was never completed. 
The parity need to be rebuild because we are not able to know if the transaction was aborted before/after the data and/or parity writing


> 
> Thanks,
> 
> -liubo
>>
>>>
>>> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
>>> With datacow, we don't do overwrite, but the following situation may happen,
>>> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
>>>
>>> 1) write 64K  --> now the raid layout is
>>> [64K data + 64K random + 64K parity]
>>> 2) write another 64K --> now the raid layout after RMW is
>>> [64K 1)'s data + 64K 2)'s data + 64K new parity]
>>>
>>> If unclean shutdown occurs before 2) finishes, then parity may be
>>> corrupted and then 1)'s data may be recovered wrongly if the disk
>>> which holds 1)'s data is offline.
>>>
>>>> - does this journal disk also host other btrfs log ?
>>>>
>>>
>>> No, purely data/parity and some associated metadata.
>>>
>>> Thanks,
>>>
>>> -liubo
>>>
>>>>> The idea and the code are similar to the write-through mode of md
>>>>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
>>>>> (If you've been familiar with md, you may find this patch set is
>>>>> boring to read...)
>>>>>
>>>>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
>>>>> the implementation, the rest patches are improvements and bugfixes,
>>>>> eg. readahead for recovery, checksum.
>>>>>
>>>>> Two btrfs-progs patches are required to play with this patch set, one
>>>>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
>>>>> option '-L', the other is to teach 'btrfs-show-super' to show
>>>>> %journal_tail.
>>>>>
>>>>> This is currently based on 4.12-rc3.
>>>>>
>>>>> The patch set is tagged with RFC, and comments are always welcome,
>>>>> thanks.
>>>>>
>>>>> Known limitations:
>>>>> - Deleting a log device is not implemented yet.
>>>>>
>>>>>
>>>>> Liu Bo (14):
>>>>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>>>>>   Btrfs: raid56: do not allocate chunk on raid56 log
>>>>>   Btrfs: raid56: detect raid56 log on mount
>>>>>   Btrfs: raid56: add verbose debug
>>>>>   Btrfs: raid56: add stripe log for raid5/6
>>>>>   Btrfs: raid56: add reclaim support
>>>>>   Btrfs: raid56: load r5log
>>>>>   Btrfs: raid56: log recovery
>>>>>   Btrfs: raid56: add readahead for recovery
>>>>>   Btrfs: raid56: use the readahead helper to get page
>>>>>   Btrfs: raid56: add csum support
>>>>>   Btrfs: raid56: fix error handling while adding a log device
>>>>>   Btrfs: raid56: initialize raid5/6 log after adding it
>>>>>   Btrfs: raid56: maintain IO order on raid5/6 log
>>>>>
>>>>>  fs/btrfs/ctree.h                |   16 +-
>>>>>  fs/btrfs/disk-io.c              |   16 +
>>>>>  fs/btrfs/ioctl.c                |   48 +-
>>>>>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
>>>>>  fs/btrfs/raid56.h               |   82 +++
>>>>>  fs/btrfs/transaction.c          |    2 +
>>>>>  fs/btrfs/volumes.c              |   56 +-
>>>>>  fs/btrfs/volumes.h              |    7 +-
>>>>>  include/uapi/linux/btrfs.h      |    3 +
>>>>>  include/uapi/linux/btrfs_tree.h |    4 +
>>>>>  10 files changed, 1487 insertions(+), 176 deletions(-)
>>>>>
>>>>
>>>>
>>>> -- 
>>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-02 20:41         ` Goffredo Baroncelli
@ 2017-08-02 20:27           ` Liu Bo
  2017-08-03  4:02             ` Duncan
  2017-08-23 15:28             ` Chris Murphy
  0 siblings, 2 replies; 24+ messages in thread
From: Liu Bo @ 2017-08-02 20:27 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Wed, Aug 02, 2017 at 10:41:30PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> thanks for your reply, below my comments
> On 2017-08-02 19:57, Liu Bo wrote:
> > On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> >> On 2017-08-01 19:24, Liu Bo wrote:
> >>> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> >>>> Hi Liu,
> >>>>
> >>>> On 2017-08-01 18:14, Liu Bo wrote:
> >>>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> >>>>> separate disk as a journal (aka raid5/6 log), so that after unclean
> >>>>> shutdown we can make sure data and parity are consistent on the raid
> >>>>> array by replaying the journal.
> >>>>>
> >>>>
> >>>> it would be possible to have more information ?
> >>>> - what is logged ? data, parity or data + parity ?
> >>>
> >>> Patch 5 has more details(sorry for not making it clear that in the
> >>> cover letter).
> >>>
> >>> So both data and parity are logged so that while replaying the journal
> >>> everything is written to whichever disk it should be written to.
> >>
> >> It is correct reading this as: all data is written two times ? Or are logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, the log is bypassed )?
> > 
> > For data, only data in bios from high level will be logged, while for
> > parity, the whole parity will be logged.
> > 
> > Full stripe write still logs all data and parity, as full stripe write
> > may not survive from unclean shutdown.
> 
> Does this matter ? Due to the COW nature of BTRFS if a transaction is interrupted (by an unclean shutdown) the transaction data are all lost. Am I missing something ?
> 
> What I want to understand, is if it is possible to log only the "partial stripe"  RMW cycle.
>

I think your point is valid if all data is written with datacow.  In
case of nodatacow, btrfs does overwrite in place, so a full stripe
write may pollute on-disk data after unclean shutdown.  Checksum can
detect errors but repair thru raid5 may not recover the correct data.

> > 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> > 
> >>>
> >>>> - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?
> >>>
> >>> An unclean shutdown causes inconsistence between data and parity, so
> >>> scrub won't help as it's not able to tell which one (data or parity)
> >>> is valid
> >> Scrub compares data against its checksum; so it knows if the data is correct. If no disk is lost, a scrub process is sufficient/needed to rebuild the parity/data.
> >>
> > 
> > If no disk is lost, it depends on whether the number of errors caused
> > by an unclean shutdown can be tolerated by the raid setup.
> 
> see below
> > 
> >> The problem born when after "an unclean shutdown" a disk failure happens. But these  are *two* distinct failures. These together break the BTRFS raid5 redundancy. But if you run a scrub process between these two failures, the btrfs raid5 redundancy is still effective.
> >>
> > 
> > I wouldn't say that the redundancy is still effective after a scrub
> > process, but rather those data which match their checksum can still be
> > read out while the mismatched data are lost forever after unclean
> > shutdown.
> 
> 
> I think that this is the point where we are in disagreement: until now I understood that in BTRFS
> a) a transaction is fully completed or fully not-completed. 
> b) a transaction is completed after both the data *and* the parity are written.
> 
> With these assumption, due to the COW nature of BTRFS an unclean shutdown might invalidate only data of the current transaction. Of course the unclean shutdown prevent the transaction to be completed, and this means that all the data of this transaction is lost in any case.
> 
> For the parity this is different, because it is possible a misalignment between the parity and the data (which might be of different transactions).
> 
> Let me to explain with the help of your example:
> 
> > Taking a raid5 setup with 3 disks as an example, doing an overwrite
> > of 4k will log 4K(data) + 64K(parity).
> 
> If the transaction is aborted, 128k-4k = 124k are untouched, and these still be valid. The last 4k might be wrong, but in any case this data is not referenced because the transaction was never completed. 
> The parity need to be rebuild because we are not able to know if the transaction was aborted before/after the data and/or parity writing
>

True, 4k data is not referenced, but again after rebuilding the
parity, the rest 124K and the 4k which has random data are not
consistent with the rebuilt parity.

The point is to keep parity and data consistent at any point of time
so that raid5 tolerance is valid.

Thanks,

-liubo

> > 
> > Thanks,
> > 
> > -liubo
> >>
> >>>
> >>> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> >>> With datacow, we don't do overwrite, but the following situation may happen,
> >>> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> >>>
> >>> 1) write 64K  --> now the raid layout is
> >>> [64K data + 64K random + 64K parity]
> >>> 2) write another 64K --> now the raid layout after RMW is
> >>> [64K 1)'s data + 64K 2)'s data + 64K new parity]
> >>>
> >>> If unclean shutdown occurs before 2) finishes, then parity may be
> >>> corrupted and then 1)'s data may be recovered wrongly if the disk
> >>> which holds 1)'s data is offline.
> >>>
> >>>> - does this journal disk also host other btrfs log ?
> >>>>
> >>>
> >>> No, purely data/parity and some associated metadata.
> >>>
> >>> Thanks,
> >>>
> >>> -liubo
> >>>
> >>>>> The idea and the code are similar to the write-through mode of md
> >>>>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> >>>>> (If you've been familiar with md, you may find this patch set is
> >>>>> boring to read...)
> >>>>>
> >>>>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> >>>>> the implementation, the rest patches are improvements and bugfixes,
> >>>>> eg. readahead for recovery, checksum.
> >>>>>
> >>>>> Two btrfs-progs patches are required to play with this patch set, one
> >>>>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> >>>>> option '-L', the other is to teach 'btrfs-show-super' to show
> >>>>> %journal_tail.
> >>>>>
> >>>>> This is currently based on 4.12-rc3.
> >>>>>
> >>>>> The patch set is tagged with RFC, and comments are always welcome,
> >>>>> thanks.
> >>>>>
> >>>>> Known limitations:
> >>>>> - Deleting a log device is not implemented yet.
> >>>>>
> >>>>>
> >>>>> Liu Bo (14):
> >>>>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >>>>>   Btrfs: raid56: do not allocate chunk on raid56 log
> >>>>>   Btrfs: raid56: detect raid56 log on mount
> >>>>>   Btrfs: raid56: add verbose debug
> >>>>>   Btrfs: raid56: add stripe log for raid5/6
> >>>>>   Btrfs: raid56: add reclaim support
> >>>>>   Btrfs: raid56: load r5log
> >>>>>   Btrfs: raid56: log recovery
> >>>>>   Btrfs: raid56: add readahead for recovery
> >>>>>   Btrfs: raid56: use the readahead helper to get page
> >>>>>   Btrfs: raid56: add csum support
> >>>>>   Btrfs: raid56: fix error handling while adding a log device
> >>>>>   Btrfs: raid56: initialize raid5/6 log after adding it
> >>>>>   Btrfs: raid56: maintain IO order on raid5/6 log
> >>>>>
> >>>>>  fs/btrfs/ctree.h                |   16 +-
> >>>>>  fs/btrfs/disk-io.c              |   16 +
> >>>>>  fs/btrfs/ioctl.c                |   48 +-
> >>>>>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
> >>>>>  fs/btrfs/raid56.h               |   82 +++
> >>>>>  fs/btrfs/transaction.c          |    2 +
> >>>>>  fs/btrfs/volumes.c              |   56 +-
> >>>>>  fs/btrfs/volumes.h              |    7 +-
> >>>>>  include/uapi/linux/btrfs.h      |    3 +
> >>>>>  include/uapi/linux/btrfs_tree.h |    4 +
> >>>>>  10 files changed, 1487 insertions(+), 176 deletions(-)
> >>>>>
> >>>>
> >>>>
> >>>> -- 
> >>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> >>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >>
> >> -- 
> >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> >>
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:39   ` Austin S. Hemmelgarn
  2017-08-01 17:07     ` Liu Bo
@ 2017-08-02 18:47     ` Chris Mason
  2018-05-03 19:16       ` Goffredo Baroncelli
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2017-08-02 18:47 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Roman Mamedov, Liu Bo; +Cc: linux-btrfs



On 08/01/2017 01:39 PM, Austin S. Hemmelgarn wrote:
> On 2017-08-01 13:25, Roman Mamedov wrote:
>> On Tue,  1 Aug 2017 10:14:23 -0600
>> Liu Bo <bo.li.liu@oracle.com> wrote:
>>
>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>>> separate disk as a journal (aka raid5/6 log), so that after unclean
>>> shutdown we can make sure data and parity are consistent on the raid
>>> array by replaying the journal.
>>
>> Could it be possible to designate areas on the in-array devices to be 
>> used as
>> journal?
>>
>> While md doesn't have much spare room in its metadata for extraneous 
>> things
>> like this, Btrfs could use almost as much as it wants to, adding to 
>> size of the
>> FS metadata areas. Reliability-wise, the log could be stored as RAID1 
>> chunks.
>>
>> It doesn't seem convenient to need having an additional storage device 
>> around
>> just for the log, and also needing to maintain its fault tolerance 
>> yourself (so
>> the log device would better be on a mirror, such as mdadm RAID1? more 
>> expense
>> and maintenance complexity).
>>
> I agree, MD pretty much needs a separate device simply because they 
> can't allocate arbitrary space on the other array members.  BTRFS can do 
> that though, and I would actually think that that would be _easier_ to 
> implement than having a separate device.
> 
> That said, I do think that it would need to be a separate chunk type, 
> because things could get really complicated if the metadata is itself 
> using a parity raid profile.

Thanks for running with this Liu, I'm reading through all the patches. 
I do agree that it's better to put the logging into a dedicated chunk 
type, that way we can have it default to either double or triple mirroring.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 22:14     ` Goffredo Baroncelli
@ 2017-08-02 17:57       ` Liu Bo
  2017-08-02 20:41         ` Goffredo Baroncelli
  0 siblings, 1 reply; 24+ messages in thread
From: Liu Bo @ 2017-08-02 17:57 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Wed, Aug 02, 2017 at 12:14:27AM +0200, Goffredo Baroncelli wrote:
> On 2017-08-01 19:24, Liu Bo wrote:
> > On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> >> Hi Liu,
> >>
> >> On 2017-08-01 18:14, Liu Bo wrote:
> >>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> >>> separate disk as a journal (aka raid5/6 log), so that after unclean
> >>> shutdown we can make sure data and parity are consistent on the raid
> >>> array by replaying the journal.
> >>>
> >>
> >> it would be possible to have more information ?
> >> - what is logged ? data, parity or data + parity ?
> > 
> > Patch 5 has more details(sorry for not making it clear that in the
> > cover letter).
> > 
> > So both data and parity are logged so that while replaying the journal
> > everything is written to whichever disk it should be written to.
> 
> It is correct reading this as: all data is written two times ? Or are logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, the log is bypassed )?

For data, only data in bios from high level will be logged, while for
parity, the whole parity will be logged.

Full stripe write still logs all data and parity, as full stripe write
may not survive from unclean shutdown.

Taking a raid5 setup with 3 disks as an example, doing an overwrite
of 4k will log 4K(data) + 64K(parity).

> > 
> >> - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?
> > 
> > An unclean shutdown causes inconsistence between data and parity, so
> > scrub won't help as it's not able to tell which one (data or parity)
> > is valid
> Scrub compares data against its checksum; so it knows if the data is correct. If no disk is lost, a scrub process is sufficient/needed to rebuild the parity/data.
>

If no disk is lost, it depends on whether the number of errors caused
by an unclean shutdown can be tolerated by the raid setup.

> The problem born when after "an unclean shutdown" a disk failure happens. But these  are *two* distinct failures. These together break the BTRFS raid5 redundancy. But if you run a scrub process between these two failures, the btrfs raid5 redundancy is still effective.
>

I wouldn't say that the redundancy is still effective after a scrub
process, but rather those data which match their checksum can still be
read out while the mismatched data are lost forever after unclean
shutdown.

Thanks,

-liubo
> 
> > 
> > With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> > With datacow, we don't do overwrite, but the following situation may happen,
> > say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> > 
> > 1) write 64K  --> now the raid layout is
> > [64K data + 64K random + 64K parity]
> > 2) write another 64K --> now the raid layout after RMW is
> > [64K 1)'s data + 64K 2)'s data + 64K new parity]
> > 
> > If unclean shutdown occurs before 2) finishes, then parity may be
> > corrupted and then 1)'s data may be recovered wrongly if the disk
> > which holds 1)'s data is offline.
> > 
> >> - does this journal disk also host other btrfs log ?
> >>
> > 
> > No, purely data/parity and some associated metadata.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> >>> The idea and the code are similar to the write-through mode of md
> >>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> >>> (If you've been familiar with md, you may find this patch set is
> >>> boring to read...)
> >>>
> >>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> >>> the implementation, the rest patches are improvements and bugfixes,
> >>> eg. readahead for recovery, checksum.
> >>>
> >>> Two btrfs-progs patches are required to play with this patch set, one
> >>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> >>> option '-L', the other is to teach 'btrfs-show-super' to show
> >>> %journal_tail.
> >>>
> >>> This is currently based on 4.12-rc3.
> >>>
> >>> The patch set is tagged with RFC, and comments are always welcome,
> >>> thanks.
> >>>
> >>> Known limitations:
> >>> - Deleting a log device is not implemented yet.
> >>>
> >>>
> >>> Liu Bo (14):
> >>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >>>   Btrfs: raid56: do not allocate chunk on raid56 log
> >>>   Btrfs: raid56: detect raid56 log on mount
> >>>   Btrfs: raid56: add verbose debug
> >>>   Btrfs: raid56: add stripe log for raid5/6
> >>>   Btrfs: raid56: add reclaim support
> >>>   Btrfs: raid56: load r5log
> >>>   Btrfs: raid56: log recovery
> >>>   Btrfs: raid56: add readahead for recovery
> >>>   Btrfs: raid56: use the readahead helper to get page
> >>>   Btrfs: raid56: add csum support
> >>>   Btrfs: raid56: fix error handling while adding a log device
> >>>   Btrfs: raid56: initialize raid5/6 log after adding it
> >>>   Btrfs: raid56: maintain IO order on raid5/6 log
> >>>
> >>>  fs/btrfs/ctree.h                |   16 +-
> >>>  fs/btrfs/disk-io.c              |   16 +
> >>>  fs/btrfs/ioctl.c                |   48 +-
> >>>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
> >>>  fs/btrfs/raid56.h               |   82 +++
> >>>  fs/btrfs/transaction.c          |    2 +
> >>>  fs/btrfs/volumes.c              |   56 +-
> >>>  fs/btrfs/volumes.h              |    7 +-
> >>>  include/uapi/linux/btrfs.h      |    3 +
> >>>  include/uapi/linux/btrfs_tree.h |    4 +
> >>>  10 files changed, 1487 insertions(+), 176 deletions(-)
> >>>
> >>
> >>
> >> -- 
> >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 21:00 ` Christoph Anton Mitterer
@ 2017-08-01 22:24   ` Goffredo Baroncelli
  0 siblings, 0 replies; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-01 22:24 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2017-08-01 23:00, Christoph Anton Mitterer wrote:
> Hi.
> 
> Stupid question:
> Would the write hole be closed already, if parity was checksummed?

No. 
The write hole problem is due to a combination of two things:
a) misalignment between parity and data (i.e. unclean shutdown)
b) loosing of a disk (i.e. disk rupture)

Note1: the write hole problem happens even if these two event are not consecutive.

After the disk rupture, when you need to read a data from the broken disk, you need the parity to compute the data. But if the parity is misaligned, wrong data is returned. 

The data checksum are sufficient to detect if wrong data is returned. The checksum parity is not needed. In any case both can't avoid the problem.

> Cheers,
> Chris.
> 

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:24   ` Liu Bo
@ 2017-08-01 22:14     ` Goffredo Baroncelli
  2017-08-02 17:57       ` Liu Bo
  0 siblings, 1 reply; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-01 22:14 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

On 2017-08-01 19:24, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
>> Hi Liu,
>>
>> On 2017-08-01 18:14, Liu Bo wrote:
>>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>>> separate disk as a journal (aka raid5/6 log), so that after unclean
>>> shutdown we can make sure data and parity are consistent on the raid
>>> array by replaying the journal.
>>>
>>
>> it would be possible to have more information ?
>> - what is logged ? data, parity or data + parity ?
> 
> Patch 5 has more details(sorry for not making it clear that in the
> cover letter).
> 
> So both data and parity are logged so that while replaying the journal
> everything is written to whichever disk it should be written to.

It is correct reading this as: all data is written two times ? Or are logged only the stripes involved by a RMW cycle (i.e. if a stripe is fully written, the log is bypassed )?
> 
>> - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?
> 
> An unclean shutdown causes inconsistence between data and parity, so
> scrub won't help as it's not able to tell which one (data or parity)
> is valid
Scrub compares data against its checksum; so it knows if the data is correct. If no disk is lost, a scrub process is sufficient/needed to rebuild the parity/data.

The problem born when after "an unclean shutdown" a disk failure happens. But these  are *two* distinct failures. These together break the BTRFS raid5 redundancy. But if you run a scrub process between these two failures, the btrfs raid5 redundancy is still effective.


> 
> With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
> With datacow, we don't do overwrite, but the following situation may happen,
> say we have a raid5 setup with 3 disks, the stripe length is 64k, so
> 
> 1) write 64K  --> now the raid layout is
> [64K data + 64K random + 64K parity]
> 2) write another 64K --> now the raid layout after RMW is
> [64K 1)'s data + 64K 2)'s data + 64K new parity]
> 
> If unclean shutdown occurs before 2) finishes, then parity may be
> corrupted and then 1)'s data may be recovered wrongly if the disk
> which holds 1)'s data is offline.
> 
>> - does this journal disk also host other btrfs log ?
>>
> 
> No, purely data/parity and some associated metadata.
> 
> Thanks,
> 
> -liubo
> 
>>> The idea and the code are similar to the write-through mode of md
>>> raid5-cache, so ppl(partial parity log) is also feasible to implement.
>>> (If you've been familiar with md, you may find this patch set is
>>> boring to read...)
>>>
>>> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
>>> the implementation, the rest patches are improvements and bugfixes,
>>> eg. readahead for recovery, checksum.
>>>
>>> Two btrfs-progs patches are required to play with this patch set, one
>>> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
>>> option '-L', the other is to teach 'btrfs-show-super' to show
>>> %journal_tail.
>>>
>>> This is currently based on 4.12-rc3.
>>>
>>> The patch set is tagged with RFC, and comments are always welcome,
>>> thanks.
>>>
>>> Known limitations:
>>> - Deleting a log device is not implemented yet.
>>>
>>>
>>> Liu Bo (14):
>>>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>>>   Btrfs: raid56: do not allocate chunk on raid56 log
>>>   Btrfs: raid56: detect raid56 log on mount
>>>   Btrfs: raid56: add verbose debug
>>>   Btrfs: raid56: add stripe log for raid5/6
>>>   Btrfs: raid56: add reclaim support
>>>   Btrfs: raid56: load r5log
>>>   Btrfs: raid56: log recovery
>>>   Btrfs: raid56: add readahead for recovery
>>>   Btrfs: raid56: use the readahead helper to get page
>>>   Btrfs: raid56: add csum support
>>>   Btrfs: raid56: fix error handling while adding a log device
>>>   Btrfs: raid56: initialize raid5/6 log after adding it
>>>   Btrfs: raid56: maintain IO order on raid5/6 log
>>>
>>>  fs/btrfs/ctree.h                |   16 +-
>>>  fs/btrfs/disk-io.c              |   16 +
>>>  fs/btrfs/ioctl.c                |   48 +-
>>>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
>>>  fs/btrfs/raid56.h               |   82 +++
>>>  fs/btrfs/transaction.c          |    2 +
>>>  fs/btrfs/volumes.c              |   56 +-
>>>  fs/btrfs/volumes.h              |    7 +-
>>>  include/uapi/linux/btrfs.h      |    3 +
>>>  include/uapi/linux/btrfs_tree.h |    4 +
>>>  10 files changed, 1487 insertions(+), 176 deletions(-)
>>>
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 16:14 Liu Bo
                   ` (2 preceding siblings ...)
  2017-08-01 17:42 ` Goffredo Baroncelli
@ 2017-08-01 21:00 ` Christoph Anton Mitterer
  2017-08-01 22:24   ` Goffredo Baroncelli
  3 siblings, 1 reply; 24+ messages in thread
From: Christoph Anton Mitterer @ 2017-08-01 21:00 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 110 bytes --]

Hi.

Stupid question:
Would the write hole be closed already, if parity was checksummed?

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 16:56   ` Liu Bo
@ 2017-08-01 18:15     ` Hugo Mills
  0 siblings, 0 replies; 24+ messages in thread
From: Hugo Mills @ 2017-08-01 18:15 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

On Tue, Aug 01, 2017 at 10:56:39AM -0600, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 05:28:57PM +0000, Hugo Mills wrote:
> >    Hi,
> > 
> >    Great to see something addressing the write hole at last.
> > 
> > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> >    What's the behaviour of the FS if the log device dies during use?
> >
> 
> Error handling on IOs is still under construction (belongs to known
> limitations).
> 
> If the log device dies suddenly, I think we could skip the writeback
> to backend raid arrays and follow the rule in btrfs, filp FS to
> readonly as it may expose data loss.  What do you think?

   I think the key thing for me is that the overall behaviour of the
redundancy in the FS is not compromised by the logging solution. That
is, the same guarantees still hold: For RAID-5, you can lose up to one
device of the FS (*including* any log devices), and the FS will
continue to operate normally, but degraded. For RAID-6, you can lose
up to two devices without losing any capabilities of the FS. Dropping
to read-only if the (single) log device fails would break those
guarantees.

   I quite like the idea of embedding the log chunks into the
allocated structure of the FS -- although as pointed out, this is
probably going to need a new chunk type, and (to retain the guarantees
of the RAID-6 behaviour above) the ability to do 3-way RAID-1 on those
chunks. You'd also have to be able to balance the log structures while
in flight. It sounds like a lot more work for you, though.

   Hmm... if 3-way RAID-1 (3c) is available, then you could also have
RAID-1*3 on metadata, RAID-6 on data, and have 2-device redundancy
throughout. That's also a very attractive configuration in many
respects. (Analagous to RAID-1 metadata and RAID-5 data).

   Hugo.

-- 
Hugo Mills             | That's not rain, that's a lake with slots in it.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 16:14 Liu Bo
  2017-08-01 17:25 ` Roman Mamedov
  2017-08-01 17:28 ` Hugo Mills
@ 2017-08-01 17:42 ` Goffredo Baroncelli
  2017-08-01 17:24   ` Liu Bo
  2017-08-01 21:00 ` Christoph Anton Mitterer
  3 siblings, 1 reply; 24+ messages in thread
From: Goffredo Baroncelli @ 2017-08-01 17:42 UTC (permalink / raw)
  To: Liu Bo, linux-btrfs

Hi Liu,

On 2017-08-01 18:14, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.
> 

it would be possible to have more information ?
- what is logged ? data, parity or data + parity ?
- in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?
- does this journal disk also host other btrfs log ?

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h                |   16 +-
>  fs/btrfs/disk-io.c              |   16 +
>  fs/btrfs/ioctl.c                |   48 +-
>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/raid56.h               |   82 +++
>  fs/btrfs/transaction.c          |    2 +
>  fs/btrfs/volumes.c              |   56 +-
>  fs/btrfs/volumes.h              |    7 +-
>  include/uapi/linux/btrfs.h      |    3 +
>  include/uapi/linux/btrfs_tree.h |    4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:25 ` Roman Mamedov
  2017-08-01 17:03   ` Liu Bo
@ 2017-08-01 17:39   ` Austin S. Hemmelgarn
  2017-08-01 17:07     ` Liu Bo
  2017-08-02 18:47     ` Chris Mason
  1 sibling, 2 replies; 24+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 17:39 UTC (permalink / raw)
  To: Roman Mamedov, Liu Bo; +Cc: linux-btrfs

On 2017-08-01 13:25, Roman Mamedov wrote:
> On Tue,  1 Aug 2017 10:14:23 -0600
> Liu Bo <bo.li.liu@oracle.com> wrote:
> 
>> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
>> separate disk as a journal (aka raid5/6 log), so that after unclean
>> shutdown we can make sure data and parity are consistent on the raid
>> array by replaying the journal.
> 
> Could it be possible to designate areas on the in-array devices to be used as
> journal?
> 
> While md doesn't have much spare room in its metadata for extraneous things
> like this, Btrfs could use almost as much as it wants to, adding to size of the
> FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.
> 
> It doesn't seem convenient to need having an additional storage device around
> just for the log, and also needing to maintain its fault tolerance yourself (so
> the log device would better be on a mirror, such as mdadm RAID1? more expense
> and maintenance complexity).
> 
I agree, MD pretty much needs a separate device simply because they 
can't allocate arbitrary space on the other array members.  BTRFS can do 
that though, and I would actually think that that would be _easier_ to 
implement than having a separate device.

That said, I do think that it would need to be a separate chunk type, 
because things could get really complicated if the metadata is itself 
using a parity raid profile.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 16:14 Liu Bo
  2017-08-01 17:25 ` Roman Mamedov
@ 2017-08-01 17:28 ` Hugo Mills
  2017-08-01 16:56   ` Liu Bo
  2017-08-01 17:42 ` Goffredo Baroncelli
  2017-08-01 21:00 ` Christoph Anton Mitterer
  3 siblings, 1 reply; 24+ messages in thread
From: Hugo Mills @ 2017-08-01 17:28 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2678 bytes --]

   Hi,

   Great to see something addressing the write hole at last.

On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

   What's the behaviour of the FS if the log device dies during use?

   Hugo.

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h                |   16 +-
>  fs/btrfs/disk-io.c              |   16 +
>  fs/btrfs/ioctl.c                |   48 +-
>  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/raid56.h               |   82 +++
>  fs/btrfs/transaction.c          |    2 +
>  fs/btrfs/volumes.c              |   56 +-
>  fs/btrfs/volumes.h              |    7 +-
>  include/uapi/linux/btrfs.h      |    3 +
>  include/uapi/linux/btrfs_tree.h |    4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 

-- 
Hugo Mills             | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 16:14 Liu Bo
@ 2017-08-01 17:25 ` Roman Mamedov
  2017-08-01 17:03   ` Liu Bo
  2017-08-01 17:39   ` Austin S. Hemmelgarn
  2017-08-01 17:28 ` Hugo Mills
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: Roman Mamedov @ 2017-08-01 17:25 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

On Tue,  1 Aug 2017 10:14:23 -0600
Liu Bo <bo.li.liu@oracle.com> wrote:

> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

Could it be possible to designate areas on the in-array devices to be used as
journal?

While md doesn't have much spare room in its metadata for extraneous things
like this, Btrfs could use almost as much as it wants to, adding to size of the
FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.

It doesn't seem convenient to need having an additional storage device around
just for the log, and also needing to maintain its fault tolerance yourself (so
the log device would better be on a mirror, such as mdadm RAID1? more expense
and maintenance complexity).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:42 ` Goffredo Baroncelli
@ 2017-08-01 17:24   ` Liu Bo
  2017-08-01 22:14     ` Goffredo Baroncelli
  0 siblings, 1 reply; 24+ messages in thread
From: Liu Bo @ 2017-08-01 17:24 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Tue, Aug 01, 2017 at 07:42:14PM +0200, Goffredo Baroncelli wrote:
> Hi Liu,
> 
> On 2017-08-01 18:14, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> > 
> 
> it would be possible to have more information ?
> - what is logged ? data, parity or data + parity ?

Patch 5 has more details(sorry for not making it clear that in the
cover letter).

So both data and parity are logged so that while replaying the journal
everything is written to whichever disk it should be written to.

> - in the past I thought that it would be sufficient to log only the stripe position involved by a RMW cycle, and then start a scrub on these stripes in case of an unclean shutdown: do you think that it is feasible ?

An unclean shutdown causes inconsistence between data and parity, so
scrub won't help as it's not able to tell which one (data or parity)
is valid.

With nodatacow, we do overwrite, so RMW during unclean shutdown is not safe.
With datacow, we don't do overwrite, but the following situation may happen,
say we have a raid5 setup with 3 disks, the stripe length is 64k, so

1) write 64K  --> now the raid layout is
[64K data + 64K random + 64K parity]
2) write another 64K --> now the raid layout after RMW is
[64K 1)'s data + 64K 2)'s data + 64K new parity]

If unclean shutdown occurs before 2) finishes, then parity may be
corrupted and then 1)'s data may be recovered wrongly if the disk
which holds 1)'s data is offline.

> - does this journal disk also host other btrfs log ?
>

No, purely data/parity and some associated metadata.

Thanks,

-liubo

> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h                |   16 +-
> >  fs/btrfs/disk-io.c              |   16 +
> >  fs/btrfs/ioctl.c                |   48 +-
> >  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
> >  fs/btrfs/raid56.h               |   82 +++
> >  fs/btrfs/transaction.c          |    2 +
> >  fs/btrfs/volumes.c              |   56 +-
> >  fs/btrfs/volumes.h              |    7 +-
> >  include/uapi/linux/btrfs.h      |    3 +
> >  include/uapi/linux/btrfs_tree.h |    4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:39   ` Austin S. Hemmelgarn
@ 2017-08-01 17:07     ` Liu Bo
  2017-08-02 18:47     ` Chris Mason
  1 sibling, 0 replies; 24+ messages in thread
From: Liu Bo @ 2017-08-01 17:07 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Roman Mamedov, linux-btrfs

On Tue, Aug 01, 2017 at 01:39:59PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-08-01 13:25, Roman Mamedov wrote:
> > On Tue,  1 Aug 2017 10:14:23 -0600
> > Liu Bo <bo.li.liu@oracle.com> wrote:
> > 
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> > Could it be possible to designate areas on the in-array devices to be used as
> > journal?
> > 
> > While md doesn't have much spare room in its metadata for extraneous things
> > like this, Btrfs could use almost as much as it wants to, adding to size of the
> > FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.
> > 
> > It doesn't seem convenient to need having an additional storage device around
> > just for the log, and also needing to maintain its fault tolerance yourself (so
> > the log device would better be on a mirror, such as mdadm RAID1? more expense
> > and maintenance complexity).
> > 
> I agree, MD pretty much needs a separate device simply because they can't
> allocate arbitrary space on the other array members.  BTRFS can do that
> though, and I would actually think that that would be _easier_ to implement
> than having a separate device.
>

Yes and no, using chunks may need a new ioctl and diving into chunk
allocation/(auto)deletion maze.

> That said, I do think that it would need to be a separate chunk type,
> because things could get really complicated if the metadata is itself using
> a parity raid profile.

Exactly, esp. when balance comes into the picture.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:25 ` Roman Mamedov
@ 2017-08-01 17:03   ` Liu Bo
  2017-08-01 17:39   ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 24+ messages in thread
From: Liu Bo @ 2017-08-01 17:03 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Tue, Aug 01, 2017 at 10:25:47PM +0500, Roman Mamedov wrote:
> On Tue,  1 Aug 2017 10:14:23 -0600
> Liu Bo <bo.li.liu@oracle.com> wrote:
> 
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
> Could it be possible to designate areas on the in-array devices to be used as
> journal?
>
> While md doesn't have much spare room in its metadata for extraneous things
> like this, Btrfs could use almost as much as it wants to, adding to size of the
> FS metadata areas. Reliability-wise, the log could be stored as RAID1 chunks.
> 

Yes, it makes sense, we could definitely do that, that was actually
the original idea.  I started with adding a new device for log as it
looks easier to me, but I could try that now.

> It doesn't seem convenient to need having an additional storage device around
> just for the log, and also needing to maintain its fault tolerance yourself (so
> the log device would better be on a mirror, such as mdadm RAID1? more expense
> and maintenance complexity).
>

That's true.  Thanks for the suggestions.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
  2017-08-01 17:28 ` Hugo Mills
@ 2017-08-01 16:56   ` Liu Bo
  2017-08-01 18:15     ` Hugo Mills
  0 siblings, 1 reply; 24+ messages in thread
From: Liu Bo @ 2017-08-01 16:56 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On Tue, Aug 01, 2017 at 05:28:57PM +0000, Hugo Mills wrote:
>    Hi,
> 
>    Great to see something addressing the write hole at last.
> 
> On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > separate disk as a journal (aka raid5/6 log), so that after unclean
> > shutdown we can make sure data and parity are consistent on the raid
> > array by replaying the journal.
> 
>    What's the behaviour of the FS if the log device dies during use?
>

Error handling on IOs is still under construction (belongs to known
limitations).

If the log device dies suddenly, I think we could skip the writeback
to backend raid arrays and follow the rule in btrfs, filp FS to
readonly as it may expose data loss.  What do you think?

Thanks,

-liubo

>    Hugo.
> 
> > The idea and the code are similar to the write-through mode of md
> > raid5-cache, so ppl(partial parity log) is also feasible to implement.
> > (If you've been familiar with md, you may find this patch set is
> > boring to read...)
> > 
> > Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> > the implementation, the rest patches are improvements and bugfixes,
> > eg. readahead for recovery, checksum.
> > 
> > Two btrfs-progs patches are required to play with this patch set, one
> > is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> > option '-L', the other is to teach 'btrfs-show-super' to show
> > %journal_tail.
> > 
> > This is currently based on 4.12-rc3.
> > 
> > The patch set is tagged with RFC, and comments are always welcome,
> > thanks.
> > 
> > Known limitations:
> > - Deleting a log device is not implemented yet.
> > 
> > 
> > Liu Bo (14):
> >   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
> >   Btrfs: raid56: do not allocate chunk on raid56 log
> >   Btrfs: raid56: detect raid56 log on mount
> >   Btrfs: raid56: add verbose debug
> >   Btrfs: raid56: add stripe log for raid5/6
> >   Btrfs: raid56: add reclaim support
> >   Btrfs: raid56: load r5log
> >   Btrfs: raid56: log recovery
> >   Btrfs: raid56: add readahead for recovery
> >   Btrfs: raid56: use the readahead helper to get page
> >   Btrfs: raid56: add csum support
> >   Btrfs: raid56: fix error handling while adding a log device
> >   Btrfs: raid56: initialize raid5/6 log after adding it
> >   Btrfs: raid56: maintain IO order on raid5/6 log
> > 
> >  fs/btrfs/ctree.h                |   16 +-
> >  fs/btrfs/disk-io.c              |   16 +
> >  fs/btrfs/ioctl.c                |   48 +-
> >  fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
> >  fs/btrfs/raid56.h               |   82 +++
> >  fs/btrfs/transaction.c          |    2 +
> >  fs/btrfs/volumes.c              |   56 +-
> >  fs/btrfs/volumes.h              |    7 +-
> >  include/uapi/linux/btrfs.h      |    3 +
> >  include/uapi/linux/btrfs_tree.h |    4 +
> >  10 files changed, 1487 insertions(+), 176 deletions(-)
> > 
> 
> -- 
> Hugo Mills             | Some days, it's just not worth gnawing through the
> hugo@... carfax.org.uk | straps
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes
@ 2017-08-01 16:14 Liu Bo
  2017-08-01 17:25 ` Roman Mamedov
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Liu Bo @ 2017-08-01 16:14 UTC (permalink / raw)
  To: linux-btrfs

This aims to fix write hole issue on btrfs raid5/6 setup by adding a
separate disk as a journal (aka raid5/6 log), so that after unclean
shutdown we can make sure data and parity are consistent on the raid
array by replaying the journal.

The idea and the code are similar to the write-through mode of md
raid5-cache, so ppl(partial parity log) is also feasible to implement.
(If you've been familiar with md, you may find this patch set is
boring to read...)

Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
the implementation, the rest patches are improvements and bugfixes,
eg. readahead for recovery, checksum.

Two btrfs-progs patches are required to play with this patch set, one
is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
option '-L', the other is to teach 'btrfs-show-super' to show
%journal_tail.

This is currently based on 4.12-rc3.

The patch set is tagged with RFC, and comments are always welcome,
thanks.

Known limitations:
- Deleting a log device is not implemented yet.

Liu Bo (14):
  Btrfs: raid56: add raid56 log via add_dev v2 ioctl
  Btrfs: raid56: do not allocate chunk on raid56 log
  Btrfs: raid56: detect raid56 log on mount
  Btrfs: raid56: add verbose debug
  Btrfs: raid56: add stripe log for raid5/6
  Btrfs: raid56: add reclaim support
  Btrfs: raid56: load r5log
  Btrfs: raid56: log recovery
  Btrfs: raid56: add readahead for recovery
  Btrfs: raid56: use the readahead helper to get page
  Btrfs: raid56: add csum support
  Btrfs: raid56: fix error handling while adding a log device
  Btrfs: raid56: initialize raid5/6 log after adding it
  Btrfs: raid56: maintain IO order on raid5/6 log

 fs/btrfs/ctree.h                |   16 +-
 fs/btrfs/disk-io.c              |   16 +
 fs/btrfs/ioctl.c                |   48 +-
 fs/btrfs/raid56.c               | 1429 ++++++++++++++++++++++++++++++++++-----
 fs/btrfs/raid56.h               |   82 +++
 fs/btrfs/transaction.c          |    2 +
 fs/btrfs/volumes.c              |   56 +-
 fs/btrfs/volumes.h              |    7 +-
 include/uapi/linux/btrfs.h      |    3 +
 include/uapi/linux/btrfs_tree.h |    4 +
 10 files changed, 1487 insertions(+), 176 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-07-30 16:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAL5DHTFA8G5fq=BQ7N7qQimesjygKTiBR2qZF4YBRAsAjB_L5Q@mail.gmail.com>
2019-07-30 16:20 ` [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes Goffredo Baroncelli
2017-08-01 16:14 Liu Bo
2017-08-01 17:25 ` Roman Mamedov
2017-08-01 17:03   ` Liu Bo
2017-08-01 17:39   ` Austin S. Hemmelgarn
2017-08-01 17:07     ` Liu Bo
2017-08-02 18:47     ` Chris Mason
2018-05-03 19:16       ` Goffredo Baroncelli
2017-08-01 17:28 ` Hugo Mills
2017-08-01 16:56   ` Liu Bo
2017-08-01 18:15     ` Hugo Mills
2017-08-01 17:42 ` Goffredo Baroncelli
2017-08-01 17:24   ` Liu Bo
2017-08-01 22:14     ` Goffredo Baroncelli
2017-08-02 17:57       ` Liu Bo
2017-08-02 20:41         ` Goffredo Baroncelli
2017-08-02 20:27           ` Liu Bo
2017-08-03  4:02             ` Duncan
2017-08-03  4:40               ` Goffredo Baroncelli
2017-08-23 15:28             ` Chris Murphy
2017-08-23 15:47               ` Austin S. Hemmelgarn
2017-08-25 13:53               ` Goffredo Baroncelli
2017-08-01 21:00 ` Christoph Anton Mitterer
2017-08-01 22:24   ` Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).