All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@libero.it>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 6 Jun 2022 20:10:58 +0200	[thread overview]
Message-ID: <128e0119-088b-7a10-c874-551196df4c56@libero.it> (raw)
In-Reply-To: <c4c298bf-ca54-1915-c22f-a6d87fc5a78f@gmx.com>

Hi Qu,

On 06/06/2022 13.21, Qu Wenruo wrote:
> 
> 
> On 2022/6/6 16:16, Qu Wenruo wrote:
>>
>>
> [...]
>>>>
>>>> Hello Qu,
>>>>
>>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>>> left
>>>> dirty for a while and doesn't need to be set for each write. Or you
>>>> could do a per-block-group dirty bit.
>>>
>>> That would be a pretty good way for auto scrub after dirty close.
>>>
>>> Currently we have quite some different ideas, but some are pretty
>>> similar but at different side of a spectrum:
>>>
>>>      Easier to implement        ..     Harder to implement
>>> |<- More on mount time scrub   ..     More on journal ->|
>>> |                    |    |    \- Full journal
>>> |                    |    \--- Per bg dirty bitmap
>>> |                    \----------- Per bg dirty flag
>>> \--------------------------------------------------- Per sb dirty flag
>>
>> In fact, recently I'm checking the MD code (including their MD-raid5).
>>
>> It turns out they have write-intent bitmap, which is almost the per-bg
>> dirty bitmap in above spectrum.
>>
>> In fact, since btrfs has all the CoW and checksum for metadata (and part
>> of its data), btrfs scrub can do a much better job than MD to resilver
>> the range.
>>
>> Furthermore, we have a pretty good reserved space (1M), and has a pretty
>> reasonable stripe length (1GiB).
>> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
>> much smaller than the 1MiB we reserved.
>>
>> I think this can be a pretty reasonable middle ground, faster than full
>> journal, while the amount to scrub should be reasonable enough to be
>> done at mount time.

Raid5 is "single fault proof". This means that it can sustain only one
failure *at time* like:
1) unavailability of a disk (e.g. data disk failure)
2) a missing write in the stripe (e.g. unclean shutdown)

a) Until now (i.e. without the "bitmap intent"), even if these failures happen
in different days (i.e. not at the same time), the result may be a "write hole".

b) With the bitmap intent, the write hole requires that 1) and 2) happen
at the same time. But this would be not anymore a "single fault", with only an
exception: if these failure have a common cause (e.g. a power
failure which in turn cause the dead of a disk). In this case this has to be
considered "single fault".

But with a battery backup (i.e. no power failure), the likelihood of b) became
negligible.

This to say that a write intent bitmap will provide an huge
improvement of the resilience of a btrfs raid5, and in turn raid6.

My only suggestions, is to find a way to store the bitmap intent not in the
raid5/6 block group, but in a separate block group, with the appropriate level
of redundancy.

This for two main reasons:
1) in future BTRFS may get the ability of allocating this block group in a
dedicate disks set. I see two main cases:
a) in case of raid6, we can store the intent bitmap (or the journal) in a
raid1C3 BG allocated in the faster disks. The cons is that each block has to be
written 3x2 times. But if you have an hybrid disks set (some ssd and some hdd,
you got a noticeable gain of performance)
b) another option is to spread the intent bitmap (or the journal) in *all* disks,
where each disks contains only the the related data (if we update only disk #1
and disk #2, we have to update only the intent bitmap (or the journal) in
disk #1 and  disk #2)


2) having a dedicate bg for the intent bitmap (or the journal), has another big
advantage: you don't need to change the meaning of the raid5/6 bg. This means
that an older kernel can read/write a raid5/6 filesystem: it sufficient to ignore
the intent bitmap (or the journal)



> 
> Furthermore, this even allows us to go something like bitmap tree, for
> such write-intent bitmap.
> And as long as the user is not using RAID56 for metadata (maybe even
> it's OK to use RAID56 for metadata), it should be pretty safe against
> most write-hole (for metadata and CoW data only though, nocow data is
> still affected).
> 
> Thus I believe this can be a valid path to explore, and even have a
> higher priority than full journal.
> 
> Thanks,
> Qu
> 



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2022-06-06 18:11 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  6:13 [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Qu Wenruo
2022-05-24 11:08 ` kernel test robot
2022-05-24 12:19 ` kernel test robot
2022-05-24 17:02 ` David Sterba
2022-05-24 22:31   ` Qu Wenruo
2022-05-25  9:00   ` Christoph Hellwig
2022-05-25  9:13     ` Qu Wenruo
2022-05-25  9:26       ` Christoph Hellwig
2022-05-25  9:35         ` Qu Wenruo
2022-05-26  9:06           ` waxhead
2022-05-26  9:26             ` Qu Wenruo
2022-05-26 15:30               ` Goffredo Baroncelli
2022-05-26 16:10                 ` David Sterba
2022-06-01  2:06 ` Wang Yugui
2022-06-01  2:13   ` Qu Wenruo
2022-06-01  2:25     ` Wang Yugui
2022-06-01  2:55       ` Qu Wenruo
2022-06-01  9:07         ` Wang Yugui
2022-06-01  9:27           ` Qu Wenruo
2022-06-01  9:56             ` Paul Jones
2022-06-01 10:12               ` Qu Wenruo
2022-06-01 18:49                 ` Martin Raiber
2022-06-01 21:37                   ` Qu Wenruo
2022-06-03  9:32                     ` Lukas Straub
2022-06-03  9:59                       ` Qu Wenruo
2022-06-06  8:16                         ` Qu Wenruo
2022-06-06 11:21                           ` Qu Wenruo
2022-06-06 18:10                             ` Goffredo Baroncelli [this message]
2022-06-07  1:27                               ` Qu Wenruo
2022-06-07 17:36                                 ` Goffredo Baroncelli
2022-06-07 22:14                                   ` Qu Wenruo
2022-06-08 17:26                                     ` Goffredo Baroncelli
2022-06-13  2:27                                       ` Qu Wenruo
2022-06-08 15:17                         ` Lukas Straub
2022-06-08 17:32                           ` Goffredo Baroncelli
2022-06-01 12:21               ` Qu Wenruo
2022-06-01 14:55                 ` Robert Krig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=128e0119-088b-7a10-c874-551196df4c56@libero.it \
    --to=kreijack@libero.it \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lukasstraub2@web.de \
    --cc=martin@urbackup.org \
    --cc=paul@pauljones.id.au \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.