All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 6 Jun 2022 16:16:17 +0800	[thread overview]
Message-ID: <252577ba-1659-62f8-fc44-fea506eb97b7@gmx.com> (raw)
In-Reply-To: <8c318892-0d36-51bb-18e0-a762dd75b723@gmx.com>



On 2022/6/3 17:59, Qu Wenruo wrote:
>
>
> On 2022/6/3 17:32, Lukas Straub wrote:
>> On Thu, 2 Jun 2022 05:37:11 +0800
>> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>> On 2022/6/2 02:49, Martin Raiber wrote:
>>>> On 01.06.2022 12:12 Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2022/6/1 17:56, Paul Jones wrote:
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>>>>> Cc: linux-btrfs@vger.kernel.org
>>>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format
>>>>>>> draft
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>>>>> If we save journal on every RAID56 HDD, it will always be very
>>>>>>>>>> slow,
>>>>>>>>>> because journal data is in a different place than normal data, so
>>>>>>>>>> HDD seek is always happen?
>>>>>>>>>>
>>>>>>>>>> If we save journal on a device just like 'mke2fs -O
>>>>>>>>>> journal_dev' or
>>>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like
>>>>>>>>>> NVDIMM?  We
>>>>>>>>>> may not need
>>>>>>>>>> RAID56/RAID1 for journal data.
>>>>>>>>>
>>>>>>>>> That device is the single point of failure. You lost that device,
>>>>>>>>> write hole come again.
>>>>>>>>
>>>>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>>>>> NVDIMM inside HW RAID card.
>>>>>>>>
>>>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>>>>> failure frequency
>>>>>>>
>>>>>>> It's a completely different level.
>>>>>>>
>>>>>>> For btrfs RAID, we have no special treat for any disk.
>>>>>>> And our RAID is focusing on ensuring device tolerance.
>>>>>>>
>>>>>>> In your RAID card case, indeed the failure rate of the card is
>>>>>>> much lower.
>>>>>>> In journal device case, how do you ensure it's still true that
>>>>>>> the journal device
>>>>>>> missing possibility is way lower than all the other devices?
>>>>>>>
>>>>>>> So this doesn't make sense, unless you introduce the journal to
>>>>>>> something
>>>>>>> definitely not a regular disk.
>>>>>>>
>>>>>>> I don't believe this benefit most users.
>>>>>>> Just consider how many regular people use dedicated journal
>>>>>>> device for
>>>>>>> XFS/EXT4 upon md/dm RAID56.
>>>>>>
>>>>>> A good solid state drive should be far less error prone than
>>>>>> spinning drives, so would be a good candidate. Not perfect, but
>>>>>> better.
>>>>>>
>>>>>> As an end user I think focusing on stability and recovery tools is
>>>>>> a better use of time than fixing the write hole, as I wouldn't
>>>>>> even consider using Raid56 in it's current state. The write hole
>>>>>> problem can be alleviated by a UPS and not using Raid56 for a busy
>>>>>> write load. It's still good to brainstorm the issue though, as it
>>>>>> will need solving eventually.
>>>>>
>>>>> In fact, since write hole is only a problem for power loss (and
>>>>> explicit
>>>>> degraded write), another solution is, only record if the fs is
>>>>> gracefully closed.
>>>>>
>>>>> If the fs is not gracefully closed (by a bit in superblock), then we
>>>>> just trigger a full scrub on all existing RAID56 block groups.
>>>>>
>>>>> This should solve the problem, with the extra cost of slow scrub for
>>>>> each unclean shutdown.
>>>>>
>>>>> To be extra safe, during that scrub run, we really want user to
>>>>> wait for
>>>>> the scrub to finish.
>>>>>
>>>>> But on the other hand, I totally understand user won't be happy to
>>>>> wait
>>>>> for 10+ hours just due to a unclean shutdown...
>>>> Would it be possible to put the stripe offsets/numbers into a
>>>> journal/commit them before write? Then, during mount you could scrub
>>>> only those after an unclean shutdown.
>>>
>>> If we go that path, we can already do full journal, and only replay that
>>> journal without the need for scrub at all.
>>
>> Hello Qu,
>>
>> If you don't care about the write-hole, you can also use a dirty bitmap
>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>> example one gigabyte of the disk that _may_ be dirty, and the bit is left
>> dirty for a while and doesn't need to be set for each write. Or you
>> could do a per-block-group dirty bit.
>
> That would be a pretty good way for auto scrub after dirty close.
>
> Currently we have quite some different ideas, but some are pretty
> similar but at different side of a spectrum:
>
>      Easier to implement        ..     Harder to implement
> |<- More on mount time scrub   ..     More on journal ->|
> |                    |    |    \- Full journal
> |                    |    \--- Per bg dirty bitmap
> |                    \----------- Per bg dirty flag
> \--------------------------------------------------- Per sb dirty flag

In fact, recently I'm checking the MD code (including their MD-raid5).

It turns out they have write-intent bitmap, which is almost the per-bg
dirty bitmap in above spectrum.

In fact, since btrfs has all the CoW and checksum for metadata (and part
of its data), btrfs scrub can do a much better job than MD to resilver
the range.

Furthermore, we have a pretty good reserved space (1M), and has a pretty
reasonable stripe length (1GiB).
This means, we only need 32KiB for the bitmap for each RAID56 stripe,
much smaller than the 1MiB we reserved.

I think this can be a pretty reasonable middle ground, faster than full
journal, while the amount to scrub should be reasonable enough to be
done at mount time.

Thanks,
Qu
>
> In fact, the dirty bitmap is just a simplified version of journal (only
> record the metadata, without data).
> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
> recover the data without problem.
>
> Even with per-bg dirty bitmap, we still need some extra location to
> record the bitmap. Thus it needs a on-disk format change anyway.
>
> Currently only sb dirty flag may be backward compatible.
>
> And whether we should wait for the scrub to finish before allowing use
> to do anything into the fs is also another concern.
>
> Even using bitmap, we may have several GiB data needs to be scrubbed.
> If we wait for the scrub to finish, it's the best and safest way, but
> users won't be happy at all.
>
> If we go scrub resume way, it's faster but still leaves a large window
> to allow write-hole to reduce our tolerance.
>
> Thanks,
> Qu
>>
>> And while you're at it, add the same mechanism to all the other raid
>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>
>> Regards,
>> Lukas Straub
>>
>>> Thanks,
>>> Qu
>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> Paul.
>>>>
>>>>
>>
>>
>>

  reply	other threads:[~2022-06-06  8:16 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  6:13 [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Qu Wenruo
2022-05-24 11:08 ` kernel test robot
2022-05-24 12:19 ` kernel test robot
2022-05-24 17:02 ` David Sterba
2022-05-24 22:31   ` Qu Wenruo
2022-05-25  9:00   ` Christoph Hellwig
2022-05-25  9:13     ` Qu Wenruo
2022-05-25  9:26       ` Christoph Hellwig
2022-05-25  9:35         ` Qu Wenruo
2022-05-26  9:06           ` waxhead
2022-05-26  9:26             ` Qu Wenruo
2022-05-26 15:30               ` Goffredo Baroncelli
2022-05-26 16:10                 ` David Sterba
2022-06-01  2:06 ` Wang Yugui
2022-06-01  2:13   ` Qu Wenruo
2022-06-01  2:25     ` Wang Yugui
2022-06-01  2:55       ` Qu Wenruo
2022-06-01  9:07         ` Wang Yugui
2022-06-01  9:27           ` Qu Wenruo
2022-06-01  9:56             ` Paul Jones
2022-06-01 10:12               ` Qu Wenruo
2022-06-01 18:49                 ` Martin Raiber
2022-06-01 21:37                   ` Qu Wenruo
2022-06-03  9:32                     ` Lukas Straub
2022-06-03  9:59                       ` Qu Wenruo
2022-06-06  8:16                         ` Qu Wenruo [this message]
2022-06-06 11:21                           ` Qu Wenruo
2022-06-06 18:10                             ` Goffredo Baroncelli
2022-06-07  1:27                               ` Qu Wenruo
2022-06-07 17:36                                 ` Goffredo Baroncelli
2022-06-07 22:14                                   ` Qu Wenruo
2022-06-08 17:26                                     ` Goffredo Baroncelli
2022-06-13  2:27                                       ` Qu Wenruo
2022-06-08 15:17                         ` Lukas Straub
2022-06-08 17:32                           ` Goffredo Baroncelli
2022-06-01 12:21               ` Qu Wenruo
2022-06-01 14:55                 ` Robert Krig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=252577ba-1659-62f8-fc44-fea506eb97b7@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lukasstraub2@web.de \
    --cc=martin@urbackup.org \
    --cc=paul@pauljones.id.au \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.