All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Wang Yugui <wangyugui@e16-tech.com>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56
Date: Thu, 7 Jul 2022 09:14:27 +0800	[thread overview]
Message-ID: <d80eb281-6aae-0dbe-5ed5-277087c9ce91@gmx.com> (raw)
In-Reply-To: <20220707073652.D696.409509F4@e16-tech.com>



On 2022/7/7 07:36, Wang Yugui wrote:
> Hi,
>
>> [BACKGROUND]
>> Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
>> loss happens.
>>
>> For pure mirror based profiles it's fine as btrfs can utilize its csums
>> to find the correct mirror the repair the bad ones.
>>
>> But for RAID56, the repair itself needs the data from other devices,
>> thus any out-of-sync data can degrade the tolerance.
>>
>> Even worse, incorrect RMW can use the stale data to generate P/Q,
>> removing the possibility of recovery the data.
>>
>>
>> For md-raid, it goes with write-intent bitmap, to do faster resilver,
>> and goes journal (partial parity log for RAID5) to ensure it can even
>> stand a powerloss + device lose.
>>
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, including full
>> stripe write), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [NO RECOVERY CODE YET]
>>
>> Unfortunately, this patchset only implements the write-intent bitmap
>> code, the recovery part is still a place holder, as we need some scrub
>> refactor to make it only scrub a logical bytenr range.
>>
>> [ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]
>>
>> Since btrfs can utilize csum for its metadata and CoWed data, unlike
>> dm-bitmap which can only be used for faster re-silver, we can fully
>> rebuild the full stripe, as long as:
>>
>> 1) There is no missing device
>>     For missing device case, we still need to go full journal.
>>
>> 2) Untouched data stays untouched
>>     This should be mostly sane for sane hardware.
>
> Is there a case that write-intent bitmap log is rotated?
>
> For hardware raid of broadcom/LSI,  once a disk is unpluged,
> the whole disk will be rebuild after this disk  is plugged again.

There is a plan in the future, adding a new flag,
WRITE_INTENT_TARGET_DEGRADED, to record such new writes after a degraded
rw operation.

But that would be something with lower priority, like after full journal.

Thanks,
Qu

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/07/07
>
>
>> And since the btrfs specific write-intent bitmaps are pretty small (4KiB
>> in size), the overhead much lower than full journal.
>>
>> In the future, we may allow users to choose between just bitmaps or full
>> journal to meet their requirement.
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Scrub refactor to allow us to do proper recovery at mount time
>>    Need to change scrub interface to scrub based on logical bytenr.
>>
>>    This can be a super big work, thus currently we will focus only on
>>    RAID56 new scrub interface for write-intent recovery only.
>>
>> - Extra optimizations
>>    * Skip full stripe writes
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>
>> - Bug hunts and more fstests runs
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (11):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>
>>   fs/btrfs/Makefile          |   2 +-
>>   fs/btrfs/ctree.h           |  24 +-
>>   fs/btrfs/disk-io.c         |  54 +++
>>   fs/btrfs/raid56.c          |  16 +
>>   fs/btrfs/sysfs.c           |   2 +
>>   fs/btrfs/volumes.c         |  34 +-
>>   fs/btrfs/write-intent.c    | 962 +++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/write-intent.h    | 288 +++++++++++
>>   fs/btrfs/zoned.c           |   8 +
>>   include/uapi/linux/btrfs.h |  17 +
>>   10 files changed, 1399 insertions(+), 8 deletions(-)
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>> --
>> 2.36.1
>
>

  reply	other threads:[~2022-07-07  1:14 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-05  7:39 [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 01/11] btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 02/11] btrfs: introduce a new experimental compat RO flag, WRITE_INTENT_BITMAP Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 03/11] btrfs: introduce the on-disk format of btrfs write intent bitmaps Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 04/11] btrfs: load/create write-intent bitmaps at mount time Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 05/11] btrfs: write-intent: write the newly created bitmaps to all disks Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 06/11] btrfs: write-intent: introduce an internal helper to set bits for a range Qu Wenruo
2022-07-06  6:16   ` Qu Wenruo
2022-07-06  9:00     ` Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 07/11] btrfs: write-intent: introduce an internal helper to clear " Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 08/11] btrfs: write back write intent bitmap after barrier_all_devices() Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 09/11] btrfs: update and writeback the write-intent bitmap for RAID56 write Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 10/11] btrfs: raid56: clear write-intent bimaps when a full stripe finishes Qu Wenruo
2022-07-05  7:39 ` [PATCH RFC 11/11] btrfs: warn and clear bitmaps if there is dirty bitmap at mount time Qu Wenruo
2022-07-06 23:36 ` [PATCH RFC 00/11] btrfs: introduce write-intent bitmaps for RAID56 Wang Yugui
2022-07-07  1:14   ` Qu Wenruo [this message]
2022-07-07  4:24 ` Wang Yugui
2022-07-07  4:28   ` Qu Wenruo
2022-07-07  4:40     ` Wang Yugui
2022-07-07  5:05       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d80eb281-6aae-0dbe-5ed5-277087c9ce91@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wangyugui@e16-tech.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.