Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Paul Jones <paul@pauljones.id.au>, Wang Yugui <wangyugui@e16-tech.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Wed, 1 Jun 2022 20:21:16 +0800	[thread overview]
Message-ID: <4ab469dd-eedd-e838-65b7-e3159128e758@gmx.com> (raw)
In-Reply-To: <SYCPR01MB4685030F15634C6C2FEC01369EDF9@SYCPR01MB4685.ausprd01.prod.outlook.com>

On 2022/6/1 17:56, Paul Jones wrote:
>
>> -----Original Message-----
>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Sent: Wednesday, 1 June 2022 7:27 PM
>> To: Wang Yugui <wangyugui@e16-tech.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>
>>
>
>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>> because journal data is in a different place than normal data, so
>>>>> HDD seek is always happen?
>>>>>
>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>> may not need
>>>>> RAID56/RAID1 for journal data.
>>>>
>>>> That device is the single point of failure. You lost that device,
>>>> write hole come again.
>>>
>>> The HW RAID card have 'single point of failure'  too, such as the
>>> NVDIMM inside HW RAID card.
>>>
>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>> failure frequency
>>
>> It's a completely different level.
>>
>> For btrfs RAID, we have no special treat for any disk.
>> And our RAID is focusing on ensuring device tolerance.
>>
>> In your RAID card case, indeed the failure rate of the card is much lower.
>> In journal device case, how do you ensure it's still true that the journal device
>> missing possibility is way lower than all the other devices?
>>
>> So this doesn't make sense, unless you introduce the journal to something
>> definitely not a regular disk.
>>
>> I don't believe this benefit most users.
>> Just consider how many regular people use dedicated journal device for
>> XFS/EXT4 upon md/dm RAID56.
>
> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.

After more consideration, it looks like it's indeed better.

Although we break the guarantee on bad devices, if the journal device is
missing, we just fall back to the old RAID56 behavior.

It's not the best situation, but we still have all the content we have.
The problem is for future write, we may degrade the recovery ability
bytes by bytes.

Thanks,
Qu
>
> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>
> Paul.