Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: kreijack@inwind.it, Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
Date: Wed, 10 Aug 2022 17:14:25 +0800	[thread overview]
Message-ID: <7da6e44b-8e54-5e2f-66ea-b6d20b05521f@gmx.com> (raw)
In-Reply-To: <8935ce6e-3801-b35a-7468-3f6fcbab82f0@inwind.it>

On 2022/8/10 16:45, Goffredo Baroncelli wrote:
[...]
>>
>>>
>>> The key factor is "read data", where we can face three cases:
>>> 1) the data is referenced and has a checksum: we can check against the
>>> checksum and if the checksum doesn't match we should perform a recover
>>> (on the basis of the data stored on the disk)
>>
>> Then let's talk about the detail problems in the btrfs specific areas.
>>
>> Yes, it's possible that we can try to check the csums when doing RMW for
>> sub-stripe writes.
>>
>> But there are problems:
>>
>> === csum tree race ===
>>
>> 1) Data writes (with csum) into some data 1 stripes finished
>>
>> 2) By somehow the just written data is corrupted
>>
>> 3) RAID56 RMW triggered for write into data 2
>>
>> 4) We search csum tree, and found nothing, as the csum insert hasn't
>>     happen yet
>>
> This is the point. Theoretically this is doable, but this would require
> a "strict"
> order not only inside the stripe but also between all the data and
> metadata.
> This would have an impact on the performance; however I think that this
> is not
> the highest problem. The highest problem is increase of the complexity
> of the code.
>
> It seems that the ZFS approach (parity inside the extent) should not
> have this problem.
>
>
>> 5) Csum insert happens for data write in 1)
>>
>> Then we still have the destructive RMW.
>> Yes we can argue that 2) is so rare that we shouldn't really bother, but
>> in theory this can still happen, all just because the csum tree search
>> and csum insert has race window.
>>
>> === performance ===
>>
>> This means, every time we do a RMW, we have to do a full csum tree and
>> extent tree search.
>> And this also mean, we have to read the full stripe, including the range
>> we're going to write data into.
>
> Likely we need to read the checksums in any case, because we need to update
> the page that hosts it for the first write.

Checksum read and full stripe read are completely different.

Checksum read needs to involve several tree block reads by itself, and
the timing is way tricky.

For normal data read, we search csum tree beforehand, but for this
RAID56 RMW, we must search csum tree at the very end stage before we
submit the write.

I'm not 100% confident if the timing is fine.

And for normal write, even for RAID56 RMW, we never trigger read for
csum tree, we just read the full stripe at most.

Thus even the code change may not be that complex, the timing change can
be very tricky.

> Also we need to read the full stripe to compute the parity; only for
> raid 5 we can do:
>
>      new_parity = old_parity ^ new_data
>
>>
>> AKA, this is full scrub level check.
> Is the usual cost that we need to pay when we read a data from the disk.

Not the same. Scrub has extra setup, like marking the whole block group
RO, to avoid a lot of race.

For regular file read, we don't need to bother the race for the range we
want to read, we have page locked and no race can happen.

But this is no longer true for RAID56, at RAID56 level, we have nothing
like page cache to help us.

Thanks,
Qu

>
>>
>> This is definitely not a small cost, which will further slowdown the
>> RAID56 write.
>>
>> Thus even if we're going to implement such scrub level check, I'm not
>> going to make it the default option, but make it opt-in.
>>
>> Although recently since I'm working on RAID56, unless there are some
>> possible dead-lock, I believe it's possible to implement.
>>
>> Maybe in the near future we may see some prototypes to try to address
>> this problem.
>>
>> Thanks,
>> Qu
>>
>>
>>> 2) the data is referenced but doesn't have a checksum (nocow): we cannot
>>> ensure the corruption of the data if checksum is not enabled. We can
>>> only ensure the availability of the data (which may be corrupted)
>>> 3) the data is not referenced: so the data is good.
>>>
>>> So in effect for the case 2) the data may be corrupted and not
>>> recoverable (but this is true in any case); but for the case 1) from a
>>> theoretical point of view it seems recoverable. Of course this has a
>>> cost: you need to read the stripe and their checksum (doing a recovery
>>> if needed) before updating any part of the stripe itself, maintaining a
>>> strict order between the read and the writing.
>>>
>>>
>