Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery

From: Goffredo Baroncelli <kreijack@libero.it>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
Date: Wed, 10 Aug 2022 10:08:22 +0200	[thread overview]
Message-ID: <92a7cc01-4ecc-c56a-5ef4-26b28e0b2aae@libero.it> (raw)
In-Reply-To: <9f504e1b-3ee2-9072-51c7-c533c0fb315f@gmx.com>

On 09/08/2022 23.50, Qu Wenruo wrote:
>>> n 2022/8/9 11:31, Zygo Blaxell wrote:
>>>> Test case is:
>>>>
>>>>     - start with a -draid5 -mraid1 filesystem on 2 disks
>>>>
>>>>     - run assorted IO with a mix of reads and writes (randomly
>>>>     run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>     replacing one of the disks...)
>>>>
>>>>     - cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>     blkdiscard on the underlying SSD in the VM host, to simulate
>>>>     single-disk data corruption
>>>
>>> One thing to mention is, this is going to cause destructive RMW to happen.
>>>
>>> As currently substripe write will not verify if the on-disk data stripe
>>> matches its csum.
>>>
>>> Thus if the wipeout happens while above workload is still running, it's
>>> going to corrupt data eventually.
>>
>> That would be a btrfs raid5 design bug,
> 
> That's something all RAID5 design would have the problem, not just btrfs.
> 
> Any P/Q based profile will have the problem.

Hi Qu,

I looked at your description of 'destructive RMW' cyle:

----
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

         0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
----

What I don't understood if we have a "implementation problem" or an
intrinsic problem of raid56...

To calculate parity we need to know:
	- data1 (in ram)
	- data2 (not cached, bad on disk)

So, first, we need to "read data2" then to calculate the parity and then to write data1.

The key factor is "read data", where we can face three cases:
1) the data is referenced and has a checksum: we can check against the checksum and if the checksum doesn't match we should perform a recover (on the basis of the data stored on the disk)
2) the data is referenced but doesn't have a checksum (nocow): we cannot ensure the corruption of the data if checksum is not enabled. We can only ensure the availability of the data (which may be corrupted)
3) the data is not referenced: so the data is good.

So in effect for the case 2) the data may be corrupted and not recoverable (but this is true in any case); but for the case 1) from a theoretical point of view it seems recoverable. Of course this has a cost: you need to read the stripe and their checksum (doing a recovery if needed) before updating any part of the stripe itself, maintaining a strict order between the read and the writing.

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5