All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@libero.it>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery
Date: Wed, 10 Aug 2022 10:08:22 +0200	[thread overview]
Message-ID: <92a7cc01-4ecc-c56a-5ef4-26b28e0b2aae@libero.it> (raw)
In-Reply-To: <9f504e1b-3ee2-9072-51c7-c533c0fb315f@gmx.com>

On 09/08/2022 23.50, Qu Wenruo wrote:
>>> n 2022/8/9 11:31, Zygo Blaxell wrote:
>>>> Test case is:
>>>>
>>>>     - start with a -draid5 -mraid1 filesystem on 2 disks
>>>>
>>>>     - run assorted IO with a mix of reads and writes (randomly
>>>>     run rsync, bees, snapshot create/delete, balance, scrub, start
>>>>     replacing one of the disks...)
>>>>
>>>>     - cat /dev/zero > /dev/vdb (device 1) in the VM guest, or run
>>>>     blkdiscard on the underlying SSD in the VM host, to simulate
>>>>     single-disk data corruption
>>>
>>> One thing to mention is, this is going to cause destructive RMW to happen.
>>>
>>> As currently substripe write will not verify if the on-disk data stripe
>>> matches its csum.
>>>
>>> Thus if the wipeout happens while above workload is still running, it's
>>> going to corrupt data eventually.
>>
>> That would be a btrfs raid5 design bug,
> 
> That's something all RAID5 design would have the problem, not just btrfs.
> 
> Any P/Q based profile will have the problem.


Hi Qu,

I looked at your description of 'destructive RMW' cyle:

----
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

         0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
----

What I don't understood if we have a "implementation problem" or an
intrinsic problem of raid56...

To calculate parity we need to know:
	- data1 (in ram)
	- data2 (not cached, bad on disk)

So, first, we need to "read data2" then to calculate the parity and then to write data1.

The key factor is "read data", where we can face three cases:
1) the data is referenced and has a checksum: we can check against the checksum and if the checksum doesn't match we should perform a recover (on the basis of the data stored on the disk)
2) the data is referenced but doesn't have a checksum (nocow): we cannot ensure the corruption of the data if checksum is not enabled. We can only ensure the availability of the data (which may be corrupted)
3) the data is not referenced: so the data is good.

So in effect for the case 2) the data may be corrupted and not recoverable (but this is true in any case); but for the case 1) from a theoretical point of view it seems recoverable. Of course this has a cost: you need to read the stripe and their checksum (doing a recovery if needed) before updating any part of the stripe itself, maintaining a strict order between the read and the writing.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


  reply	other threads:[~2022-08-10  8:08 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09  3:31 misc-next and for-next: kernel BUG at fs/btrfs/extent_io.c:2350! during raid5 recovery Zygo Blaxell
2022-08-09  4:36 ` Qu Wenruo
2022-08-09 19:46   ` Zygo Blaxell
2022-08-10  7:17     ` Qu Wenruo
2022-08-14  4:52     ` Qu Wenruo
2022-08-16  1:01       ` Zygo Blaxell
2022-08-16  1:25         ` Qu Wenruo
2022-08-09  7:35 ` Qu Wenruo
2022-08-09 19:29   ` Zygo Blaxell
2022-08-09 21:50     ` Qu Wenruo
2022-08-10  8:08       ` Goffredo Baroncelli [this message]
2022-08-10  8:24         ` Qu Wenruo
2022-08-10  8:45           ` Goffredo Baroncelli
2022-08-10  9:14             ` Qu Wenruo
2022-08-09  8:29 ` Christoph Hellwig
2022-08-09 19:24   ` Zygo Blaxell
2022-08-12  2:58     ` Wang Yugui
2022-08-12 22:47       ` Wang Yugui
2022-08-13  1:50     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=92a7cc01-4ecc-c56a-5ef4-26b28e0b2aae@libero.it \
    --to=kreijack@libero.it \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.