Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@lst.de>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 4/7] btrfs: introduce new read-repair infrastructure
Date: Thu, 26 May 2022 15:37:47 +0800	[thread overview]
Message-ID: <bf92f4ee-811e-35c0-823d-9201f1bceb0e@gmx.com> (raw)
In-Reply-To: <20220526073022.GA25511@lst.de>

On 2022/5/26 15:30, Christoph Hellwig wrote:
> On Thu, May 26, 2022 at 11:06:31AM +0800, Qu Wenruo wrote:
>> Hi Christoph, I'm pretty sure the non-continuous bio problem is here for
>> all of our attempts to rework read-repair.
>
> Why is it a problem?  Multiple discontiguous errors in the same bio
> are a very unusual error pattern.  We need to handle it obviously, but
> it doesn't need to be optimized as it is so rare.  The most common error
> pattern is that the entire read will return an error, followed by a single
> corrupted sector.

Rare case doesn't mean it won't happen.

We still need to address it anyway.

Furthermore, if we can submit one bio to read the whole mirror range,
without putting the corrupted data into our repaired data, it also means
we will have read at most (num_copies - 1) times, without resetting the
initial mirror.

>
>> I'm wondering if there is some "dummy" page provided from block layer that
>> we can utilize?
>
> For reads nvme (and a few SCSI HBAs) support a bit bucket SGL for reads
> that discard parts of the data.  Right now upstream none of this is
> supported, altough Keith has been looking into it (for a rather different
> use case) in nvme.  This does not help with writes, never mind the fact
> that I would not want to use exotic and barely tested code and hardware
> features for a non time critical and rarely used error handling path..

I'm not purposing the SGL method, but still do a full range read, the
only difference is, the page range we don't care will be written to some
dust bin page, and only the range we care will be put into the real pages.

E.g. we allocate a dedicated page per-fs (or even for the whole btrfs
module) as a dustbin page.

When we don't want to read some range, we just add that page into the
bio (this means we may put the same page into the bio several times, and
the page may be utilized by several different bios at the same time).
And submit the bio.

I'm not sure the current code base can handle the case though.

For write, it's pretty simple, we only writeback the whole correct range.
If we didn't recover the full corrupted range, we just don't do writeback.

Thanks,
Qu