Re: [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio
Date: Tue, 24 May 2022 16:21:38 +0800	[thread overview]
Message-ID: <b78b6c09-eb70-68a7-7e69-e8481378b968@gmx.com> (raw)
In-Reply-To: <6047f29e-966d-1bf5-6052-915c1572d07a@gmx.com>

On 2022/5/24 16:04, Qu Wenruo wrote:
>
>
> On 2022/5/24 15:32, Christoph Hellwig wrote:
>> On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>>>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
>>>> is also is rather cumbersome because it bypassed the normal bio
>>>> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
>>>> with a special flag for the single mirror parity write rather than that
>>>> hack.
>>>
>>> In fact so far for all callers of btrfs_repair_io_failure(), we are
>>> always handling things inside one stripe.
>>>
>>> Thus we can easily enhance that function to handle multi page ranges.
>>>
>>> Although a dedicated btrfs_map_bio() flags seems more generic and
>>> better.
>>
>> I did think of moving btrfs_repair_io_failure over to my new
>> infrastructure in fact, because it seems inherently possible.  Just
>> not the highest priority right now.
>>
>>>> Because the whole bio at this point is all the bad sectors.  There
>>>> is no point in writing only parts of the bio because that would leave
>>>> corruption on disk.
>>>>
>>>>>     The only reason I can think of is, we're still trying to do some
>>>>>     "optimization".
>>>>>
>>>>>     But all our bio submission is already synchronous, I doubt such
>>>>>     "optimization" would make much difference.
>>>>
>>>> Can you explain what you mean here?
>>>
>>> We wait for the read bio anyway, I doubt the batched write part is that
>>> important.
>>
>> I still don't understand the point.  Once we read more than a single
>> page, writing it back as a patch is completely trivial as shown by
>> this series.  Why would we not do it?
>>
>>>
>>> If you really want, I can try to make the write part asynchronous, while
>>> still keep the read part synchronous, and easier to read.
>>
>> Asynchronous writes gets us back into all the I/O completion handler
>> complexities, which was the whole reason to start on the synchronous
>> repair.
>>
>>> In your current version, the do {} while() loop iterates through all
>>> mirrors.
>>>
>>> But for the following case, we will hit problems thanks to RAID1C3
>>> again:
>>>
>>> Mirror 1     |X|X|X|X|
>>> Mirror 2    |X| |X| |
>>> Mirror 3    | |X| |X|
>>>
>>> We hit mirror 1 initially, thus @initial_mirror is 1.
>>>
>>> Then when we try mirror 2, since the first sector is still bad, we jump
>>> to the next mirror.
>>>
>>> For mirror 3, we fixed the first sector only. Then 2nd sector is still
>>> from mirror 3 and didn't pass.
>>> Now we have no more mirrors, and still return -EIO.
>>
>> Can you share a test case?
>
> Unfortunately no real test case can work here.
>
> The problem is, VFS will try to re-read with smaller block size.
> In that case, fallback to sector-by-sector repair, thus even if we have
> some policy terribly wrong, as long as the sector-by-sector behavior is
>   fine, it will be hidden.
>
> That's why when I do my bitmap version, I have to add extra trace events
> to make sure it's not the retry, but really my read-repair code doing
> the correct repair, without triggering the re-read.
>
>>  The code resets initial_mirror as soon as
>> we made any progress so that should not happen.
>
> Oh, didn't see that part.
>
> And in that case, yes, it will work for the checker pattern, although
> not really any more efficient.
>
> We will do 4 different reads to fix it, no better than sector-by-sector
> repair.
> And worse than 2 reads from the bitmap version.
>
> But I get your point, it can handle continuous corruption better than
> sector-by-sector, and no worse than bitmap version in that case.
>
>>
>>> So my points still stand, if we want to do batched handling, either we
>>> go bitmap or we give up.
>>
>> Why?  For the very common case of clustered corruption or entirely
>> failing reads it is significantly faster than a simple synchronous
>> read of each sector, and also much better than the existing code.
>> It also is a lot less code than the existing code base, and (maybe
>> I'm biassed) a lot more readable.
>
> The problem here is, yes, you can pick the common case one, but it comes
> with the burden of worst cases too.
>
> And for the readable part, I strongly doubt.
>
> The things like resetting initial_mirror, making the naming "initial"
> meaningless.
> And the reset on the length part is also very quirky.

In fact, if you didn't do the initial_mirror and length change (which is
a big disaster of readability, to change iterator in a loop, at least to
me), and rely on the VFS re-read behavior to fall back to sector by
secot read, I would call it better readability...

Thanks,
Qu

>
> Yes, my bitmap version is super complex, that's no doubt, but the idea
> and code should be very straightforward.
>
> Just loop all mirrors until the bad bitmap is all zero. No need to reset
> the length or whatever halfway, bitmap and mirror number is the only
> iterator.
>
> And even for the bitmap preallocation failure part, we have VFS to bail
> us out.
> And code wise, it's not that simpler, if you ignore the bitmap
> pre-allocation part...
>
> And for the ultimate readability, the sector-by-sector method can not be
> beaten.
> Thus I'm not a particular fan of any middle ground here.
>
>>
>> Bitmaps only help you with randomly splattered corruption, which simply
>> is not how SSDs or hard drives actually fail.
>
> But that's the case you have to take into consideration.
>
> Even for cases where real world SSD to corrupt a big range of data, we
> can still submit a read that crosses the corruption boundary.
>
>>
>>> Such hacky bandage seems to work at first glance and will pass your new
>>> test cases, but it doesn't do it any better than sector-by-sector
>>> waiting.
>>> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
>>> read on other mirrors will cause read-repair, screwing up our later
>>> retry, thus we must check pid first before doing any read.)
>>
>> The updated version uses the read from mirror loop from btrfs/142
>> that cleverly used bash internals to not issue the read if it would
>> be done using the wrong mirror.  Which also really nicely speeds up
>> the tests including the exist 140 and 141 ones.
>>
> That's wonderful.
>
> However the smaller problem is still there, we have no way to know if
> it's the read-repair itself does its part correctly, or it's VFS retry
> saves our day.
>
> But yeah, btrfs/142 method is much better and should be backported to
> btrfs/140 and btrfs/141.
>
> Thanks,
> Qu