All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio
Date: Tue, 24 May 2022 16:21:38 +0800	[thread overview]
Message-ID: <b78b6c09-eb70-68a7-7e69-e8481378b968@gmx.com> (raw)
In-Reply-To: <6047f29e-966d-1bf5-6052-915c1572d07a@gmx.com>



On 2022/5/24 16:04, Qu Wenruo wrote:
>
>
> On 2022/5/24 15:32, Christoph Hellwig wrote:
>> On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>>>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
>>>> is also is rather cumbersome because it bypassed the normal bio
>>>> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
>>>> with a special flag for the single mirror parity write rather than that
>>>> hack.
>>>
>>> In fact so far for all callers of btrfs_repair_io_failure(), we are
>>> always handling things inside one stripe.
>>>
>>> Thus we can easily enhance that function to handle multi page ranges.
>>>
>>> Although a dedicated btrfs_map_bio() flags seems more generic and
>>> better.
>>
>> I did think of moving btrfs_repair_io_failure over to my new
>> infrastructure in fact, because it seems inherently possible.  Just
>> not the highest priority right now.
>>
>>>> Because the whole bio at this point is all the bad sectors.  There
>>>> is no point in writing only parts of the bio because that would leave
>>>> corruption on disk.
>>>>
>>>>>     The only reason I can think of is, we're still trying to do some
>>>>>     "optimization".
>>>>>
>>>>>     But all our bio submission is already synchronous, I doubt such
>>>>>     "optimization" would make much difference.
>>>>
>>>> Can you explain what you mean here?
>>>
>>> We wait for the read bio anyway, I doubt the batched write part is that
>>> important.
>>
>> I still don't understand the point.  Once we read more than a single
>> page, writing it back as a patch is completely trivial as shown by
>> this series.  Why would we not do it?
>>
>>>
>>> If you really want, I can try to make the write part asynchronous, while
>>> still keep the read part synchronous, and easier to read.
>>
>> Asynchronous writes gets us back into all the I/O completion handler
>> complexities, which was the whole reason to start on the synchronous
>> repair.
>>
>>> In your current version, the do {} while() loop iterates through all
>>> mirrors.
>>>
>>> But for the following case, we will hit problems thanks to RAID1C3
>>> again:
>>>
>>> Mirror 1     |X|X|X|X|
>>> Mirror 2    |X| |X| |
>>> Mirror 3    | |X| |X|
>>>
>>> We hit mirror 1 initially, thus @initial_mirror is 1.
>>>
>>> Then when we try mirror 2, since the first sector is still bad, we jump
>>> to the next mirror.
>>>
>>> For mirror 3, we fixed the first sector only. Then 2nd sector is still
>>> from mirror 3 and didn't pass.
>>> Now we have no more mirrors, and still return -EIO.
>>
>> Can you share a test case?
>
> Unfortunately no real test case can work here.
>
> The problem is, VFS will try to re-read with smaller block size.
> In that case, fallback to sector-by-sector repair, thus even if we have
> some policy terribly wrong, as long as the sector-by-sector behavior is
>   fine, it will be hidden.
>
> That's why when I do my bitmap version, I have to add extra trace events
> to make sure it's not the retry, but really my read-repair code doing
> the correct repair, without triggering the re-read.
>
>>  The code resets initial_mirror as soon as
>> we made any progress so that should not happen.
>
> Oh, didn't see that part.
>
> And in that case, yes, it will work for the checker pattern, although
> not really any more efficient.
>
> We will do 4 different reads to fix it, no better than sector-by-sector
> repair.
> And worse than 2 reads from the bitmap version.
>
> But I get your point, it can handle continuous corruption better than
> sector-by-sector, and no worse than bitmap version in that case.
>
>>
>>> So my points still stand, if we want to do batched handling, either we
>>> go bitmap or we give up.
>>
>> Why?  For the very common case of clustered corruption or entirely
>> failing reads it is significantly faster than a simple synchronous
>> read of each sector, and also much better than the existing code.
>> It also is a lot less code than the existing code base, and (maybe
>> I'm biassed) a lot more readable.
>
> The problem here is, yes, you can pick the common case one, but it comes
> with the burden of worst cases too.
>
> And for the readable part, I strongly doubt.
>
> The things like resetting initial_mirror, making the naming "initial"
> meaningless.
> And the reset on the length part is also very quirky.

In fact, if you didn't do the initial_mirror and length change (which is
a big disaster of readability, to change iterator in a loop, at least to
me), and rely on the VFS re-read behavior to fall back to sector by
secot read, I would call it better readability...

Thanks,
Qu

>
> Yes, my bitmap version is super complex, that's no doubt, but the idea
> and code should be very straightforward.
>
> Just loop all mirrors until the bad bitmap is all zero. No need to reset
> the length or whatever halfway, bitmap and mirror number is the only
> iterator.
>
> And even for the bitmap preallocation failure part, we have VFS to bail
> us out.
> And code wise, it's not that simpler, if you ignore the bitmap
> pre-allocation part...
>
> And for the ultimate readability, the sector-by-sector method can not be
> beaten.
> Thus I'm not a particular fan of any middle ground here.
>
>>
>> Bitmaps only help you with randomly splattered corruption, which simply
>> is not how SSDs or hard drives actually fail.
>
> But that's the case you have to take into consideration.
>
> Even for cases where real world SSD to corrupt a big range of data, we
> can still submit a read that crosses the corruption boundary.
>
>>
>>> Such hacky bandage seems to work at first glance and will pass your new
>>> test cases, but it doesn't do it any better than sector-by-sector
>>> waiting.
>>> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
>>> read on other mirrors will cause read-repair, screwing up our later
>>> retry, thus we must check pid first before doing any read.)
>>
>> The updated version uses the read from mirror loop from btrfs/142
>> that cleverly used bash internals to not issue the read if it would
>> be done using the wrong mirror.  Which also really nicely speeds up
>> the tests including the exist 140 and 141 ones.
>>
> That's wonderful.
>
> However the smaller problem is still there, we have no way to know if
> it's the read-repair itself does its part correctly, or it's VFS retry
> saves our day.
>
> But yeah, btrfs/142 method is much better and should be backported to
> btrfs/140 and btrfs/141.
>
> Thanks,
> Qu

  reply	other threads:[~2022-05-24  8:21 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-22 11:47 misc btrfs cleanups Christoph Hellwig
2022-05-22 11:47 ` [PATCH 1/8] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Christoph Hellwig
2022-05-22 11:47 ` [PATCH 2/8] btrfs: introduce a pure data checksum checking helper Christoph Hellwig
2022-05-23  0:38   ` Qu Wenruo
2022-05-24  7:24     ` Christoph Hellwig
2022-05-24  8:07       ` Qu Wenruo
2022-05-24  8:11         ` Christoph Hellwig
2022-05-25 16:20           ` David Sterba
2022-05-22 11:47 ` [PATCH 3/8] btrfs: remove duplicated parameters from submit_data_read_repair() Christoph Hellwig
2022-05-22 11:47 ` [PATCH 4/8] btrfs: factor out a helper to end a single sector buffere I/O Christoph Hellwig
2022-05-25 14:54   ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 5/8] btrfs: refactor end_bio_extent_readpage Christoph Hellwig
2022-05-25 14:57   ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 6/8] btrfs: factor out a btrfs_csum_ptr helper Christoph Hellwig
2022-05-25 14:32   ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 7/8] btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks Christoph Hellwig
2022-05-26 12:58   ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio Christoph Hellwig
2022-05-22 12:21   ` Qu Wenruo
2022-05-22 12:31     ` Christoph Hellwig
2022-05-22 12:38       ` Qu Wenruo
2022-05-22 12:53         ` Christoph Hellwig
2022-05-23  0:07           ` Qu Wenruo
2022-05-23  6:26             ` Christoph Hellwig
2022-05-23  7:46               ` Qu Wenruo
2022-05-24  7:32                 ` Christoph Hellwig
2022-05-24  8:04                   ` Qu Wenruo
2022-05-24  8:21                     ` Qu Wenruo [this message]
2022-05-24 12:08                       ` Christoph Hellwig
2022-05-24 13:13                         ` Qu Wenruo
2022-05-24 14:02                           ` Qu Wenruo
2022-05-25 15:12                           ` Nikolay Borisov
2022-05-25 22:46                             ` Qu Wenruo
2022-05-26 12:58   ` Nikolay Borisov
2022-05-25 20:20 ` misc btrfs cleanups David Sterba
2022-05-27 15:20   ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b78b6c09-eb70-68a7-7e69-e8481378b968@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@lst.de \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.