Re: About the weird tree block corruption

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Tavian Barnes <tavianator@tavianator.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: About the weird tree block corruption
Date: Sat, 16 Mar 2024 09:44:20 +1030	[thread overview]
Message-ID: <08e2a247-cb8c-4454-8625-f0279682e18f@gmx.com> (raw)
In-Reply-To: <CABg4E-mNTRoTqW9Hj=9tO=8=N3PQx7C-BYee1V6QMtGDiqr9eg@mail.gmail.com>



在 2024/3/16 08:45, Tavian Barnes 写道:
> On Fri, Mar 15, 2024 at 4:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> 在 2024/3/16 06:31, Tavian Barnes 写道:
>>> On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes
>>> <tavianator@tavianator.com> wrote:
>>>> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>> [SNIP]
>>>>>
>>>>> The second patch is to making tree-checker to BUG_ON() when something
>>>>> went wrong.
>>>>> This patch should only be applied if you can reliably reproduce it
>>>>> inside a VM.
>>>>
>>>> Okay, I have finally reproduced this in a VM.  I had to add this hunk
>>>> to your patch 0002 in order to trigger the BUG_ON:
>>>>
>>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>>> index c8fbcae4e88e..4ee7a717642a 100644
>>>> --- a/fs/btrfs/tree-checker.c
>>>> +++ b/fs/btrfs/tree-checker.c
>>>> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
>>>> extent_buffer *eb, u64 root_owner)
>>>>                                   btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>>>                                   root_owner, btrfs_header_bytenr(eb), eb_owner,
>>>>                                   root_owner);
>>>> +                       BUG_ON(1);
>>>>                           return -EUCLEAN;
>>>>                   }
>>>>                   return 0;
>>>> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
>>>> extent_buffer *eb, u64 root_owner)
>>>>                           btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>>>                           root_owner, btrfs_header_bytenr(eb), eb_owner,
>>>>                           BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
>>>> +               BUG_ON(1);
>>>>                   return -EUCLEAN;
>>>>           }
>>>>           return 0;
>>>>
>>>>> When using the 2nd patch, it's strongly recommended to enable the
>>>>> following sysctl:
>>>>>
>>>>>     kernel.ftrace_dump_on_oops = 1
>>>>>     kernel.panic = 5
>>>>>     kernel.panic_on_oops = 1
>>>>
>>>> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
>>>> otherwise I got watchdog backtraces (due to slow console) interspersed
>>>> with the traces which was hard to read.
>>>>
>>>>> And you need a way to reliably access the VM (either netconsole or a
>>>>> serial console setup).
>>>>> In that case, you would got all the ftrace buffer to be dumped into the
>>>>> netconsole/serial console.
>>>>>
>>>>> This has the extra benefit of reducing the noise. But really needs a
>>>>> reliable VM setup and can be a little tricky to setup.
>>>>
>>>> I got this to work, the console logs are attached.  I added
>>>>
>>>>       echo 1 > $tracefs/buffer_size_kb
>>>>
>>>> otherwise it tried to dump 48MiB over the serial console which I
>>>> didn't have the patience for.  Hopefully that's a big enough buffer, I
>>>> can re-run it if you need more logs.
>>>>
>>>>> Feel free to ask for any extra help to setup the environment, as you're
>>>>> our last hope to properly pin down the bug.
>>>>
>>>> Hopefully this trace helps you debug this.  Let me know whenever you
>>>> have something else for me to test.
>>>>
>>>> I can also try to send you the VM, but I'm not sure how to package it
>>>> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
>>>> top.
>>>
>>> I added eb->start to the "corrupted node/leaf" message so I could look
>>> for relevant lines in the trace output.  From another run, I see this:
>>>
>>> $ grep 'eb=15302115328' typescript-5
>>> [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258
>>> block=15302115328 owner mismatch, have 13709377558419367261 expect
>>> [256, 18446744073709551360] eb=15302115328
>>> [ 2740.240046] iou-wrk--173727   15..... 2649295481us :
>>> alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328
>>> len=16384
>>> [ 2740.301767] kworker/-322      15..... 2649295735us :
>>> end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0
>>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
>>> [ 2740.328424] kworker/-5095     31..... 2649295941us :
>>> end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0
>>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
>>>
>>> I am surprised to see two end_bbio_meta_read lines with only one
>>> matching alloc_extent_buffer.  That made me check the locking in
>>> read_extent_buffer_pages() again, and I think I may have found
>>> something.
>>>
>>> Let's say we get two threads simultaneously call
>>> read_extent_buffer_pages() on the same eb.  Thread 1 starts reading:
>>>
>>>       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>>>           return 0;
>>>
>>>       /*
>>>        * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
>>>        * operation, which could potentially still be in flight.  In this case
>>>        * we simply want to return an error.
>>>        */
>>>       if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
>>>           return -EIO;
>>>
>>>       /* (Thread 2 pre-empted here) */
>>>
>>>       /* Someone else is already reading the buffer, just wait for it. */
>>>       if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
>>>           goto done;
>>>       ...
>>>
>>> Meanwhile, thread 2 does the same thing but gets preempted at the
>>> marked point, before testing EXTENT_BUFFER_READING.  Now the read
>>> finishes, and end_bbio_meta_read() does
>>>
>>>               btrfs_folio_set_uptodate(fs_info, folio, start, len);
>>>       ...
>>>       clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
>>>       smp_mb__after_atomic();
>>>       wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
>>>
>>> Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING
>>> (again) and starts reading into the already-filled-in extent buffer.
>>> This might normally be a benign race, except end_bbio_meta_read() has
>>> also set EXTENT_BUFFER_UPTODATE.  So now if a third thread tries to
>>> read the same extent buffer, it will do
>>>
>>>       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>>>           return 0;
>>>
>>> and return 0 *while the eb is still under I/O*.  The caller will then
>>> try to read data from the extent buffer which is concurrently being
>>> updated by the extra read started by thread 2.
>>
>> Awesome! Not only you got a super good trace, but even pinned down the
>> root cause.
>
> Thanks!
>
>> The trace indeed shows duplicated metadata read endio finishes, which
>> means we indeed have race where we shouldn't.
>
> Okay, glad to see I read the trace correctly.  Now that we think we
> understand it, can you think of a more reliable reproducer?  It would
> need lots of read_block_for_search() on the same block in parallel.
>
>> Although I'm not 100% sure about the fix, the normal way we handle it
>> would be locking the page before IO, and unlock it at endio time.
>
> Fair enough, I don't really know the locking rules in btrfs or fs/mm
> generally.  I'm pretty sure the double-checked approach from my patch
> is enough to protect read_extent_buffer_pages() from racing with
> itself in the way I described, but I don't know if a lock may be
> required for other reasons.  It might be nice to avoid taking a lock
> in this path but I'm not sure how important that is.
>
> I do see that this code path used to take page locks until commit
> d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer
> reading").
>
> I'm currently re-running the reproducer with my patch.  If it survives
> the night without crashing, I'll post it formally, unless you'd like
> to do it a different way.

After digging deeper, your fix looks fine.
(Just remove the cleanup, as we need a minimal hot fix for backport, and
don't forget to CC the fix to stable, and if possible find the
regression commit if it's a regression, and I think the commit you
mentioned is the culprit, as previously we have extra page locking to
prevent this from happening)

I can only come up with the following racing cases, and all looks fine
with your fix:

Case 1:
                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
                                   | test_set(READING) == false
                                   | goto wait

Everything is fine, the old common case.


Case 2:
                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
end_io:                           |
set_bit(UPTODATE)                 |
clear_bit(READING)                |
                                   | test_set(READING) == true
                                   | test_bit(UPTODATE) == true
                                   | done

The old case where the double read would happen, and now it's fine.

Case 3:
The extra race between UPTODATE and READING:

                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
end_io:                           |
set_bit(UPTODATE)                 |
                                   | test_set(READING) == false
                                   | wait
clear_bit(READING)                |

Still safe, due to the fact that UPTODATE is set before clearing READING.

So yep, your fix should be fine.

Although initially I'd prefer a more straightforward spinlock based
solution like this:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7441245b1ceb..88fcf1df87c7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4318,20 +4318,28 @@ int read_extent_buffer_pages(struct
extent_buffer *eb, int wait, int mirror_num,
  	struct btrfs_bio *bbio;
  	bool ret;

-	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+	spin_lock(&eb->refs_lock);
+	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) {
+		spin_unlock(&eb->refs_lock);
  		return 0;
+	}

  	/*
  	 * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
  	 * operation, which could potentially still be in flight.  In this case
  	 * we simply want to return an error.
  	 */
-	if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
+	if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) {
+		spin_unlock(&eb->refs_lock);
  		return -EIO;
+	}

  	/* Someone else is already reading the buffer, just wait for it. */
-	if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
+	if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) {
+		spin_unlock(&eb->refs_lock);
  		goto done;
+	}
+	spin_unlock(&eb->refs_lock);

  	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
  	eb->read_mirror = 0;

But this is much larger than your fix.
Although you may want some extra explanation on the fix.

Otherwise pretty happy to see we finally pinned it down.

Thanks,
Qu
>> I believe we can do that and properly solve it.
>> And go with fallbacks to use spinlock to make all the bits testing in a
>> critical section if the page lock one is no feasible.
>>
>> Thank you so much!
>> You really saved us all!
>
> No problem, happy to help :)
>
>> Thanks,
>> Qu
>