All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs error: write time tree block corruption detected
@ 2021-03-06  9:10 chil L1n
  2021-03-08  8:41 ` Johannes Thumshirn
  0 siblings, 1 reply; 8+ messages in thread
From: chil L1n @ 2021-03-06  9:10 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Just noticed that one of my btrfs mounts on a server was in read-only mode.

dmesg shows:
[2217355.427810] BTRFS info (device sda3): scrub: started on devid 1
[2221262.216646] BTRFS info (device sda3): scrub: finished on devid 1
with status: 0
[2390153.679168] BTRFS info (device sda4): scrub: started on devid 1
[2393339.627095] BTRFS info (device sda4): scrub: finished on devid 1
with status: 0
[2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
current (256703 108 1310720)
[2555511.868650] BTRFS error (device sda4): block=250975895552 write
time tree block corruption detected
[2555511.916529] BTRFS: error (device sda4) in
btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
writing out transaction)
[2555511.916544] BTRFS info (device sda4): forced readonly
[2555511.916547] BTRFS warning (device sda4): Skipping commit of
aborted transaction.
[2555511.916551] BTRFS: error (device sda4) in
cleanup_transaction:1832: errno=-5 IO failure
[2555511.916560] BTRFS info (device sda4): delayed_refs has NO entry
[2555511.916687] BTRFS info (device sda4): delayed_refs has NO entry

Running "btrfs check" shows no further issues:

sudo btrfs check --force --readonly /dev/sda4
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/sda4
UUID: 72deb54c-96dd-42cf-a809-bef1a135f409
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 242436075520 bytes used, no error found
total csum bytes: 235186808
total tree bytes: 961773568
total fs tree bytes: 665829376
total extent tree bytes: 43728896
btree space waste bytes: 122086726
file data blocks allocated: 447700926464
 referenced 257665265664

Kernel:
Linux amd8 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC
2021 x86_64 x86_64 x86_64 GNU/Linux

Can someone help me to pinpoint the cause of this issue and prevent it
from happening again?
If more info is needed from my side, please let me know.

Cheers,

chill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-06  9:10 btrfs error: write time tree block corruption detected chil L1n
@ 2021-03-08  8:41 ` Johannes Thumshirn
  2021-03-08  8:56   ` chil L1n
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Thumshirn @ 2021-03-08  8:41 UTC (permalink / raw)
  To: chil L1n, linux-btrfs

On 06/03/2021 10:11, chil L1n wrote:
> [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
> block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
> current (256703 108 1310720)
> [2555511.868650] BTRFS error (device sda4): block=250975895552 write
> time tree block corruption detected

This /might/ be a memory bitflip:

3276800 = 0b1100100000000000000000
1310720 = 0b101000000000000000000

I guess the highest bit did flip so it should have been:
3407872 = 0b1101000000000000000000
 
(3407872 - 3276800) / 4096.0
32.0

Can you run a memtest on the machine to check if the RAM is ok?

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08  8:41 ` Johannes Thumshirn
@ 2021-03-08  8:56   ` chil L1n
  2021-03-08  9:23     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: chil L1n @ 2021-03-08  8:56 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

Hi Johannes,

Thanks for the advice. I'm running memtester now. This will take some
time as the machine has 32GB RAM.
Regarding your explanation, I count two bit position differences, not
1. Can you explain your reasoning?

Thanks,

chill


On Mon, Mar 8, 2021 at 9:41 AM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 06/03/2021 10:11, chil L1n wrote:
> > [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
> > block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
> > current (256703 108 1310720)
> > [2555511.868650] BTRFS error (device sda4): block=250975895552 write
> > time tree block corruption detected
>
> This /might/ be a memory bitflip:
>
> 3276800 = 0b1100100000000000000000
> 1310720 = 0b101000000000000000000
>
> I guess the highest bit did flip so it should have been:
> 3407872 = 0b1101000000000000000000
>
> (3407872 - 3276800) / 4096.0
> 32.0
>
> Can you run a memtest on the machine to check if the RAM is ok?
>
> Byte,
>         Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08  8:56   ` chil L1n
@ 2021-03-08  9:23     ` Qu Wenruo
  2021-03-08  9:28       ` Qu Wenruo
  2021-03-08  9:33       ` Qu Wenruo
  0 siblings, 2 replies; 8+ messages in thread
From: Qu Wenruo @ 2021-03-08  9:23 UTC (permalink / raw)
  To: chil L1n, Johannes Thumshirn; +Cc: linux-btrfs



On 2021/3/8 下午4:56, chil L1n wrote:
> Hi Johannes,
>
> Thanks for the advice. I'm running memtester now. This will take some
> time as the machine has 32GB RAM.
> Regarding your explanation, I count two bit position differences, not
> 1. Can you explain your reasoning?

It looks like Johannes missed one 0, and caused some confusion.

With 0 padded correctly, the result is:

3276800 = 0b1100100000000000000000
1310720 = 0b0101000000000000000000

That's why I prefer to use hex:
3276800 = 0x320000
1310720 = 0x140000
diff    = 0x200000

Definitely one bit flipped.

Thanks,
Qu

>
> Thanks,
>
> chill
>
>
> On Mon, Mar 8, 2021 at 9:41 AM Johannes Thumshirn
> <Johannes.Thumshirn@wdc.com> wrote:
>>
>> On 06/03/2021 10:11, chil L1n wrote:
>>> [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
>>> block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
>>> current (256703 108 1310720)
>>> [2555511.868650] BTRFS error (device sda4): block=250975895552 write
>>> time tree block corruption detected
>>
>> This /might/ be a memory bitflip:
>>
>> 3276800 = 0b1100100000000000000000
>> 1310720 = 0b101000000000000000000
>>
>> I guess the highest bit did flip so it should have been:
>> 3407872 = 0b1101000000000000000000
>>
>> (3407872 - 3276800) / 4096.0
>> 32.0
>>
>> Can you run a memtest on the machine to check if the RAM is ok?
>>
>> Byte,
>>          Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08  9:23     ` Qu Wenruo
@ 2021-03-08  9:28       ` Qu Wenruo
  2021-03-08  9:33       ` Qu Wenruo
  1 sibling, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2021-03-08  9:28 UTC (permalink / raw)
  To: chil L1n, Johannes Thumshirn; +Cc: linux-btrfs



On 2021/3/8 下午5:23, Qu Wenruo wrote:
>
>
> On 2021/3/8 下午4:56, chil L1n wrote:
>> Hi Johannes,
>>
>> Thanks for the advice. I'm running memtester now. This will take some
>> time as the machine has 32GB RAM.
>> Regarding your explanation, I count two bit position differences, not
>> 1. Can you explain your reasoning?
>
> It looks like Johannes missed one 0, and caused some confusion.
>
> With 0 padded correctly, the result is:
>
> 3276800 = 0b1100100000000000000000
> 1310720 = 0b0101000000000000000000

What the heck? The copy&paste caused even more problem for the binary
output....

>
> That's why I prefer to use hex:
> 3276800 = 0x320000
> 1310720 = 0x140000
> diff    = 0x200000

At least the hex output still stands correctly.

Yeah, next time, let's not use binary output anymore.

Thanks,
Qu
>
> Definitely one bit flipped.
>
> Thanks,
> Qu
>
>>
>> Thanks,
>>
>> chill
>>
>>
>> On Mon, Mar 8, 2021 at 9:41 AM Johannes Thumshirn
>> <Johannes.Thumshirn@wdc.com> wrote:
>>>
>>> On 06/03/2021 10:11, chil L1n wrote:
>>>> [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
>>>> block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
>>>> current (256703 108 1310720)
>>>> [2555511.868650] BTRFS error (device sda4): block=250975895552 write
>>>> time tree block corruption detected
>>>
>>> This /might/ be a memory bitflip:
>>>
>>> 3276800 = 0b1100100000000000000000
>>> 1310720 = 0b101000000000000000000
>>>
>>> I guess the highest bit did flip so it should have been:
>>> 3407872 = 0b1101000000000000000000
>>>
>>> (3407872 - 3276800) / 4096.0
>>> 32.0
>>>
>>> Can you run a memtest on the machine to check if the RAM is ok?
>>>
>>> Byte,
>>>          Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08  9:23     ` Qu Wenruo
  2021-03-08  9:28       ` Qu Wenruo
@ 2021-03-08  9:33       ` Qu Wenruo
  2021-03-08 10:02         ` chil L1n
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2021-03-08  9:33 UTC (permalink / raw)
  To: chil L1n, Johannes Thumshirn; +Cc: linux-btrfs



On 2021/3/8 下午5:23, Qu Wenruo wrote:
>
>
> On 2021/3/8 下午4:56, chil L1n wrote:
>> Hi Johannes,
>>
>> Thanks for the advice. I'm running memtester now. This will take some
>> time as the machine has 32GB RAM.
>> Regarding your explanation, I count two bit position differences, not
>> 1. Can you explain your reasoning?
>
> It looks like Johannes missed one 0, and caused some confusion.
>
> With 0 padded correctly, the result is:
>
> 3276800 = 0b1100100000000000000000
> 1310720 = 0b0101000000000000000000

Oh, no, the value is correct.... It's my hex diff incorrect...
>
> That's why I prefer to use hex:
> 3276800 = 0x320000
> 1310720 = 0x140000
> diff    = 0x200000

The diff is 0x260000 (xor).

But that can still be an indication of bitflip, on that 0x200000 part.

As the current key should be larger than previous key, one bit flip at
0x200000 can cause the problem and trigger the tree-checker.

Thanks,
Qu
>
> Definitely one bit flipped.
>
> Thanks,
> Qu
>
>>
>> Thanks,
>>
>> chill
>>
>>
>> On Mon, Mar 8, 2021 at 9:41 AM Johannes Thumshirn
>> <Johannes.Thumshirn@wdc.com> wrote:
>>>
>>> On 06/03/2021 10:11, chil L1n wrote:
>>>> [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
>>>> block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
>>>> current (256703 108 1310720)
>>>> [2555511.868650] BTRFS error (device sda4): block=250975895552 write
>>>> time tree block corruption detected
>>>
>>> This /might/ be a memory bitflip:
>>>
>>> 3276800 = 0b1100100000000000000000
>>> 1310720 = 0b101000000000000000000
>>>
>>> I guess the highest bit did flip so it should have been:
>>> 3407872 = 0b1101000000000000000000
>>>
>>> (3407872 - 3276800) / 4096.0
>>> 32.0
>>>
>>> Can you run a memtest on the machine to check if the RAM is ok?
>>>
>>> Byte,
>>>          Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08  9:33       ` Qu Wenruo
@ 2021-03-08 10:02         ` chil L1n
  2021-03-08 10:09           ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: chil L1n @ 2021-03-08 10:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Johannes Thumshirn, linux-btrfs

Hi Qu,

Thanks for some explanation.
Personally, I prefer binary to compare bit-level changes.
Actually, I also miscounted. I count 3 bit flips. Isn't that extremely
unlikely, assuming that each bit flip is independent?
Nonetheless, I'm running another RAM test with memtester and 6GB RAM
blocks.... still no errors.
Will post an update later today.

-- 
Cheers,

Chillin

On Mon, Mar 8, 2021 at 10:33 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/3/8 下午5:23, Qu Wenruo wrote:
> >
> >
> > On 2021/3/8 下午4:56, chil L1n wrote:
> >> Hi Johannes,
> >>
> >> Thanks for the advice. I'm running memtester now. This will take some
> >> time as the machine has 32GB RAM.
> >> Regarding your explanation, I count two bit position differences, not
> >> 1. Can you explain your reasoning?
> >
> > It looks like Johannes missed one 0, and caused some confusion.
> >
> > With 0 padded correctly, the result is:
> >
> > 3276800 = 0b1100100000000000000000
> > 1310720 = 0b0101000000000000000000
>
> Oh, no, the value is correct.... It's my hex diff incorrect...
> >
> > That's why I prefer to use hex:
> > 3276800 = 0x320000
> > 1310720 = 0x140000
> > diff    = 0x200000
>
> The diff is 0x260000 (xor).
>
> But that can still be an indication of bitflip, on that 0x200000 part.
>
> As the current key should be larger than previous key, one bit flip at
> 0x200000 can cause the problem and trigger the tree-checker.
>
> Thanks,
> Qu
> >
> > Definitely one bit flipped.
> >
> > Thanks,
> > Qu
> >
> >>
> >> Thanks,
> >>
> >> chill
> >>
> >>
> >> On Mon, Mar 8, 2021 at 9:41 AM Johannes Thumshirn
> >> <Johannes.Thumshirn@wdc.com> wrote:
> >>>
> >>> On 06/03/2021 10:11, chil L1n wrote:
> >>>> [2555511.868642] BTRFS critical (device sda4): corrupt leaf: root=258
> >>>> block=250975895552 slot=78, bad key order, prev (256703 108 3276800)
> >>>> current (256703 108 1310720)
> >>>> [2555511.868650] BTRFS error (device sda4): block=250975895552 write
> >>>> time tree block corruption detected
> >>>
> >>> This /might/ be a memory bitflip:
> >>>
> >>> 3276800 = 0b1100100000000000000000
> >>> 1310720 = 0b101000000000000000000
> >>>
> >>> I guess the highest bit did flip so it should have been:
> >>> 3407872 = 0b1101000000000000000000
> >>>
> >>> (3407872 - 3276800) / 4096.0
> >>> 32.0
> >>>
> >>> Can you run a memtest on the machine to check if the RAM is ok?
> >>>
> >>> Byte,
> >>>          Johannes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs error: write time tree block corruption detected
  2021-03-08 10:02         ` chil L1n
@ 2021-03-08 10:09           ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2021-03-08 10:09 UTC (permalink / raw)
  To: chil L1n; +Cc: Johannes Thumshirn, linux-btrfs



On 2021/3/8 下午6:02, chil L1n wrote:
> Hi Qu,
>
> Thanks for some explanation.
> Personally, I prefer binary to compare bit-level changes.
> Actually, I also miscounted. I count 3 bit flips.

Yes, you're right, xor also returns 3 bits flips.

But the point is not about directly comparing the two key offsets.

The point is, the bit at 0x200000 can be flipped.

If that's the case, the remaining bits are no longer important anymore,
as that one bit flip just makes the current key to be smaller than the
previous key, which will trigger the problem.

> Isn't that extremely
> unlikely, assuming that each bit flip is independent?
> Nonetheless, I'm running another RAM test with memtester and 6GB RAM
> blocks.... still no errors.
> Will post an update later today.

I'd recommend to run UEFI memtest86.

This should really test the full system RAM, without anything else
affecting the result.
(This also means you are not able to use the computer obviously)

 From my personal experience, especially for write time tree-checker,
it's almost sure the system has something wrong.

The RAM is the most common case, and personally I'm very proud that
tree-checker has detected more than a dozen similar cases and quite a
lot of them turns out to be hardware problems.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-08 10:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-06  9:10 btrfs error: write time tree block corruption detected chil L1n
2021-03-08  8:41 ` Johannes Thumshirn
2021-03-08  8:56   ` chil L1n
2021-03-08  9:23     ` Qu Wenruo
2021-03-08  9:28       ` Qu Wenruo
2021-03-08  9:33       ` Qu Wenruo
2021-03-08 10:02         ` chil L1n
2021-03-08 10:09           ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.