All of lore.kernel.org
 help / color / mirror / Atom feed
* System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO)
@ 2021-05-19  5:39 Swâmi Petaramesh
  2021-05-19  7:25 ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Swâmi Petaramesh @ 2021-05-19  5:39 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi list,

(Please CC: me on replies, I'm not currently susbscribed to the list)

This is to report a bug with Manjaro Linux 5.12.1-2 kernel that
immediately affected 4 different, usually stable machines after update
to kernel 5.12 from 5.11 or 5.10, and went away after reverting back to
either 5.10 or 5.11.

Kernel affected : Linux 5.12.1-2-MANJARO
(Not sure if I tried Manjaro Linux 5.12.1-1-MANJARO)

Kernel not affected : All previous versions up to 5.11.18-1-MANJARO

Symptoms : Under heavy disk usage (such as performing a system backup
onto external USB HD) the machine soon completely freezes and only a
hard power cycle can get it out of it.

After reboot, systems on which BTRFS is built over bcache may show heavy
filesystem corruption.

Happened on :

- HP Laptop 1 (Intel Atom) : BTRFS over LUKS on SSD : System freeze, no
BTRFS corruption after reboot.

- Dell Laptop 1 (Intel Core2 duo) : BTRFS over LUKS on SSD : System
freeze, no BTRFS corruption after reboot.

- HP Laptop 2 : One BTRFS FS over LUKS on SSD, and one BTRFS over bcache
over LUKS on HD+SSD : System freeze, SSD BTRFS was not corrupt but BTRFS
over bcache was severely corrupt, beyond repair and had to be rebuilt
and restored from backups.

- Asus old desktop with AMD Athlon 64 X2 : BTRFS RAID-1 over bcache over
LUKS on 2 HD + SSD : System freeze, heavy BTRFS corruption that could
however be fixed by simply running a “btrfs scrub” after reverting back
to a 5.10 Manjaro kernel.


To be thorough, I also have to report an Arch Linux Intel Celeron
machine running 5.12.4-arch1-2 kernel, BTRFS over LUKS on SSD, that has
been running for a while without showing any such symptom.


Hope these reports can be useful.

Best regards.

ॐ

-- 
Swâmi Petaramesh <swami@petaramesh.org> PGP 9076E32E

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO)
  2021-05-19  5:39 System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO) Swâmi Petaramesh
@ 2021-05-19  7:25 ` Qu Wenruo
  2021-05-19  9:17   ` Swâmi Petaramesh
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2021-05-19  7:25 UTC (permalink / raw)
  To: Swâmi Petaramesh, Btrfs BTRFS



On 2021/5/19 下午1:39, Swâmi Petaramesh wrote:
> Hi list,
>
> (Please CC: me on replies, I'm not currently susbscribed to the list)
>
> This is to report a bug with Manjaro Linux 5.12.1-2 kernel that
> immediately affected 4 different, usually stable machines after update
> to kernel 5.12 from 5.11 or 5.10, and went away after reverting back to
> either 5.10 or 5.11.
>
> Kernel affected : Linux 5.12.1-2-MANJARO
> (Not sure if I tried Manjaro Linux 5.12.1-1-MANJARO)
>
> Kernel not affected : All previous versions up to 5.11.18-1-MANJARO
>
> Symptoms : Under heavy disk usage (such as performing a system backup
> onto external USB HD) the machine soon completely freezes and only a
> hard power cycle can get it out of it.

Any dying message?

>
> After reboot, systems on which BTRFS is built over bcache may show heavy
> filesystem corruption.

Which kind of corruption? Just data csum mismatch?

Does `btrfs check` reports other problems?

Thanks,
Qu
>
> Happened on :
>
> - HP Laptop 1 (Intel Atom) : BTRFS over LUKS on SSD : System freeze, no
> BTRFS corruption after reboot.
>
> - Dell Laptop 1 (Intel Core2 duo) : BTRFS over LUKS on SSD : System
> freeze, no BTRFS corruption after reboot.
>
> - HP Laptop 2 : One BTRFS FS over LUKS on SSD, and one BTRFS over bcache
> over LUKS on HD+SSD : System freeze, SSD BTRFS was not corrupt but BTRFS
> over bcache was severely corrupt, beyond repair and had to be rebuilt
> and restored from backups.
>
> - Asus old desktop with AMD Athlon 64 X2 : BTRFS RAID-1 over bcache over
> LUKS on 2 HD + SSD : System freeze, heavy BTRFS corruption that could
> however be fixed by simply running a “btrfs scrub” after reverting back
> to a 5.10 Manjaro kernel.
>
>
> To be thorough, I also have to report an Arch Linux Intel Celeron
> machine running 5.12.4-arch1-2 kernel, BTRFS over LUKS on SSD, that has
> been running for a while without showing any such symptom.
>
>
> Hope these reports can be useful.
>
> Best regards.
>
> ॐ
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO)
  2021-05-19  7:25 ` Qu Wenruo
@ 2021-05-19  9:17   ` Swâmi Petaramesh
  2021-05-19 10:02     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Swâmi Petaramesh @ 2021-05-19  9:17 UTC (permalink / raw)
  To: Qu Wenruo, Btrfs BTRFS

On 5/19/21 9:25 AM, Qu Wenruo wrote:
>
>> Kernel affected : Linux 5.12.1-2-MANJARO
>> (Not sure if I tried Manjaro Linux 5.12.1-1-MANJARO)
>>
>> Kernel not affected : All previous versions up to 5.11.18-1-MANJARO
>>
>> Symptoms : Under heavy disk usage (such as performing a system backup
>> onto external USB HD) the machine soon completely freezes and only a
>> hard power cycle can get it out of it.
>
> Any dying message?
>
No, just a sudden and complete system and disks freeze. Thus no 
messages, nothing logged.

>>
>> After reboot, systems on which BTRFS is built over bcache may show heavy
>> filesystem corruption.
>
> Which kind of corruption? Just data csum mismatch?

AFAIR it was some kind of “generation mismatch”, expected something, 
found another, in very large quantities.

The machine with BTRFS RAID-1 could heal itself out of this by running a 
simple btrfs scrub, I gave up on the non-RAID one my previous experience 
with similar errors making me think the FS was beyond repair, I 
reformatted and restored from backups.

> Does `btrfs check` reports other problems?

I didn't try.

Thanks for the quick help :)

ॐ

-- 
Swâmi Petaramesh <swami@petaramesh.org> PGP 9076E32E


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO)
  2021-05-19  9:17   ` Swâmi Petaramesh
@ 2021-05-19 10:02     ` Qu Wenruo
  2021-05-19 13:46       ` Swâmi Petaramesh
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2021-05-19 10:02 UTC (permalink / raw)
  To: Swâmi Petaramesh, Btrfs BTRFS



On 2021/5/19 下午5:17, Swâmi Petaramesh wrote:
> On 5/19/21 9:25 AM, Qu Wenruo wrote:
>>
>>> Kernel affected : Linux 5.12.1-2-MANJARO
>>> (Not sure if I tried Manjaro Linux 5.12.1-1-MANJARO)
>>>
>>> Kernel not affected : All previous versions up to 5.11.18-1-MANJARO
>>>
>>> Symptoms : Under heavy disk usage (such as performing a system backup
>>> onto external USB HD) the machine soon completely freezes and only a
>>> hard power cycle can get it out of it.
>>
>> Any dying message?
>>
> No, just a sudden and complete system and disks freeze. Thus no
> messages, nothing logged.

Have you tried something like net-console to catch something?

If it's some hang, after 120s it would have some dmesg popping out.
But in that hang case, you should still be able to do a lot of things.

If it's something like BUG_ON(), it would immediately show up.
(And if the trace is not btrfs related, I bet it's something in the dm
layer)

Without the dying message, it's really hard to further debug.

Considering you have so many devices, it should be pretty simple to
setup a device running nc to receive all the net-console output:
https://www.kernel.org/doc/html/latest/networking/netconsole.html

>
>>>
>>> After reboot, systems on which BTRFS is built over bcache may show heavy
>>> filesystem corruption.
>>
>> Which kind of corruption? Just data csum mismatch?
>
> AFAIR it was some kind of “generation mismatch”, expected something,
> found another, in very large quantities.

That means flush command doesn't work as expected.

Considering there are extra layers involved, it's pretty hard to tell
which is the cause, btrfs or dm-* modules.

>
> The machine with BTRFS RAID-1 could heal itself out of this by running a
> simple btrfs scrub,

This further proves it may be lower layer doing something wrong.

As if it's btrfs itself causing the bug, the transid mismatch shouldn't
be recoverable at all.

For btrfs caused error, it would be broken COW, thus all copies should
be corrupted.

It's really a good practice to have LUKS under all your fs, but it also
introduces an extra layer of flush problems.
Did you have any raw btrfs directly over HDD/SDD experiencing such problem?

Thanks,
Qu

> I gave up on the non-RAID one my previous experience
> with similar errors making me think the FS was beyond repair, I
> reformatted and restored from backups.
>
>> Does `btrfs check` reports other problems?
>
> I didn't try.
>
> Thanks for the quick help :)
>
> ॐ
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO)
  2021-05-19 10:02     ` Qu Wenruo
@ 2021-05-19 13:46       ` Swâmi Petaramesh
  0 siblings, 0 replies; 5+ messages in thread
From: Swâmi Petaramesh @ 2021-05-19 13:46 UTC (permalink / raw)
  To: Qu Wenruo, Btrfs BTRFS

On 5/19/21 12:02 PM, Qu Wenruo wrote:
>
> Have you tried something like net-console to catch something?

Nope but the machines were each time plain dead : screen frozen, mouse 
frozen, kbd frozen (LEDs not changing), no ssh, no ping, not even any 
reaction to [Magic SysRq] keys...

>
> If it's some hang, after 120s it would have some dmesg popping out.
> But in that hang case, you should still be able to do a lot of things.
>
More than a hang, appears to be a complete kernel crash.
> Without the dying message, it's really hard to further debug.
>
I would guess so...

> AFAIR it was some kind of “generation mismatch”, expected something,
>> found another, in very large quantities.
>
> That means flush command doesn't work as expected.
>
I would suppose that those machines running bcache in writeback mode, 
some data didn't make it to permanent storage at the time the system 
suffered a sudden death...

Thus incomplete or out-of-order data on disk.

> Considering there are extra layers involved, it's pretty hard to tell
> which is the cause, btrfs or dm-* modules.
>
Well... My view is that the systems crash with or without bcache, but 
BTRFS gets corrupt only when bcache is in use. So I would say that 
bcache is not responsible for the system crashing, but is responsible 
for data not having been properly committed to disk in the good way or 
order at the time the system crashes...

I was wondering if you got any report of other kernel 5.12 issues with 
BTRFS in different configs, or kernel 5.12 crashes that might not be 
related to BTRFS...

>>
>> The machine with BTRFS RAID-1 could heal itself out of this by running a
>> simple btrfs scrub,
>
> This further proves it may be lower layer doing something wrong.
>
I would guess so...

> It's really a good practice to have LUKS under all your fs, but it also
> introduces an extra layer of flush problems.

Yes. However I've been doing this for years on a bunch of machines and 
never got any problem that would relate to this except with this 5.12 
kernel.

I was however wondering if some new optimizations introduced in BTRFS in 
5.12 could have made it prone to crashes or maybe something not being 
properly commited to disk, use of fsyncs or barriers or whatever...

> Did you have any raw btrfs directly over HDD/SDD experiencing such 
> problem?

Unfortunately I don't have any BTRFS out ok LUKS, except for /boot on 
some machines, but this one gets so little activity that I wouldn't 
expect an issue with a /boot partition.

Thanks again for your help Qu.


ॐ

-- 
Swâmi Petaramesh <swami@petaramesh.org> PGP 9076E32E


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-19 13:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19  5:39 System freeze with BTRFS corruption on 4 systems with kernel 5.12 (MANJARO) Swâmi Petaramesh
2021-05-19  7:25 ` Qu Wenruo
2021-05-19  9:17   ` Swâmi Petaramesh
2021-05-19 10:02     ` Qu Wenruo
2021-05-19 13:46       ` Swâmi Petaramesh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.