linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* repeatable(ish) corrupt leaf filesystem splat on 5.1.x
@ 2019-07-04 21:03 Zygo Blaxell
  2019-07-05  0:06 ` Qu Wenruo
  2020-02-04  4:43 ` repeatable(ish) corrupt leaf filesystem splat on 5.1.x - fixed in 5.4.14, 5.5.0 Zygo Blaxell
  0 siblings, 2 replies; 4+ messages in thread
From: Zygo Blaxell @ 2019-07-04 21:03 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2889 bytes --]

I've seen this twice in 3 days after releasing 5.1.x kernels from the
test lab:

5.1.15 on 2xSATA RAID1 SSD, during a balance:

	[48714.200014][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
	[48714.200381][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
	[48714.200399][ T9749] BTRFS: error (device dm-21) in __btrfs_free_extent:7109: errno=-5 IO failure
	[48714.200401][ T9749] BTRFS info (device dm-21): forced readonly
	[48714.200405][ T9749] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
	[48714.200419][ T9749] BTRFS info (device dm-21): found 359 extents
	[48714.200442][ T9749] BTRFS info (device dm-21): 1 enospc errors during balance
	[48714.200445][ T9749] BTRFS info (device dm-21): balance: ended with status: -30

and 5.1.9 on 1xNVME, a few hours after some /proc NULL pointer dereference
bugs:

	[89244.144505][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
	[89244.144822][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
	[89244.144832][ T2403] BTRFS: error (device dm-4) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
	[89244.144836][ T2403] BTRFS info (device dm-4): forced readonly

The machines had been upgraded from 5.0.x to 5.1.x for less than 24
hours each.

The 5.1.9 machine had crashed (on 5.0.15) before, but a scrub had passed
while running 5.1.9 after the crash.  The filesystem failure occurred
20 hours later.  There were some other NULL pointer deferences in that
uptime, so maybe 5.1.9 is just a generally buggy kernel that nobody
should ever run.  I expect better from 5.1.15, though, which had no
unusual events reported in the 8 hours between its post-reboot scrub
and the filesystem failure.

I have several other machines running 5.1.x kernels that have not yet had
such a failure--including all of my test machines, which don't seem to hit
this issue after 25+ days of stress-testing.  Most of the test machines
are using rotating disks, a few are running SSD+HDD with lvmcache.

One correlation that may be interesting:  both of the failing filesystems
had 1MB unallocated on all disks; all of the non-failing filesystems have
1GB or more unallocated on all disks.  I was running the balance on the
first filesystem to try to free up some unallocated space.  The second
filesystem died without any help from me.

It turns out that 'btrfs check --repair' can fix these!  First time
I've ever seen check --repair fix a broken filesystem.  A few files are
damaged, but the filesystem is read-write again and still working so far
(on a 5.0.21 kernel) .

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repeatable(ish) corrupt leaf filesystem splat on 5.1.x
  2019-07-04 21:03 repeatable(ish) corrupt leaf filesystem splat on 5.1.x Zygo Blaxell
@ 2019-07-05  0:06 ` Qu Wenruo
  2019-07-05  3:33   ` Zygo Blaxell
  2020-02-04  4:43 ` repeatable(ish) corrupt leaf filesystem splat on 5.1.x - fixed in 5.4.14, 5.5.0 Zygo Blaxell
  1 sibling, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2019-07-05  0:06 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4070 bytes --]



On 2019/7/5 上午5:03, Zygo Blaxell wrote:
> I've seen this twice in 3 days after releasing 5.1.x kernels from the
> test lab:
> 
> 5.1.15 on 2xSATA RAID1 SSD, during a balance:
> 
> 	[48714.200014][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> 	[48714.200381][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> 	[48714.200399][ T9749] BTRFS: error (device dm-21) in __btrfs_free_extent:7109: errno=-5 IO failure
> 	[48714.200401][ T9749] BTRFS info (device dm-21): forced readonly
> 	[48714.200405][ T9749] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> 	[48714.200419][ T9749] BTRFS info (device dm-21): found 359 extents
> 	[48714.200442][ T9749] BTRFS info (device dm-21): 1 enospc errors during balance
> 	[48714.200445][ T9749] BTRFS info (device dm-21): balance: ended with status: -30
> 
> and 5.1.9 on 1xNVME, a few hours after some /proc NULL pointer dereference
> bugs:
> 
> 	[89244.144505][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> 	[89244.144822][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> 	[89244.144832][ T2403] BTRFS: error (device dm-4) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> 	[89244.144836][ T2403] BTRFS info (device dm-4): forced readonly
> 
> The machines had been upgraded from 5.0.x to 5.1.x for less than 24
> hours each.

All in btrfs_run_delayed_refs() when updating extent tree.

The new check is to prevent btrfs receiving corrupted data, thus it
happens at tree read time, where there should be no race.
And that check has been there for a while, v4.15 should be the first
kernel version with such check.

So it doesn't look like false alert but some real corruption.

Would you please provide the tree dump of the related tree blocks?

> 
> The 5.1.9 machine had crashed (on 5.0.15) before, but a scrub had passed
> while running 5.1.9 after the crash.  The filesystem failure occurred
> 20 hours later.  There were some other NULL pointer deferences in that
> uptime, so maybe 5.1.9 is just a generally buggy kernel that nobody
> should ever run.  I expect better from 5.1.15, though, which had no
> unusual events reported in the 8 hours between its post-reboot scrub
> and the filesystem failure.

BTW, would you like to spend some extra time on v5.2-rc kernel?
In v5.2 kernel, we have extra write time tree checker (and makes read
time tree checker a little faster).
Current read time only checker is not good enough to expose the cause of
the coorruption and prevent it.

With v5.2, if the bug is still there, we could at least prevent the
corruption before it's too late.

> 
> I have several other machines running 5.1.x kernels that have not yet had
> such a failure--including all of my test machines, which don't seem to hit
> this issue after 25+ days of stress-testing.  Most of the test machines
> are using rotating disks, a few are running SSD+HDD with lvmcache.
> 
> One correlation that may be interesting:  both of the failing filesystems
> had 1MB unallocated on all disks; all of the non-failing filesystems have
> 1GB or more unallocated on all disks.  I was running the balance on the
> first filesystem to try to free up some unallocated space.  The second
> filesystem died without any help from me.
> 
> It turns out that 'btrfs check --repair' can fix these!  First time
> I've ever seen check --repair fix a broken filesystem.  A few files are
> damaged, but the filesystem is read-write again and still working so far
> (on a 5.0.21 kernel) .

Maybe one won't believe, btrfs check has leaf re-order and item offset
repair from the very beginning, but I'm not sure how reliable it is.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repeatable(ish) corrupt leaf filesystem splat on 5.1.x
  2019-07-05  0:06 ` Qu Wenruo
@ 2019-07-05  3:33   ` Zygo Blaxell
  0 siblings, 0 replies; 4+ messages in thread
From: Zygo Blaxell @ 2019-07-05  3:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5817 bytes --]

On Fri, Jul 05, 2019 at 08:06:13AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/7/5 上午5:03, Zygo Blaxell wrote:
> > I've seen this twice in 3 days after releasing 5.1.x kernels from the
> > test lab:
> > 
> > 5.1.15 on 2xSATA RAID1 SSD, during a balance:
> > 
> > 	[48714.200014][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> > 	[48714.200381][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> > 	[48714.200399][ T9749] BTRFS: error (device dm-21) in __btrfs_free_extent:7109: errno=-5 IO failure
> > 	[48714.200401][ T9749] BTRFS info (device dm-21): forced readonly
> > 	[48714.200405][ T9749] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> > 	[48714.200419][ T9749] BTRFS info (device dm-21): found 359 extents
> > 	[48714.200442][ T9749] BTRFS info (device dm-21): 1 enospc errors during balance
> > 	[48714.200445][ T9749] BTRFS info (device dm-21): balance: ended with status: -30
> > 
> > and 5.1.9 on 1xNVME, a few hours after some /proc NULL pointer dereference
> > bugs:
> > 
> > 	[89244.144505][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> > 	[89244.144822][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> > 	[89244.144832][ T2403] BTRFS: error (device dm-4) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> > 	[89244.144836][ T2403] BTRFS info (device dm-4): forced readonly
> > 
> > The machines had been upgraded from 5.0.x to 5.1.x for less than 24
> > hours each.
> 
> All in btrfs_run_delayed_refs() when updating extent tree.
> 
> The new check is to prevent btrfs receiving corrupted data, thus it
> happens at tree read time, where there should be no race.
> And that check has been there for a while, v4.15 should be the first
> kernel version with such check.
> 
> So it doesn't look like false alert but some real corruption.

There was definite persistent corruption in this case (which is different
from the "ghost" cases in the other thread).  I had to repair or replace
filesystems after each of these events.  I did a quick dump-tree and
found exactly the corruption stated in the kernel log, even after reboot.

> Would you please provide the tree dump of the related tree blocks?

Alas, these happened on production systems, so I had to repair or restore
the filesystems immediately and didn't have time to do more than a quick
analysis.  I also didn't know I was looking at a pattern of new behavior
until after the first two failures, so I didn't think of saving the tree.

I can't reproduce this on expendable test machines (they run a replica
of production workload with no problems, 27 days on 5.1.x so far),
and it's too expensive to keep running 5.1.x kernels in production if
an important filesystem is going down every few days.

> > The 5.1.9 machine had crashed (on 5.0.15) before, but a scrub had passed
> > while running 5.1.9 after the crash.  The filesystem failure occurred
> > 20 hours later.  There were some other NULL pointer deferences in that
> > uptime, so maybe 5.1.9 is just a generally buggy kernel that nobody
> > should ever run.  I expect better from 5.1.15, though, which had no
> > unusual events reported in the 8 hours between its post-reboot scrub
> > and the filesystem failure.
> 
> BTW, would you like to spend some extra time on v5.2-rc kernel?
> In v5.2 kernel, we have extra write time tree checker (and makes read
> time tree checker a little faster).
> Current read time only checker is not good enough to expose the cause of
> the coorruption and prevent it.
> 
> With v5.2, if the bug is still there, we could at least prevent the
> corruption before it's too late.

I can build a v5.2-rc kernel and throw it at a test machine...but my test
machines can't reproduce this bug, so they might not tell us much.

> > I have several other machines running 5.1.x kernels that have not yet had
> > such a failure--including all of my test machines, which don't seem to hit
> > this issue after 25+ days of stress-testing.  Most of the test machines
> > are using rotating disks, a few are running SSD+HDD with lvmcache.
> > 
> > One correlation that may be interesting:  both of the failing filesystems
> > had 1MB unallocated on all disks; all of the non-failing filesystems have
> > 1GB or more unallocated on all disks.  I was running the balance on the
> > first filesystem to try to free up some unallocated space.  The second
> > filesystem died without any help from me.
> > 
> > It turns out that 'btrfs check --repair' can fix these!  First time
> > I've ever seen check --repair fix a broken filesystem.  A few files are
> > damaged, but the filesystem is read-write again and still working so far
> > (on a 5.0.21 kernel) .
> 
> Maybe one won't believe, btrfs check has leaf re-order and item offset
> repair from the very beginning, but I'm not sure how reliable it is.

Repair seems to do well with simple software bugs (i.e. things that
corrupt only single items and don't lose entire leaf nodes) on small
filesystems (under 10 TB).  I just don't have many failures that fit
those constraints.  For me, it's either something outside of btrfs
(hardware failure, firmware bugs, or both) that completely destroys a
filesystem, or the filesystem is too large for btrfs check (it runs out
of RAM before it can do anything useful), or the filesystem runs for 5+
years with no problems at all.

> Thanks,
> Qu
> 
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: repeatable(ish) corrupt leaf filesystem splat on 5.1.x - fixed in 5.4.14, 5.5.0
  2019-07-04 21:03 repeatable(ish) corrupt leaf filesystem splat on 5.1.x Zygo Blaxell
  2019-07-05  0:06 ` Qu Wenruo
@ 2020-02-04  4:43 ` Zygo Blaxell
  1 sibling, 0 replies; 4+ messages in thread
From: Zygo Blaxell @ 2020-02-04  4:43 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4131 bytes --]

On Thu, Jul 04, 2019 at 05:03:23PM -0400, Zygo Blaxell wrote:
> I've seen this twice in 3 days after releasing 5.1.x kernels from the
> test lab:
> 
> 5.1.15 on 2xSATA RAID1 SSD, during a balance:
> 
> 	[48714.200014][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> 	[48714.200381][ T3498] BTRFS critical (device dm-21): corrupt leaf: root=2 block=117776711680 slot=57, unexpected item end, have 109534755 expect 12632
> 	[48714.200399][ T9749] BTRFS: error (device dm-21) in __btrfs_free_extent:7109: errno=-5 IO failure
> 	[48714.200401][ T9749] BTRFS info (device dm-21): forced readonly
> 	[48714.200405][ T9749] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> 	[48714.200419][ T9749] BTRFS info (device dm-21): found 359 extents
> 	[48714.200442][ T9749] BTRFS info (device dm-21): 1 enospc errors during balance
> 	[48714.200445][ T9749] BTRFS info (device dm-21): balance: ended with status: -30
> 
> and 5.1.9 on 1xNVME, a few hours after some /proc NULL pointer dereference
> bugs:
> 
> 	[89244.144505][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> 	[89244.144822][ T7009] BTRFS critical (device dm-4): corrupt leaf: root=2 block=1854946361344 slot=32, unexpected item end, have 1335222703 expect 15056
> 	[89244.144832][ T2403] BTRFS: error (device dm-4) in btrfs_run_delayed_refs:3008: errno=-5 IO failure
> 	[89244.144836][ T2403] BTRFS info (device dm-4): forced readonly
> 
> The machines had been upgraded from 5.0.x to 5.1.x for less than 24
> hours each.
> 
> The 5.1.9 machine had crashed (on 5.0.15) before, but a scrub had passed
> while running 5.1.9 after the crash.  The filesystem failure occurred
> 20 hours later.  There were some other NULL pointer deferences in that
> uptime, so maybe 5.1.9 is just a generally buggy kernel that nobody
> should ever run.  I expect better from 5.1.15, though, which had no
> unusual events reported in the 8 hours between its post-reboot scrub
> and the filesystem failure.
> 
> I have several other machines running 5.1.x kernels that have not yet had
> such a failure--including all of my test machines, which don't seem to hit
> this issue after 25+ days of stress-testing.  Most of the test machines
> are using rotating disks, a few are running SSD+HDD with lvmcache.
> 
> One correlation that may be interesting:  both of the failing filesystems
> had 1MB unallocated on all disks; all of the non-failing filesystems have
> 1GB or more unallocated on all disks.  I was running the balance on the
> first filesystem to try to free up some unallocated space.  The second
> filesystem died without any help from me.
> 
> It turns out that 'btrfs check --repair' can fix these!  First time
> I've ever seen check --repair fix a broken filesystem.  A few files are
> damaged, but the filesystem is read-write again and still working so far
> (on a 5.0.21 kernel) .

Since this report I have repeated this event several times on kernels
from 5.1 to 5.4, running my 10x rsync, bees dedupe, balance, scrub,
snapshot create and delete stress test.

The symptoms on each kernel are different because the bug interacts with
other capabilities and fixes in each kernel:

	5.1.21 - all test runs eventually end with corrupted metadata
	on disk (*)

	5.2.21, 5.3.18 - write time tree checker (usually) detects
	filesystem corruption and aborts transaction before metadata on
	disk is damaged

	5.4.13 - NULL pointer splats in various places, especially
	snapshot create and during mount.  These end the test too quickly
	to see if there is also metadata corruption.

These are all fixed by:

	707de9c0806d btrfs: relocation: fix reloc_root lifespan and access

When this patch is applied to kernels 5.1, 5.2, or 5.3, it fixes all of
the above problems.

Thanks Qu for this patch.

(*) or one of the tree mod log UAF bugs--but the metadata corruption
usually happens much faster.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-02-04  4:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-04 21:03 repeatable(ish) corrupt leaf filesystem splat on 5.1.x Zygo Blaxell
2019-07-05  0:06 ` Qu Wenruo
2019-07-05  3:33   ` Zygo Blaxell
2020-02-04  4:43 ` repeatable(ish) corrupt leaf filesystem splat on 5.1.x - fixed in 5.4.14, 5.5.0 Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).