All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Pernegger <pernegger@gmail.com>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: first it froze, now the (btrfs) root fs won't mount ...
Date: Sun, 20 Oct 2019 12:11:47 +0200	[thread overview]
Message-ID: <CAKbQEqG8Sb-5+wx4NMfjORc-XnCtytuoqRw4J9hk2Pj9BNY_9g@mail.gmail.com> (raw)
In-Reply-To: <4608b644-0fa3-7212-af45-0f4eebfb0be9@gmx.com>

[Re-send, hit reply instead of reply-all by mistake. Please CC me, I'm
not on the list.]

Good morning & thank you.

Am So., 20. Okt. 2019 um 02:38 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> It looks like you're using eGPU and the thunderbolt 3 connection disconnect?
> That would cause a kernel panic/hang or whatever.

No, it's a Radeon VII in a Gigabyte X570 Aorus Master. The board has
PCIe 4, otherwise nothing exotic.

> > [...]
> > BTRFS error [...]: bad tree block start, want 284041084928 have 0
> > BTRFS error [...]: failed to read block groups: -5
> > BTRFS error [...]: open_ctree failed
["big number" filled in above]

> This means some tree blocks didn't reach disk or just got wiped out.
> Are you using discard mount option?

Not to my knowledge. As in, I didn't set "discard", as far as I can
remember it didn't show up in mount output, but it's possible it's on
by default.

> > running btrfs check gives:
> > checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000
> > checksum verify failed on 284041084928 found E4E3BDB6 wanted 00000000
> > bytenr mismatch, want=284041084928, have=0
> > ERROR: cannot open filesystem.
["big number" and "8-digit hex" filled in above]

> Again, some old tree blocks got wiped out.
> BTW, you don't need to wipe the numbers, sometimes it help developer to find some corner problem.

I was just being lazy, sorry about that.

> If it's the only problem, you can try this kernel branch to at least do
> a RO mount:
> https://github.com/adam900710/linux/tree/rescue_options
>
> Then mount the fs with "rescue=skipbg,ro" option.
> If the bad tree block is the only problem, it should be able to mount it.
>
> If that mount succeeded, and you can access all files, then it means
> only extent tree is corrupted, then you can try btrfs check
> --init-extent-tree, there are some reports of --init-extent-tree fixed
> the problem.

You wouldn't happen to know of a bootable rescue image that has this?
The affected machine obviously doesn't boot, getting the NVMe out
requires dismantling the CPU cooler, and TBH, I haven't built a kernel
in ~15 years.

> About the cause, either btrfs didn't write some tree blocks correctly or
> the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is
> the case).
>
> So it's recommended to update the kernel to 5.3 kernel.

FWIW, it's a Samsung 970 Evo Plus.
TBH, I didn't expect to lose more than the last couple minutes of
writes in such a crash, certainly not an unmountable filesystem. So
I'd love to know what caused this so I can avoid it in future. But
first things first, have to get this thing up & running again ...

Cheers,
Christian

Am So., 20. Okt. 2019 um 02:38 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> On 2019/10/20 上午6:34, Christian Pernegger wrote:
> > [Please CC me, I'm not on the list.]
> >
> > Hello,
> >
> > I'm afraid I could use some help.
> >
> > The affected machine froze during a game, was entirely unresponsive
> > locally, though ssh still worked. For completeness' sake, dmesg had:
> > [110592.128512] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
> > timeout, signaled seq=3404070, emitted seq=3404071
> > [110592.128545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> > information: process Xorg pid 1191 thread Xorg:cs0 pid 1204
> > [110592.128549] amdgpu 0000:0c:00.0: GPU reset begin!
> > [110592.138530] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
> > timeout, signaled seq=13149116, emitted seq=13149118
> > [110592.138577] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> > information: process Overcooked.exe pid 4830 thread dxvk-submit pid
> > 4856
> > [110592.138579] amdgpu 0000:0c:00.0: GPU reset begin!
>
> It looks like you're using eGPU and the thunderbolt 3 connection disconnect?
> That would cause a kernel panic/hang or whatever.
>
> >
> > Oh well, I thought, and "shutdown -h now" it. That quit my ssh session
> > and locked me out, but otherwise didn't take, no reboot, still frozen.
> > Alt-SysRq-REISUB it was. That did it.
> >
> > Only now all I get is a rescue shell, the pertinent messages look to
> > be [everything is copied off the screen by hand]:
> > [...]
> > BTRFS info [...]: disk space caching is enabled
> > BTRFS info [...]: has skinny extents
> > BTRFS error [...]: bad tree block start, want [big number] have 0
> > BTRFS error [...]: failed to read block groups: -5
> > BTRFS error [...]: open_ctree failed
>
> This means some tree blocks didn't reach disk or just got wiped out.
>
> Are you using discard mount option?
>
> >
> > Mounting with -o ro,usebackuproot doesn't change anything.
> >
> > running btrfs check gives:
> > checksum verify failed on [same big number] found [8 digits hex] wanted 00000000
> > checksum verify failed on [same big number] found [8 digits hex] wanted 00000000
>
> Again, some old tree blocks got wiped out.
>
> BTW, you don't need to wipe the numbers, sometimes it help developer to
> find some corner problem.
>
> > bytenr mismatch, want=[same big number], have=0
> > ERROR: cannot open filesystem.
> >
> > That's all I've got, I'd really appreciate some help. There's hourly
> > snapshots courtesy of Timeshift, though I have a feeling those won't
> > help ...
>
> If it's the only problem, you can try this kernel branch to at least do
> a RO mount:
> https://github.com/adam900710/linux/tree/rescue_options
>
> Then mount the fs with "rescue=skipbg,ro" option.
> If the bad tree block is the only problem, it should be able to mount it.
>
> If that mount succeeded, and you can access all files, then it means
> only extent tree is corrupted, then you can try btrfs check
> --init-extent-tree, there are some reports of --init-extent-tree fixed
> the problem.
>
> >
> > Oh, it's a recent Linux Mint 19.2 install, default layout (@, @home),
> > Timeshift enabled; on a single device (NVMe). HWE kernel (Kernel
> > 5.0.0-31-generic), btrfs-progs 4.15.1.
>
> About the cause, either btrfs didn't write some tree blocks correctly or
> the NVMe doesn't implement FUA/FLUSH correctly (which I don't believe is
> the case).
>
> So it's recommended to update the kernel to 5.3 kernel.
>
> Thanks,
> Qu
>
> >
> > TIA,
> > Christian
> >
>

  reply	other threads:[~2019-10-20 10:12 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAKbQEqE7xN1q3byFL7-_pD=_pGJ0Vm9pj7d-g+rRgtONeH-GrA@mail.gmail.com>
2019-10-19 22:34 ` first it froze, now the (btrfs) root fs won't mount Christian Pernegger
2019-10-20  0:38   ` Qu Wenruo
2019-10-20 10:11     ` Christian Pernegger [this message]
2019-10-20 10:22       ` Christian Pernegger
2019-10-20 10:28         ` Qu Wenruo
2019-10-21 10:47           ` Christian Pernegger
2019-10-21 10:55             ` Qu Wenruo
2019-10-21 11:47             ` Austin S. Hemmelgarn
2019-10-21 13:02               ` Christian Pernegger
2019-10-21 13:34                 ` Qu Wenruo
2019-10-22 22:56                   ` Christian Pernegger
2019-10-23  0:25                     ` Qu Wenruo
2019-10-23 11:31                     ` Austin S. Hemmelgarn
2019-10-24 10:41                       ` Christian Pernegger
2019-10-24 11:26                         ` Qu Wenruo
2019-10-24 11:40                         ` Austin S. Hemmelgarn
2019-10-25 16:43                           ` Christian Pernegger
2019-10-25 17:05                             ` Christian Pernegger
2019-10-25 17:16                               ` Austin S. Hemmelgarn
2019-10-25 17:12                             ` Austin S. Hemmelgarn
2019-10-26  0:01                             ` Qu Wenruo
2019-10-26  9:23                               ` Christian Pernegger
2019-10-26  9:41                                 ` Qu Wenruo
2019-10-26 13:52                                   ` Christian Pernegger
2019-10-26 14:06                                     ` Qu Wenruo
2019-10-26 16:30                                       ` Christian Pernegger
2019-10-27  0:46                                         ` Qu Wenruo
     [not found]                                           ` <CAKbQEqFne8eohE3gvCMm8LqA-KimFrwwvE5pUBTn-h-VBhJq1A@mail.gmail.com>
2019-10-27 13:38                                             ` Qu Wenruo
2019-10-21 14:02                 ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKbQEqG8Sb-5+wx4NMfjORc-XnCtytuoqRw4J9hk2Pj9BNY_9g@mail.gmail.com \
    --to=pernegger@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.