linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BTRFS failure after resume from hibernate
@ 2020-01-20 14:45 Robbie Smith
  2020-01-20 17:29 ` Nikolay Borisov
  2020-01-21  0:10 ` Qu Wenruo
  0 siblings, 2 replies; 14+ messages in thread
From: Robbie Smith @ 2020-01-20 14:45 UTC (permalink / raw)
  To: linux-btrfs

I put my laptop into hibernation mode for a few days so I could boot
up into Windows 10 to do some things, and upon waking up BTRFS has
borked itself, spitting out errors and locking itself into read-only
mode. Is there any up-to-date information on how to fix it, short of
wiping the partition and reinstalling (which is what I ended up
resorting to last time after none of the attempts to fix it worked)?
The error messages in my journal are:

BTRFS error (device dm-0): parent transid verify failed on
223458705408 wanted 144360 found 144376
BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
extent bytenr=223451267072 len=16384 invalid generation, have 144376
expect (0, 144375]
BTRFS error (device dm-0): block=223455346688 read time tree block
corruption detected
BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5

The parent transid messages are repeated a few times. There's nothing
fancy about my BTRFS setup: subvolumes are used to emulate my root and
home partition. No RAID, no compression, though the partition does sit
beneath a dm-crypt layer using LUKS. Hibernation is done onto a
separate swap partion on the same drive.

This is the second time in six months this has happened on this
laptop. The only other thing I can think of is that the laptop BIOS
reported that the charger wasn't supplying the correct wattage, and I
have no idea why it would do that—both laptop and charger are nearly
brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
T470.

I've got backups, but reinstalling is a nuisance and I really don't
want to spend a couple of days getting the laptop working again. I
don't have a conveniently large drive lying around to mirror this one
onto.

Robbie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-20 14:45 BTRFS failure after resume from hibernate Robbie Smith
@ 2020-01-20 17:29 ` Nikolay Borisov
  2020-01-21  0:10 ` Qu Wenruo
  1 sibling, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2020-01-20 17:29 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs



On 20.01.20 г. 16:45 ч., Robbie Smith wrote:
> I put my laptop into hibernation mode for a few days so I could boot
> up into Windows 10 to do some things, and upon waking up BTRFS has
> borked itself, spitting out errors and locking itself into read-only
> mode. Is there any up-to-date information on how to fix it, short of
> wiping the partition and reinstalling (which is what I ended up
> resorting to last time after none of the attempts to fix it worked)?
> The error messages in my journal are:


Are your btrfs and windows installations on the same disk but different
partitions? While you were booted into windows did you perform any updates?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-20 14:45 BTRFS failure after resume from hibernate Robbie Smith
  2020-01-20 17:29 ` Nikolay Borisov
@ 2020-01-21  0:10 ` Qu Wenruo
  2020-01-21  1:39   ` Robbie Smith
  1 sibling, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-21  0:10 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2306 bytes --]



On 2020/1/20 下午10:45, Robbie Smith wrote:
> I put my laptop into hibernation mode for a few days so I could boot
> up into Windows 10 to do some things, and upon waking up BTRFS has
> borked itself, spitting out errors and locking itself into read-only
> mode. Is there any up-to-date information on how to fix it, short of
> wiping the partition and reinstalling (which is what I ended up
> resorting to last time after none of the attempts to fix it worked)?
> The error messages in my journal are:
> 
> BTRFS error (device dm-0): parent transid verify failed on
> 223458705408 wanted 144360 found 144376

The fs is already corrupted at this point.

> BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
> extent bytenr=223451267072 len=16384 invalid generation, have 144376
> expect (0, 144375]

This is one newer tree-checker added in latest kernel.

It can be fixed with btrfs check in this branch:
https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair

But that transid error can't be repair, so it doesn't make much sense.

> BTRFS error (device dm-0): block=223455346688 read time tree block
> corruption detected
> BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
> 
> The parent transid messages are repeated a few times. There's nothing
> fancy about my BTRFS setup: subvolumes are used to emulate my root and
> home partition. No RAID, no compression, though the partition does sit
> beneath a dm-crypt layer using LUKS. Hibernation is done onto a
> separate swap partion on the same drive.

Please provide the output of "btrfs check" and kernel version.

Thanks,
Qu

> 
> This is the second time in six months this has happened on this
> laptop. The only other thing I can think of is that the laptop BIOS
> reported that the charger wasn't supplying the correct wattage, and I
> have no idea why it would do that—both laptop and charger are nearly
> brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
> T470.
> 
> I've got backups, but reinstalling is a nuisance and I really don't
> want to spend a couple of days getting the laptop working again. I
> don't have a conveniently large drive lying around to mirror this one
> onto.
> 
> Robbie
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  0:10 ` Qu Wenruo
@ 2020-01-21  1:39   ` Robbie Smith
  2020-01-21  1:49     ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Robbie Smith @ 2020-01-21  1:39 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 21 Jan 2020 at 11:10, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/20 下午10:45, Robbie Smith wrote:
> > I put my laptop into hibernation mode for a few days so I could boot
> > up into Windows 10 to do some things, and upon waking up BTRFS has
> > borked itself, spitting out errors and locking itself into read-only
> > mode. Is there any up-to-date information on how to fix it, short of
> > wiping the partition and reinstalling (which is what I ended up
> > resorting to last time after none of the attempts to fix it worked)?
> > The error messages in my journal are:
> >
> > BTRFS error (device dm-0): parent transid verify failed on
> > 223458705408 wanted 144360 found 144376
>
> The fs is already corrupted at this point.
>
> > BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
> > extent bytenr=223451267072 len=16384 invalid generation, have 144376
> > expect (0, 144375]
>
> This is one newer tree-checker added in latest kernel.
>
> It can be fixed with btrfs check in this branch:
> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
>
> But that transid error can't be repair, so it doesn't make much sense.
>
> > BTRFS error (device dm-0): block=223455346688 read time tree block
> > corruption detected
> > BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
> >
> > The parent transid messages are repeated a few times. There's nothing
> > fancy about my BTRFS setup: subvolumes are used to emulate my root and
> > home partition. No RAID, no compression, though the partition does sit
> > beneath a dm-crypt layer using LUKS. Hibernation is done onto a
> > separate swap partion on the same drive.
>
> Please provide the output of "btrfs check" and kernel version.

Here's the kernel and btrfs information:

> # uname -a
> Linux rocinante 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
>
> # btrfs --version
> btrfs-progs v5.4
>
> # btrfs fi df /
> Data, single: total=541.01GiB, used=538.54GiB
> System, DUP: total=8.00MiB, used=80.00KiB
> Metadata, DUP: total=3.00GiB, used=1.56GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> # btrfs fi show
> Label: 'rootfs'  uuid: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>         Total devices 1 FS bytes used 540.11GiB
> devid    1 size 794.25GiB used 547.02GiB path /dev/mapper/cryptroot

I tried a btrfs check and it failed almost immediately.

> # btrfs check /dev/mapper/cryptroot
> Opening filesystem to check...
> ERROR: /dev/mapper/cryptroot is currently mounted, use --force if you really intend to check the filesystem
>
> # btrfs check --force /dev/mapper/cryptroot
> Opening filesystem to check...
> WARNING: filesystem mounted, continuing because of --force
> Checking filesystem on /dev/mapper/cryptroot
> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> [1/7] checking root items
> parent transid verify failed on 223455674368 wanted 144355 found 144376
> parent transid verify failed on 223455674368 wanted 144355 found 144376
> parent transid verify failed on 223455674368 wanted 144355 found 144376
> Ignoring transid failure
> parent transid verify failed on 223452872704 wanted 144358 found 144376
> parent transid verify failed on 223452872704 wanted 144358 found 144376
> parent transid verify failed on 223452872704 wanted 144358 found 144376
> Ignoring transid failure
> ERROR: child eb corrupted: parent bytenr=223602655232 item=233 parent level=1 child level=2
> ERROR: failed to repair root items: Input/output error

I haven't rebooted the laptop, in case this issue makes the laptop
unbootable, but I could try re-running the check from a live USB and
an unmounted filesystem. My Arch Live USB is from June last year, and
it's got kernel 4.20 and btrfs-progs 4.19.1 on it—will they be new
enough, or should I fetch the latest Arch disk and flash a new one?

In answer to Nikolay's questions, both Windows and Linux share a disk
but are on separate partitions, and Windows did update itself. I've
had Windows updates occur while Linux is hibernated before, and it has
no reason to touch a partition it can't read and never mounts.

Robbie
>
> Thanks,
> Qu
>
> >
> > This is the second time in six months this has happened on this
> > laptop. The only other thing I can think of is that the laptop BIOS
> > reported that the charger wasn't supplying the correct wattage, and I
> > have no idea why it would do that—both laptop and charger are nearly
> > brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
> > T470.
> >
> > I've got backups, but reinstalling is a nuisance and I really don't
> > want to spend a couple of days getting the laptop working again. I
> > don't have a conveniently large drive lying around to mirror this one
> > onto.
> >
> > Robbie
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  1:39   ` Robbie Smith
@ 2020-01-21  1:49     ` Qu Wenruo
  2020-01-21  2:06       ` Robbie Smith
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-21  1:49 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5980 bytes --]



On 2020/1/21 上午9:39, Robbie Smith wrote:
> On Tue, 21 Jan 2020 at 11:10, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/1/20 下午10:45, Robbie Smith wrote:
>>> I put my laptop into hibernation mode for a few days so I could boot
>>> up into Windows 10 to do some things, and upon waking up BTRFS has
>>> borked itself, spitting out errors and locking itself into read-only
>>> mode. Is there any up-to-date information on how to fix it, short of
>>> wiping the partition and reinstalling (which is what I ended up
>>> resorting to last time after none of the attempts to fix it worked)?
>>> The error messages in my journal are:
>>>
>>> BTRFS error (device dm-0): parent transid verify failed on
>>> 223458705408 wanted 144360 found 144376
>>
>> The fs is already corrupted at this point.
>>
>>> BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
>>> extent bytenr=223451267072 len=16384 invalid generation, have 144376
>>> expect (0, 144375]
>>
>> This is one newer tree-checker added in latest kernel.
>>
>> It can be fixed with btrfs check in this branch:
>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
>>
>> But that transid error can't be repair, so it doesn't make much sense.
>>
>>> BTRFS error (device dm-0): block=223455346688 read time tree block
>>> corruption detected
>>> BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
>>>
>>> The parent transid messages are repeated a few times. There's nothing
>>> fancy about my BTRFS setup: subvolumes are used to emulate my root and
>>> home partition. No RAID, no compression, though the partition does sit
>>> beneath a dm-crypt layer using LUKS. Hibernation is done onto a
>>> separate swap partion on the same drive.
>>
>> Please provide the output of "btrfs check" and kernel version.
> 
> Here's the kernel and btrfs information:
> 
>> # uname -a
>> Linux rocinante 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
>>
>> # btrfs --version
>> btrfs-progs v5.4
>>
>> # btrfs fi df /
>> Data, single: total=541.01GiB, used=538.54GiB
>> System, DUP: total=8.00MiB, used=80.00KiB
>> Metadata, DUP: total=3.00GiB, used=1.56GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> # btrfs fi show
>> Label: 'rootfs'  uuid: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>>         Total devices 1 FS bytes used 540.11GiB
>> devid    1 size 794.25GiB used 547.02GiB path /dev/mapper/cryptroot
> 
> I tried a btrfs check and it failed almost immediately.
> 
>> # btrfs check /dev/mapper/cryptroot
>> Opening filesystem to check...
>> ERROR: /dev/mapper/cryptroot is currently mounted, use --force if you really intend to check the filesystem
>>
>> # btrfs check --force /dev/mapper/cryptroot
>> Opening filesystem to check...
>> WARNING: filesystem mounted, continuing because of --force
>> Checking filesystem on /dev/mapper/cryptroot
>> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>> [1/7] checking root items
>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>> Ignoring transid failure
>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>> Ignoring transid failure
>> ERROR: child eb corrupted: parent bytenr=223602655232 item=233 parent level=1 child level=2
>> ERROR: failed to repair root items: Input/output error

The corruption looks happened on root tree. Which is mostly ensured to
cause problem for next mount.

It's highly recommended to start data salvage.

> 
> I haven't rebooted the laptop, in case this issue makes the laptop
> unbootable, but I could try re-running the check from a live USB and
> an unmounted filesystem. My Arch Live USB is from June last year, and
> it's got kernel 4.20 and btrfs-progs 4.19.1 on it—will they be new
> enough, or should I fetch the latest Arch disk and flash a new one?

I don't believe newer btrfs-progs can handle it at all.
But you can still consider it as a last try.

BTW did you have anything weird in dmesg?

> 
> In answer to Nikolay's questions, both Windows and Linux share a disk
> but are on separate partitions, and Windows did update itself. I've
> had Windows updates occur while Linux is hibernated before, and it has
> no reason to touch a partition it can't read and never mounts.

For the cause, I don't believe it's related to Windows, but the
hibernation part.

Not sure how hibernation would interact with fs, but my guess is it
should at least sync the fs.

Anyway, if something extra happened, dmesg should have some clue.


Another possible cause is, some older (still v5.x) upstream kernel had
some bug, e.g. before v5.2.15/v5.3 there is a bug in btrfs which could
cause part of metadata not synced to disk, causing the same transid
corruption.

And since you're not rebooting, but only hibernate, the problem remains
undetected until today...

Thanks,
Qu

> 
> Robbie
>>
>> Thanks,
>> Qu
>>
>>>
>>> This is the second time in six months this has happened on this
>>> laptop. The only other thing I can think of is that the laptop BIOS
>>> reported that the charger wasn't supplying the correct wattage, and I
>>> have no idea why it would do that—both laptop and charger are nearly
>>> brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
>>> T470.
>>>
>>> I've got backups, but reinstalling is a nuisance and I really don't
>>> want to spend a couple of days getting the laptop working again. I
>>> don't have a conveniently large drive lying around to mirror this one
>>> onto.
>>>
>>> Robbie
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  1:49     ` Qu Wenruo
@ 2020-01-21  2:06       ` Robbie Smith
  2020-01-21  2:26         ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Robbie Smith @ 2020-01-21  2:06 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 21 Jan 2020 at 12:49, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/21 上午9:39, Robbie Smith wrote:
> > On Tue, 21 Jan 2020 at 11:10, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/1/20 下午10:45, Robbie Smith wrote:
> >>> I put my laptop into hibernation mode for a few days so I could boot
> >>> up into Windows 10 to do some things, and upon waking up BTRFS has
> >>> borked itself, spitting out errors and locking itself into read-only
> >>> mode. Is there any up-to-date information on how to fix it, short of
> >>> wiping the partition and reinstalling (which is what I ended up
> >>> resorting to last time after none of the attempts to fix it worked)?
> >>> The error messages in my journal are:
> >>>
> >>> BTRFS error (device dm-0): parent transid verify failed on
> >>> 223458705408 wanted 144360 found 144376
> >>
> >> The fs is already corrupted at this point.
> >>
> >>> BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
> >>> extent bytenr=223451267072 len=16384 invalid generation, have 144376
> >>> expect (0, 144375]
> >>
> >> This is one newer tree-checker added in latest kernel.
> >>
> >> It can be fixed with btrfs check in this branch:
> >> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
> >>
> >> But that transid error can't be repair, so it doesn't make much sense.
> >>
> >>> BTRFS error (device dm-0): block=223455346688 read time tree block
> >>> corruption detected
> >>> BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
> >>>
> >>> The parent transid messages are repeated a few times. There's nothing
> >>> fancy about my BTRFS setup: subvolumes are used to emulate my root and
> >>> home partition. No RAID, no compression, though the partition does sit
> >>> beneath a dm-crypt layer using LUKS. Hibernation is done onto a
> >>> separate swap partion on the same drive.
> >>
> >> Please provide the output of "btrfs check" and kernel version.
> >
> > Here's the kernel and btrfs information:
> >
> >> # uname -a
> >> Linux rocinante 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
> >>
> >> # btrfs --version
> >> btrfs-progs v5.4
> >>
> >> # btrfs fi df /
> >> Data, single: total=541.01GiB, used=538.54GiB
> >> System, DUP: total=8.00MiB, used=80.00KiB
> >> Metadata, DUP: total=3.00GiB, used=1.56GiB
> >> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>
> >> # btrfs fi show
> >> Label: 'rootfs'  uuid: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> >>         Total devices 1 FS bytes used 540.11GiB
> >> devid    1 size 794.25GiB used 547.02GiB path /dev/mapper/cryptroot
> >
> > I tried a btrfs check and it failed almost immediately.
> >
> >> # btrfs check /dev/mapper/cryptroot
> >> Opening filesystem to check...
> >> ERROR: /dev/mapper/cryptroot is currently mounted, use --force if you really intend to check the filesystem
> >>
> >> # btrfs check --force /dev/mapper/cryptroot
> >> Opening filesystem to check...
> >> WARNING: filesystem mounted, continuing because of --force
> >> Checking filesystem on /dev/mapper/cryptroot
> >> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> >> [1/7] checking root items
> >> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >> Ignoring transid failure
> >> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >> Ignoring transid failure
> >> ERROR: child eb corrupted: parent bytenr=223602655232 item=233 parent level=1 child level=2
> >> ERROR: failed to repair root items: Input/output error
>
> The corruption looks happened on root tree. Which is mostly ensured to
> cause problem for next mount.
>
> It's highly recommended to start data salvage.
>
> >
> > I haven't rebooted the laptop, in case this issue makes the laptop
> > unbootable, but I could try re-running the check from a live USB and
> > an unmounted filesystem. My Arch Live USB is from June last year, and
> > it's got kernel 4.20 and btrfs-progs 4.19.1 on it—will they be new
> > enough, or should I fetch the latest Arch disk and flash a new one?
>
> I don't believe newer btrfs-progs can handle it at all.
> But you can still consider it as a last try.
>
> BTW did you have anything weird in dmesg?

dmesg is full of errors from journalctl because the filesystem is
read-only. Journalctl had paused after resume due to this, and I
thought I could catch newer messages by running it (isn't it supposed
to temporarily store logs in volatile storage?), and that made my
laptop completely die. Every program I had open segfaulted at once,
and now it's just spooling through dmesg with thousands (if not
millions) of lines about journalctl being unable to rotate the logs.
Amazingly enough, I'm still logged in remotely via ssh/mosh, but I
can't run any commands due to a bus error. I can't even su to root.

I guess I try rebooting it with a Live USB, and running the check
again, and if that fails, looks like I'll be spending my day
reinstalling everything. Do I have any better options? The only thing
that isn't backed up on this machine is my music collection, but
that's a local lossy copy generated from my lossless library on my
other machine, so I can recreate it if I need to (I'd rather not—if I
can mount the fs readonly, I might be able to copy that to a separate
drive).

What on Earth could possibly cause BTRFS to fail so badly like this,
with this specific error? I've been using BTRFS for years without
problems, except this and the exact same error on the same machine six
months ago.

>
> >
> > In answer to Nikolay's questions, both Windows and Linux share a disk
> > but are on separate partitions, and Windows did update itself. I've
> > had Windows updates occur while Linux is hibernated before, and it has
> > no reason to touch a partition it can't read and never mounts.
>
> For the cause, I don't believe it's related to Windows, but the
> hibernation part.
>
> Not sure how hibernation would interact with fs, but my guess is it
> should at least sync the fs.
>
> Anyway, if something extra happened, dmesg should have some clue.
>
>
> Another possible cause is, some older (still v5.x) upstream kernel had
> some bug, e.g. before v5.2.15/v5.3 there is a bug in btrfs which could
> cause part of metadata not synced to disk, causing the same transid
> corruption.
>
> And since you're not rebooting, but only hibernate, the problem remains
> undetected until today...
>
> Thanks,
> Qu
>
> >
> > Robbie
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> This is the second time in six months this has happened on this
> >>> laptop. The only other thing I can think of is that the laptop BIOS
> >>> reported that the charger wasn't supplying the correct wattage, and I
> >>> have no idea why it would do that—both laptop and charger are nearly
> >>> brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
> >>> T470.
> >>>
> >>> I've got backups, but reinstalling is a nuisance and I really don't
> >>> want to spend a couple of days getting the laptop working again. I
> >>> don't have a conveniently large drive lying around to mirror this one
> >>> onto.
> >>>
> >>> Robbie
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  2:06       ` Robbie Smith
@ 2020-01-21  2:26         ` Qu Wenruo
  2020-01-21  2:58           ` Robbie Smith
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-21  2:26 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8691 bytes --]



On 2020/1/21 上午10:06, Robbie Smith wrote:
> On Tue, 21 Jan 2020 at 12:49, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/1/21 上午9:39, Robbie Smith wrote:
>>> On Tue, 21 Jan 2020 at 11:10, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/1/20 下午10:45, Robbie Smith wrote:
>>>>> I put my laptop into hibernation mode for a few days so I could boot
>>>>> up into Windows 10 to do some things, and upon waking up BTRFS has
>>>>> borked itself, spitting out errors and locking itself into read-only
>>>>> mode. Is there any up-to-date information on how to fix it, short of
>>>>> wiping the partition and reinstalling (which is what I ended up
>>>>> resorting to last time after none of the attempts to fix it worked)?
>>>>> The error messages in my journal are:
>>>>>
>>>>> BTRFS error (device dm-0): parent transid verify failed on
>>>>> 223458705408 wanted 144360 found 144376
>>>>
>>>> The fs is already corrupted at this point.
>>>>
>>>>> BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
>>>>> extent bytenr=223451267072 len=16384 invalid generation, have 144376
>>>>> expect (0, 144375]
>>>>
>>>> This is one newer tree-checker added in latest kernel.
>>>>
>>>> It can be fixed with btrfs check in this branch:
>>>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
>>>>
>>>> But that transid error can't be repair, so it doesn't make much sense.
>>>>
>>>>> BTRFS error (device dm-0): block=223455346688 read time tree block
>>>>> corruption detected
>>>>> BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
>>>>>
>>>>> The parent transid messages are repeated a few times. There's nothing
>>>>> fancy about my BTRFS setup: subvolumes are used to emulate my root and
>>>>> home partition. No RAID, no compression, though the partition does sit
>>>>> beneath a dm-crypt layer using LUKS. Hibernation is done onto a
>>>>> separate swap partion on the same drive.
>>>>
>>>> Please provide the output of "btrfs check" and kernel version.
>>>
>>> Here's the kernel and btrfs information:
>>>
>>>> # uname -a
>>>> Linux rocinante 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
>>>>
>>>> # btrfs --version
>>>> btrfs-progs v5.4
>>>>
>>>> # btrfs fi df /
>>>> Data, single: total=541.01GiB, used=538.54GiB
>>>> System, DUP: total=8.00MiB, used=80.00KiB
>>>> Metadata, DUP: total=3.00GiB, used=1.56GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>> # btrfs fi show
>>>> Label: 'rootfs'  uuid: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>>>>         Total devices 1 FS bytes used 540.11GiB
>>>> devid    1 size 794.25GiB used 547.02GiB path /dev/mapper/cryptroot
>>>
>>> I tried a btrfs check and it failed almost immediately.
>>>
>>>> # btrfs check /dev/mapper/cryptroot
>>>> Opening filesystem to check...
>>>> ERROR: /dev/mapper/cryptroot is currently mounted, use --force if you really intend to check the filesystem
>>>>
>>>> # btrfs check --force /dev/mapper/cryptroot
>>>> Opening filesystem to check...
>>>> WARNING: filesystem mounted, continuing because of --force
>>>> Checking filesystem on /dev/mapper/cryptroot
>>>> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>>>> [1/7] checking root items
>>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
>>>> Ignoring transid failure
>>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
>>>> Ignoring transid failure
>>>> ERROR: child eb corrupted: parent bytenr=223602655232 item=233 parent level=1 child level=2
>>>> ERROR: failed to repair root items: Input/output error
>>
>> The corruption looks happened on root tree. Which is mostly ensured to
>> cause problem for next mount.
>>
>> It's highly recommended to start data salvage.
>>
>>>
>>> I haven't rebooted the laptop, in case this issue makes the laptop
>>> unbootable, but I could try re-running the check from a live USB and
>>> an unmounted filesystem. My Arch Live USB is from June last year, and
>>> it's got kernel 4.20 and btrfs-progs 4.19.1 on it—will they be new
>>> enough, or should I fetch the latest Arch disk and flash a new one?
>>
>> I don't believe newer btrfs-progs can handle it at all.
>> But you can still consider it as a last try.
>>
>> BTW did you have anything weird in dmesg?
> 
> dmesg is full of errors from journalctl because the filesystem is
> read-only. Journalctl had paused after resume due to this, and I
> thought I could catch newer messages by running it (isn't it supposed
> to temporarily store logs in volatile storage?), and that made my
> laptop completely die. Every program I had open segfaulted at once,
> and now it's just spooling through dmesg with thousands (if not
> millions) of lines about journalctl being unable to rotate the logs.
> Amazingly enough, I'm still logged in remotely via ssh/mosh, but I
> can't run any commands due to a bus error. I can't even su to root.

Well, when a fs get fully corrupted, everything can happen.

> 
> I guess I try rebooting it with a Live USB, and running the check
> again, and if that fails, looks like I'll be spending my day
> reinstalling everything. Do I have any better options? The only thing
> that isn't backed up on this machine is my music collection, but
> that's a local lossy copy generated from my lossless library on my
> other machine, so I can recreate it if I need to (I'd rather not—if I
> can mount the fs readonly, I might be able to copy that to a separate
> drive).
> 
> What on Earth could possibly cause BTRFS to fail so badly like this,
> with this specific error? I've been using BTRFS for years without
> problems, except this and the exact same error on the same machine six
> months ago.

Really hard to say, there are at least 3 things related to this problem.

- Btrfs itself
- Hibernation
- Dm-crypt (less possible)

For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
then it's possible the fs is already corrupted but not detected.

For the hibernation part, Linux is not the best place to utilize it for
the first place.
(My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
suspension/hiberation)

Since linux development is mostly server oriented, such daily consumer
operation may not be that well tested.

Things like Windows updating certain firmware could break the controller
behavior and cause unexpected behavior.

So my personal recommendation is, to avoid hibernation/suspension, use
Windows as little as possible.

Thanks,
Qu

> 
>>
>>>
>>> In answer to Nikolay's questions, both Windows and Linux share a disk
>>> but are on separate partitions, and Windows did update itself. I've
>>> had Windows updates occur while Linux is hibernated before, and it has
>>> no reason to touch a partition it can't read and never mounts.
>>
>> For the cause, I don't believe it's related to Windows, but the
>> hibernation part.
>>
>> Not sure how hibernation would interact with fs, but my guess is it
>> should at least sync the fs.
>>
>> Anyway, if something extra happened, dmesg should have some clue.
>>
>>
>> Another possible cause is, some older (still v5.x) upstream kernel had
>> some bug, e.g. before v5.2.15/v5.3 there is a bug in btrfs which could
>> cause part of metadata not synced to disk, causing the same transid
>> corruption.
>>
>> And since you're not rebooting, but only hibernate, the problem remains
>> undetected until today...
>>
>> Thanks,
>> Qu
>>
>>>
>>> Robbie
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> This is the second time in six months this has happened on this
>>>>> laptop. The only other thing I can think of is that the laptop BIOS
>>>>> reported that the charger wasn't supplying the correct wattage, and I
>>>>> have no idea why it would do that—both laptop and charger are nearly
>>>>> brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
>>>>> T470.
>>>>>
>>>>> I've got backups, but reinstalling is a nuisance and I really don't
>>>>> want to spend a couple of days getting the laptop working again. I
>>>>> don't have a conveniently large drive lying around to mirror this one
>>>>> onto.
>>>>>
>>>>> Robbie
>>>>>
>>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  2:26         ` Qu Wenruo
@ 2020-01-21  2:58           ` Robbie Smith
  2020-01-21  3:05             ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Robbie Smith @ 2020-01-21  2:58 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 21 Jan 2020 at 13:26, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/21 上午10:06, Robbie Smith wrote:
> > On Tue, 21 Jan 2020 at 12:49, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/1/21 上午9:39, Robbie Smith wrote:
> >>> On Tue, 21 Jan 2020 at 11:10, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/1/20 下午10:45, Robbie Smith wrote:
> >>>>> I put my laptop into hibernation mode for a few days so I could boot
> >>>>> up into Windows 10 to do some things, and upon waking up BTRFS has
> >>>>> borked itself, spitting out errors and locking itself into read-only
> >>>>> mode. Is there any up-to-date information on how to fix it, short of
> >>>>> wiping the partition and reinstalling (which is what I ended up
> >>>>> resorting to last time after none of the attempts to fix it worked)?
> >>>>> The error messages in my journal are:
> >>>>>
> >>>>> BTRFS error (device dm-0): parent transid verify failed on
> >>>>> 223458705408 wanted 144360 found 144376
> >>>>
> >>>> The fs is already corrupted at this point.
> >>>>
> >>>>> BTRFS critical (device dm-0): corrupt leaf: block=223455346688 slot=23
> >>>>> extent bytenr=223451267072 len=16384 invalid generation, have 144376
> >>>>> expect (0, 144375]
> >>>>
> >>>> This is one newer tree-checker added in latest kernel.
> >>>>
> >>>> It can be fixed with btrfs check in this branch:
> >>>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
> >>>>
> >>>> But that transid error can't be repair, so it doesn't make much sense.
> >>>>
> >>>>> BTRFS error (device dm-0): block=223455346688 read time tree block
> >>>>> corruption detected
> >>>>> BTRFS error (device dm-0): error loading props for ino 1032412 (root 258): -5
> >>>>>
> >>>>> The parent transid messages are repeated a few times. There's nothing
> >>>>> fancy about my BTRFS setup: subvolumes are used to emulate my root and
> >>>>> home partition. No RAID, no compression, though the partition does sit
> >>>>> beneath a dm-crypt layer using LUKS. Hibernation is done onto a
> >>>>> separate swap partion on the same drive.
> >>>>
> >>>> Please provide the output of "btrfs check" and kernel version.
> >>>
> >>> Here's the kernel and btrfs information:
> >>>
> >>>> # uname -a
> >>>> Linux rocinante 5.4.10-arch1-1 #1 SMP PREEMPT Thu, 09 Jan 2020 10:14:29 +0000 x86_64 GNU/Linux
> >>>>
> >>>> # btrfs --version
> >>>> btrfs-progs v5.4
> >>>>
> >>>> # btrfs fi df /
> >>>> Data, single: total=541.01GiB, used=538.54GiB
> >>>> System, DUP: total=8.00MiB, used=80.00KiB
> >>>> Metadata, DUP: total=3.00GiB, used=1.56GiB
> >>>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>>>
> >>>> # btrfs fi show
> >>>> Label: 'rootfs'  uuid: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> >>>>         Total devices 1 FS bytes used 540.11GiB
> >>>> devid    1 size 794.25GiB used 547.02GiB path /dev/mapper/cryptroot
> >>>
> >>> I tried a btrfs check and it failed almost immediately.
> >>>
> >>>> # btrfs check /dev/mapper/cryptroot
> >>>> Opening filesystem to check...
> >>>> ERROR: /dev/mapper/cryptroot is currently mounted, use --force if you really intend to check the filesystem
> >>>>
> >>>> # btrfs check --force /dev/mapper/cryptroot
> >>>> Opening filesystem to check...
> >>>> WARNING: filesystem mounted, continuing because of --force
> >>>> Checking filesystem on /dev/mapper/cryptroot
> >>>> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> >>>> [1/7] checking root items
> >>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >>>> parent transid verify failed on 223455674368 wanted 144355 found 144376
> >>>> Ignoring transid failure
> >>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >>>> parent transid verify failed on 223452872704 wanted 144358 found 144376
> >>>> Ignoring transid failure
> >>>> ERROR: child eb corrupted: parent bytenr=223602655232 item=233 parent level=1 child level=2
> >>>> ERROR: failed to repair root items: Input/output error
> >>
> >> The corruption looks happened on root tree. Which is mostly ensured to
> >> cause problem for next mount.
> >>
> >> It's highly recommended to start data salvage.
> >>
> >>>
> >>> I haven't rebooted the laptop, in case this issue makes the laptop
> >>> unbootable, but I could try re-running the check from a live USB and
> >>> an unmounted filesystem. My Arch Live USB is from June last year, and
> >>> it's got kernel 4.20 and btrfs-progs 4.19.1 on it—will they be new
> >>> enough, or should I fetch the latest Arch disk and flash a new one?
> >>
> >> I don't believe newer btrfs-progs can handle it at all.
> >> But you can still consider it as a last try.
> >>
> >> BTW did you have anything weird in dmesg?
> >
> > dmesg is full of errors from journalctl because the filesystem is
> > read-only. Journalctl had paused after resume due to this, and I
> > thought I could catch newer messages by running it (isn't it supposed
> > to temporarily store logs in volatile storage?), and that made my
> > laptop completely die. Every program I had open segfaulted at once,
> > and now it's just spooling through dmesg with thousands (if not
> > millions) of lines about journalctl being unable to rotate the logs.
> > Amazingly enough, I'm still logged in remotely via ssh/mosh, but I
> > can't run any commands due to a bus error. I can't even su to root.
>
> Well, when a fs get fully corrupted, everything can happen.
>
> >
> > I guess I try rebooting it with a Live USB, and running the check
> > again, and if that fails, looks like I'll be spending my day
> > reinstalling everything. Do I have any better options? The only thing
> > that isn't backed up on this machine is my music collection, but
> > that's a local lossy copy generated from my lossless library on my
> > other machine, so I can recreate it if I need to (I'd rather not—if I
> > can mount the fs readonly, I might be able to copy that to a separate
> > drive).
> >
> > What on Earth could possibly cause BTRFS to fail so badly like this,
> > with this specific error? I've been using BTRFS for years without
> > problems, except this and the exact same error on the same machine six
> > months ago.
>
> Really hard to say, there are at least 3 things related to this problem.
>
> - Btrfs itself
> - Hibernation
> - Dm-crypt (less possible)
>
> For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
> then it's possible the fs is already corrupted but not detected.
>
> For the hibernation part, Linux is not the best place to utilize it for
> the first place.
> (My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
> suspension/hiberation)
>
> Since linux development is mostly server oriented, such daily consumer
> operation may not be that well tested.
>
> Things like Windows updating certain firmware could break the controller
> behavior and cause unexpected behavior.
>
> So my personal recommendation is, to avoid hibernation/suspension, use
> Windows as little as possible.
>
> Thanks,
> Qu

Suspension works flawlessly for me, and hibernation usually does as
well. The one thing that has happened both times I've had a failure
has been something weird with the power: first time was a static shock
from walking on carpet and then touching the laptop, second time was
the BIOS reporting a wattage error with the charger.

I tried mounting the FS from a live USB and the mount said: "can't
read superblock on /dev/mapper/cryptroot" in addition to the transid
failures. Should I try running a `btrfs check --repair`? At this point
I'm pretty much resigned to reinstalling today, so I can't make things
any worse, can I?

I've also used kernel between version 5.2.0 and 5.2.15 on both my
machines, so does that mean there's a risk of undetected disk errors
on my desktop as well? I don't have backups of my backups, and all my
drives use BTRFS because I like the subvolume/snapshot features. I
also don't have a backup of my music/video library because I don't
have another 5 TB HDD.

>
> >
> >>
> >>>
> >>> In answer to Nikolay's questions, both Windows and Linux share a disk
> >>> but are on separate partitions, and Windows did update itself. I've
> >>> had Windows updates occur while Linux is hibernated before, and it has
> >>> no reason to touch a partition it can't read and never mounts.
> >>
> >> For the cause, I don't believe it's related to Windows, but the
> >> hibernation part.
> >>
> >> Not sure how hibernation would interact with fs, but my guess is it
> >> should at least sync the fs.
> >>
> >> Anyway, if something extra happened, dmesg should have some clue.
> >>
> >>
> >> Another possible cause is, some older (still v5.x) upstream kernel had
> >> some bug, e.g. before v5.2.15/v5.3 there is a bug in btrfs which could
> >> cause part of metadata not synced to disk, causing the same transid
> >> corruption.
> >>
> >> And since you're not rebooting, but only hibernate, the problem remains
> >> undetected until today...
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Robbie
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> This is the second time in six months this has happened on this
> >>>>> laptop. The only other thing I can think of is that the laptop BIOS
> >>>>> reported that the charger wasn't supplying the correct wattage, and I
> >>>>> have no idea why it would do that—both laptop and charger are nearly
> >>>>> brand-new, less than a year old. The laptop model is a Lenovo Thinkpad
> >>>>> T470.
> >>>>>
> >>>>> I've got backups, but reinstalling is a nuisance and I really don't
> >>>>> want to spend a couple of days getting the laptop working again. I
> >>>>> don't have a conveniently large drive lying around to mirror this one
> >>>>> onto.
> >>>>>
> >>>>> Robbie
> >>>>>
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  2:58           ` Robbie Smith
@ 2020-01-21  3:05             ` Qu Wenruo
  2020-01-21  3:51               ` Robbie Smith
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-21  3:05 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2195 bytes --]



On 2020/1/21 上午10:58, Robbie Smith wrote:
[...]
>>
>> Really hard to say, there are at least 3 things related to this problem.
>>
>> - Btrfs itself
>> - Hibernation
>> - Dm-crypt (less possible)
>>
>> For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
>> then it's possible the fs is already corrupted but not detected.
>>
>> For the hibernation part, Linux is not the best place to utilize it for
>> the first place.
>> (My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
>> suspension/hiberation)
>>
>> Since linux development is mostly server oriented, such daily consumer
>> operation may not be that well tested.
>>
>> Things like Windows updating certain firmware could break the controller
>> behavior and cause unexpected behavior.
>>
>> So my personal recommendation is, to avoid hibernation/suspension, use
>> Windows as little as possible.
>>
>> Thanks,
>> Qu
> 
> Suspension works flawlessly for me, and hibernation usually does as
> well. The one thing that has happened both times I've had a failure
> has been something weird with the power: first time was a static shock
> from walking on carpet and then touching the laptop, second time was
> the BIOS reporting a wattage error with the charger.

This doesn't look correct for ThinkPad T series machine...

> 
> I tried mounting the FS from a live USB and the mount said: "can't
> read superblock on /dev/mapper/cryptroot" in addition to the transid
> failures. Should I try running a `btrfs check --repair`? At this point
> I'm pretty much resigned to reinstalling today, so I can't make things
> any worse, can I?

Full output please.

> 
> I've also used kernel between version 5.2.0 and 5.2.15 on both my
> machines, so does that mean there's a risk of undetected disk errors
> on my desktop as well?

It's possible.

> I don't have backups of my backups, and all my
> drives use BTRFS because I like the subvolume/snapshot features. I
> also don't have a backup of my music/video library because I don't
> have another 5 TB HDD.

You can just run "btrfs check" from a liveUSB to check if the fs is OK.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  3:05             ` Qu Wenruo
@ 2020-01-21  3:51               ` Robbie Smith
  2020-01-21 10:59                 ` Robbie Smith
  0 siblings, 1 reply; 14+ messages in thread
From: Robbie Smith @ 2020-01-21  3:51 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 21 Jan 2020 at 14:05, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/1/21 上午10:58, Robbie Smith wrote:
> [...]
> >>
> >> Really hard to say, there are at least 3 things related to this problem.
> >>
> >> - Btrfs itself
> >> - Hibernation
> >> - Dm-crypt (less possible)
> >>
> >> For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
> >> then it's possible the fs is already corrupted but not detected.
> >>
> >> For the hibernation part, Linux is not the best place to utilize it for
> >> the first place.
> >> (My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
> >> suspension/hiberation)
> >>
> >> Since linux development is mostly server oriented, such daily consumer
> >> operation may not be that well tested.
> >>
> >> Things like Windows updating certain firmware could break the controller
> >> behavior and cause unexpected behavior.
> >>
> >> So my personal recommendation is, to avoid hibernation/suspension, use
> >> Windows as little as possible.
> >>
> >> Thanks,
> >> Qu
> >
> > Suspension works flawlessly for me, and hibernation usually does as
> > well. The one thing that has happened both times I've had a failure
> > has been something weird with the power: first time was a static shock
> > from walking on carpet and then touching the laptop, second time was
> > the BIOS reporting a wattage error with the charger.
>
> This doesn't look correct for ThinkPad T series machine...
>
> >
> > I tried mounting the FS from a live USB and the mount said: "can't
> > read superblock on /dev/mapper/cryptroot" in addition to the transid
> > failures. Should I try running a `btrfs check --repair`? At this point
> > I'm pretty much resigned to reinstalling today, so I can't make things
> > any worse, can I?
>
> Full output please.

I can't get the output from that mount run as it's lost in the shell
history. Attempting to mount now does nothing and just spits out:
> # mount -t btrfs -o ro,usebackuproot /dev/mapper/cryptroot /mnt/cryptroot
> [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376
> [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376

btrfs check prints the UUID, and that's it.
> # btrfs check /dev/mapper/cryptroot
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/cryptroot
> UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076

Attempting a dry-run of btrfs restore gave me these messages. The fact
that it can read some files and find my /home subvolume gives me some
hope.
> # btrfs restore -D /dev/mapper/cryptroot /mnt/restore
> This is a dry-run, no files are going to be restored
> We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Code Cache/js too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Cache too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 1/Cache too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Code Cache/js too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Cache too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/thumbnails/large too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.cache/mozilla/firefox/eedh8ma4.default-release/cache2/entries too many times to be making progress, stopping
> We have looped trying to restore files in /@home/robbie/.config/discord/Cache too many times to be making progress, stopping

I'm going to go get myself a new external drive, reformat it as ext4
or something (what would be the best filesystem to use?—they always
come out of the box as NTFS for Windows), and then try restoring my
filesystem to that. Maybe I can recover things before attempting a
`btrfs check --repair`. Worst case scenario then is that I have a few
corrupted files on a spare disk.

>
> >
> > I've also used kernel between version 5.2.0 and 5.2.15 on both my
> > machines, so does that mean there's a risk of undetected disk errors
> > on my desktop as well?
>
> It's possible.
>
> > I don't have backups of my backups, and all my
> > drives use BTRFS because I like the subvolume/snapshot features. I
> > also don't have a backup of my music/video library because I don't
> > have another 5 TB HDD.
>
> You can just run "btrfs check" from a liveUSB to check if the fs is OK.
>
> Thanks,
> Qu
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21  3:51               ` Robbie Smith
@ 2020-01-21 10:59                 ` Robbie Smith
  2020-01-21 11:57                   ` Andrei Borzenkov
                                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Robbie Smith @ 2020-01-21 10:59 UTC (permalink / raw)
  To: linux-btrfs

I think I have a hunch as to why this issue has occurred. I've had two
btrfs partition failures, and both times it was upon resuming from
hibernation. The key file for the encrypted swap was stored in
/root/key-file, and the openswap hook unlocks the encrypted root,
mounts it, reads the keyfile for the swap partition, and then unmounts
it again. Could this action be causing the transid to be incremented
somehow?

> /etc/initcpio/hooks/openswap
> run_hook ()
> {
>     ## Optional: To avoid race conditions
>     x=0;
>     while [ ! -b /dev/mapper/cryptroot ] && [ $x -le 10 ]; do
>        x=$((x+1))
>        sleep .2
>     done
>     ## End of optional
>
>     mkdir crypto_key_device
>     mount /dev/mapper/cryptroot crypto_key_device
>     cryptsetup open --key-file crypto_key_device/root/key-file /dev/disk/by-uuid/<UUID> swapDevice
>     umount crypto_key_device
> }

The very first line of swsusp[1] has a big fat warning about touching
data on the disk between suspend and resume, and in hindsight I
imagine this action may count. The openswap hook doesn't write
anything, but it's still accessing the disk (however, atime is
disabled in my mount options).

[1]https://www.kernel.org/doc/Documentation/power/swsusp.txt

On Tue, 21 Jan 2020 at 14:51, Robbie Smith <zoqaeski@gmail.com> wrote:
>
> On Tue, 21 Jan 2020 at 14:05, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2020/1/21 上午10:58, Robbie Smith wrote:
> > [...]
> > >>
> > >> Really hard to say, there are at least 3 things related to this problem.
> > >>
> > >> - Btrfs itself
> > >> - Hibernation
> > >> - Dm-crypt (less possible)
> > >>
> > >> For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
> > >> then it's possible the fs is already corrupted but not detected.
> > >>
> > >> For the hibernation part, Linux is not the best place to utilize it for
> > >> the first place.
> > >> (My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
> > >> suspension/hiberation)
> > >>
> > >> Since linux development is mostly server oriented, such daily consumer
> > >> operation may not be that well tested.
> > >>
> > >> Things like Windows updating certain firmware could break the controller
> > >> behavior and cause unexpected behavior.
> > >>
> > >> So my personal recommendation is, to avoid hibernation/suspension, use
> > >> Windows as little as possible.
> > >>
> > >> Thanks,
> > >> Qu
> > >
> > > Suspension works flawlessly for me, and hibernation usually does as
> > > well. The one thing that has happened both times I've had a failure
> > > has been something weird with the power: first time was a static shock
> > > from walking on carpet and then touching the laptop, second time was
> > > the BIOS reporting a wattage error with the charger.
> >
> > This doesn't look correct for ThinkPad T series machine...
> >
> > >
> > > I tried mounting the FS from a live USB and the mount said: "can't
> > > read superblock on /dev/mapper/cryptroot" in addition to the transid
> > > failures. Should I try running a `btrfs check --repair`? At this point
> > > I'm pretty much resigned to reinstalling today, so I can't make things
> > > any worse, can I?
> >
> > Full output please.
>
> I can't get the output from that mount run as it's lost in the shell
> history. Attempting to mount now does nothing and just spits out:
> > # mount -t btrfs -o ro,usebackuproot /dev/mapper/cryptroot /mnt/cryptroot
> > [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376
> > [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376
>
> btrfs check prints the UUID, and that's it.
> > # btrfs check /dev/mapper/cryptroot
> > Opening filesystem to check...
> > Checking filesystem on /dev/mapper/cryptroot
> > UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
>
> Attempting a dry-run of btrfs restore gave me these messages. The fact
> that it can read some files and find my /home subvolume gives me some
> hope.
> > # btrfs restore -D /dev/mapper/cryptroot /mnt/restore
> > This is a dry-run, no files are going to be restored
> > We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Code Cache/js too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Cache too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 1/Cache too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Code Cache/js too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Cache too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/thumbnails/large too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.cache/mozilla/firefox/eedh8ma4.default-release/cache2/entries too many times to be making progress, stopping
> > We have looped trying to restore files in /@home/robbie/.config/discord/Cache too many times to be making progress, stopping
>
> I'm going to go get myself a new external drive, reformat it as ext4
> or something (what would be the best filesystem to use?—they always
> come out of the box as NTFS for Windows), and then try restoring my
> filesystem to that. Maybe I can recover things before attempting a
> `btrfs check --repair`. Worst case scenario then is that I have a few
> corrupted files on a spare disk.
>
> >
> > >
> > > I've also used kernel between version 5.2.0 and 5.2.15 on both my
> > > machines, so does that mean there's a risk of undetected disk errors
> > > on my desktop as well?
> >
> > It's possible.
> >
> > > I don't have backups of my backups, and all my
> > > drives use BTRFS because I like the subvolume/snapshot features. I
> > > also don't have a backup of my music/video library because I don't
> > > have another 5 TB HDD.
> >
> > You can just run "btrfs check" from a liveUSB to check if the fs is OK.
> >
> > Thanks,
> > Qu
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21 10:59                 ` Robbie Smith
@ 2020-01-21 11:57                   ` Andrei Borzenkov
  2020-01-21 13:04                   ` Nikolay Borisov
  2020-01-22  0:43                   ` Chris Murphy
  2 siblings, 0 replies; 14+ messages in thread
From: Andrei Borzenkov @ 2020-01-21 11:57 UTC (permalink / raw)
  To: Robbie Smith; +Cc: Btrfs BTRFS

On Tue, Jan 21, 2020 at 2:01 PM Robbie Smith <zoqaeski@gmail.com> wrote:
>
> I think I have a hunch as to why this issue has occurred. I've had two
> btrfs partition failures, and both times it was upon resuming from
> hibernation. The key file for the encrypted swap was stored in
> /root/key-file, and the openswap hook unlocks the encrypted root,
> mounts it, reads the keyfile for the swap partition, and then unmounts
> it again. Could this action be causing the transid to be incremented
> somehow?
>

Of course. This means on-disk state is different from in-memory state
after resuming. You must not access filesystem stored in hibernation
image before resuming.

File bug report against whatever component does it.

> > /etc/initcpio/hooks/openswap
> > run_hook ()
> > {
> >     ## Optional: To avoid race conditions
> >     x=0;
> >     while [ ! -b /dev/mapper/cryptroot ] && [ $x -le 10 ]; do
> >        x=$((x+1))
> >        sleep .2
> >     done
> >     ## End of optional
> >
> >     mkdir crypto_key_device
> >     mount /dev/mapper/cryptroot crypto_key_device

What /may/ work is to mount read-only, although even in this case
btrfs may replay previous transaction. "mount -o ro,nologreplay" may
work.

> >     cryptsetup open --key-file crypto_key_device/root/key-file /dev/disk/by-uuid/<UUID> swapDevice
> >     umount crypto_key_device
> > }
>
> The very first line of swsusp[1] has a big fat warning about touching
> data on the disk between suspend and resume, and in hindsight I
> imagine this action may count. The openswap hook doesn't write
> anything, but it's still accessing the disk (however, atime is
> disabled in my mount options).
>
> [1]https://www.kernel.org/doc/Documentation/power/swsusp.txt
>
> On Tue, 21 Jan 2020 at 14:51, Robbie Smith <zoqaeski@gmail.com> wrote:
> >
> > On Tue, 21 Jan 2020 at 14:05, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > >
> > >
> > >
> > > On 2020/1/21 上午10:58, Robbie Smith wrote:
> > > [...]
> > > >>
> > > >> Really hard to say, there are at least 3 things related to this problem.
> > > >>
> > > >> - Btrfs itself
> > > >> - Hibernation
> > > >> - Dm-crypt (less possible)
> > > >>
> > > >> For btrfs, if you have used kernel between version v5.2.0 and v5.2.15,
> > > >> then it's possible the fs is already corrupted but not detected.
> > > >>
> > > >> For the hibernation part, Linux is not the best place to utilize it for
> > > >> the first place.
> > > >> (My ThinkPad X1 Carbon 6th suffers from hibernation, so I rarely use
> > > >> suspension/hiberation)
> > > >>
> > > >> Since linux development is mostly server oriented, such daily consumer
> > > >> operation may not be that well tested.
> > > >>
> > > >> Things like Windows updating certain firmware could break the controller
> > > >> behavior and cause unexpected behavior.
> > > >>
> > > >> So my personal recommendation is, to avoid hibernation/suspension, use
> > > >> Windows as little as possible.
> > > >>
> > > >> Thanks,
> > > >> Qu
> > > >
> > > > Suspension works flawlessly for me, and hibernation usually does as
> > > > well. The one thing that has happened both times I've had a failure
> > > > has been something weird with the power: first time was a static shock
> > > > from walking on carpet and then touching the laptop, second time was
> > > > the BIOS reporting a wattage error with the charger.
> > >
> > > This doesn't look correct for ThinkPad T series machine...
> > >
> > > >
> > > > I tried mounting the FS from a live USB and the mount said: "can't
> > > > read superblock on /dev/mapper/cryptroot" in addition to the transid
> > > > failures. Should I try running a `btrfs check --repair`? At this point
> > > > I'm pretty much resigned to reinstalling today, so I can't make things
> > > > any worse, can I?
> > >
> > > Full output please.
> >
> > I can't get the output from that mount run as it's lost in the shell
> > history. Attempting to mount now does nothing and just spits out:
> > > # mount -t btrfs -o ro,usebackuproot /dev/mapper/cryptroot /mnt/cryptroot
> > > [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376
> > > [dmesg timestamp] BTRFS error (device dm-0): parent transid verify failed on 223452889088 wanted 144360 found 144376
> >
> > btrfs check prints the UUID, and that's it.
> > > # btrfs check /dev/mapper/cryptroot
> > > Opening filesystem to check...
> > > Checking filesystem on /dev/mapper/cryptroot
> > > UUID: 25ac1f63-5986-4eb8-920f-ed7a5354c076
> >
> > Attempting a dry-run of btrfs restore gave me these messages. The fact
> > that it can read some files and find my /home subvolume gives me some
> > hope.
> > > # btrfs restore -D /dev/mapper/cryptroot /mnt/restore
> > > This is a dry-run, no files are going to be restored
> > > We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Code Cache/js too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/chromium/Default/Cache too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 1/Cache too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Code Cache/js too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/chromium/Profile 2/Cache too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/thumbnails/large too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.cache/mozilla/firefox/eedh8ma4.default-release/cache2/entries too many times to be making progress, stopping
> > > We have looped trying to restore files in /@home/robbie/.config/discord/Cache too many times to be making progress, stopping
> >
> > I'm going to go get myself a new external drive, reformat it as ext4
> > or something (what would be the best filesystem to use?—they always
> > come out of the box as NTFS for Windows), and then try restoring my
> > filesystem to that. Maybe I can recover things before attempting a
> > `btrfs check --repair`. Worst case scenario then is that I have a few
> > corrupted files on a spare disk.
> >
> > >
> > > >
> > > > I've also used kernel between version 5.2.0 and 5.2.15 on both my
> > > > machines, so does that mean there's a risk of undetected disk errors
> > > > on my desktop as well?
> > >
> > > It's possible.
> > >
> > > > I don't have backups of my backups, and all my
> > > > drives use BTRFS because I like the subvolume/snapshot features. I
> > > > also don't have a backup of my music/video library because I don't
> > > > have another 5 TB HDD.
> > >
> > > You can just run "btrfs check" from a liveUSB to check if the fs is OK.
> > >
> > > Thanks,
> > > Qu
> > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21 10:59                 ` Robbie Smith
  2020-01-21 11:57                   ` Andrei Borzenkov
@ 2020-01-21 13:04                   ` Nikolay Borisov
  2020-01-22  0:43                   ` Chris Murphy
  2 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2020-01-21 13:04 UTC (permalink / raw)
  To: Robbie Smith, linux-btrfs



On 21.01.20 г. 12:59 ч., Robbie Smith wrote:
> I think I have a hunch as to why this issue has occurred. I've had two
> btrfs partition failures, and both times it was upon resuming from
> hibernation. The key file for the encrypted swap was stored in
> /root/key-file, and the openswap hook unlocks the encrypted root,
> mounts it, reads the keyfile for the swap partition, and then unmounts
> it again. Could this action be causing the transid to be incremented
> somehow?
> 
>> /etc/initcpio/hooks/openswap
>> run_hook ()
>> {
>>     ## Optional: To avoid race conditions
>>     x=0;
>>     while [ ! -b /dev/mapper/cryptroot ] && [ $x -le 10 ]; do
>>        x=$((x+1))
>>        sleep .2
>>     done
>>     ## End of optional
>>
>>     mkdir crypto_key_device
>>     mount /dev/mapper/cryptroot crypto_key_device
>>     cryptsetup open --key-file crypto_key_device/root/key-file /dev/disk/by-uuid/<UUID> swapDevice
>>     umount crypto_key_device
>> }
> 
> The very first line of swsusp[1] has a big fat warning about touching
> data on the disk between suspend and resume, and in hindsight I
> imagine this action may count. The openswap hook doesn't write
> anything, but it's still accessing the disk (however, atime is
> disabled in my mount options).
> 
> [1]https://www.kernel.org/doc/Documentation/power/swsusp.txt

I just tested with a freshly created filesystem. And indeed just
mounting and unmounting the filesystem writes to the root tree since it
has to synchronize the freespace cache

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS failure after resume from hibernate
  2020-01-21 10:59                 ` Robbie Smith
  2020-01-21 11:57                   ` Andrei Borzenkov
  2020-01-21 13:04                   ` Nikolay Borisov
@ 2020-01-22  0:43                   ` Chris Murphy
  2 siblings, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2020-01-22  0:43 UTC (permalink / raw)
  To: Robbie Smith; +Cc: Btrfs BTRFS

On Tue, Jan 21, 2020 at 4:00 AM Robbie Smith <zoqaeski@gmail.com> wrote:
>
> I think I have a hunch as to why this issue has occurred. I've had two
> btrfs partition failures, and both times it was upon resuming from
> hibernation. The key file for the encrypted swap was stored in
> /root/key-file, and the openswap hook unlocks the encrypted root,
> mounts it, reads the keyfile for the swap partition, and then unmounts
> it again.

For sure if it's a rw mount it's a problem. I'm pretty sure even ro
mount isn't guaranteed to be ro, it's only ro for user space but
kernel space could still write. I think the only sure way is use
blockdev --setro before mounting the volume.


> The very first line of swsusp[1] has a big fat warning about touching
> data on the disk between suspend and resume, and in hindsight I
> imagine this action may count.

Totally counts. The hibernation image has its own view of the file
system which is restored when resuming from that image. I don't know
enough about the work implied by not merely syncing the file system at
hibernation time, but forcing an unmount after the hibernation image
is written; and therefore requiring it be (freshly) mounted upon
resuming from hibernation. That would prevent this problem, but then
also makes hibernation entry and resume more complicated.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-01-22  0:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-20 14:45 BTRFS failure after resume from hibernate Robbie Smith
2020-01-20 17:29 ` Nikolay Borisov
2020-01-21  0:10 ` Qu Wenruo
2020-01-21  1:39   ` Robbie Smith
2020-01-21  1:49     ` Qu Wenruo
2020-01-21  2:06       ` Robbie Smith
2020-01-21  2:26         ` Qu Wenruo
2020-01-21  2:58           ` Robbie Smith
2020-01-21  3:05             ` Qu Wenruo
2020-01-21  3:51               ` Robbie Smith
2020-01-21 10:59                 ` Robbie Smith
2020-01-21 11:57                   ` Andrei Borzenkov
2020-01-21 13:04                   ` Nikolay Borisov
2020-01-22  0:43                   ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).