linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs root fs started remounting ro
       [not found] <CA+M2ft9zjGm7XJw1BUm364AMqGSd3a8QgsvQDCWz317qjP=o8g@mail.gmail.com>
@ 2020-02-07 17:52 ` John Hendy
  2020-02-07 20:21   ` Chris Murphy
  2020-02-07 23:42   ` Qu Wenruo
  2020-05-06  4:37 ` John Hendy
  1 sibling, 2 replies; 24+ messages in thread
From: John Hendy @ 2020-02-07 17:52 UTC (permalink / raw)
  To: Btrfs BTRFS

Greetings,

I'm resending, as this isn't showing in the archives. Perhaps it was
the attachments, which I've converted to pastebin links.

As an update, I'm now running off of a different drive (ssd, not the
nvme) and I got the error again! I'm now inclined to think this might
not be hardware after all, but something related to my setup or a bug
with chromium.

After a reboot, chromium wouldn't start for me and demsg showed
similar parent transid/csum errors to my original post below. I used
btrfs-inspect-internal to find the inode traced to
~/.config/chromium/History. I deleted that, and got a new set of
errors tracing to ~/.config/chromium/Cookies. After I deleted that and
tried starting chromium, I found that my btrfs /home/jwhendy pool was
mounted ro just like the original problem below.

dmesg after trying to start chromium:
- https://pastebin.com/CsCEQMJa

Thanks for any pointers, as it would now seem that my purchase of a
new m2.sata may not buy my way out of this problem! While I didn't
want to reinstall, at least new hardware is a simple fix. Now I'm
worried there is a deeper issue bound to recur :(

Best regards,
John

On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>
> Greetings,
>
> I've had this issue occur twice, once ~1mo ago and once a couple of
> weeks ago. Chromium suddenly quit on me, and when trying to start it
> again, it complained about a lock file in ~. I tried to delete it
> manually and was informed I was on a read-only fs! I ended up biting
> the bullet and re-installing linux due to the number of dead end
> threads and slow response rates on diagnosing these issues, and the
> issue occurred again shortly after.
>
> $ uname -a
> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
> +0000 x86_64 GNU/Linux
>
> $ btrfs --version
> btrfs-progs v5.4
>
> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
> Data, single: total=114.01GiB, used=80.88GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=2.01GiB, used=769.61MiB
> GlobalReserve, single: total=140.73MiB, used=0.00B
>
> This is a single device, no RAID, not on a VM. HP Zbook 15.
> nvme0n1                                       259:5    0 232.9G  0 disk
> ├─nvme0n1p1                                   259:6    0   512M  0
> part  (/boot/efi)
> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>
> I have the following subvols:
> arch: used for / when booting arch
> jwhendy: used for /home/jwhendy on arch
> vault: shared data between distros on /mnt/vault
> bionic: root when booting ubuntu bionic
>
> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>
> dmesg, smartctl, btrfs check, and btrfs dev stats attached.

Edit: links now:
- btrfs check: https://pastebin.com/nz6Bc145
- dmesg: https://pastebin.com/1GGpNiqk
- smartctl: https://pastebin.com/ADtYqfrd

btrfs dev stats (not worth a link):

[/dev/mapper/old].write_io_errs    0
[/dev/mapper/old].read_io_errs     0
[/dev/mapper/old].flush_io_errs    0
[/dev/mapper/old].corruption_errs  0
[/dev/mapper/old].generation_errs  0


> If these are of interested, here are reddit threads where I posted the
> issue and was referred here.
> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>
> It has been suggested this is a hardware issue. I've already ordered a
> replacement m2.sata, but for sanity it would be great to know
> definitively this was the case. If anything stands out above that
> could indicate I'm not setup properly re. btrfs, that would also be
> fantastic so I don't repeat the issue!
>
> The only thing I've stumbled on is that I have been mounting with
> rd.luks.options=discard and that manually running fstrim is preferred.
>
>
> Many thanks for any input/suggestions,
> John

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 17:52 ` btrfs root fs started remounting ro John Hendy
@ 2020-02-07 20:21   ` Chris Murphy
  2020-02-07 22:31     ` John Hendy
  2020-02-07 23:42   ` Qu Wenruo
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2020-02-07 20:21 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS

On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@gmail.com> wrote:

> As an update, I'm now running off of a different drive (ssd, not the
> nvme) and I got the error again! I'm now inclined to think this might
> not be hardware after all, but something related to my setup or a bug
> with chromium.

Even if there's a Chromium bug, it should result in file system
corruption like what you're seeing.


> dmesg after trying to start chromium:
> - https://pastebin.com/CsCEQMJa

Could you post the entire dmesg, start to finish, for the boot in
which this first occurred?

This transid isn't realistic, in particular for a filesystem this new.

[   60.697438] BTRFS error (device dm-0): parent transid verify failed
on 202711384064 wanted 68719924810 found 448074
[   60.697457] BTRFS info (device dm-0): no csum found for inode 19064
start 2392064
[   60.697777] BTRFS warning (device dm-0): csum failed root 339 ino
19064 off 2392064 csum 0x8941f998 expected csum 0x00000000 mirror 1

Expected csum null? Are these files using chattr +C? Something like
this might help figure it out:

$ sudo btrfs insp inod -v 19064 /home
$ lsattr /path/to/that/file/

Report output for both.


> Thanks for any pointers, as it would now seem that my purchase of a
> new m2.sata may not buy my way out of this problem! While I didn't
> want to reinstall, at least new hardware is a simple fix. Now I'm
> worried there is a deeper issue bound to recur :(

Yep. And fixing Btrfs is not simple.

> > nvme0n1p3 is encrypted with dm-crypt/LUKS.

I don't think the problem is here, except that I sooner believe
there's a regression in dm-crypt or Btrfs with discards, than I
believe two different drives have discard related bugs.


> > The only thing I've stumbled on is that I have been mounting with
> > rd.luks.options=discard and that manually running fstrim is preferred.

This was the case for both the NVMe and SSD drives?

What was the kernel version this problem first appeared on with NVMe?
For the (new) SSD you're using 5.5.1, correct?

Can you correlate both corruption events to recent use of fstrim?

What are the make/model of both drives?

In the meantime, I suggest refreshing backups. Btrfs won't allow files
with checksums that it knows are corrupt to be copied to user space.
But it sounds like so far the only files affected are Chrome cache
files? If so this is relatively straight forward to get back to a
healthy file system. And then it's time to start iterating some of the
setup to find out what's causing the problem.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 20:21   ` Chris Murphy
@ 2020-02-07 22:31     ` John Hendy
  2020-02-07 23:17       ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-02-07 22:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@gmail.com> wrote:
>
> > As an update, I'm now running off of a different drive (ssd, not the
> > nvme) and I got the error again! I'm now inclined to think this might
> > not be hardware after all, but something related to my setup or a bug
> > with chromium.
>
> Even if there's a Chromium bug, it should result in file system
> corruption like what you're seeing.

I'm assuming you meant "*shouldn't* result in file system corruption"?

>
> > dmesg after trying to start chromium:
> > - https://pastebin.com/CsCEQMJa
>
> Could you post the entire dmesg, start to finish, for the boot in
> which this first occurred?

Indeed. Just reproduced it:
- https://pastebin.com/UJ8gbgFE

Aside: is there a preferred way for sharing these? The page I read
about this list said text couldn't exceed 100kb, but my original
appears to have bounced and the dmesg alone is >100kb... Just want to
make sure pastebin is cool and am happy to use something
better/preferred.

> This transid isn't realistic, in particular for a filesystem this new.

Clarification, and apologies for the confusion:
- the m2.sata in my original post was my primary drive and had an
issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc.
and it happened again.

- the ssd I'm now running on was the former boot drive in my last
computer which I was using as a backup drive for /mnt/vault pool but
still had the old root fs. After the m2.sata failure, I started
booting from it. It is not a new fs but >2yrs old.

If you'd like, let's stick to troubleshooting the ssd for now.

> [   60.697438] BTRFS error (device dm-0): parent transid verify failed
> on 202711384064 wanted 68719924810 found 448074
> [   60.697457] BTRFS info (device dm-0): no csum found for inode 19064
> start 2392064
> [   60.697777] BTRFS warning (device dm-0): csum failed root 339 ino
> 19064 off 2392064 csum 0x8941f998 expected csum 0x00000000 mirror 1
>
> Expected csum null? Are these files using chattr +C? Something like
> this might help figure it out:
>
> $ sudo btrfs insp inod -v 19064 /home

$ sudo btrfs insp inod -v 19056 /home/jwhendy
ioctl ret=0, bytes_left=4039, bytes_missing=0, cnt=1, missed=0
/home/jwhendy/.config/chromium/Default/Cookies

> $ lsattr /path/to/that/file/

$ lsattr /home/jwhendy/.config/chromium/Default/Cookies
-------------------- /home/jwhendy/.config/chromium/Default/Cookies

> Report output for both.
>
>
> > Thanks for any pointers, as it would now seem that my purchase of a
> > new m2.sata may not buy my way out of this problem! While I didn't
> > want to reinstall, at least new hardware is a simple fix. Now I'm
> > worried there is a deeper issue bound to recur :(
>
> Yep. And fixing Btrfs is not simple.
>
> > > nvme0n1p3 is encrypted with dm-crypt/LUKS.
>
> I don't think the problem is here, except that I sooner believe
> there's a regression in dm-crypt or Btrfs with discards, than I
> believe two different drives have discard related bugs.
>
>
> > > The only thing I've stumbled on is that I have been mounting with
> > > rd.luks.options=discard and that manually running fstrim is preferred.
>
> This was the case for both the NVMe and SSD drives?

Yes, though I have turned that off for the SSD ever since I started
booting from it. That said, I realized that discard is still in my
fstab... is this a potential source of the transid/csum issues? I've
now removed that and am about to reboot after I send this.

$ cat /etc/fstab
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on / type btrfs
(rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=263,subvol=/arch)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /home/jwhendy
type btrfs (rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=339,subvol=/jwhendy)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /mnt/vault
type btrfs (rw,relatime,compress=lzo,ssd,discard,space_cache,subvolid=265,subvol=/vault)

> What was the kernel version this problem first appeared on with NVMe?
> For the (new) SSD you're using 5.5.1, correct?

I just updated today which put me at 5.5.2, but in theory yes. And as
I went to check that I get an Input/Output error trying to check the
pacman log! Here's the dmesg with those new errors included:
- https://pastebin.com/QzYQ2RRg

I'm still mounted rw, but my gosh... what the heck is happening. The
output is for a different root/inode:

$ sudo btrfs insp inod -v 273 /
ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0
//var/log/pacman.log

Is the double // a concern for that file?

$ sudo lsattr /var/log/pacman.log
-------------------- /var/log/pacman.log

> Can you correlate both corruption events to recent use of fstrim?

I've never used fstrim manually on either drive.

> What are the make/model of both drives?

- ssd: Samsung 850 evo, 250G
- m2.sata: nvme Samsung 960 evo, 250G

> In the meantime, I suggest refreshing backups. Btrfs won't allow files
> with checksums that it knows are corrupt to be copied to user space.
> But it sounds like so far the only files affected are Chrome cache
> files? If so this is relatively straight forward to get back to a
> healthy file system. And then it's time to start iterating some of the
> setup to find out what's causing the problem.

So far, it seemed limited to chromium. I'm not sure about the new
input/output error trying to cat/grep on /var/log/pacman.log. I can
also ro my old drive just fine and have not done anything significant
on the new one. If/when we get to potential destructive operations,
I'll certainly re-up prior to doing those.

Really appreciate the help!
John

>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 22:31     ` John Hendy
@ 2020-02-07 23:17       ` Chris Murphy
  2020-02-08  4:37         ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2020-02-07 23:17 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS

On Fri, Feb 7, 2020 at 3:31 PM John Hendy <jw.hendy@gmail.com> wrote:
>
> On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@colorremedies.com> wrote:
> >
> > On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@gmail.com> wrote:
> >
> > > As an update, I'm now running off of a different drive (ssd, not the
> > > nvme) and I got the error again! I'm now inclined to think this might
> > > not be hardware after all, but something related to my setup or a bug
> > > with chromium.
> >
> > Even if there's a Chromium bug, it should result in file system
> > corruption like what you're seeing.
>
> I'm assuming you meant "*shouldn't* result in file system corruption"?

Ha! Yes, of course.


> Indeed. Just reproduced it:
> - https://pastebin.com/UJ8gbgFE

[  126.656696] BTRFS info (device dm-0): turning on discard

I advise removing the discard mount option from /etc/fstab. This
obviates manual fstrim, and makes sure you can't correlate discards to
these problems.


> Aside: is there a preferred way for sharing these? The page I read
> about this list said text couldn't exceed 100kb, but my original
> appears to have bounced and the dmesg alone is >100kb... Just want to
> make sure pastebin is cool and am happy to use something
> better/preferred.

Everyone has their own convention. My preferred convention is to put
the entire dmesg up on google drive, unedited, and include the URL.
And then I extract excerpts I think are relevant and paste into the
email body. That way search engines can find relevant threads.

> Clarification, and apologies for the confusion:
> - the m2.sata in my original post was my primary drive and had an
> issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc.
> and it happened again.
>
> - the ssd I'm now running on was the former boot drive in my last
> computer which I was using as a backup drive for /mnt/vault pool but
> still had the old root fs. After the m2.sata failure, I started
> booting from it. It is not a new fs but >2yrs old.

Got it. Well it would be really bad luck but not impossible to have
two different drives with discard related firmware bugs. But the point
of going through the tedious work to prove this? Such devices will get
the relevant (mis)feature blacklisted in the kernel for that
make/model so that no one else experiences it.




>
> If you'd like, let's stick to troubleshooting the ssd for now.
>
> > [   60.697438] BTRFS error (device dm-0): parent transid verify failed
> > on 202711384064 wanted 68719924810 found 448074

448704 is reasonable for a 2 year old file system. I'm doubt 68719924810 is.


> $ lsattr /home/jwhendy/.config/chromium/Default/Cookies
> -------------------- /home/jwhendy/.config/chromium/Default/Cookies

No +C so these files should have csums.


> Yes, though I have turned that off for the SSD ever since I started
> booting from it. That said, I realized that discard is still in my
> fstab... is this a potential source of the transid/csum issues? I've
> now removed that and am about to reboot after I send this.

Maybe.


> I just updated today which put me at 5.5.2, but in theory yes. And as
> I went to check that I get an Input/Output error trying to check the
> pacman log! Here's the dmesg with those new errors included:
> - https://pastebin.com/QzYQ2RRg
>
> I'm still mounted rw, but my gosh... what the heck is happening. The
> output is for a different root/inode:

Understand that Btrfs is like a canary in the coal mine. It's *less*
tolerant of hardware problems than other file systems, because it
doesn't trust the hardware. Everything is checksummed. The instant
there's a problem, Btrfs will start complaining, and if it gets
confused it goes ro in order to stop spreading the corruption.


>
> $ sudo btrfs insp inod -v 273 /
> ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0
> //var/log/pacman.log
>
> Is the double // a concern for that file?

No it's just a convention.


> - ssd: Samsung 850 evo, 250G
> - m2.sata: nvme Samsung 960 evo, 250G

As a first step, stop using discard mount option. And delete all the
corrupt files by searching for other affected inodes. Once you're sure
they're all deleted, do a scrub and report back. If the scrub finds no
errors, then I suggest booting off install media and running 'btrfs
check --mode=lowmem' and reporting that output to the list also. Don't
use --repair even if there are reported problems.

A general rule is to change only one thing at a time when
troubleshooting. That way you have a much easier time finding the
source of the problem. I'm not sure how quickly this problem started
to happen, days or weeks? But you want to go for about that long,
unless the problem happens again, to prove whether any change solved
the problem. Ideally, you revert to the suspected setting that causes
the problem to try and prove it's the source, but that's tedious and
up to you. It's fine to just not ever use the discard mount option if
that's what's causing the problem.

I can't really estimate whether that could be defect in the SSD, or
firmware bug that's maybe fixed with a firmware update, or a Btrfs
regression bug. BTW, I think your laptop has a more recent firmware
update available. 01.31 Rev.A 13.5 MB Nov 8, 2019. Could it be
related? *shrug* No idea. But it's vaguely possible. More likely such
things are drive firmware related.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 17:52 ` btrfs root fs started remounting ro John Hendy
  2020-02-07 20:21   ` Chris Murphy
@ 2020-02-07 23:42   ` Qu Wenruo
  2020-02-08  4:48     ` John Hendy
  1 sibling, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-07 23:42 UTC (permalink / raw)
  To: John Hendy, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 5064 bytes --]



On 2020/2/8 上午1:52, John Hendy wrote:
> Greetings,
> 
> I'm resending, as this isn't showing in the archives. Perhaps it was
> the attachments, which I've converted to pastebin links.
> 
> As an update, I'm now running off of a different drive (ssd, not the
> nvme) and I got the error again! I'm now inclined to think this might
> not be hardware after all, but something related to my setup or a bug
> with chromium.
> 
> After a reboot, chromium wouldn't start for me and demsg showed
> similar parent transid/csum errors to my original post below. I used
> btrfs-inspect-internal to find the inode traced to
> ~/.config/chromium/History. I deleted that, and got a new set of
> errors tracing to ~/.config/chromium/Cookies. After I deleted that and
> tried starting chromium, I found that my btrfs /home/jwhendy pool was
> mounted ro just like the original problem below.
> 
> dmesg after trying to start chromium:
> - https://pastebin.com/CsCEQMJa

So far, it's only transid bug in your csum tree.

And two backref mismatch in data backref.

In theory, you can fix your problem by `btrfs check --repair
--init-csum-tree`.

But I'm more interesting in how this happened.

Have your every experienced any power loss for your NVME drive?
I'm not say btrfs is unsafe against power loss, all fs should be safe
against power loss, I'm just curious about if mount time log replay is
involved, or just regular internal log replay.

From your smartctl, the drive experienced 61 unsafe shutdown with 2144
power cycles.

Not sure if it's related.

Another interesting point is, did you remember what's the oldest kernel
running on this fs? v5.4 or v5.5?

Thanks,
Qu
> 
> Thanks for any pointers, as it would now seem that my purchase of a
> new m2.sata may not buy my way out of this problem! While I didn't
> want to reinstall, at least new hardware is a simple fix. Now I'm
> worried there is a deeper issue bound to recur :(
> 
> Best regards,
> John
> 
> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>>
>> Greetings,
>>
>> I've had this issue occur twice, once ~1mo ago and once a couple of
>> weeks ago. Chromium suddenly quit on me, and when trying to start it
>> again, it complained about a lock file in ~. I tried to delete it
>> manually and was informed I was on a read-only fs! I ended up biting
>> the bullet and re-installing linux due to the number of dead end
>> threads and slow response rates on diagnosing these issues, and the
>> issue occurred again shortly after.
>>
>> $ uname -a
>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
>> +0000 x86_64 GNU/Linux
>>
>> $ btrfs --version
>> btrfs-progs v5.4
>>
>> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
>> Data, single: total=114.01GiB, used=80.88GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=2.01GiB, used=769.61MiB
>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>
>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>> nvme0n1                                       259:5    0 232.9G  0 disk
>> ├─nvme0n1p1                                   259:6    0   512M  0
>> part  (/boot/efi)
>> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
>> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>>
>> I have the following subvols:
>> arch: used for / when booting arch
>> jwhendy: used for /home/jwhendy on arch
>> vault: shared data between distros on /mnt/vault
>> bionic: root when booting ubuntu bionic
>>
>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>
>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> 
> Edit: links now:
> - btrfs check: https://pastebin.com/nz6Bc145
> - dmesg: https://pastebin.com/1GGpNiqk
> - smartctl: https://pastebin.com/ADtYqfrd
> 
> btrfs dev stats (not worth a link):
> 
> [/dev/mapper/old].write_io_errs    0
> [/dev/mapper/old].read_io_errs     0
> [/dev/mapper/old].flush_io_errs    0
> [/dev/mapper/old].corruption_errs  0
> [/dev/mapper/old].generation_errs  0
> 
> 
>> If these are of interested, here are reddit threads where I posted the
>> issue and was referred here.
>> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>
>> It has been suggested this is a hardware issue. I've already ordered a
>> replacement m2.sata, but for sanity it would be great to know
>> definitively this was the case. If anything stands out above that
>> could indicate I'm not setup properly re. btrfs, that would also be
>> fantastic so I don't repeat the issue!
>>
>> The only thing I've stumbled on is that I have been mounting with
>> rd.luks.options=discard and that manually running fstrim is preferred.
>>
>>
>> Many thanks for any input/suggestions,
>> John


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 23:17       ` Chris Murphy
@ 2020-02-08  4:37         ` John Hendy
  0 siblings, 0 replies; 24+ messages in thread
From: John Hendy @ 2020-02-08  4:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Fri, Feb 7, 2020 at 5:17 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Fri, Feb 7, 2020 at 3:31 PM John Hendy <jw.hendy@gmail.com> wrote:
> >
> > On Fri, Feb 7, 2020 at 2:22 PM Chris Murphy <lists@colorremedies.com> wrote:
> > >
> > > On Fri, Feb 7, 2020 at 10:52 AM John Hendy <jw.hendy@gmail.com> wrote:
> > >
> > > > As an update, I'm now running off of a different drive (ssd, not the
> > > > nvme) and I got the error again! I'm now inclined to think this might
> > > > not be hardware after all, but something related to my setup or a bug
> > > > with chromium.
> > >
> > > Even if there's a Chromium bug, it should result in file system
> > > corruption like what you're seeing.
> >
> > I'm assuming you meant "*shouldn't* result in file system corruption"?
>
> Ha! Yes, of course.
>
>
> > Indeed. Just reproduced it:
> > - https://pastebin.com/UJ8gbgFE
>
> [  126.656696] BTRFS info (device dm-0): turning on discard
>
> I advise removing the discard mount option from /etc/fstab. This
> obviates manual fstrim, and makes sure you can't correlate discards to
> these problems.

Done!

/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on / type btrfs
(rw,relatime,compress=lzo,ssd,space_cache,subvolid=263,subvol=/arch)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /home/jwhendy
type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=339,subvol=/jwhendy)
/dev/mapper/luks-0712af67-3f01-4dde-9d45-194df9d29d14 on /mnt/vault
type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=265,subvol=/vault)

> > Aside: is there a preferred way for sharing these? The page I read
> > about this list said text couldn't exceed 100kb, but my original
> > appears to have bounced and the dmesg alone is >100kb... Just want to
> > make sure pastebin is cool and am happy to use something
> > better/preferred.
>
> Everyone has their own convention. My preferred convention is to put
> the entire dmesg up on google drive, unedited, and include the URL.
> And then I extract excerpts I think are relevant and paste into the
> email body. That way search engines can find relevant threads.
>

Thanks for that. I'll stick to pastebin for now just for convenience.
Mainly I wanted to make sure that links to these were reasonable, and
sounds like this is okay for the list. Thanks!

> > Clarification, and apologies for the confusion:
> > - the m2.sata in my original post was my primary drive and had an
> > issue, then I wiped, mkfs.btrfs from scratch, reinstalled linux, etc.
> > and it happened again.
> >
> > - the ssd I'm now running on was the former boot drive in my last
> > computer which I was using as a backup drive for /mnt/vault pool but
> > still had the old root fs. After the m2.sata failure, I started
> > booting from it. It is not a new fs but >2yrs old.
>
> Got it. Well it would be really bad luck but not impossible to have
> two different drives with discard related firmware bugs. But the point
> of going through the tedious work to prove this? Such devices will get
> the relevant (mis)feature blacklisted in the kernel for that
> make/model so that no one else experiences it.

> >
> > If you'd like, let's stick to troubleshooting the ssd for now.
> >
> > > [   60.697438] BTRFS error (device dm-0): parent transid verify failed
> > > on 202711384064 wanted 68719924810 found 448074
>
> 448704 is reasonable for a 2 year old file system. I'm doubt 68719924810 is.
>
>
> > $ lsattr /home/jwhendy/.config/chromium/Default/Cookies
> > -------------------- /home/jwhendy/.config/chromium/Default/Cookies
>
> No +C so these files should have csums.
>
>
> > Yes, though I have turned that off for the SSD ever since I started
> > booting from it. That said, I realized that discard is still in my
> > fstab... is this a potential source of the transid/csum issues? I've
> > now removed that and am about to reboot after I send this.
>
> Maybe.
>
>
> > I just updated today which put me at 5.5.2, but in theory yes. And as
> > I went to check that I get an Input/Output error trying to check the
> > pacman log! Here's the dmesg with those new errors included:
> > - https://pastebin.com/QzYQ2RRg
> >
> > I'm still mounted rw, but my gosh... what the heck is happening. The
> > output is for a different root/inode:
>
> Understand that Btrfs is like a canary in the coal mine. It's *less*
> tolerant of hardware problems than other file systems, because it
> doesn't trust the hardware. Everything is checksummed. The instant
> there's a problem, Btrfs will start complaining, and if it gets
> confused it goes ro in order to stop spreading the corruption.
>
>
> >
> > $ sudo btrfs insp inod -v 273 /
> > ioctl ret=0, bytes_left=4053, bytes_missing=0, cnt=1, missed=0
> > //var/log/pacman.log
> >
> > Is the double // a concern for that file?
>
> No it's just a convention.
>
>
> > - ssd: Samsung 850 evo, 250G
> > - m2.sata: nvme Samsung 960 evo, 250G
>
> As a first step, stop using discard mount option. And delete all the
> corrupt files by searching for other affected inodes. Once you're sure
> they're all deleted, do a scrub and report back. If the scrub finds no
> errors, then I suggest booting off install media and running 'btrfs
> check --mode=lowmem' and reporting that output to the list also. Don't
> use --repair even if there are reported problems.

I tried to remove .config/chromium, but ran into a weird problem. I
was getting an error on `rm` with a TransportSecurity file saying "No
such file or directory." More on that below. I also removed
/var/log/pacman.log, the other offending file from the previous inode
error. At this point I tried a `btrfs scrub start /` but it fails
(aborted):

[  126.520270] BTRFS error (device dm-0): parent transid verify failed
on 202711384064 wanted 68719924810 found 448074
[  126.532637] BTRFS info (device dm-0): scrub: not finished on devid
1 with status: -5

Full dmesg at that point:
- https://pastebin.com/9TvvMVpE

Brief aside before we get back to .config/chromium: after I sent the
last message and removed the discard option (but before I deleted
these files), I ran btrfs check from an arch install usb.
- https://pastebin.com/Wdg8aqTY

The first inode resolved to /var/log/journal so I just rm'd the whole
thing. Every subsequent inode on root 263 (/ mountpoint) resulted in
the following, so I think problematic files on / are set:
ERROR: ino paths ioctl: No such file or directory

This inode was also in the output of the btrfs check, and is the same
file I can't delete from above:

root 339 inode 17848 errors 200, dir isize wrong
    unresolved ref dir 17848 index 6 namelen 11 name File System
filetype 2 errors 2, no dir index
root 339 inode 4504988 errors 1, no inode item
    unresolved ref dir 17848 index 489287 namelen 17 name
TransportSecurity filetype 1 errors 5, no dir item, no inode ref

$ sudo btrfs insp inode -v 17848 /home/jwhendy/
[sudo] password for jwhendy:
ioctl ret=0, bytes_left=4034, bytes_missing=0, cnt=1, missed=0
/home/jwhendy//.local/share/Trash/expunged/3065996973

$ cd .local/share/Trash/expunged/3065996973/
$ ls
ls: cannot access 'TransportSecurity': No such file or directory
TransportSecurity
$ ls -la
ls: cannot access 'TransportSecurity': No such file or directory
total 0
drwx------ 1 jwhendy jwhendy 22 Feb  7 21:42 .
drwx------ 1 jwhendy jwhendy 20 Feb  7 21:46 ..
-????????? ? ?       ?        ?            ? TransportSecurity

Posts online suggest `rm -i -- ./*` but that doesn't work.

$ rm -i -- ./*
rm: cannot remove './TransportSecurity': No such file or directory

I also found a post suggesting this, potentially revealing weird,
non-obvious characters that might be present:
$ ls | od -a
0000000   T   r   a   n   s   p   o   r   t   S   e   c   u   r   i   t
0000020   y  nl
0000022

Not sure what to make of that. In other StackOverflow and similar
posts, the `rm -i -- ./*` does the trick. Yet another post suggested
moving to /tmp and rebooting, but I can't move it (same "no such file
or directory" error).

Any input on how to blow this thing up?

> A general rule is to change only one thing at a time when
> troubleshooting. That way you have a much easier time finding the
> source of the problem. I'm not sure how quickly this problem started
> to happen, days or weeks? But you want to go for about that long,
> unless the problem happens again, to prove whether any change solved
> the problem. Ideally, you revert to the suspected setting that causes
> the problem to try and prove it's the source, but that's tedious and
> up to you. It's fine to just not ever use the discard mount option if
> that's what's causing the problem.
>
> I can't really estimate whether that could be defect in the SSD, or
> firmware bug that's maybe fixed with a firmware update, or a Btrfs
> regression bug. BTW, I think your laptop has a more recent firmware
> update available. 01.31 Rev.A 13.5 MB Nov 8, 2019. Could it be
> related? *shrug* No idea. But it's vaguely possible. More likely such
> things are drive firmware related.

firmware = BIOS? I can check that. Or if this is is intel-ucode, I
just have whatever arch has as current...

Thanks again,
John

>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-07 23:42   ` Qu Wenruo
@ 2020-02-08  4:48     ` John Hendy
  2020-02-08  7:29       ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-02-08  4:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/8 上午1:52, John Hendy wrote:
> > Greetings,
> >
> > I'm resending, as this isn't showing in the archives. Perhaps it was
> > the attachments, which I've converted to pastebin links.
> >
> > As an update, I'm now running off of a different drive (ssd, not the
> > nvme) and I got the error again! I'm now inclined to think this might
> > not be hardware after all, but something related to my setup or a bug
> > with chromium.
> >
> > After a reboot, chromium wouldn't start for me and demsg showed
> > similar parent transid/csum errors to my original post below. I used
> > btrfs-inspect-internal to find the inode traced to
> > ~/.config/chromium/History. I deleted that, and got a new set of
> > errors tracing to ~/.config/chromium/Cookies. After I deleted that and
> > tried starting chromium, I found that my btrfs /home/jwhendy pool was
> > mounted ro just like the original problem below.
> >
> > dmesg after trying to start chromium:
> > - https://pastebin.com/CsCEQMJa
>
> So far, it's only transid bug in your csum tree.
>
> And two backref mismatch in data backref.
>
> In theory, you can fix your problem by `btrfs check --repair
> --init-csum-tree`.
>

Now that I might be narrowing in on offending files, I'll wait to see
what you think from my last response to Chris. I did try the above
when I first ran into this:
- https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/

> But I'm more interesting in how this happened.

Me too :)

> Have your every experienced any power loss for your NVME drive?
> I'm not say btrfs is unsafe against power loss, all fs should be safe
> against power loss, I'm just curious about if mount time log replay is
> involved, or just regular internal log replay.
>
> From your smartctl, the drive experienced 61 unsafe shutdown with 2144
> power cycles.

Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
caught off gaurd by low battery and instant power-off, I kick myself
and mean to set up a script to force poweroff before that happens. So,
indeed, I've lost power a ton. Surprised it was 61 times, but maybe
not over ~2 years. And actually, I mis-stated the age. I haven't
*booted* from this drive in almost 2yrs. It's a corporate laptop,
issued every 3, so the ssd drive is more like 5 years old.

> Not sure if it's related.
>
> Another interesting point is, did you remember what's the oldest kernel
> running on this fs? v5.4 or v5.5?

Hard to say, but arch linux maintains a package archive. The nvme
drive is from ~May 2018. The archives only go back to Jan 2019 and the
kernel/btrfs-progs was at 4.20 then:
- https://archive.archlinux.org/packages/l/linux/

Searching my Amazon orders, the SSD was in the 2015 time frame, so the
kernel version would have been even older.

Thanks for your input,
John

>
> Thanks,
> Qu
> >
> > Thanks for any pointers, as it would now seem that my purchase of a
> > new m2.sata may not buy my way out of this problem! While I didn't
> > want to reinstall, at least new hardware is a simple fix. Now I'm
> > worried there is a deeper issue bound to recur :(
> >
> > Best regards,
> > John
> >
> > On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
> >>
> >> Greetings,
> >>
> >> I've had this issue occur twice, once ~1mo ago and once a couple of
> >> weeks ago. Chromium suddenly quit on me, and when trying to start it
> >> again, it complained about a lock file in ~. I tried to delete it
> >> manually and was informed I was on a read-only fs! I ended up biting
> >> the bullet and re-installing linux due to the number of dead end
> >> threads and slow response rates on diagnosing these issues, and the
> >> issue occurred again shortly after.
> >>
> >> $ uname -a
> >> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
> >> +0000 x86_64 GNU/Linux
> >>
> >> $ btrfs --version
> >> btrfs-progs v5.4
> >>
> >> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
> >> Data, single: total=114.01GiB, used=80.88GiB
> >> System, single: total=32.00MiB, used=16.00KiB
> >> Metadata, single: total=2.01GiB, used=769.61MiB
> >> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>
> >> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >> nvme0n1                                       259:5    0 232.9G  0 disk
> >> ├─nvme0n1p1                                   259:6    0   512M  0
> >> part  (/boot/efi)
> >> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
> >> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
> >>
> >> I have the following subvols:
> >> arch: used for / when booting arch
> >> jwhendy: used for /home/jwhendy on arch
> >> vault: shared data between distros on /mnt/vault
> >> bionic: root when booting ubuntu bionic
> >>
> >> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>
> >> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >
> > Edit: links now:
> > - btrfs check: https://pastebin.com/nz6Bc145
> > - dmesg: https://pastebin.com/1GGpNiqk
> > - smartctl: https://pastebin.com/ADtYqfrd
> >
> > btrfs dev stats (not worth a link):
> >
> > [/dev/mapper/old].write_io_errs    0
> > [/dev/mapper/old].read_io_errs     0
> > [/dev/mapper/old].flush_io_errs    0
> > [/dev/mapper/old].corruption_errs  0
> > [/dev/mapper/old].generation_errs  0
> >
> >
> >> If these are of interested, here are reddit threads where I posted the
> >> issue and was referred here.
> >> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>
> >> It has been suggested this is a hardware issue. I've already ordered a
> >> replacement m2.sata, but for sanity it would be great to know
> >> definitively this was the case. If anything stands out above that
> >> could indicate I'm not setup properly re. btrfs, that would also be
> >> fantastic so I don't repeat the issue!
> >>
> >> The only thing I've stumbled on is that I have been mounting with
> >> rd.luks.options=discard and that manually running fstrim is preferred.
> >>
> >>
> >> Many thanks for any input/suggestions,
> >> John
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-08  4:48     ` John Hendy
@ 2020-02-08  7:29       ` Qu Wenruo
  2020-02-08 19:56         ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-08  7:29 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 7529 bytes --]



On 2020/2/8 下午12:48, John Hendy wrote:
> On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/2/8 上午1:52, John Hendy wrote:
>>> Greetings,
>>>
>>> I'm resending, as this isn't showing in the archives. Perhaps it was
>>> the attachments, which I've converted to pastebin links.
>>>
>>> As an update, I'm now running off of a different drive (ssd, not the
>>> nvme) and I got the error again! I'm now inclined to think this might
>>> not be hardware after all, but something related to my setup or a bug
>>> with chromium.
>>>
>>> After a reboot, chromium wouldn't start for me and demsg showed
>>> similar parent transid/csum errors to my original post below. I used
>>> btrfs-inspect-internal to find the inode traced to
>>> ~/.config/chromium/History. I deleted that, and got a new set of
>>> errors tracing to ~/.config/chromium/Cookies. After I deleted that and
>>> tried starting chromium, I found that my btrfs /home/jwhendy pool was
>>> mounted ro just like the original problem below.
>>>
>>> dmesg after trying to start chromium:
>>> - https://pastebin.com/CsCEQMJa
>>
>> So far, it's only transid bug in your csum tree.
>>
>> And two backref mismatch in data backref.
>>
>> In theory, you can fix your problem by `btrfs check --repair
>> --init-csum-tree`.
>>
> 
> Now that I might be narrowing in on offending files, I'll wait to see
> what you think from my last response to Chris. I did try the above
> when I first ran into this:
> - https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/

That RO is caused by the missing data backref.

Which can be fixed by btrfs check --repair.

Then you should be able to delete offending files them. (Or the whole
chromium cache, and switch to firefox if you wish :P )

But also please keep in mind that, the transid mismatch looks happen in
your csum tree, which means your csum tree is no longer reliable, and
may cause -EIO reading unrelated files.

Thus it's recommended to re-fill the csum tree by --init-csum-tree.

It can be done altogether by --repair --init-csum-tree, but to be safe,
please run --repair only first, then make sure btrfs check reports no
error after that. Then go --init-csum-tree.

> 
>> But I'm more interesting in how this happened.
> 
> Me too :)
> 
>> Have your every experienced any power loss for your NVME drive?
>> I'm not say btrfs is unsafe against power loss, all fs should be safe
>> against power loss, I'm just curious about if mount time log replay is
>> involved, or just regular internal log replay.
>>
>> From your smartctl, the drive experienced 61 unsafe shutdown with 2144
>> power cycles.
> 
> Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> caught off gaurd by low battery and instant power-off, I kick myself
> and mean to set up a script to force poweroff before that happens. So,
> indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> not over ~2 years. And actually, I mis-stated the age. I haven't
> *booted* from this drive in almost 2yrs. It's a corporate laptop,
> issued every 3, so the ssd drive is more like 5 years old.
> 
>> Not sure if it's related.
>>
>> Another interesting point is, did you remember what's the oldest kernel
>> running on this fs? v5.4 or v5.5?
> 
> Hard to say, but arch linux maintains a package archive. The nvme
> drive is from ~May 2018. The archives only go back to Jan 2019 and the
> kernel/btrfs-progs was at 4.20 then:
> - https://archive.archlinux.org/packages/l/linux/

There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
cause metadata corruption. And the symptom is transid error, which also
matches your problem.

Thanks,
Qu

> 
> Searching my Amazon orders, the SSD was in the 2015 time frame, so the
> kernel version would have been even older.
> 
> Thanks for your input,
> John
> 
>>
>> Thanks,
>> Qu
>>>
>>> Thanks for any pointers, as it would now seem that my purchase of a
>>> new m2.sata may not buy my way out of this problem! While I didn't
>>> want to reinstall, at least new hardware is a simple fix. Now I'm
>>> worried there is a deeper issue bound to recur :(
>>>
>>> Best regards,
>>> John
>>>
>>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>>>>
>>>> Greetings,
>>>>
>>>> I've had this issue occur twice, once ~1mo ago and once a couple of
>>>> weeks ago. Chromium suddenly quit on me, and when trying to start it
>>>> again, it complained about a lock file in ~. I tried to delete it
>>>> manually and was informed I was on a read-only fs! I ended up biting
>>>> the bullet and re-installing linux due to the number of dead end
>>>> threads and slow response rates on diagnosing these issues, and the
>>>> issue occurred again shortly after.
>>>>
>>>> $ uname -a
>>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
>>>> +0000 x86_64 GNU/Linux
>>>>
>>>> $ btrfs --version
>>>> btrfs-progs v5.4
>>>>
>>>> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
>>>> Data, single: total=114.01GiB, used=80.88GiB
>>>> System, single: total=32.00MiB, used=16.00KiB
>>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>>
>>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>> nvme0n1                                       259:5    0 232.9G  0 disk
>>>> ├─nvme0n1p1                                   259:6    0   512M  0
>>>> part  (/boot/efi)
>>>> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
>>>> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>>>>
>>>> I have the following subvols:
>>>> arch: used for / when booting arch
>>>> jwhendy: used for /home/jwhendy on arch
>>>> vault: shared data between distros on /mnt/vault
>>>> bionic: root when booting ubuntu bionic
>>>>
>>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>>
>>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>
>>> Edit: links now:
>>> - btrfs check: https://pastebin.com/nz6Bc145
>>> - dmesg: https://pastebin.com/1GGpNiqk
>>> - smartctl: https://pastebin.com/ADtYqfrd
>>>
>>> btrfs dev stats (not worth a link):
>>>
>>> [/dev/mapper/old].write_io_errs    0
>>> [/dev/mapper/old].read_io_errs     0
>>> [/dev/mapper/old].flush_io_errs    0
>>> [/dev/mapper/old].corruption_errs  0
>>> [/dev/mapper/old].generation_errs  0
>>>
>>>
>>>> If these are of interested, here are reddit threads where I posted the
>>>> issue and was referred here.
>>>> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>>
>>>> It has been suggested this is a hardware issue. I've already ordered a
>>>> replacement m2.sata, but for sanity it would be great to know
>>>> definitively this was the case. If anything stands out above that
>>>> could indicate I'm not setup properly re. btrfs, that would also be
>>>> fantastic so I don't repeat the issue!
>>>>
>>>> The only thing I've stumbled on is that I have been mounting with
>>>> rd.luks.options=discard and that manually running fstrim is preferred.
>>>>
>>>>
>>>> Many thanks for any input/suggestions,
>>>> John
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-08  7:29       ` Qu Wenruo
@ 2020-02-08 19:56         ` John Hendy
       [not found]           ` <CA+M2ft9dcMKKQstZVcGQ=9MREbfhPF5GG=xoMoh5Aq8MK9P8wA@mail.gmail.com>
  2020-02-09  3:46           ` Chris Murphy
  0 siblings, 2 replies; 24+ messages in thread
From: John Hendy @ 2020-02-08 19:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

This is not going so hot. Updates:

booted from arch install, pre repair btrfs check:
- https://pastebin.com/6vNaSdf2

btrfs check --mode=lowmem as requested by Chris:
- https://pastebin.com/uSwSTVVY

Then I did btrfs check --repair, which seg faulted at the end. I've
typed them off of pictures I took:

Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/mapper/ssd
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
parent transid verify failed on 20271138064 wanted 68719924810 found 448074
parent transid verify failed on 20271138064 wanted 68719924810 found 448074
Ignoring transid failure
# ... repeated the previous two lines maybe hundreds of times
# ended with this:
ref mismatch on [12797435904 268505088] extent item 1, found 412
[1] 1814 segmentation fault (core dumped) btrfs check --repair /dev/mapper/ssd

This was with btrfs-progs 5.4 (the install USB is maybe a month old).

Here is the output of btrfs check after the --repair attempt:
- https://pastebin.com/6MYRNdga

I rebooted to write this email given the seg fault, as I wanted to
make sure that I should still follow-up --repair with
--init-csum-tree. I had pictures of the --repair output, but Firefox
just wouldn't load imgur.com for me to post the pics and was acting
really weird. In suspiciously checking dmesg, things have gone ro on
me :(  Here is the dmesg from this session:
- https://pastebin.com/a2z7xczy

The gist is:

[   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
block=172703744 slot=0, csum end range (12980568064) goes beyond the
start range (12980297728) of the next csum item
[   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
total ptrs 34 free space 29 owner 7
[   40.997942]     item 0 key (18446744073709551606 128 12979060736)
itemoff 14811 itemsize 1472
[   40.997944]     item 1 key (18446744073709551606 128 12980297728)
itemoff 13895 itemsize 916
[   40.997945]     item 2 key (18446744073709551606 128 12981235712)
itemoff 13811 itemsize 84
# ... there's maybe 30 of these item n key lines in total
[   40.997984] BTRFS error (device dm-0): block=172703744 write time
tree block corruption detected
[   41.016793] BTRFS: error (device dm-0) in
btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
writing out transaction)
[   41.016799] BTRFS info (device dm-0): forced readonly
[   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
transaction.
[   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
errno=-5 IO failure
[   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
[   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
transaction.
[   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
[   44.509418] systemd-journald[416]:
/var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
Journal file corrupted, rotating.
[   44.509440] systemd-journald[416]: Failed to rotate
/var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
Read-only file system
[   44.509450] systemd-journald[416]: Failed to rotate
/var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
Read-only file system
[   44.509540] systemd-journald[416]: Failed to write entry (23 items,
705 bytes) despite vacuuming, ignoring: Bad message
# ... then a bunch of these failed journal attempts (of note:
/var/log/journal was one of the bad inodes from btrfs check
previously)

Kindly let me know what you would recommend. I'm sadly back to an
unusable system vs. a complaining/worrisome one. This is similar to
the behavior I had with the m2.sata nvme drive in my original
experience. After trying all of --repair, --init-csum-tree, and
--init-extent-tree, I couldn't boot anymore. After my dm-crypt
password at boot, I just saw a bunch of [FAILED] in the text splash
output. Hoping to not repeat that with this drive.

Thanks,
John


On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/8 下午12:48, John Hendy wrote:
> > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/2/8 上午1:52, John Hendy wrote:
> >>> Greetings,
> >>>
> >>> I'm resending, as this isn't showing in the archives. Perhaps it was
> >>> the attachments, which I've converted to pastebin links.
> >>>
> >>> As an update, I'm now running off of a different drive (ssd, not the
> >>> nvme) and I got the error again! I'm now inclined to think this might
> >>> not be hardware after all, but something related to my setup or a bug
> >>> with chromium.
> >>>
> >>> After a reboot, chromium wouldn't start for me and demsg showed
> >>> similar parent transid/csum errors to my original post below. I used
> >>> btrfs-inspect-internal to find the inode traced to
> >>> ~/.config/chromium/History. I deleted that, and got a new set of
> >>> errors tracing to ~/.config/chromium/Cookies. After I deleted that and
> >>> tried starting chromium, I found that my btrfs /home/jwhendy pool was
> >>> mounted ro just like the original problem below.
> >>>
> >>> dmesg after trying to start chromium:
> >>> - https://pastebin.com/CsCEQMJa
> >>
> >> So far, it's only transid bug in your csum tree.
> >>
> >> And two backref mismatch in data backref.
> >>
> >> In theory, you can fix your problem by `btrfs check --repair
> >> --init-csum-tree`.
> >>
> >
> > Now that I might be narrowing in on offending files, I'll wait to see
> > what you think from my last response to Chris. I did try the above
> > when I first ran into this:
> > - https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>
> That RO is caused by the missing data backref.
>
> Which can be fixed by btrfs check --repair.
>
> Then you should be able to delete offending files them. (Or the whole
> chromium cache, and switch to firefox if you wish :P )
>
> But also please keep in mind that, the transid mismatch looks happen in
> your csum tree, which means your csum tree is no longer reliable, and
> may cause -EIO reading unrelated files.
>
> Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>
> It can be done altogether by --repair --init-csum-tree, but to be safe,
> please run --repair only first, then make sure btrfs check reports no
> error after that. Then go --init-csum-tree.
>
> >
> >> But I'm more interesting in how this happened.
> >
> > Me too :)
> >
> >> Have your every experienced any power loss for your NVME drive?
> >> I'm not say btrfs is unsafe against power loss, all fs should be safe
> >> against power loss, I'm just curious about if mount time log replay is
> >> involved, or just regular internal log replay.
> >>
> >> From your smartctl, the drive experienced 61 unsafe shutdown with 2144
> >> power cycles.
> >
> > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> > caught off gaurd by low battery and instant power-off, I kick myself
> > and mean to set up a script to force poweroff before that happens. So,
> > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> > not over ~2 years. And actually, I mis-stated the age. I haven't
> > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> > issued every 3, so the ssd drive is more like 5 years old.
> >
> >> Not sure if it's related.
> >>
> >> Another interesting point is, did you remember what's the oldest kernel
> >> running on this fs? v5.4 or v5.5?
> >
> > Hard to say, but arch linux maintains a package archive. The nvme
> > drive is from ~May 2018. The archives only go back to Jan 2019 and the
> > kernel/btrfs-progs was at 4.20 then:
> > - https://archive.archlinux.org/packages/l/linux/
>
> There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> cause metadata corruption. And the symptom is transid error, which also
> matches your problem.
>
> Thanks,
> Qu
>
> >
> > Searching my Amazon orders, the SSD was in the 2015 time frame, so the
> > kernel version would have been even older.
> >
> > Thanks for your input,
> > John
> >
> >>
> >> Thanks,
> >> Qu
> >>>
> >>> Thanks for any pointers, as it would now seem that my purchase of a
> >>> new m2.sata may not buy my way out of this problem! While I didn't
> >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> >>> worried there is a deeper issue bound to recur :(
> >>>
> >>> Best regards,
> >>> John
> >>>
> >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
> >>>>
> >>>> Greetings,
> >>>>
> >>>> I've had this issue occur twice, once ~1mo ago and once a couple of
> >>>> weeks ago. Chromium suddenly quit on me, and when trying to start it
> >>>> again, it complained about a lock file in ~. I tried to delete it
> >>>> manually and was informed I was on a read-only fs! I ended up biting
> >>>> the bullet and re-installing linux due to the number of dead end
> >>>> threads and slow response rates on diagnosing these issues, and the
> >>>> issue occurred again shortly after.
> >>>>
> >>>> $ uname -a
> >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
> >>>> +0000 x86_64 GNU/Linux
> >>>>
> >>>> $ btrfs --version
> >>>> btrfs-progs v5.4
> >>>>
> >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
> >>>> Data, single: total=114.01GiB, used=80.88GiB
> >>>> System, single: total=32.00MiB, used=16.00KiB
> >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>>>
> >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >>>> nvme0n1                                       259:5    0 232.9G  0 disk
> >>>> ├─nvme0n1p1                                   259:6    0   512M  0
> >>>> part  (/boot/efi)
> >>>> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
> >>>> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
> >>>>
> >>>> I have the following subvols:
> >>>> arch: used for / when booting arch
> >>>> jwhendy: used for /home/jwhendy on arch
> >>>> vault: shared data between distros on /mnt/vault
> >>>> bionic: root when booting ubuntu bionic
> >>>>
> >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>>>
> >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >>>
> >>> Edit: links now:
> >>> - btrfs check: https://pastebin.com/nz6Bc145
> >>> - dmesg: https://pastebin.com/1GGpNiqk
> >>> - smartctl: https://pastebin.com/ADtYqfrd
> >>>
> >>> btrfs dev stats (not worth a link):
> >>>
> >>> [/dev/mapper/old].write_io_errs    0
> >>> [/dev/mapper/old].read_io_errs     0
> >>> [/dev/mapper/old].flush_io_errs    0
> >>> [/dev/mapper/old].corruption_errs  0
> >>> [/dev/mapper/old].generation_errs  0
> >>>
> >>>
> >>>> If these are of interested, here are reddit threads where I posted the
> >>>> issue and was referred here.
> >>>> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >>>> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>>>
> >>>> It has been suggested this is a hardware issue. I've already ordered a
> >>>> replacement m2.sata, but for sanity it would be great to know
> >>>> definitively this was the case. If anything stands out above that
> >>>> could indicate I'm not setup properly re. btrfs, that would also be
> >>>> fantastic so I don't repeat the issue!
> >>>>
> >>>> The only thing I've stumbled on is that I have been mounting with
> >>>> rd.luks.options=discard and that manually running fstrim is preferred.
> >>>>
> >>>>
> >>>> Many thanks for any input/suggestions,
> >>>> John
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
       [not found]           ` <CA+M2ft9dcMKKQstZVcGQ=9MREbfhPF5GG=xoMoh5Aq8MK9P8wA@mail.gmail.com>
@ 2020-02-08 23:56             ` Qu Wenruo
  2020-02-09  0:51               ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-08 23:56 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 14766 bytes --]



On 2020/2/9 上午5:57, John Hendy wrote:
> On phone due to no OS, so apologies if this is in html mode. Indeed, I
> can't mount or boot any longer. I get the error:
> 
> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> to recover log tree)
> BTRFS error (device dm-0): open_ctree failed

That can be easily fixed by `btrfs rescue zero-log`.

At least, btrfs check --repair didn't make things worse.

Thanks,
Qu
> 
> John
> 
> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> <mailto:jw.hendy@gmail.com>> wrote:
> 
>     This is not going so hot. Updates:
> 
>     booted from arch install, pre repair btrfs check:
>     - https://pastebin.com/6vNaSdf2
> 
>     btrfs check --mode=lowmem as requested by Chris:
>     - https://pastebin.com/uSwSTVVY
> 
>     Then I did btrfs check --repair, which seg faulted at the end. I've
>     typed them off of pictures I took:
> 
>     Starting repair.
>     Opening filesystem to check...
>     Checking filesystem on /dev/mapper/ssd
>     [1/7] checking root items
>     Fixed 0 roots.
>     [2/7] checking extents
>     parent transid verify failed on 20271138064 wanted 68719924810 found
>     448074
>     parent transid verify failed on 20271138064 wanted 68719924810 found
>     448074
>     Ignoring transid failure
>     # ... repeated the previous two lines maybe hundreds of times
>     # ended with this:
>     ref mismatch on [12797435904 268505088] extent item 1, found 412
>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
>     /dev/mapper/ssd
> 
>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> 
>     Here is the output of btrfs check after the --repair attempt:
>     - https://pastebin.com/6MYRNdga
> 
>     I rebooted to write this email given the seg fault, as I wanted to
>     make sure that I should still follow-up --repair with
>     --init-csum-tree. I had pictures of the --repair output, but Firefox
>     just wouldn't load imgur.com <http://imgur.com> for me to post the
>     pics and was acting
>     really weird. In suspiciously checking dmesg, things have gone ro on
>     me :(  Here is the dmesg from this session:
>     - https://pastebin.com/a2z7xczy
> 
>     The gist is:
> 
>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
>     start range (12980297728) of the next csum item
>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
>     total ptrs 34 free space 29 owner 7
>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
>     itemoff 14811 itemsize 1472
>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
>     itemoff 13895 itemsize 916
>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
>     itemoff 13811 itemsize 84
>     # ... there's maybe 30 of these item n key lines in total
>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
>     tree block corruption detected
>     [   41.016793] BTRFS: error (device dm-0) in
>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
>     writing out transaction)
>     [   41.016799] BTRFS info (device dm-0): forced readonly
>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
>     transaction.
>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
>     errno=-5 IO failure
>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
>     transaction.
>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
>     [   44.509418] systemd-journald[416]:
>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>     Journal file corrupted, rotating.
>     [   44.509440] systemd-journald[416]: Failed to rotate
>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>     Read-only file system
>     [   44.509450] systemd-journald[416]: Failed to rotate
>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
>     Read-only file system
>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
>     705 bytes) despite vacuuming, ignoring: Bad message
>     # ... then a bunch of these failed journal attempts (of note:
>     /var/log/journal was one of the bad inodes from btrfs check
>     previously)
> 
>     Kindly let me know what you would recommend. I'm sadly back to an
>     unusable system vs. a complaining/worrisome one. This is similar to
>     the behavior I had with the m2.sata nvme drive in my original
>     experience. After trying all of --repair, --init-csum-tree, and
>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
>     password at boot, I just saw a bunch of [FAILED] in the text splash
>     output. Hoping to not repeat that with this drive.
> 
>     Thanks,
>     John
> 
> 
>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>     >
>     >
>     >
>     > On 2020/2/8 下午12:48, John Hendy wrote:
>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>     > >>
>     > >>
>     > >>
>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
>     > >>> Greetings,
>     > >>>
>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
>     it was
>     > >>> the attachments, which I've converted to pastebin links.
>     > >>>
>     > >>> As an update, I'm now running off of a different drive (ssd,
>     not the
>     > >>> nvme) and I got the error again! I'm now inclined to think
>     this might
>     > >>> not be hardware after all, but something related to my setup
>     or a bug
>     > >>> with chromium.
>     > >>>
>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
>     > >>> similar parent transid/csum errors to my original post below.
>     I used
>     > >>> btrfs-inspect-internal to find the inode traced to
>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
>     that and
>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
>     pool was
>     > >>> mounted ro just like the original problem below.
>     > >>>
>     > >>> dmesg after trying to start chromium:
>     > >>> - https://pastebin.com/CsCEQMJa
>     > >>
>     > >> So far, it's only transid bug in your csum tree.
>     > >>
>     > >> And two backref mismatch in data backref.
>     > >>
>     > >> In theory, you can fix your problem by `btrfs check --repair
>     > >> --init-csum-tree`.
>     > >>
>     > >
>     > > Now that I might be narrowing in on offending files, I'll wait
>     to see
>     > > what you think from my last response to Chris. I did try the above
>     > > when I first ran into this:
>     > > -
>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>     >
>     > That RO is caused by the missing data backref.
>     >
>     > Which can be fixed by btrfs check --repair.
>     >
>     > Then you should be able to delete offending files them. (Or the whole
>     > chromium cache, and switch to firefox if you wish :P )
>     >
>     > But also please keep in mind that, the transid mismatch looks
>     happen in
>     > your csum tree, which means your csum tree is no longer reliable, and
>     > may cause -EIO reading unrelated files.
>     >
>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>     >
>     > It can be done altogether by --repair --init-csum-tree, but to be
>     safe,
>     > please run --repair only first, then make sure btrfs check reports no
>     > error after that. Then go --init-csum-tree.
>     >
>     > >
>     > >> But I'm more interesting in how this happened.
>     > >
>     > > Me too :)
>     > >
>     > >> Have your every experienced any power loss for your NVME drive?
>     > >> I'm not say btrfs is unsafe against power loss, all fs should
>     be safe
>     > >> against power loss, I'm just curious about if mount time log
>     replay is
>     > >> involved, or just regular internal log replay.
>     > >>
>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
>     with 2144
>     > >> power cycles.
>     > >
>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
>     > > caught off gaurd by low battery and instant power-off, I kick myself
>     > > and mean to set up a script to force poweroff before that
>     happens. So,
>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
>     > > issued every 3, so the ssd drive is more like 5 years old.
>     > >
>     > >> Not sure if it's related.
>     > >>
>     > >> Another interesting point is, did you remember what's the
>     oldest kernel
>     > >> running on this fs? v5.4 or v5.5?
>     > >
>     > > Hard to say, but arch linux maintains a package archive. The nvme
>     > > drive is from ~May 2018. The archives only go back to Jan 2019
>     and the
>     > > kernel/btrfs-progs was at 4.20 then:
>     > > - https://archive.archlinux.org/packages/l/linux/
>     >
>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
>     > cause metadata corruption. And the symptom is transid error, which
>     also
>     > matches your problem.
>     >
>     > Thanks,
>     > Qu
>     >
>     > >
>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
>     so the
>     > > kernel version would have been even older.
>     > >
>     > > Thanks for your input,
>     > > John
>     > >
>     > >>
>     > >> Thanks,
>     > >> Qu
>     > >>>
>     > >>> Thanks for any pointers, as it would now seem that my purchase
>     of a
>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
>     > >>> worried there is a deeper issue bound to recur :(
>     > >>>
>     > >>> Best regards,
>     > >>> John
>     > >>>
>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
>     <mailto:jw.hendy@gmail.com>> wrote:
>     > >>>>
>     > >>>> Greetings,
>     > >>>>
>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
>     couple of
>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
>     start it
>     > >>>> again, it complained about a lock file in ~. I tried to delete it
>     > >>>> manually and was informed I was on a read-only fs! I ended up
>     biting
>     > >>>> the bullet and re-installing linux due to the number of dead end
>     > >>>> threads and slow response rates on diagnosing these issues,
>     and the
>     > >>>> issue occurred again shortly after.
>     > >>>>
>     > >>>> $ uname -a
>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
>     16:38:40
>     > >>>> +0000 x86_64 GNU/Linux
>     > >>>>
>     > >>>> $ btrfs --version
>     > >>>> btrfs-progs v5.4
>     > >>>>
>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
>     mounting a subvol on /
>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
>     > >>>> System, single: total=32.00MiB, used=16.00KiB
>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>     > >>>>
>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>     > >>>> nvme0n1                                       259:5    0
>     232.9G  0 disk
>     > >>>> ├─nvme0n1p1                                   259:6    0 
>      512M  0
>     > >>>> part  (/boot/efi)
>     > >>>> ├─nvme0n1p2                                   259:7    0   
>      1G  0 part  (/boot)
>     > >>>> └─nvme0n1p3                                   259:8    0
>     231.4G  0 part (btrfs)
>     > >>>>
>     > >>>> I have the following subvols:
>     > >>>> arch: used for / when booting arch
>     > >>>> jwhendy: used for /home/jwhendy on arch
>     > >>>> vault: shared data between distros on /mnt/vault
>     > >>>> bionic: root when booting ubuntu bionic
>     > >>>>
>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>     > >>>>
>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>     > >>>
>     > >>> Edit: links now:
>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
>     > >>>
>     > >>> btrfs dev stats (not worth a link):
>     > >>>
>     > >>> [/dev/mapper/old].write_io_errs    0
>     > >>> [/dev/mapper/old].read_io_errs     0
>     > >>> [/dev/mapper/old].flush_io_errs    0
>     > >>> [/dev/mapper/old].corruption_errs  0
>     > >>> [/dev/mapper/old].generation_errs  0
>     > >>>
>     > >>>
>     > >>>> If these are of interested, here are reddit threads where I
>     posted the
>     > >>>> issue and was referred here.
>     > >>>> 1)
>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>     > >>>> 2) 
>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>     > >>>>
>     > >>>> It has been suggested this is a hardware issue. I've already
>     ordered a
>     > >>>> replacement m2.sata, but for sanity it would be great to know
>     > >>>> definitively this was the case. If anything stands out above that
>     > >>>> could indicate I'm not setup properly re. btrfs, that would
>     also be
>     > >>>> fantastic so I don't repeat the issue!
>     > >>>>
>     > >>>> The only thing I've stumbled on is that I have been mounting with
>     > >>>> rd.luks.options=discard and that manually running fstrim is
>     preferred.
>     > >>>>
>     > >>>>
>     > >>>> Many thanks for any input/suggestions,
>     > >>>> John
>     > >>
>     >
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-08 23:56             ` Qu Wenruo
@ 2020-02-09  0:51               ` John Hendy
  2020-02-09  0:59                 ` John Hendy
  2020-02-09  1:07                 ` Qu Wenruo
  0 siblings, 2 replies; 24+ messages in thread
From: John Hendy @ 2020-02-09  0:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/9 上午5:57, John Hendy wrote:
> > On phone due to no OS, so apologies if this is in html mode. Indeed, I
> > can't mount or boot any longer. I get the error:
> >
> > Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> > to recover log tree)
> > BTRFS error (device dm-0): open_ctree failed
>
> That can be easily fixed by `btrfs rescue zero-log`.
>

Whew. This was most helpful and it is wonderful to be booting at
least. I think the outstanding issues are:
- what should I do about `btrfs check --repair seg` faulting?
- how can I deal with this (probably related to seg fault) ghost file
that cannot be deleted?
- I'm not sure if you looked at the post --repair log, but there a ton
of these errors that didn't used to be there:

backpointer mismatch on [13037375488 20480]
ref mismatch on [13037395968 892928] extent item 0, found 1
data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
not found in extent tree
incorrect local backref count on 13037395968 root 263 owner 4257169
offset 0 found 1 wanted 0 back 0x5627f59cadc0

Here is the latest btrfs check output after the zero-log operation.
- https://pastebin.com/KWeUnk0y

I'm hoping once that file is deleted, it's a matter of
--init-csum-tree and perhaps I'm set? Or --init-extent-tree?

Thanks,
John

> At least, btrfs check --repair didn't make things worse.
>
> Thanks,
> Qu
> >
> > John
> >
> > On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> > <mailto:jw.hendy@gmail.com>> wrote:
> >
> >     This is not going so hot. Updates:
> >
> >     booted from arch install, pre repair btrfs check:
> >     - https://pastebin.com/6vNaSdf2
> >
> >     btrfs check --mode=lowmem as requested by Chris:
> >     - https://pastebin.com/uSwSTVVY
> >
> >     Then I did btrfs check --repair, which seg faulted at the end. I've
> >     typed them off of pictures I took:
> >
> >     Starting repair.
> >     Opening filesystem to check...
> >     Checking filesystem on /dev/mapper/ssd
> >     [1/7] checking root items
> >     Fixed 0 roots.
> >     [2/7] checking extents
> >     parent transid verify failed on 20271138064 wanted 68719924810 found
> >     448074
> >     parent transid verify failed on 20271138064 wanted 68719924810 found
> >     448074
> >     Ignoring transid failure
> >     # ... repeated the previous two lines maybe hundreds of times
> >     # ended with this:
> >     ref mismatch on [12797435904 268505088] extent item 1, found 412
> >     [1] 1814 segmentation fault (core dumped) btrfs check --repair
> >     /dev/mapper/ssd
> >
> >     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> >
> >     Here is the output of btrfs check after the --repair attempt:
> >     - https://pastebin.com/6MYRNdga
> >
> >     I rebooted to write this email given the seg fault, as I wanted to
> >     make sure that I should still follow-up --repair with
> >     --init-csum-tree. I had pictures of the --repair output, but Firefox
> >     just wouldn't load imgur.com <http://imgur.com> for me to post the
> >     pics and was acting
> >     really weird. In suspiciously checking dmesg, things have gone ro on
> >     me :(  Here is the dmesg from this session:
> >     - https://pastebin.com/a2z7xczy
> >
> >     The gist is:
> >
> >     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
> >     block=172703744 slot=0, csum end range (12980568064) goes beyond the
> >     start range (12980297728) of the next csum item
> >     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
> >     total ptrs 34 free space 29 owner 7
> >     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
> >     itemoff 14811 itemsize 1472
> >     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
> >     itemoff 13895 itemsize 916
> >     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
> >     itemoff 13811 itemsize 84
> >     # ... there's maybe 30 of these item n key lines in total
> >     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
> >     tree block corruption detected
> >     [   41.016793] BTRFS: error (device dm-0) in
> >     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
> >     writing out transaction)
> >     [   41.016799] BTRFS info (device dm-0): forced readonly
> >     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
> >     transaction.
> >     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
> >     errno=-5 IO failure
> >     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
> >     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
> >     transaction.
> >     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
> >     [   44.509418] systemd-journald[416]:
> >     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >     Journal file corrupted, rotating.
> >     [   44.509440] systemd-journald[416]: Failed to rotate
> >     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >     Read-only file system
> >     [   44.509450] systemd-journald[416]: Failed to rotate
> >     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
> >     Read-only file system
> >     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
> >     705 bytes) despite vacuuming, ignoring: Bad message
> >     # ... then a bunch of these failed journal attempts (of note:
> >     /var/log/journal was one of the bad inodes from btrfs check
> >     previously)
> >
> >     Kindly let me know what you would recommend. I'm sadly back to an
> >     unusable system vs. a complaining/worrisome one. This is similar to
> >     the behavior I had with the m2.sata nvme drive in my original
> >     experience. After trying all of --repair, --init-csum-tree, and
> >     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
> >     password at boot, I just saw a bunch of [FAILED] in the text splash
> >     output. Hoping to not repeat that with this drive.
> >
> >     Thanks,
> >     John
> >
> >
> >     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
> >     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >     >
> >     >
> >     >
> >     > On 2020/2/8 下午12:48, John Hendy wrote:
> >     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
> >     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >     > >>
> >     > >>
> >     > >>
> >     > >> On 2020/2/8 上午1:52, John Hendy wrote:
> >     > >>> Greetings,
> >     > >>>
> >     > >>> I'm resending, as this isn't showing in the archives. Perhaps
> >     it was
> >     > >>> the attachments, which I've converted to pastebin links.
> >     > >>>
> >     > >>> As an update, I'm now running off of a different drive (ssd,
> >     not the
> >     > >>> nvme) and I got the error again! I'm now inclined to think
> >     this might
> >     > >>> not be hardware after all, but something related to my setup
> >     or a bug
> >     > >>> with chromium.
> >     > >>>
> >     > >>> After a reboot, chromium wouldn't start for me and demsg showed
> >     > >>> similar parent transid/csum errors to my original post below.
> >     I used
> >     > >>> btrfs-inspect-internal to find the inode traced to
> >     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
> >     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
> >     that and
> >     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
> >     pool was
> >     > >>> mounted ro just like the original problem below.
> >     > >>>
> >     > >>> dmesg after trying to start chromium:
> >     > >>> - https://pastebin.com/CsCEQMJa
> >     > >>
> >     > >> So far, it's only transid bug in your csum tree.
> >     > >>
> >     > >> And two backref mismatch in data backref.
> >     > >>
> >     > >> In theory, you can fix your problem by `btrfs check --repair
> >     > >> --init-csum-tree`.
> >     > >>
> >     > >
> >     > > Now that I might be narrowing in on offending files, I'll wait
> >     to see
> >     > > what you think from my last response to Chris. I did try the above
> >     > > when I first ran into this:
> >     > > -
> >     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
> >     >
> >     > That RO is caused by the missing data backref.
> >     >
> >     > Which can be fixed by btrfs check --repair.
> >     >
> >     > Then you should be able to delete offending files them. (Or the whole
> >     > chromium cache, and switch to firefox if you wish :P )
> >     >
> >     > But also please keep in mind that, the transid mismatch looks
> >     happen in
> >     > your csum tree, which means your csum tree is no longer reliable, and
> >     > may cause -EIO reading unrelated files.
> >     >
> >     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
> >     >
> >     > It can be done altogether by --repair --init-csum-tree, but to be
> >     safe,
> >     > please run --repair only first, then make sure btrfs check reports no
> >     > error after that. Then go --init-csum-tree.
> >     >
> >     > >
> >     > >> But I'm more interesting in how this happened.
> >     > >
> >     > > Me too :)
> >     > >
> >     > >> Have your every experienced any power loss for your NVME drive?
> >     > >> I'm not say btrfs is unsafe against power loss, all fs should
> >     be safe
> >     > >> against power loss, I'm just curious about if mount time log
> >     replay is
> >     > >> involved, or just regular internal log replay.
> >     > >>
> >     > >> From your smartctl, the drive experienced 61 unsafe shutdown
> >     with 2144
> >     > >> power cycles.
> >     > >
> >     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> >     > > caught off gaurd by low battery and instant power-off, I kick myself
> >     > > and mean to set up a script to force poweroff before that
> >     happens. So,
> >     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> >     > > not over ~2 years. And actually, I mis-stated the age. I haven't
> >     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> >     > > issued every 3, so the ssd drive is more like 5 years old.
> >     > >
> >     > >> Not sure if it's related.
> >     > >>
> >     > >> Another interesting point is, did you remember what's the
> >     oldest kernel
> >     > >> running on this fs? v5.4 or v5.5?
> >     > >
> >     > > Hard to say, but arch linux maintains a package archive. The nvme
> >     > > drive is from ~May 2018. The archives only go back to Jan 2019
> >     and the
> >     > > kernel/btrfs-progs was at 4.20 then:
> >     > > - https://archive.archlinux.org/packages/l/linux/
> >     >
> >     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> >     > cause metadata corruption. And the symptom is transid error, which
> >     also
> >     > matches your problem.
> >     >
> >     > Thanks,
> >     > Qu
> >     >
> >     > >
> >     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
> >     so the
> >     > > kernel version would have been even older.
> >     > >
> >     > > Thanks for your input,
> >     > > John
> >     > >
> >     > >>
> >     > >> Thanks,
> >     > >> Qu
> >     > >>>
> >     > >>> Thanks for any pointers, as it would now seem that my purchase
> >     of a
> >     > >>> new m2.sata may not buy my way out of this problem! While I didn't
> >     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> >     > >>> worried there is a deeper issue bound to recur :(
> >     > >>>
> >     > >>> Best regards,
> >     > >>> John
> >     > >>>
> >     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
> >     <mailto:jw.hendy@gmail.com>> wrote:
> >     > >>>>
> >     > >>>> Greetings,
> >     > >>>>
> >     > >>>> I've had this issue occur twice, once ~1mo ago and once a
> >     couple of
> >     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
> >     start it
> >     > >>>> again, it complained about a lock file in ~. I tried to delete it
> >     > >>>> manually and was informed I was on a read-only fs! I ended up
> >     biting
> >     > >>>> the bullet and re-installing linux due to the number of dead end
> >     > >>>> threads and slow response rates on diagnosing these issues,
> >     and the
> >     > >>>> issue occurred again shortly after.
> >     > >>>>
> >     > >>>> $ uname -a
> >     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
> >     16:38:40
> >     > >>>> +0000 x86_64 GNU/Linux
> >     > >>>>
> >     > >>>> $ btrfs --version
> >     > >>>> btrfs-progs v5.4
> >     > >>>>
> >     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
> >     mounting a subvol on /
> >     > >>>> Data, single: total=114.01GiB, used=80.88GiB
> >     > >>>> System, single: total=32.00MiB, used=16.00KiB
> >     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> >     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> >     > >>>>
> >     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >     > >>>> nvme0n1                                       259:5    0
> >     232.9G  0 disk
> >     > >>>> ├─nvme0n1p1                                   259:6    0
> >      512M  0
> >     > >>>> part  (/boot/efi)
> >     > >>>> ├─nvme0n1p2                                   259:7    0
> >      1G  0 part  (/boot)
> >     > >>>> └─nvme0n1p3                                   259:8    0
> >     231.4G  0 part (btrfs)
> >     > >>>>
> >     > >>>> I have the following subvols:
> >     > >>>> arch: used for / when booting arch
> >     > >>>> jwhendy: used for /home/jwhendy on arch
> >     > >>>> vault: shared data between distros on /mnt/vault
> >     > >>>> bionic: root when booting ubuntu bionic
> >     > >>>>
> >     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >     > >>>>
> >     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >     > >>>
> >     > >>> Edit: links now:
> >     > >>> - btrfs check: https://pastebin.com/nz6Bc145
> >     > >>> - dmesg: https://pastebin.com/1GGpNiqk
> >     > >>> - smartctl: https://pastebin.com/ADtYqfrd
> >     > >>>
> >     > >>> btrfs dev stats (not worth a link):
> >     > >>>
> >     > >>> [/dev/mapper/old].write_io_errs    0
> >     > >>> [/dev/mapper/old].read_io_errs     0
> >     > >>> [/dev/mapper/old].flush_io_errs    0
> >     > >>> [/dev/mapper/old].corruption_errs  0
> >     > >>> [/dev/mapper/old].generation_errs  0
> >     > >>>
> >     > >>>
> >     > >>>> If these are of interested, here are reddit threads where I
> >     posted the
> >     > >>>> issue and was referred here.
> >     > >>>> 1)
> >     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >     > >>>> 2)
> >     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >     > >>>>
> >     > >>>> It has been suggested this is a hardware issue. I've already
> >     ordered a
> >     > >>>> replacement m2.sata, but for sanity it would be great to know
> >     > >>>> definitively this was the case. If anything stands out above that
> >     > >>>> could indicate I'm not setup properly re. btrfs, that would
> >     also be
> >     > >>>> fantastic so I don't repeat the issue!
> >     > >>>>
> >     > >>>> The only thing I've stumbled on is that I have been mounting with
> >     > >>>> rd.luks.options=discard and that manually running fstrim is
> >     preferred.
> >     > >>>>
> >     > >>>>
> >     > >>>> Many thanks for any input/suggestions,
> >     > >>>> John
> >     > >>
> >     >
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  0:51               ` John Hendy
@ 2020-02-09  0:59                 ` John Hendy
  2020-02-09  1:09                   ` Qu Wenruo
  2020-02-09  1:07                 ` Qu Wenruo
  1 sibling, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-02-09  0:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Also, if it's of interest, the zero-log trick was new to me. For my
original m2.sata nvme drive, I'd already run all of --init-csum-tree,
--init-extent-tree, and --repair (unsure on the order of the first
two, but --repair was definitely last) but could then not mount it. I
just ran `btrfs rescue zero-log` on it and here is the very brief
output from a btrfs check:

$ sudo btrfs check /dev/mapper/nvme
Opening filesystem to check...
Checking filesystem on /dev/mapper/nvme
UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
[1/7] checking root items
[2/7] checking extents
data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 40762777600 root 256 owner 525787
offset 0 found 1 wanted 0 back 0x5635831f9a20
incorrect local backref count on 40762777600 root 4352 owner 525787
offset 0 found 0 wanted 1 back 0x56357e5a3c70
backref disk bytenr does not match extent record, bytenr=40762777600,
ref bytenr=0
backpointer mismatch on [40762777600 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 87799443456 bytes used, error(s) found
total csum bytes: 84696784
total tree bytes: 954220544
total fs tree bytes: 806535168
total extent tree bytes: 47710208
btree space waste bytes: 150766636
file data blocks allocated: 87780622336
 referenced 94255783936

If that looks promising... I'm hoping that the ssd we're currently
working on will follow suit! I'll await your recommendation for what
to do on the previous inquiries for the SSD, and if you have any
suggestions for the backref errors on the nvme drive above.

Many thanks,
John

On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@gmail.com> wrote:
>
> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2020/2/9 上午5:57, John Hendy wrote:
> > > On phone due to no OS, so apologies if this is in html mode. Indeed, I
> > > can't mount or boot any longer. I get the error:
> > >
> > > Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> > > to recover log tree)
> > > BTRFS error (device dm-0): open_ctree failed
> >
> > That can be easily fixed by `btrfs rescue zero-log`.
> >
>
> Whew. This was most helpful and it is wonderful to be booting at
> least. I think the outstanding issues are:
> - what should I do about `btrfs check --repair seg` faulting?
> - how can I deal with this (probably related to seg fault) ghost file
> that cannot be deleted?
> - I'm not sure if you looked at the post --repair log, but there a ton
> of these errors that didn't used to be there:
>
> backpointer mismatch on [13037375488 20480]
> ref mismatch on [13037395968 892928] extent item 0, found 1
> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
> not found in extent tree
> incorrect local backref count on 13037395968 root 263 owner 4257169
> offset 0 found 1 wanted 0 back 0x5627f59cadc0
>
> Here is the latest btrfs check output after the zero-log operation.
> - https://pastebin.com/KWeUnk0y
>
> I'm hoping once that file is deleted, it's a matter of
> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
>
> Thanks,
> John
>
> > At least, btrfs check --repair didn't make things worse.
> >
> > Thanks,
> > Qu
> > >
> > > John
> > >
> > > On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> > > <mailto:jw.hendy@gmail.com>> wrote:
> > >
> > >     This is not going so hot. Updates:
> > >
> > >     booted from arch install, pre repair btrfs check:
> > >     - https://pastebin.com/6vNaSdf2
> > >
> > >     btrfs check --mode=lowmem as requested by Chris:
> > >     - https://pastebin.com/uSwSTVVY
> > >
> > >     Then I did btrfs check --repair, which seg faulted at the end. I've
> > >     typed them off of pictures I took:
> > >
> > >     Starting repair.
> > >     Opening filesystem to check...
> > >     Checking filesystem on /dev/mapper/ssd
> > >     [1/7] checking root items
> > >     Fixed 0 roots.
> > >     [2/7] checking extents
> > >     parent transid verify failed on 20271138064 wanted 68719924810 found
> > >     448074
> > >     parent transid verify failed on 20271138064 wanted 68719924810 found
> > >     448074
> > >     Ignoring transid failure
> > >     # ... repeated the previous two lines maybe hundreds of times
> > >     # ended with this:
> > >     ref mismatch on [12797435904 268505088] extent item 1, found 412
> > >     [1] 1814 segmentation fault (core dumped) btrfs check --repair
> > >     /dev/mapper/ssd
> > >
> > >     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> > >
> > >     Here is the output of btrfs check after the --repair attempt:
> > >     - https://pastebin.com/6MYRNdga
> > >
> > >     I rebooted to write this email given the seg fault, as I wanted to
> > >     make sure that I should still follow-up --repair with
> > >     --init-csum-tree. I had pictures of the --repair output, but Firefox
> > >     just wouldn't load imgur.com <http://imgur.com> for me to post the
> > >     pics and was acting
> > >     really weird. In suspiciously checking dmesg, things have gone ro on
> > >     me :(  Here is the dmesg from this session:
> > >     - https://pastebin.com/a2z7xczy
> > >
> > >     The gist is:
> > >
> > >     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
> > >     block=172703744 slot=0, csum end range (12980568064) goes beyond the
> > >     start range (12980297728) of the next csum item
> > >     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
> > >     total ptrs 34 free space 29 owner 7
> > >     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
> > >     itemoff 14811 itemsize 1472
> > >     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
> > >     itemoff 13895 itemsize 916
> > >     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
> > >     itemoff 13811 itemsize 84
> > >     # ... there's maybe 30 of these item n key lines in total
> > >     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
> > >     tree block corruption detected
> > >     [   41.016793] BTRFS: error (device dm-0) in
> > >     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
> > >     writing out transaction)
> > >     [   41.016799] BTRFS info (device dm-0): forced readonly
> > >     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
> > >     transaction.
> > >     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
> > >     errno=-5 IO failure
> > >     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
> > >     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
> > >     transaction.
> > >     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
> > >     [   44.509418] systemd-journald[416]:
> > >     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> > >     Journal file corrupted, rotating.
> > >     [   44.509440] systemd-journald[416]: Failed to rotate
> > >     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> > >     Read-only file system
> > >     [   44.509450] systemd-journald[416]: Failed to rotate
> > >     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
> > >     Read-only file system
> > >     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
> > >     705 bytes) despite vacuuming, ignoring: Bad message
> > >     # ... then a bunch of these failed journal attempts (of note:
> > >     /var/log/journal was one of the bad inodes from btrfs check
> > >     previously)
> > >
> > >     Kindly let me know what you would recommend. I'm sadly back to an
> > >     unusable system vs. a complaining/worrisome one. This is similar to
> > >     the behavior I had with the m2.sata nvme drive in my original
> > >     experience. After trying all of --repair, --init-csum-tree, and
> > >     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
> > >     password at boot, I just saw a bunch of [FAILED] in the text splash
> > >     output. Hoping to not repeat that with this drive.
> > >
> > >     Thanks,
> > >     John
> > >
> > >
> > >     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
> > >     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> > >     >
> > >     >
> > >     >
> > >     > On 2020/2/8 下午12:48, John Hendy wrote:
> > >     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
> > >     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> > >     > >>
> > >     > >>
> > >     > >>
> > >     > >> On 2020/2/8 上午1:52, John Hendy wrote:
> > >     > >>> Greetings,
> > >     > >>>
> > >     > >>> I'm resending, as this isn't showing in the archives. Perhaps
> > >     it was
> > >     > >>> the attachments, which I've converted to pastebin links.
> > >     > >>>
> > >     > >>> As an update, I'm now running off of a different drive (ssd,
> > >     not the
> > >     > >>> nvme) and I got the error again! I'm now inclined to think
> > >     this might
> > >     > >>> not be hardware after all, but something related to my setup
> > >     or a bug
> > >     > >>> with chromium.
> > >     > >>>
> > >     > >>> After a reboot, chromium wouldn't start for me and demsg showed
> > >     > >>> similar parent transid/csum errors to my original post below.
> > >     I used
> > >     > >>> btrfs-inspect-internal to find the inode traced to
> > >     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
> > >     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
> > >     that and
> > >     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
> > >     pool was
> > >     > >>> mounted ro just like the original problem below.
> > >     > >>>
> > >     > >>> dmesg after trying to start chromium:
> > >     > >>> - https://pastebin.com/CsCEQMJa
> > >     > >>
> > >     > >> So far, it's only transid bug in your csum tree.
> > >     > >>
> > >     > >> And two backref mismatch in data backref.
> > >     > >>
> > >     > >> In theory, you can fix your problem by `btrfs check --repair
> > >     > >> --init-csum-tree`.
> > >     > >>
> > >     > >
> > >     > > Now that I might be narrowing in on offending files, I'll wait
> > >     to see
> > >     > > what you think from my last response to Chris. I did try the above
> > >     > > when I first ran into this:
> > >     > > -
> > >     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
> > >     >
> > >     > That RO is caused by the missing data backref.
> > >     >
> > >     > Which can be fixed by btrfs check --repair.
> > >     >
> > >     > Then you should be able to delete offending files them. (Or the whole
> > >     > chromium cache, and switch to firefox if you wish :P )
> > >     >
> > >     > But also please keep in mind that, the transid mismatch looks
> > >     happen in
> > >     > your csum tree, which means your csum tree is no longer reliable, and
> > >     > may cause -EIO reading unrelated files.
> > >     >
> > >     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
> > >     >
> > >     > It can be done altogether by --repair --init-csum-tree, but to be
> > >     safe,
> > >     > please run --repair only first, then make sure btrfs check reports no
> > >     > error after that. Then go --init-csum-tree.
> > >     >
> > >     > >
> > >     > >> But I'm more interesting in how this happened.
> > >     > >
> > >     > > Me too :)
> > >     > >
> > >     > >> Have your every experienced any power loss for your NVME drive?
> > >     > >> I'm not say btrfs is unsafe against power loss, all fs should
> > >     be safe
> > >     > >> against power loss, I'm just curious about if mount time log
> > >     replay is
> > >     > >> involved, or just regular internal log replay.
> > >     > >>
> > >     > >> From your smartctl, the drive experienced 61 unsafe shutdown
> > >     with 2144
> > >     > >> power cycles.
> > >     > >
> > >     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> > >     > > caught off gaurd by low battery and instant power-off, I kick myself
> > >     > > and mean to set up a script to force poweroff before that
> > >     happens. So,
> > >     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> > >     > > not over ~2 years. And actually, I mis-stated the age. I haven't
> > >     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> > >     > > issued every 3, so the ssd drive is more like 5 years old.
> > >     > >
> > >     > >> Not sure if it's related.
> > >     > >>
> > >     > >> Another interesting point is, did you remember what's the
> > >     oldest kernel
> > >     > >> running on this fs? v5.4 or v5.5?
> > >     > >
> > >     > > Hard to say, but arch linux maintains a package archive. The nvme
> > >     > > drive is from ~May 2018. The archives only go back to Jan 2019
> > >     and the
> > >     > > kernel/btrfs-progs was at 4.20 then:
> > >     > > - https://archive.archlinux.org/packages/l/linux/
> > >     >
> > >     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> > >     > cause metadata corruption. And the symptom is transid error, which
> > >     also
> > >     > matches your problem.
> > >     >
> > >     > Thanks,
> > >     > Qu
> > >     >
> > >     > >
> > >     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
> > >     so the
> > >     > > kernel version would have been even older.
> > >     > >
> > >     > > Thanks for your input,
> > >     > > John
> > >     > >
> > >     > >>
> > >     > >> Thanks,
> > >     > >> Qu
> > >     > >>>
> > >     > >>> Thanks for any pointers, as it would now seem that my purchase
> > >     of a
> > >     > >>> new m2.sata may not buy my way out of this problem! While I didn't
> > >     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> > >     > >>> worried there is a deeper issue bound to recur :(
> > >     > >>>
> > >     > >>> Best regards,
> > >     > >>> John
> > >     > >>>
> > >     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
> > >     <mailto:jw.hendy@gmail.com>> wrote:
> > >     > >>>>
> > >     > >>>> Greetings,
> > >     > >>>>
> > >     > >>>> I've had this issue occur twice, once ~1mo ago and once a
> > >     couple of
> > >     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
> > >     start it
> > >     > >>>> again, it complained about a lock file in ~. I tried to delete it
> > >     > >>>> manually and was informed I was on a read-only fs! I ended up
> > >     biting
> > >     > >>>> the bullet and re-installing linux due to the number of dead end
> > >     > >>>> threads and slow response rates on diagnosing these issues,
> > >     and the
> > >     > >>>> issue occurred again shortly after.
> > >     > >>>>
> > >     > >>>> $ uname -a
> > >     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
> > >     16:38:40
> > >     > >>>> +0000 x86_64 GNU/Linux
> > >     > >>>>
> > >     > >>>> $ btrfs --version
> > >     > >>>> btrfs-progs v5.4
> > >     > >>>>
> > >     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
> > >     mounting a subvol on /
> > >     > >>>> Data, single: total=114.01GiB, used=80.88GiB
> > >     > >>>> System, single: total=32.00MiB, used=16.00KiB
> > >     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> > >     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> > >     > >>>>
> > >     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> > >     > >>>> nvme0n1                                       259:5    0
> > >     232.9G  0 disk
> > >     > >>>> ├─nvme0n1p1                                   259:6    0
> > >      512M  0
> > >     > >>>> part  (/boot/efi)
> > >     > >>>> ├─nvme0n1p2                                   259:7    0
> > >      1G  0 part  (/boot)
> > >     > >>>> └─nvme0n1p3                                   259:8    0
> > >     231.4G  0 part (btrfs)
> > >     > >>>>
> > >     > >>>> I have the following subvols:
> > >     > >>>> arch: used for / when booting arch
> > >     > >>>> jwhendy: used for /home/jwhendy on arch
> > >     > >>>> vault: shared data between distros on /mnt/vault
> > >     > >>>> bionic: root when booting ubuntu bionic
> > >     > >>>>
> > >     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> > >     > >>>>
> > >     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> > >     > >>>
> > >     > >>> Edit: links now:
> > >     > >>> - btrfs check: https://pastebin.com/nz6Bc145
> > >     > >>> - dmesg: https://pastebin.com/1GGpNiqk
> > >     > >>> - smartctl: https://pastebin.com/ADtYqfrd
> > >     > >>>
> > >     > >>> btrfs dev stats (not worth a link):
> > >     > >>>
> > >     > >>> [/dev/mapper/old].write_io_errs    0
> > >     > >>> [/dev/mapper/old].read_io_errs     0
> > >     > >>> [/dev/mapper/old].flush_io_errs    0
> > >     > >>> [/dev/mapper/old].corruption_errs  0
> > >     > >>> [/dev/mapper/old].generation_errs  0
> > >     > >>>
> > >     > >>>
> > >     > >>>> If these are of interested, here are reddit threads where I
> > >     posted the
> > >     > >>>> issue and was referred here.
> > >     > >>>> 1)
> > >     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> > >     > >>>> 2)
> > >     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> > >     > >>>>
> > >     > >>>> It has been suggested this is a hardware issue. I've already
> > >     ordered a
> > >     > >>>> replacement m2.sata, but for sanity it would be great to know
> > >     > >>>> definitively this was the case. If anything stands out above that
> > >     > >>>> could indicate I'm not setup properly re. btrfs, that would
> > >     also be
> > >     > >>>> fantastic so I don't repeat the issue!
> > >     > >>>>
> > >     > >>>> The only thing I've stumbled on is that I have been mounting with
> > >     > >>>> rd.luks.options=discard and that manually running fstrim is
> > >     preferred.
> > >     > >>>>
> > >     > >>>>
> > >     > >>>> Many thanks for any input/suggestions,
> > >     > >>>> John
> > >     > >>
> > >     >
> > >
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  0:51               ` John Hendy
  2020-02-09  0:59                 ` John Hendy
@ 2020-02-09  1:07                 ` Qu Wenruo
  2020-02-09  4:10                   ` John Hendy
  1 sibling, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-09  1:07 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 17307 bytes --]



On 2020/2/9 上午8:51, John Hendy wrote:
> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/2/9 上午5:57, John Hendy wrote:
>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
>>> can't mount or boot any longer. I get the error:
>>>
>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
>>> to recover log tree)
>>> BTRFS error (device dm-0): open_ctree failed
>>
>> That can be easily fixed by `btrfs rescue zero-log`.
>>
> 
> Whew. This was most helpful and it is wonderful to be booting at
> least. I think the outstanding issues are:
> - what should I do about `btrfs check --repair seg` faulting?

That needs extra debugging. But you can try `btrfs check --repair
--mode=lowmem` which sometimes can bring better result than regular mode.
The trade-off is much slower speed.

> - how can I deal with this (probably related to seg fault) ghost file
> that cannot be deleted?

Only `btrfs check` can handle it, kernel will only fallback to RO to
prevent further corruption.

> - I'm not sure if you looked at the post --repair log, but there a ton
> of these errors that didn't used to be there:
> 
> backpointer mismatch on [13037375488 20480]
> ref mismatch on [13037395968 892928] extent item 0, found 1
> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
> not found in extent tree
> incorrect local backref count on 13037395968 root 263 owner 4257169
> offset 0 found 1 wanted 0 back 0x5627f59cadc0

All 13037395968 related line is just one problem, it's the original mode
doing human-unfriendly output.

But the extra transid looks kinda dangerous.

I'd recommend to backup important data first before trying to repair.

> 
> Here is the latest btrfs check output after the zero-log operation.
> - https://pastebin.com/KWeUnk0y
> 
> I'm hoping once that file is deleted, it's a matter of
> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?

--init-csum-tree has the least priority, thus it doesn't really matter.

--init-extent-tree would in theory reset your extent tree, but the
problem is, the transid mismatch may cause something wrong.

So please backup your data before trying any repair.
After data backup, please try `btrfs check --repair --mode=lowmem` first.

Thanks,
Qu
> 
> Thanks,
> John
> 
>> At least, btrfs check --repair didn't make things worse.
>>
>> Thanks,
>> Qu
>>>
>>> John
>>>
>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
>>> <mailto:jw.hendy@gmail.com>> wrote:
>>>
>>>     This is not going so hot. Updates:
>>>
>>>     booted from arch install, pre repair btrfs check:
>>>     - https://pastebin.com/6vNaSdf2
>>>
>>>     btrfs check --mode=lowmem as requested by Chris:
>>>     - https://pastebin.com/uSwSTVVY
>>>
>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
>>>     typed them off of pictures I took:
>>>
>>>     Starting repair.
>>>     Opening filesystem to check...
>>>     Checking filesystem on /dev/mapper/ssd
>>>     [1/7] checking root items
>>>     Fixed 0 roots.
>>>     [2/7] checking extents
>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>     448074
>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>     448074
>>>     Ignoring transid failure
>>>     # ... repeated the previous two lines maybe hundreds of times
>>>     # ended with this:
>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
>>>     /dev/mapper/ssd
>>>
>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
>>>
>>>     Here is the output of btrfs check after the --repair attempt:
>>>     - https://pastebin.com/6MYRNdga
>>>
>>>     I rebooted to write this email given the seg fault, as I wanted to
>>>     make sure that I should still follow-up --repair with
>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
>>>     pics and was acting
>>>     really weird. In suspiciously checking dmesg, things have gone ro on
>>>     me :(  Here is the dmesg from this session:
>>>     - https://pastebin.com/a2z7xczy
>>>
>>>     The gist is:
>>>
>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
>>>     start range (12980297728) of the next csum item
>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
>>>     total ptrs 34 free space 29 owner 7
>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
>>>     itemoff 14811 itemsize 1472
>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
>>>     itemoff 13895 itemsize 916
>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
>>>     itemoff 13811 itemsize 84
>>>     # ... there's maybe 30 of these item n key lines in total
>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
>>>     tree block corruption detected
>>>     [   41.016793] BTRFS: error (device dm-0) in
>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
>>>     writing out transaction)
>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
>>>     transaction.
>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
>>>     errno=-5 IO failure
>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
>>>     transaction.
>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
>>>     [   44.509418] systemd-journald[416]:
>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>     Journal file corrupted, rotating.
>>>     [   44.509440] systemd-journald[416]: Failed to rotate
>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>     Read-only file system
>>>     [   44.509450] systemd-journald[416]: Failed to rotate
>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
>>>     Read-only file system
>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
>>>     705 bytes) despite vacuuming, ignoring: Bad message
>>>     # ... then a bunch of these failed journal attempts (of note:
>>>     /var/log/journal was one of the bad inodes from btrfs check
>>>     previously)
>>>
>>>     Kindly let me know what you would recommend. I'm sadly back to an
>>>     unusable system vs. a complaining/worrisome one. This is similar to
>>>     the behavior I had with the m2.sata nvme drive in my original
>>>     experience. After trying all of --repair, --init-csum-tree, and
>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
>>>     output. Hoping to not repeat that with this drive.
>>>
>>>     Thanks,
>>>     John
>>>
>>>
>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>     >
>>>     >
>>>     >
>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>     > >>
>>>     > >>
>>>     > >>
>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
>>>     > >>> Greetings,
>>>     > >>>
>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
>>>     it was
>>>     > >>> the attachments, which I've converted to pastebin links.
>>>     > >>>
>>>     > >>> As an update, I'm now running off of a different drive (ssd,
>>>     not the
>>>     > >>> nvme) and I got the error again! I'm now inclined to think
>>>     this might
>>>     > >>> not be hardware after all, but something related to my setup
>>>     or a bug
>>>     > >>> with chromium.
>>>     > >>>
>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
>>>     > >>> similar parent transid/csum errors to my original post below.
>>>     I used
>>>     > >>> btrfs-inspect-internal to find the inode traced to
>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
>>>     that and
>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
>>>     pool was
>>>     > >>> mounted ro just like the original problem below.
>>>     > >>>
>>>     > >>> dmesg after trying to start chromium:
>>>     > >>> - https://pastebin.com/CsCEQMJa
>>>     > >>
>>>     > >> So far, it's only transid bug in your csum tree.
>>>     > >>
>>>     > >> And two backref mismatch in data backref.
>>>     > >>
>>>     > >> In theory, you can fix your problem by `btrfs check --repair
>>>     > >> --init-csum-tree`.
>>>     > >>
>>>     > >
>>>     > > Now that I might be narrowing in on offending files, I'll wait
>>>     to see
>>>     > > what you think from my last response to Chris. I did try the above
>>>     > > when I first ran into this:
>>>     > > -
>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>>>     >
>>>     > That RO is caused by the missing data backref.
>>>     >
>>>     > Which can be fixed by btrfs check --repair.
>>>     >
>>>     > Then you should be able to delete offending files them. (Or the whole
>>>     > chromium cache, and switch to firefox if you wish :P )
>>>     >
>>>     > But also please keep in mind that, the transid mismatch looks
>>>     happen in
>>>     > your csum tree, which means your csum tree is no longer reliable, and
>>>     > may cause -EIO reading unrelated files.
>>>     >
>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>>>     >
>>>     > It can be done altogether by --repair --init-csum-tree, but to be
>>>     safe,
>>>     > please run --repair only first, then make sure btrfs check reports no
>>>     > error after that. Then go --init-csum-tree.
>>>     >
>>>     > >
>>>     > >> But I'm more interesting in how this happened.
>>>     > >
>>>     > > Me too :)
>>>     > >
>>>     > >> Have your every experienced any power loss for your NVME drive?
>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
>>>     be safe
>>>     > >> against power loss, I'm just curious about if mount time log
>>>     replay is
>>>     > >> involved, or just regular internal log replay.
>>>     > >>
>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
>>>     with 2144
>>>     > >> power cycles.
>>>     > >
>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
>>>     > > and mean to set up a script to force poweroff before that
>>>     happens. So,
>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
>>>     > > issued every 3, so the ssd drive is more like 5 years old.
>>>     > >
>>>     > >> Not sure if it's related.
>>>     > >>
>>>     > >> Another interesting point is, did you remember what's the
>>>     oldest kernel
>>>     > >> running on this fs? v5.4 or v5.5?
>>>     > >
>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
>>>     and the
>>>     > > kernel/btrfs-progs was at 4.20 then:
>>>     > > - https://archive.archlinux.org/packages/l/linux/
>>>     >
>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
>>>     > cause metadata corruption. And the symptom is transid error, which
>>>     also
>>>     > matches your problem.
>>>     >
>>>     > Thanks,
>>>     > Qu
>>>     >
>>>     > >
>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
>>>     so the
>>>     > > kernel version would have been even older.
>>>     > >
>>>     > > Thanks for your input,
>>>     > > John
>>>     > >
>>>     > >>
>>>     > >> Thanks,
>>>     > >> Qu
>>>     > >>>
>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
>>>     of a
>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
>>>     > >>> worried there is a deeper issue bound to recur :(
>>>     > >>>
>>>     > >>> Best regards,
>>>     > >>> John
>>>     > >>>
>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
>>>     <mailto:jw.hendy@gmail.com>> wrote:
>>>     > >>>>
>>>     > >>>> Greetings,
>>>     > >>>>
>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
>>>     couple of
>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
>>>     start it
>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
>>>     biting
>>>     > >>>> the bullet and re-installing linux due to the number of dead end
>>>     > >>>> threads and slow response rates on diagnosing these issues,
>>>     and the
>>>     > >>>> issue occurred again shortly after.
>>>     > >>>>
>>>     > >>>> $ uname -a
>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
>>>     16:38:40
>>>     > >>>> +0000 x86_64 GNU/Linux
>>>     > >>>>
>>>     > >>>> $ btrfs --version
>>>     > >>>> btrfs-progs v5.4
>>>     > >>>>
>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
>>>     mounting a subvol on /
>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>     > >>>>
>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>     > >>>> nvme0n1                                       259:5    0
>>>     232.9G  0 disk
>>>     > >>>> ├─nvme0n1p1                                   259:6    0
>>>      512M  0
>>>     > >>>> part  (/boot/efi)
>>>     > >>>> ├─nvme0n1p2                                   259:7    0
>>>      1G  0 part  (/boot)
>>>     > >>>> └─nvme0n1p3                                   259:8    0
>>>     231.4G  0 part (btrfs)
>>>     > >>>>
>>>     > >>>> I have the following subvols:
>>>     > >>>> arch: used for / when booting arch
>>>     > >>>> jwhendy: used for /home/jwhendy on arch
>>>     > >>>> vault: shared data between distros on /mnt/vault
>>>     > >>>> bionic: root when booting ubuntu bionic
>>>     > >>>>
>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>     > >>>>
>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>     > >>>
>>>     > >>> Edit: links now:
>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
>>>     > >>>
>>>     > >>> btrfs dev stats (not worth a link):
>>>     > >>>
>>>     > >>> [/dev/mapper/old].write_io_errs    0
>>>     > >>> [/dev/mapper/old].read_io_errs     0
>>>     > >>> [/dev/mapper/old].flush_io_errs    0
>>>     > >>> [/dev/mapper/old].corruption_errs  0
>>>     > >>> [/dev/mapper/old].generation_errs  0
>>>     > >>>
>>>     > >>>
>>>     > >>>> If these are of interested, here are reddit threads where I
>>>     posted the
>>>     > >>>> issue and was referred here.
>>>     > >>>> 1)
>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>     > >>>> 2)
>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>     > >>>>
>>>     > >>>> It has been suggested this is a hardware issue. I've already
>>>     ordered a
>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
>>>     > >>>> definitively this was the case. If anything stands out above that
>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
>>>     also be
>>>     > >>>> fantastic so I don't repeat the issue!
>>>     > >>>>
>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
>>>     preferred.
>>>     > >>>>
>>>     > >>>>
>>>     > >>>> Many thanks for any input/suggestions,
>>>     > >>>> John
>>>     > >>
>>>     >
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  0:59                 ` John Hendy
@ 2020-02-09  1:09                   ` Qu Wenruo
  2020-02-09  1:20                     ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-09  1:09 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 18958 bytes --]



On 2020/2/9 上午8:59, John Hendy wrote:
> Also, if it's of interest, the zero-log trick was new to me. For my
> original m2.sata nvme drive, I'd already run all of --init-csum-tree,
> --init-extent-tree, and --repair (unsure on the order of the first
> two, but --repair was definitely last) but could then not mount it. I
> just ran `btrfs rescue zero-log` on it and here is the very brief
> output from a btrfs check:
> 
> $ sudo btrfs check /dev/mapper/nvme
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/nvme
> UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
> [1/7] checking root items
> [2/7] checking extents
> data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
> found in extent tree
> incorrect local backref count on 40762777600 root 256 owner 525787
> offset 0 found 1 wanted 0 back 0x5635831f9a20
> incorrect local backref count on 40762777600 root 4352 owner 525787
> offset 0 found 0 wanted 1 back 0x56357e5a3c70
> backref disk bytenr does not match extent record, bytenr=40762777600,
> ref bytenr=0
> backpointer mismatch on [40762777600 4096]

At this stage, btrfs check --repair should be able to fix it.

Or does it still segfault?

Thanks,
Qu
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 87799443456 bytes used, error(s) found
> total csum bytes: 84696784
> total tree bytes: 954220544
> total fs tree bytes: 806535168
> total extent tree bytes: 47710208
> btree space waste bytes: 150766636
> file data blocks allocated: 87780622336
>  referenced 94255783936
> 
> If that looks promising... I'm hoping that the ssd we're currently
> working on will follow suit! I'll await your recommendation for what
> to do on the previous inquiries for the SSD, and if you have any
> suggestions for the backref errors on the nvme drive above.
> 
> Many thanks,
> John
> 
> On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@gmail.com> wrote:
>>
>> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>>
>>> On 2020/2/9 上午5:57, John Hendy wrote:
>>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
>>>> can't mount or boot any longer. I get the error:
>>>>
>>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
>>>> to recover log tree)
>>>> BTRFS error (device dm-0): open_ctree failed
>>>
>>> That can be easily fixed by `btrfs rescue zero-log`.
>>>
>>
>> Whew. This was most helpful and it is wonderful to be booting at
>> least. I think the outstanding issues are:
>> - what should I do about `btrfs check --repair seg` faulting?
>> - how can I deal with this (probably related to seg fault) ghost file
>> that cannot be deleted?
>> - I'm not sure if you looked at the post --repair log, but there a ton
>> of these errors that didn't used to be there:
>>
>> backpointer mismatch on [13037375488 20480]
>> ref mismatch on [13037395968 892928] extent item 0, found 1
>> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
>> not found in extent tree
>> incorrect local backref count on 13037395968 root 263 owner 4257169
>> offset 0 found 1 wanted 0 back 0x5627f59cadc0
>>
>> Here is the latest btrfs check output after the zero-log operation.
>> - https://pastebin.com/KWeUnk0y
>>
>> I'm hoping once that file is deleted, it's a matter of
>> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
>>
>> Thanks,
>> John
>>
>>> At least, btrfs check --repair didn't make things worse.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> John
>>>>
>>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
>>>> <mailto:jw.hendy@gmail.com>> wrote:
>>>>
>>>>     This is not going so hot. Updates:
>>>>
>>>>     booted from arch install, pre repair btrfs check:
>>>>     - https://pastebin.com/6vNaSdf2
>>>>
>>>>     btrfs check --mode=lowmem as requested by Chris:
>>>>     - https://pastebin.com/uSwSTVVY
>>>>
>>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
>>>>     typed them off of pictures I took:
>>>>
>>>>     Starting repair.
>>>>     Opening filesystem to check...
>>>>     Checking filesystem on /dev/mapper/ssd
>>>>     [1/7] checking root items
>>>>     Fixed 0 roots.
>>>>     [2/7] checking extents
>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>     448074
>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>     448074
>>>>     Ignoring transid failure
>>>>     # ... repeated the previous two lines maybe hundreds of times
>>>>     # ended with this:
>>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
>>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
>>>>     /dev/mapper/ssd
>>>>
>>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
>>>>
>>>>     Here is the output of btrfs check after the --repair attempt:
>>>>     - https://pastebin.com/6MYRNdga
>>>>
>>>>     I rebooted to write this email given the seg fault, as I wanted to
>>>>     make sure that I should still follow-up --repair with
>>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
>>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
>>>>     pics and was acting
>>>>     really weird. In suspiciously checking dmesg, things have gone ro on
>>>>     me :(  Here is the dmesg from this session:
>>>>     - https://pastebin.com/a2z7xczy
>>>>
>>>>     The gist is:
>>>>
>>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
>>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
>>>>     start range (12980297728) of the next csum item
>>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
>>>>     total ptrs 34 free space 29 owner 7
>>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
>>>>     itemoff 14811 itemsize 1472
>>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
>>>>     itemoff 13895 itemsize 916
>>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
>>>>     itemoff 13811 itemsize 84
>>>>     # ... there's maybe 30 of these item n key lines in total
>>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
>>>>     tree block corruption detected
>>>>     [   41.016793] BTRFS: error (device dm-0) in
>>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
>>>>     writing out transaction)
>>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
>>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>     transaction.
>>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
>>>>     errno=-5 IO failure
>>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>     transaction.
>>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>     [   44.509418] systemd-journald[416]:
>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>     Journal file corrupted, rotating.
>>>>     [   44.509440] systemd-journald[416]: Failed to rotate
>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>     Read-only file system
>>>>     [   44.509450] systemd-journald[416]: Failed to rotate
>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
>>>>     Read-only file system
>>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
>>>>     705 bytes) despite vacuuming, ignoring: Bad message
>>>>     # ... then a bunch of these failed journal attempts (of note:
>>>>     /var/log/journal was one of the bad inodes from btrfs check
>>>>     previously)
>>>>
>>>>     Kindly let me know what you would recommend. I'm sadly back to an
>>>>     unusable system vs. a complaining/worrisome one. This is similar to
>>>>     the behavior I had with the m2.sata nvme drive in my original
>>>>     experience. After trying all of --repair, --init-csum-tree, and
>>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
>>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
>>>>     output. Hoping to not repeat that with this drive.
>>>>
>>>>     Thanks,
>>>>     John
>>>>
>>>>
>>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
>>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>     > >>
>>>>     > >>
>>>>     > >>
>>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
>>>>     > >>> Greetings,
>>>>     > >>>
>>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
>>>>     it was
>>>>     > >>> the attachments, which I've converted to pastebin links.
>>>>     > >>>
>>>>     > >>> As an update, I'm now running off of a different drive (ssd,
>>>>     not the
>>>>     > >>> nvme) and I got the error again! I'm now inclined to think
>>>>     this might
>>>>     > >>> not be hardware after all, but something related to my setup
>>>>     or a bug
>>>>     > >>> with chromium.
>>>>     > >>>
>>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
>>>>     > >>> similar parent transid/csum errors to my original post below.
>>>>     I used
>>>>     > >>> btrfs-inspect-internal to find the inode traced to
>>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
>>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
>>>>     that and
>>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
>>>>     pool was
>>>>     > >>> mounted ro just like the original problem below.
>>>>     > >>>
>>>>     > >>> dmesg after trying to start chromium:
>>>>     > >>> - https://pastebin.com/CsCEQMJa
>>>>     > >>
>>>>     > >> So far, it's only transid bug in your csum tree.
>>>>     > >>
>>>>     > >> And two backref mismatch in data backref.
>>>>     > >>
>>>>     > >> In theory, you can fix your problem by `btrfs check --repair
>>>>     > >> --init-csum-tree`.
>>>>     > >>
>>>>     > >
>>>>     > > Now that I might be narrowing in on offending files, I'll wait
>>>>     to see
>>>>     > > what you think from my last response to Chris. I did try the above
>>>>     > > when I first ran into this:
>>>>     > > -
>>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>>>>     >
>>>>     > That RO is caused by the missing data backref.
>>>>     >
>>>>     > Which can be fixed by btrfs check --repair.
>>>>     >
>>>>     > Then you should be able to delete offending files them. (Or the whole
>>>>     > chromium cache, and switch to firefox if you wish :P )
>>>>     >
>>>>     > But also please keep in mind that, the transid mismatch looks
>>>>     happen in
>>>>     > your csum tree, which means your csum tree is no longer reliable, and
>>>>     > may cause -EIO reading unrelated files.
>>>>     >
>>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>>>>     >
>>>>     > It can be done altogether by --repair --init-csum-tree, but to be
>>>>     safe,
>>>>     > please run --repair only first, then make sure btrfs check reports no
>>>>     > error after that. Then go --init-csum-tree.
>>>>     >
>>>>     > >
>>>>     > >> But I'm more interesting in how this happened.
>>>>     > >
>>>>     > > Me too :)
>>>>     > >
>>>>     > >> Have your every experienced any power loss for your NVME drive?
>>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
>>>>     be safe
>>>>     > >> against power loss, I'm just curious about if mount time log
>>>>     replay is
>>>>     > >> involved, or just regular internal log replay.
>>>>     > >>
>>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
>>>>     with 2144
>>>>     > >> power cycles.
>>>>     > >
>>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
>>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
>>>>     > > and mean to set up a script to force poweroff before that
>>>>     happens. So,
>>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
>>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
>>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
>>>>     > > issued every 3, so the ssd drive is more like 5 years old.
>>>>     > >
>>>>     > >> Not sure if it's related.
>>>>     > >>
>>>>     > >> Another interesting point is, did you remember what's the
>>>>     oldest kernel
>>>>     > >> running on this fs? v5.4 or v5.5?
>>>>     > >
>>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
>>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
>>>>     and the
>>>>     > > kernel/btrfs-progs was at 4.20 then:
>>>>     > > - https://archive.archlinux.org/packages/l/linux/
>>>>     >
>>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
>>>>     > cause metadata corruption. And the symptom is transid error, which
>>>>     also
>>>>     > matches your problem.
>>>>     >
>>>>     > Thanks,
>>>>     > Qu
>>>>     >
>>>>     > >
>>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
>>>>     so the
>>>>     > > kernel version would have been even older.
>>>>     > >
>>>>     > > Thanks for your input,
>>>>     > > John
>>>>     > >
>>>>     > >>
>>>>     > >> Thanks,
>>>>     > >> Qu
>>>>     > >>>
>>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
>>>>     of a
>>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
>>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
>>>>     > >>> worried there is a deeper issue bound to recur :(
>>>>     > >>>
>>>>     > >>> Best regards,
>>>>     > >>> John
>>>>     > >>>
>>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
>>>>     <mailto:jw.hendy@gmail.com>> wrote:
>>>>     > >>>>
>>>>     > >>>> Greetings,
>>>>     > >>>>
>>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
>>>>     couple of
>>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
>>>>     start it
>>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
>>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
>>>>     biting
>>>>     > >>>> the bullet and re-installing linux due to the number of dead end
>>>>     > >>>> threads and slow response rates on diagnosing these issues,
>>>>     and the
>>>>     > >>>> issue occurred again shortly after.
>>>>     > >>>>
>>>>     > >>>> $ uname -a
>>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
>>>>     16:38:40
>>>>     > >>>> +0000 x86_64 GNU/Linux
>>>>     > >>>>
>>>>     > >>>> $ btrfs --version
>>>>     > >>>> btrfs-progs v5.4
>>>>     > >>>>
>>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
>>>>     mounting a subvol on /
>>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
>>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
>>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>>     > >>>>
>>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>>     > >>>> nvme0n1                                       259:5    0
>>>>     232.9G  0 disk
>>>>     > >>>> ├─nvme0n1p1                                   259:6    0
>>>>      512M  0
>>>>     > >>>> part  (/boot/efi)
>>>>     > >>>> ├─nvme0n1p2                                   259:7    0
>>>>      1G  0 part  (/boot)
>>>>     > >>>> └─nvme0n1p3                                   259:8    0
>>>>     231.4G  0 part (btrfs)
>>>>     > >>>>
>>>>     > >>>> I have the following subvols:
>>>>     > >>>> arch: used for / when booting arch
>>>>     > >>>> jwhendy: used for /home/jwhendy on arch
>>>>     > >>>> vault: shared data between distros on /mnt/vault
>>>>     > >>>> bionic: root when booting ubuntu bionic
>>>>     > >>>>
>>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>>     > >>>>
>>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>>     > >>>
>>>>     > >>> Edit: links now:
>>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
>>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
>>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
>>>>     > >>>
>>>>     > >>> btrfs dev stats (not worth a link):
>>>>     > >>>
>>>>     > >>> [/dev/mapper/old].write_io_errs    0
>>>>     > >>> [/dev/mapper/old].read_io_errs     0
>>>>     > >>> [/dev/mapper/old].flush_io_errs    0
>>>>     > >>> [/dev/mapper/old].corruption_errs  0
>>>>     > >>> [/dev/mapper/old].generation_errs  0
>>>>     > >>>
>>>>     > >>>
>>>>     > >>>> If these are of interested, here are reddit threads where I
>>>>     posted the
>>>>     > >>>> issue and was referred here.
>>>>     > >>>> 1)
>>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>>     > >>>> 2)
>>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>>     > >>>>
>>>>     > >>>> It has been suggested this is a hardware issue. I've already
>>>>     ordered a
>>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
>>>>     > >>>> definitively this was the case. If anything stands out above that
>>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
>>>>     also be
>>>>     > >>>> fantastic so I don't repeat the issue!
>>>>     > >>>>
>>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
>>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
>>>>     preferred.
>>>>     > >>>>
>>>>     > >>>>
>>>>     > >>>> Many thanks for any input/suggestions,
>>>>     > >>>> John
>>>>     > >>
>>>>     >
>>>>
>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  1:09                   ` Qu Wenruo
@ 2020-02-09  1:20                     ` John Hendy
  2020-02-09  1:24                       ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-02-09  1:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Sat, Feb 8, 2020 at 7:09 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/9 上午8:59, John Hendy wrote:
> > Also, if it's of interest, the zero-log trick was new to me. For my
> > original m2.sata nvme drive, I'd already run all of --init-csum-tree,
> > --init-extent-tree, and --repair (unsure on the order of the first
> > two, but --repair was definitely last) but could then not mount it. I
> > just ran `btrfs rescue zero-log` on it and here is the very brief
> > output from a btrfs check:
> >
> > $ sudo btrfs check /dev/mapper/nvme
> > Opening filesystem to check...
> > Checking filesystem on /dev/mapper/nvme
> > UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
> > [1/7] checking root items
> > [2/7] checking extents
> > data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
> > found in extent tree
> > incorrect local backref count on 40762777600 root 256 owner 525787
> > offset 0 found 1 wanted 0 back 0x5635831f9a20
> > incorrect local backref count on 40762777600 root 4352 owner 525787
> > offset 0 found 0 wanted 1 back 0x56357e5a3c70
> > backref disk bytenr does not match extent record, bytenr=40762777600,
> > ref bytenr=0
> > backpointer mismatch on [40762777600 4096]
>
> At this stage, btrfs check --repair should be able to fix it.
>
> Or does it still segfault?

This was the original problematic drive, the m2.sata. I just did
`btrfs check --repair` and it completed with:

$ sudo btrfs check --repair /dev/mapper/nvme
enabling repair mode
WARNING:

    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that no
    fsck can successfully repair all types of filesystem corruption. Eg.
    some software or hardware bugs can fatally damage a volume.
    The operation will start in 10 seconds.
    Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/mapper/nvme
UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
found in extent tree
incorrect local backref count on 40762777600 root 256 owner 525787
offset 0 found 1 wanted 0 back 0x5561d1f74ee0
incorrect local backref count on 40762777600 root 4352 owner 525787
offset 0 found 0 wanted 1 back 0x5561cd31f220
backref disk bytenr does not match extent record, bytenr=40762777600,
ref bytenr=0
backpointer mismatch on [40762777600 4096]
repair deleting extent record: key [40762777600,168,4096]
adding new data backref on 40762777600 root 256 owner 525787 offset 0 found 1
Repaired extent references for 40762777600
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 87799443456 bytes used, no error found
total csum bytes: 84696784
total tree bytes: 954220544
total fs tree bytes: 806535168
total extent tree bytes: 47710208
btree space waste bytes: 150766636
file data blocks allocated: 87780622336
 referenced 94255783936

The output of btrfs check now on this drive:

$ sudo btrfs check /dev/mapper/nvme
Opening filesystem to check...
Checking filesystem on /dev/mapper/nvme
UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 87799443456 bytes used, no error found
total csum bytes: 84696784
total tree bytes: 954220544
total fs tree bytes: 806535168
total extent tree bytes: 47710208
btree space waste bytes: 150766636
file data blocks allocated: 87780622336
 referenced 94255783936

How is that looking? I'll boot back into a usb drive to try --repair
--mode=lowmem on the SSD. My continued worry is the spurious file I
can't delete. Is that something btrfs --repair will try to fix or is
there something else that needs to be done? It seems this inode is
tripping things up and I can't find a way to get rid of that file.

John


>
> Thanks,
> Qu
> > ERROR: errors found in extent allocation tree or chunk allocation
> > [3/7] checking free space cache
> > [4/7] checking fs roots
> > [5/7] checking only csums items (without verifying data)
> > [6/7] checking root refs
> > [7/7] checking quota groups skipped (not enabled on this FS)
> > found 87799443456 bytes used, error(s) found
> > total csum bytes: 84696784
> > total tree bytes: 954220544
> > total fs tree bytes: 806535168
> > total extent tree bytes: 47710208
> > btree space waste bytes: 150766636
> > file data blocks allocated: 87780622336
> >  referenced 94255783936
> >
> > If that looks promising... I'm hoping that the ssd we're currently
> > working on will follow suit! I'll await your recommendation for what
> > to do on the previous inquiries for the SSD, and if you have any
> > suggestions for the backref errors on the nvme drive above.
> >
> > Many thanks,
> > John
> >
> > On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@gmail.com> wrote:
> >>
> >> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/2/9 上午5:57, John Hendy wrote:
> >>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
> >>>> can't mount or boot any longer. I get the error:
> >>>>
> >>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> >>>> to recover log tree)
> >>>> BTRFS error (device dm-0): open_ctree failed
> >>>
> >>> That can be easily fixed by `btrfs rescue zero-log`.
> >>>
> >>
> >> Whew. This was most helpful and it is wonderful to be booting at
> >> least. I think the outstanding issues are:
> >> - what should I do about `btrfs check --repair seg` faulting?
> >> - how can I deal with this (probably related to seg fault) ghost file
> >> that cannot be deleted?
> >> - I'm not sure if you looked at the post --repair log, but there a ton
> >> of these errors that didn't used to be there:
> >>
> >> backpointer mismatch on [13037375488 20480]
> >> ref mismatch on [13037395968 892928] extent item 0, found 1
> >> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
> >> not found in extent tree
> >> incorrect local backref count on 13037395968 root 263 owner 4257169
> >> offset 0 found 1 wanted 0 back 0x5627f59cadc0
> >>
> >> Here is the latest btrfs check output after the zero-log operation.
> >> - https://pastebin.com/KWeUnk0y
> >>
> >> I'm hoping once that file is deleted, it's a matter of
> >> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
> >>
> >> Thanks,
> >> John
> >>
> >>> At least, btrfs check --repair didn't make things worse.
> >>>
> >>> Thanks,
> >>> Qu
> >>>>
> >>>> John
> >>>>
> >>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> >>>> <mailto:jw.hendy@gmail.com>> wrote:
> >>>>
> >>>>     This is not going so hot. Updates:
> >>>>
> >>>>     booted from arch install, pre repair btrfs check:
> >>>>     - https://pastebin.com/6vNaSdf2
> >>>>
> >>>>     btrfs check --mode=lowmem as requested by Chris:
> >>>>     - https://pastebin.com/uSwSTVVY
> >>>>
> >>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
> >>>>     typed them off of pictures I took:
> >>>>
> >>>>     Starting repair.
> >>>>     Opening filesystem to check...
> >>>>     Checking filesystem on /dev/mapper/ssd
> >>>>     [1/7] checking root items
> >>>>     Fixed 0 roots.
> >>>>     [2/7] checking extents
> >>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>>     448074
> >>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>>     448074
> >>>>     Ignoring transid failure
> >>>>     # ... repeated the previous two lines maybe hundreds of times
> >>>>     # ended with this:
> >>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
> >>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
> >>>>     /dev/mapper/ssd
> >>>>
> >>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> >>>>
> >>>>     Here is the output of btrfs check after the --repair attempt:
> >>>>     - https://pastebin.com/6MYRNdga
> >>>>
> >>>>     I rebooted to write this email given the seg fault, as I wanted to
> >>>>     make sure that I should still follow-up --repair with
> >>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
> >>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
> >>>>     pics and was acting
> >>>>     really weird. In suspiciously checking dmesg, things have gone ro on
> >>>>     me :(  Here is the dmesg from this session:
> >>>>     - https://pastebin.com/a2z7xczy
> >>>>
> >>>>     The gist is:
> >>>>
> >>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
> >>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
> >>>>     start range (12980297728) of the next csum item
> >>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
> >>>>     total ptrs 34 free space 29 owner 7
> >>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
> >>>>     itemoff 14811 itemsize 1472
> >>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
> >>>>     itemoff 13895 itemsize 916
> >>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
> >>>>     itemoff 13811 itemsize 84
> >>>>     # ... there's maybe 30 of these item n key lines in total
> >>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
> >>>>     tree block corruption detected
> >>>>     [   41.016793] BTRFS: error (device dm-0) in
> >>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
> >>>>     writing out transaction)
> >>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
> >>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>>     transaction.
> >>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
> >>>>     errno=-5 IO failure
> >>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>>     transaction.
> >>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>>     [   44.509418] systemd-journald[416]:
> >>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>>     Journal file corrupted, rotating.
> >>>>     [   44.509440] systemd-journald[416]: Failed to rotate
> >>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>>     Read-only file system
> >>>>     [   44.509450] systemd-journald[416]: Failed to rotate
> >>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
> >>>>     Read-only file system
> >>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
> >>>>     705 bytes) despite vacuuming, ignoring: Bad message
> >>>>     # ... then a bunch of these failed journal attempts (of note:
> >>>>     /var/log/journal was one of the bad inodes from btrfs check
> >>>>     previously)
> >>>>
> >>>>     Kindly let me know what you would recommend. I'm sadly back to an
> >>>>     unusable system vs. a complaining/worrisome one. This is similar to
> >>>>     the behavior I had with the m2.sata nvme drive in my original
> >>>>     experience. After trying all of --repair, --init-csum-tree, and
> >>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
> >>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
> >>>>     output. Hoping to not repeat that with this drive.
> >>>>
> >>>>     Thanks,
> >>>>     John
> >>>>
> >>>>
> >>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>>     >
> >>>>     >
> >>>>     >
> >>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
> >>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>>     > >>
> >>>>     > >>
> >>>>     > >>
> >>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
> >>>>     > >>> Greetings,
> >>>>     > >>>
> >>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
> >>>>     it was
> >>>>     > >>> the attachments, which I've converted to pastebin links.
> >>>>     > >>>
> >>>>     > >>> As an update, I'm now running off of a different drive (ssd,
> >>>>     not the
> >>>>     > >>> nvme) and I got the error again! I'm now inclined to think
> >>>>     this might
> >>>>     > >>> not be hardware after all, but something related to my setup
> >>>>     or a bug
> >>>>     > >>> with chromium.
> >>>>     > >>>
> >>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
> >>>>     > >>> similar parent transid/csum errors to my original post below.
> >>>>     I used
> >>>>     > >>> btrfs-inspect-internal to find the inode traced to
> >>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
> >>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
> >>>>     that and
> >>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
> >>>>     pool was
> >>>>     > >>> mounted ro just like the original problem below.
> >>>>     > >>>
> >>>>     > >>> dmesg after trying to start chromium:
> >>>>     > >>> - https://pastebin.com/CsCEQMJa
> >>>>     > >>
> >>>>     > >> So far, it's only transid bug in your csum tree.
> >>>>     > >>
> >>>>     > >> And two backref mismatch in data backref.
> >>>>     > >>
> >>>>     > >> In theory, you can fix your problem by `btrfs check --repair
> >>>>     > >> --init-csum-tree`.
> >>>>     > >>
> >>>>     > >
> >>>>     > > Now that I might be narrowing in on offending files, I'll wait
> >>>>     to see
> >>>>     > > what you think from my last response to Chris. I did try the above
> >>>>     > > when I first ran into this:
> >>>>     > > -
> >>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
> >>>>     >
> >>>>     > That RO is caused by the missing data backref.
> >>>>     >
> >>>>     > Which can be fixed by btrfs check --repair.
> >>>>     >
> >>>>     > Then you should be able to delete offending files them. (Or the whole
> >>>>     > chromium cache, and switch to firefox if you wish :P )
> >>>>     >
> >>>>     > But also please keep in mind that, the transid mismatch looks
> >>>>     happen in
> >>>>     > your csum tree, which means your csum tree is no longer reliable, and
> >>>>     > may cause -EIO reading unrelated files.
> >>>>     >
> >>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
> >>>>     >
> >>>>     > It can be done altogether by --repair --init-csum-tree, but to be
> >>>>     safe,
> >>>>     > please run --repair only first, then make sure btrfs check reports no
> >>>>     > error after that. Then go --init-csum-tree.
> >>>>     >
> >>>>     > >
> >>>>     > >> But I'm more interesting in how this happened.
> >>>>     > >
> >>>>     > > Me too :)
> >>>>     > >
> >>>>     > >> Have your every experienced any power loss for your NVME drive?
> >>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
> >>>>     be safe
> >>>>     > >> against power loss, I'm just curious about if mount time log
> >>>>     replay is
> >>>>     > >> involved, or just regular internal log replay.
> >>>>     > >>
> >>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
> >>>>     with 2144
> >>>>     > >> power cycles.
> >>>>     > >
> >>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> >>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
> >>>>     > > and mean to set up a script to force poweroff before that
> >>>>     happens. So,
> >>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> >>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
> >>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> >>>>     > > issued every 3, so the ssd drive is more like 5 years old.
> >>>>     > >
> >>>>     > >> Not sure if it's related.
> >>>>     > >>
> >>>>     > >> Another interesting point is, did you remember what's the
> >>>>     oldest kernel
> >>>>     > >> running on this fs? v5.4 or v5.5?
> >>>>     > >
> >>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
> >>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
> >>>>     and the
> >>>>     > > kernel/btrfs-progs was at 4.20 then:
> >>>>     > > - https://archive.archlinux.org/packages/l/linux/
> >>>>     >
> >>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> >>>>     > cause metadata corruption. And the symptom is transid error, which
> >>>>     also
> >>>>     > matches your problem.
> >>>>     >
> >>>>     > Thanks,
> >>>>     > Qu
> >>>>     >
> >>>>     > >
> >>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
> >>>>     so the
> >>>>     > > kernel version would have been even older.
> >>>>     > >
> >>>>     > > Thanks for your input,
> >>>>     > > John
> >>>>     > >
> >>>>     > >>
> >>>>     > >> Thanks,
> >>>>     > >> Qu
> >>>>     > >>>
> >>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
> >>>>     of a
> >>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
> >>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> >>>>     > >>> worried there is a deeper issue bound to recur :(
> >>>>     > >>>
> >>>>     > >>> Best regards,
> >>>>     > >>> John
> >>>>     > >>>
> >>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
> >>>>     <mailto:jw.hendy@gmail.com>> wrote:
> >>>>     > >>>>
> >>>>     > >>>> Greetings,
> >>>>     > >>>>
> >>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
> >>>>     couple of
> >>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
> >>>>     start it
> >>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
> >>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
> >>>>     biting
> >>>>     > >>>> the bullet and re-installing linux due to the number of dead end
> >>>>     > >>>> threads and slow response rates on diagnosing these issues,
> >>>>     and the
> >>>>     > >>>> issue occurred again shortly after.
> >>>>     > >>>>
> >>>>     > >>>> $ uname -a
> >>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
> >>>>     16:38:40
> >>>>     > >>>> +0000 x86_64 GNU/Linux
> >>>>     > >>>>
> >>>>     > >>>> $ btrfs --version
> >>>>     > >>>> btrfs-progs v5.4
> >>>>     > >>>>
> >>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
> >>>>     mounting a subvol on /
> >>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
> >>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
> >>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> >>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>>>     > >>>>
> >>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >>>>     > >>>> nvme0n1                                       259:5    0
> >>>>     232.9G  0 disk
> >>>>     > >>>> ├─nvme0n1p1                                   259:6    0
> >>>>      512M  0
> >>>>     > >>>> part  (/boot/efi)
> >>>>     > >>>> ├─nvme0n1p2                                   259:7    0
> >>>>      1G  0 part  (/boot)
> >>>>     > >>>> └─nvme0n1p3                                   259:8    0
> >>>>     231.4G  0 part (btrfs)
> >>>>     > >>>>
> >>>>     > >>>> I have the following subvols:
> >>>>     > >>>> arch: used for / when booting arch
> >>>>     > >>>> jwhendy: used for /home/jwhendy on arch
> >>>>     > >>>> vault: shared data between distros on /mnt/vault
> >>>>     > >>>> bionic: root when booting ubuntu bionic
> >>>>     > >>>>
> >>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>>>     > >>>>
> >>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >>>>     > >>>
> >>>>     > >>> Edit: links now:
> >>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
> >>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
> >>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
> >>>>     > >>>
> >>>>     > >>> btrfs dev stats (not worth a link):
> >>>>     > >>>
> >>>>     > >>> [/dev/mapper/old].write_io_errs    0
> >>>>     > >>> [/dev/mapper/old].read_io_errs     0
> >>>>     > >>> [/dev/mapper/old].flush_io_errs    0
> >>>>     > >>> [/dev/mapper/old].corruption_errs  0
> >>>>     > >>> [/dev/mapper/old].generation_errs  0
> >>>>     > >>>
> >>>>     > >>>
> >>>>     > >>>> If these are of interested, here are reddit threads where I
> >>>>     posted the
> >>>>     > >>>> issue and was referred here.
> >>>>     > >>>> 1)
> >>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >>>>     > >>>> 2)
> >>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>>>     > >>>>
> >>>>     > >>>> It has been suggested this is a hardware issue. I've already
> >>>>     ordered a
> >>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
> >>>>     > >>>> definitively this was the case. If anything stands out above that
> >>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
> >>>>     also be
> >>>>     > >>>> fantastic so I don't repeat the issue!
> >>>>     > >>>>
> >>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
> >>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
> >>>>     preferred.
> >>>>     > >>>>
> >>>>     > >>>>
> >>>>     > >>>> Many thanks for any input/suggestions,
> >>>>     > >>>> John
> >>>>     > >>
> >>>>     >
> >>>>
> >>>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  1:20                     ` John Hendy
@ 2020-02-09  1:24                       ` Qu Wenruo
  2020-02-09  1:49                         ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-09  1:24 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 23852 bytes --]



On 2020/2/9 上午9:20, John Hendy wrote:
> On Sat, Feb 8, 2020 at 7:09 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/2/9 上午8:59, John Hendy wrote:
>>> Also, if it's of interest, the zero-log trick was new to me. For my
>>> original m2.sata nvme drive, I'd already run all of --init-csum-tree,
>>> --init-extent-tree, and --repair (unsure on the order of the first
>>> two, but --repair was definitely last) but could then not mount it. I
>>> just ran `btrfs rescue zero-log` on it and here is the very brief
>>> output from a btrfs check:
>>>
>>> $ sudo btrfs check /dev/mapper/nvme
>>> Opening filesystem to check...
>>> Checking filesystem on /dev/mapper/nvme
>>> UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
>>> [1/7] checking root items
>>> [2/7] checking extents
>>> data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
>>> found in extent tree
>>> incorrect local backref count on 40762777600 root 256 owner 525787
>>> offset 0 found 1 wanted 0 back 0x5635831f9a20
>>> incorrect local backref count on 40762777600 root 4352 owner 525787
>>> offset 0 found 0 wanted 1 back 0x56357e5a3c70
>>> backref disk bytenr does not match extent record, bytenr=40762777600,
>>> ref bytenr=0
>>> backpointer mismatch on [40762777600 4096]
>>
>> At this stage, btrfs check --repair should be able to fix it.
>>
>> Or does it still segfault?
> 
> This was the original problematic drive, the m2.sata. I just did
> `btrfs check --repair` and it completed with:
> 
> $ sudo btrfs check --repair /dev/mapper/nvme
> enabling repair mode
> WARNING:
> 
>     Do not use --repair unless you are advised to do so by a developer
>     or an experienced user, and then only after having accepted that no
>     fsck can successfully repair all types of filesystem corruption. Eg.
>     some software or hardware bugs can fatally damage a volume.
>     The operation will start in 10 seconds.
>     Use Ctrl-C to stop it.
> 10 9 8 7 6 5 4 3 2 1
> Starting repair.
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/nvme
> UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
> [1/7] checking root items
> Fixed 0 roots.
> [2/7] checking extents
> data backref 40762777600 root 256 owner 525787 offset 0 num_refs 0 not
> found in extent tree
> incorrect local backref count on 40762777600 root 256 owner 525787
> offset 0 found 1 wanted 0 back 0x5561d1f74ee0
> incorrect local backref count on 40762777600 root 4352 owner 525787
> offset 0 found 0 wanted 1 back 0x5561cd31f220
> backref disk bytenr does not match extent record, bytenr=40762777600,
> ref bytenr=0
> backpointer mismatch on [40762777600 4096]
> repair deleting extent record: key [40762777600,168,4096]
> adding new data backref on 40762777600 root 256 owner 525787 offset 0 found 1
> Repaired extent references for 40762777600
> No device size related problem found
> [3/7] checking free space cache
> cache and super generation don't match, space cache will be invalidated
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 87799443456 bytes used, no error found
> total csum bytes: 84696784
> total tree bytes: 954220544
> total fs tree bytes: 806535168
> total extent tree bytes: 47710208
> btree space waste bytes: 150766636
> file data blocks allocated: 87780622336
>  referenced 94255783936
> 
> The output of btrfs check now on this drive:
> 
> $ sudo btrfs check /dev/mapper/nvme
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/nvme
> UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> cache and super generation don't match, space cache will be invalidated
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 87799443456 bytes used, no error found
> total csum bytes: 84696784
> total tree bytes: 954220544
> total fs tree bytes: 806535168
> total extent tree bytes: 47710208
> btree space waste bytes: 150766636
> file data blocks allocated: 87780622336
>  referenced 94255783936

Just as it said, there is no error found by btrfs-check.

If you want to be extra safe, please run `btrfs check` again, using
v5.4.1 (which adds an extra check for extent item generation).

At this stage, at least v5.3 kernel should be able to mount it, and
delete offending files.

v5.4 is a little more strict on extent item generation. But if you
delete the offending files using v5.3, everything should be fine.

If you want to be abosultely safe, you can run `btrfs check
--check-data-csum` to do a scrub-like check on data.

Thanks,
Qu
> 
> How is that looking? I'll boot back into a usb drive to try --repair
> --mode=lowmem on the SSD. My continued worry is the spurious file I
> can't delete. Is that something btrfs --repair will try to fix or is
> there something else that needs to be done? It seems this inode is
> tripping things up and I can't find a way to get rid of that file.
> 
> John
> 
> 
>>
>> Thanks,
>> Qu
>>> ERROR: errors found in extent allocation tree or chunk allocation
>>> [3/7] checking free space cache
>>> [4/7] checking fs roots
>>> [5/7] checking only csums items (without verifying data)
>>> [6/7] checking root refs
>>> [7/7] checking quota groups skipped (not enabled on this FS)
>>> found 87799443456 bytes used, error(s) found
>>> total csum bytes: 84696784
>>> total tree bytes: 954220544
>>> total fs tree bytes: 806535168
>>> total extent tree bytes: 47710208
>>> btree space waste bytes: 150766636
>>> file data blocks allocated: 87780622336
>>>  referenced 94255783936
>>>
>>> If that looks promising... I'm hoping that the ssd we're currently
>>> working on will follow suit! I'll await your recommendation for what
>>> to do on the previous inquiries for the SSD, and if you have any
>>> suggestions for the backref errors on the nvme drive above.
>>>
>>> Many thanks,
>>> John
>>>
>>> On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@gmail.com> wrote:
>>>>
>>>> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2020/2/9 上午5:57, John Hendy wrote:
>>>>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
>>>>>> can't mount or boot any longer. I get the error:
>>>>>>
>>>>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
>>>>>> to recover log tree)
>>>>>> BTRFS error (device dm-0): open_ctree failed
>>>>>
>>>>> That can be easily fixed by `btrfs rescue zero-log`.
>>>>>
>>>>
>>>> Whew. This was most helpful and it is wonderful to be booting at
>>>> least. I think the outstanding issues are:
>>>> - what should I do about `btrfs check --repair seg` faulting?
>>>> - how can I deal with this (probably related to seg fault) ghost file
>>>> that cannot be deleted?
>>>> - I'm not sure if you looked at the post --repair log, but there a ton
>>>> of these errors that didn't used to be there:
>>>>
>>>> backpointer mismatch on [13037375488 20480]
>>>> ref mismatch on [13037395968 892928] extent item 0, found 1
>>>> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
>>>> not found in extent tree
>>>> incorrect local backref count on 13037395968 root 263 owner 4257169
>>>> offset 0 found 1 wanted 0 back 0x5627f59cadc0
>>>>
>>>> Here is the latest btrfs check output after the zero-log operation.
>>>> - https://pastebin.com/KWeUnk0y
>>>>
>>>> I'm hoping once that file is deleted, it's a matter of
>>>> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>>> At least, btrfs check --repair didn't make things worse.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>
>>>>>> John
>>>>>>
>>>>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
>>>>>> <mailto:jw.hendy@gmail.com>> wrote:
>>>>>>
>>>>>>     This is not going so hot. Updates:
>>>>>>
>>>>>>     booted from arch install, pre repair btrfs check:
>>>>>>     - https://pastebin.com/6vNaSdf2
>>>>>>
>>>>>>     btrfs check --mode=lowmem as requested by Chris:
>>>>>>     - https://pastebin.com/uSwSTVVY
>>>>>>
>>>>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
>>>>>>     typed them off of pictures I took:
>>>>>>
>>>>>>     Starting repair.
>>>>>>     Opening filesystem to check...
>>>>>>     Checking filesystem on /dev/mapper/ssd
>>>>>>     [1/7] checking root items
>>>>>>     Fixed 0 roots.
>>>>>>     [2/7] checking extents
>>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>>>     448074
>>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>>>     448074
>>>>>>     Ignoring transid failure
>>>>>>     # ... repeated the previous two lines maybe hundreds of times
>>>>>>     # ended with this:
>>>>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
>>>>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
>>>>>>     /dev/mapper/ssd
>>>>>>
>>>>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
>>>>>>
>>>>>>     Here is the output of btrfs check after the --repair attempt:
>>>>>>     - https://pastebin.com/6MYRNdga
>>>>>>
>>>>>>     I rebooted to write this email given the seg fault, as I wanted to
>>>>>>     make sure that I should still follow-up --repair with
>>>>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
>>>>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
>>>>>>     pics and was acting
>>>>>>     really weird. In suspiciously checking dmesg, things have gone ro on
>>>>>>     me :(  Here is the dmesg from this session:
>>>>>>     - https://pastebin.com/a2z7xczy
>>>>>>
>>>>>>     The gist is:
>>>>>>
>>>>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
>>>>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
>>>>>>     start range (12980297728) of the next csum item
>>>>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
>>>>>>     total ptrs 34 free space 29 owner 7
>>>>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
>>>>>>     itemoff 14811 itemsize 1472
>>>>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
>>>>>>     itemoff 13895 itemsize 916
>>>>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
>>>>>>     itemoff 13811 itemsize 84
>>>>>>     # ... there's maybe 30 of these item n key lines in total
>>>>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
>>>>>>     tree block corruption detected
>>>>>>     [   41.016793] BTRFS: error (device dm-0) in
>>>>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
>>>>>>     writing out transaction)
>>>>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
>>>>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>>>     transaction.
>>>>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
>>>>>>     errno=-5 IO failure
>>>>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>>>     transaction.
>>>>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>>>     [   44.509418] systemd-journald[416]:
>>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>>>     Journal file corrupted, rotating.
>>>>>>     [   44.509440] systemd-journald[416]: Failed to rotate
>>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>>>     Read-only file system
>>>>>>     [   44.509450] systemd-journald[416]: Failed to rotate
>>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
>>>>>>     Read-only file system
>>>>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
>>>>>>     705 bytes) despite vacuuming, ignoring: Bad message
>>>>>>     # ... then a bunch of these failed journal attempts (of note:
>>>>>>     /var/log/journal was one of the bad inodes from btrfs check
>>>>>>     previously)
>>>>>>
>>>>>>     Kindly let me know what you would recommend. I'm sadly back to an
>>>>>>     unusable system vs. a complaining/worrisome one. This is similar to
>>>>>>     the behavior I had with the m2.sata nvme drive in my original
>>>>>>     experience. After trying all of --repair, --init-csum-tree, and
>>>>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
>>>>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
>>>>>>     output. Hoping to not repeat that with this drive.
>>>>>>
>>>>>>     Thanks,
>>>>>>     John
>>>>>>
>>>>>>
>>>>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
>>>>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>>>     > >>
>>>>>>     > >>
>>>>>>     > >>
>>>>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
>>>>>>     > >>> Greetings,
>>>>>>     > >>>
>>>>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
>>>>>>     it was
>>>>>>     > >>> the attachments, which I've converted to pastebin links.
>>>>>>     > >>>
>>>>>>     > >>> As an update, I'm now running off of a different drive (ssd,
>>>>>>     not the
>>>>>>     > >>> nvme) and I got the error again! I'm now inclined to think
>>>>>>     this might
>>>>>>     > >>> not be hardware after all, but something related to my setup
>>>>>>     or a bug
>>>>>>     > >>> with chromium.
>>>>>>     > >>>
>>>>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
>>>>>>     > >>> similar parent transid/csum errors to my original post below.
>>>>>>     I used
>>>>>>     > >>> btrfs-inspect-internal to find the inode traced to
>>>>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
>>>>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
>>>>>>     that and
>>>>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
>>>>>>     pool was
>>>>>>     > >>> mounted ro just like the original problem below.
>>>>>>     > >>>
>>>>>>     > >>> dmesg after trying to start chromium:
>>>>>>     > >>> - https://pastebin.com/CsCEQMJa
>>>>>>     > >>
>>>>>>     > >> So far, it's only transid bug in your csum tree.
>>>>>>     > >>
>>>>>>     > >> And two backref mismatch in data backref.
>>>>>>     > >>
>>>>>>     > >> In theory, you can fix your problem by `btrfs check --repair
>>>>>>     > >> --init-csum-tree`.
>>>>>>     > >>
>>>>>>     > >
>>>>>>     > > Now that I might be narrowing in on offending files, I'll wait
>>>>>>     to see
>>>>>>     > > what you think from my last response to Chris. I did try the above
>>>>>>     > > when I first ran into this:
>>>>>>     > > -
>>>>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>>>>>>     >
>>>>>>     > That RO is caused by the missing data backref.
>>>>>>     >
>>>>>>     > Which can be fixed by btrfs check --repair.
>>>>>>     >
>>>>>>     > Then you should be able to delete offending files them. (Or the whole
>>>>>>     > chromium cache, and switch to firefox if you wish :P )
>>>>>>     >
>>>>>>     > But also please keep in mind that, the transid mismatch looks
>>>>>>     happen in
>>>>>>     > your csum tree, which means your csum tree is no longer reliable, and
>>>>>>     > may cause -EIO reading unrelated files.
>>>>>>     >
>>>>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>>>>>>     >
>>>>>>     > It can be done altogether by --repair --init-csum-tree, but to be
>>>>>>     safe,
>>>>>>     > please run --repair only first, then make sure btrfs check reports no
>>>>>>     > error after that. Then go --init-csum-tree.
>>>>>>     >
>>>>>>     > >
>>>>>>     > >> But I'm more interesting in how this happened.
>>>>>>     > >
>>>>>>     > > Me too :)
>>>>>>     > >
>>>>>>     > >> Have your every experienced any power loss for your NVME drive?
>>>>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
>>>>>>     be safe
>>>>>>     > >> against power loss, I'm just curious about if mount time log
>>>>>>     replay is
>>>>>>     > >> involved, or just regular internal log replay.
>>>>>>     > >>
>>>>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
>>>>>>     with 2144
>>>>>>     > >> power cycles.
>>>>>>     > >
>>>>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
>>>>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
>>>>>>     > > and mean to set up a script to force poweroff before that
>>>>>>     happens. So,
>>>>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
>>>>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
>>>>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
>>>>>>     > > issued every 3, so the ssd drive is more like 5 years old.
>>>>>>     > >
>>>>>>     > >> Not sure if it's related.
>>>>>>     > >>
>>>>>>     > >> Another interesting point is, did you remember what's the
>>>>>>     oldest kernel
>>>>>>     > >> running on this fs? v5.4 or v5.5?
>>>>>>     > >
>>>>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
>>>>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
>>>>>>     and the
>>>>>>     > > kernel/btrfs-progs was at 4.20 then:
>>>>>>     > > - https://archive.archlinux.org/packages/l/linux/
>>>>>>     >
>>>>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
>>>>>>     > cause metadata corruption. And the symptom is transid error, which
>>>>>>     also
>>>>>>     > matches your problem.
>>>>>>     >
>>>>>>     > Thanks,
>>>>>>     > Qu
>>>>>>     >
>>>>>>     > >
>>>>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
>>>>>>     so the
>>>>>>     > > kernel version would have been even older.
>>>>>>     > >
>>>>>>     > > Thanks for your input,
>>>>>>     > > John
>>>>>>     > >
>>>>>>     > >>
>>>>>>     > >> Thanks,
>>>>>>     > >> Qu
>>>>>>     > >>>
>>>>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
>>>>>>     of a
>>>>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
>>>>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
>>>>>>     > >>> worried there is a deeper issue bound to recur :(
>>>>>>     > >>>
>>>>>>     > >>> Best regards,
>>>>>>     > >>> John
>>>>>>     > >>>
>>>>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
>>>>>>     <mailto:jw.hendy@gmail.com>> wrote:
>>>>>>     > >>>>
>>>>>>     > >>>> Greetings,
>>>>>>     > >>>>
>>>>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
>>>>>>     couple of
>>>>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
>>>>>>     start it
>>>>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
>>>>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
>>>>>>     biting
>>>>>>     > >>>> the bullet and re-installing linux due to the number of dead end
>>>>>>     > >>>> threads and slow response rates on diagnosing these issues,
>>>>>>     and the
>>>>>>     > >>>> issue occurred again shortly after.
>>>>>>     > >>>>
>>>>>>     > >>>> $ uname -a
>>>>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
>>>>>>     16:38:40
>>>>>>     > >>>> +0000 x86_64 GNU/Linux
>>>>>>     > >>>>
>>>>>>     > >>>> $ btrfs --version
>>>>>>     > >>>> btrfs-progs v5.4
>>>>>>     > >>>>
>>>>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
>>>>>>     mounting a subvol on /
>>>>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
>>>>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
>>>>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>>>>     > >>>>
>>>>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>>>>     > >>>> nvme0n1                                       259:5    0
>>>>>>     232.9G  0 disk
>>>>>>     > >>>> ├─nvme0n1p1                                   259:6    0
>>>>>>      512M  0
>>>>>>     > >>>> part  (/boot/efi)
>>>>>>     > >>>> ├─nvme0n1p2                                   259:7    0
>>>>>>      1G  0 part  (/boot)
>>>>>>     > >>>> └─nvme0n1p3                                   259:8    0
>>>>>>     231.4G  0 part (btrfs)
>>>>>>     > >>>>
>>>>>>     > >>>> I have the following subvols:
>>>>>>     > >>>> arch: used for / when booting arch
>>>>>>     > >>>> jwhendy: used for /home/jwhendy on arch
>>>>>>     > >>>> vault: shared data between distros on /mnt/vault
>>>>>>     > >>>> bionic: root when booting ubuntu bionic
>>>>>>     > >>>>
>>>>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>>>>     > >>>>
>>>>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>>>>     > >>>
>>>>>>     > >>> Edit: links now:
>>>>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
>>>>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
>>>>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
>>>>>>     > >>>
>>>>>>     > >>> btrfs dev stats (not worth a link):
>>>>>>     > >>>
>>>>>>     > >>> [/dev/mapper/old].write_io_errs    0
>>>>>>     > >>> [/dev/mapper/old].read_io_errs     0
>>>>>>     > >>> [/dev/mapper/old].flush_io_errs    0
>>>>>>     > >>> [/dev/mapper/old].corruption_errs  0
>>>>>>     > >>> [/dev/mapper/old].generation_errs  0
>>>>>>     > >>>
>>>>>>     > >>>
>>>>>>     > >>>> If these are of interested, here are reddit threads where I
>>>>>>     posted the
>>>>>>     > >>>> issue and was referred here.
>>>>>>     > >>>> 1)
>>>>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>>>>     > >>>> 2)
>>>>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>>>>     > >>>>
>>>>>>     > >>>> It has been suggested this is a hardware issue. I've already
>>>>>>     ordered a
>>>>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
>>>>>>     > >>>> definitively this was the case. If anything stands out above that
>>>>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
>>>>>>     also be
>>>>>>     > >>>> fantastic so I don't repeat the issue!
>>>>>>     > >>>>
>>>>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
>>>>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
>>>>>>     preferred.
>>>>>>     > >>>>
>>>>>>     > >>>>
>>>>>>     > >>>> Many thanks for any input/suggestions,
>>>>>>     > >>>> John
>>>>>>     > >>
>>>>>>     >
>>>>>>
>>>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  1:24                       ` Qu Wenruo
@ 2020-02-09  1:49                         ` John Hendy
  0 siblings, 0 replies; 24+ messages in thread
From: John Hendy @ 2020-02-09  1:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Sat, Feb 8, 2020 at 7:24 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/9 上午9:20, John Hendy wrote:
> > On Sat, Feb 8, 2020 at 7:09 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/2/9 上午8:59, John Hendy wrote:
> >>> Also, if it's of interest, the zero-log trick was new to me. For my
> >>> original m2.sata nvme drive, I'd already run all of --init-csum-tree,
> >>> --init-extent-tree, and --repair (unsure on the order of the first
> >>> two, but --repair was definitely last) but could then not mount it. I
> >>> just ran `btrfs rescue zero-log` on it and here is the very brief
> >>> output from a btrfs check:
> >>>

[snip]

> > The output of btrfs check now on this drive:
> >
> > $ sudo btrfs check /dev/mapper/nvme
> > Opening filesystem to check...
> > Checking filesystem on /dev/mapper/nvme
> > UUID: 488f733d-1dfd-4a0f-ab2f-ba690e095fe4
> > [1/7] checking root items
> > [2/7] checking extents
> > [3/7] checking free space cache
> > cache and super generation don't match, space cache will be invalidated
> > [4/7] checking fs roots
> > [5/7] checking only csums items (without verifying data)
> > [6/7] checking root refs
> > [7/7] checking quota groups skipped (not enabled on this FS)
> > found 87799443456 bytes used, no error found
> > total csum bytes: 84696784
> > total tree bytes: 954220544
> > total fs tree bytes: 806535168
> > total extent tree bytes: 47710208
> > btree space waste bytes: 150766636
> > file data blocks allocated: 87780622336
> >  referenced 94255783936
>
> Just as it said, there is no error found by btrfs-check.

My apologies. I think we are circling around on which drive is which.

1) NVME, m2.sata, the original drive of this thread:
- had the ro issues, I reinstalled linux, then ro occurred again,
prompting this thread
- on my own, I did --init-csum-tree, --init-extent-tree, and --repair
but it then wouldn't boot
- you gave the zero-log trick for the *other* drive, which I then
applied to this one
- zero-log lets it mount again, and btrfs check --repair appeared to work
- btrfs check is now reporting no issues
- the --check-data-csum on 5.4 also looks good

2) The SSD, the drive which I started using after my nvme woes above
- in tracking down offending files by inode, I deleted some, another
is unable to be deleted, no matter what:

$ ls -la
ls: cannot access 'TransportSecurity': No such file or directory
total 0
drwx------ 1 jwhendy jwhendy 22 Feb  8 18:47 .
drwx------ 1 jwhendy jwhendy 18 Feb  7 22:22 ..
-????????? ? ?       ?        ?            ? TransportSecurity

- I have not been able to run --repair successfully due to segfault
- per your advice, I am about to try btrfs check --repair --mode=lowmem on it

In summary, thanks to your help I might have recovered the nvme drive,
but it's unclear to me where the SSD is at. The latest on the SSD
(copying from earlier thread for convenience):

Here is the output of btrfs check after the --repair attempt (which seg faulted)
- https://pastebin.com/6MYRNdga

Here was the dmesg after I rebooted from that --repair attempt and it went ro:
- https://pastebin.com/a2z7xczy

The only thing that's happened since then is `btrfs rescue zero-log` on it.

John

> If you want to be extra safe, please run `btrfs check` again, using
> v5.4.1 (which adds an extra check for extent item generation).
>
> At this stage, at least v5.3 kernel should be able to mount it, and
> delete offending files.
>
> v5.4 is a little more strict on extent item generation. But if you
> delete the offending files using v5.3, everything should be fine.
>
> If you want to be abosultely safe, you can run `btrfs check
> --check-data-csum` to do a scrub-like check on data.
>
> Thanks,
> Qu
> >
> > How is that looking? I'll boot back into a usb drive to try --repair
> > --mode=lowmem on the SSD. My continued worry is the spurious file I
> > can't delete. Is that something btrfs --repair will try to fix or is
> > there something else that needs to be done? It seems this inode is
> > tripping things up and I can't find a way to get rid of that file.
> >
> > John
> >
> >
> >>
> >> Thanks,
> >> Qu
> >>> ERROR: errors found in extent allocation tree or chunk allocation
> >>> [3/7] checking free space cache
> >>> [4/7] checking fs roots
> >>> [5/7] checking only csums items (without verifying data)
> >>> [6/7] checking root refs
> >>> [7/7] checking quota groups skipped (not enabled on this FS)
> >>> found 87799443456 bytes used, error(s) found
> >>> total csum bytes: 84696784
> >>> total tree bytes: 954220544
> >>> total fs tree bytes: 806535168
> >>> total extent tree bytes: 47710208
> >>> btree space waste bytes: 150766636
> >>> file data blocks allocated: 87780622336
> >>>  referenced 94255783936
> >>>
> >>> If that looks promising... I'm hoping that the ssd we're currently
> >>> working on will follow suit! I'll await your recommendation for what
> >>> to do on the previous inquiries for the SSD, and if you have any
> >>> suggestions for the backref errors on the nvme drive above.
> >>>
> >>> Many thanks,
> >>> John
> >>>
> >>> On Sat, Feb 8, 2020 at 6:51 PM John Hendy <jw.hendy@gmail.com> wrote:
> >>>>
> >>>> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2020/2/9 上午5:57, John Hendy wrote:
> >>>>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
> >>>>>> can't mount or boot any longer. I get the error:
> >>>>>>
> >>>>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> >>>>>> to recover log tree)
> >>>>>> BTRFS error (device dm-0): open_ctree failed
> >>>>>
> >>>>> That can be easily fixed by `btrfs rescue zero-log`.
> >>>>>
> >>>>
> >>>> Whew. This was most helpful and it is wonderful to be booting at
> >>>> least. I think the outstanding issues are:
> >>>> - what should I do about `btrfs check --repair seg` faulting?
> >>>> - how can I deal with this (probably related to seg fault) ghost file
> >>>> that cannot be deleted?
> >>>> - I'm not sure if you looked at the post --repair log, but there a ton
> >>>> of these errors that didn't used to be there:
> >>>>
> >>>> backpointer mismatch on [13037375488 20480]
> >>>> ref mismatch on [13037395968 892928] extent item 0, found 1
> >>>> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
> >>>> not found in extent tree
> >>>> incorrect local backref count on 13037395968 root 263 owner 4257169
> >>>> offset 0 found 1 wanted 0 back 0x5627f59cadc0
> >>>>
> >>>> Here is the latest btrfs check output after the zero-log operation.
> >>>> - https://pastebin.com/KWeUnk0y
> >>>>
> >>>> I'm hoping once that file is deleted, it's a matter of
> >>>> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
> >>>>
> >>>> Thanks,
> >>>> John
> >>>>
> >>>>> At least, btrfs check --repair didn't make things worse.
> >>>>>
> >>>>> Thanks,
> >>>>> Qu
> >>>>>>
> >>>>>> John
> >>>>>>
> >>>>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> >>>>>> <mailto:jw.hendy@gmail.com>> wrote:
> >>>>>>
> >>>>>>     This is not going so hot. Updates:
> >>>>>>
> >>>>>>     booted from arch install, pre repair btrfs check:
> >>>>>>     - https://pastebin.com/6vNaSdf2
> >>>>>>
> >>>>>>     btrfs check --mode=lowmem as requested by Chris:
> >>>>>>     - https://pastebin.com/uSwSTVVY
> >>>>>>
> >>>>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
> >>>>>>     typed them off of pictures I took:
> >>>>>>
> >>>>>>     Starting repair.
> >>>>>>     Opening filesystem to check...
> >>>>>>     Checking filesystem on /dev/mapper/ssd
> >>>>>>     [1/7] checking root items
> >>>>>>     Fixed 0 roots.
> >>>>>>     [2/7] checking extents
> >>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>>>>     448074
> >>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>>>>     448074
> >>>>>>     Ignoring transid failure
> >>>>>>     # ... repeated the previous two lines maybe hundreds of times
> >>>>>>     # ended with this:
> >>>>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
> >>>>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
> >>>>>>     /dev/mapper/ssd
> >>>>>>
> >>>>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> >>>>>>
> >>>>>>     Here is the output of btrfs check after the --repair attempt:
> >>>>>>     - https://pastebin.com/6MYRNdga
> >>>>>>
> >>>>>>     I rebooted to write this email given the seg fault, as I wanted to
> >>>>>>     make sure that I should still follow-up --repair with
> >>>>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
> >>>>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
> >>>>>>     pics and was acting
> >>>>>>     really weird. In suspiciously checking dmesg, things have gone ro on
> >>>>>>     me :(  Here is the dmesg from this session:
> >>>>>>     - https://pastebin.com/a2z7xczy
> >>>>>>
> >>>>>>     The gist is:
> >>>>>>
> >>>>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
> >>>>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
> >>>>>>     start range (12980297728) of the next csum item
> >>>>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
> >>>>>>     total ptrs 34 free space 29 owner 7
> >>>>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
> >>>>>>     itemoff 14811 itemsize 1472
> >>>>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
> >>>>>>     itemoff 13895 itemsize 916
> >>>>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
> >>>>>>     itemoff 13811 itemsize 84
> >>>>>>     # ... there's maybe 30 of these item n key lines in total
> >>>>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
> >>>>>>     tree block corruption detected
> >>>>>>     [   41.016793] BTRFS: error (device dm-0) in
> >>>>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
> >>>>>>     writing out transaction)
> >>>>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
> >>>>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>>>>     transaction.
> >>>>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
> >>>>>>     errno=-5 IO failure
> >>>>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>>>>     transaction.
> >>>>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>>>>     [   44.509418] systemd-journald[416]:
> >>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>>>>     Journal file corrupted, rotating.
> >>>>>>     [   44.509440] systemd-journald[416]: Failed to rotate
> >>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>>>>     Read-only file system
> >>>>>>     [   44.509450] systemd-journald[416]: Failed to rotate
> >>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
> >>>>>>     Read-only file system
> >>>>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
> >>>>>>     705 bytes) despite vacuuming, ignoring: Bad message
> >>>>>>     # ... then a bunch of these failed journal attempts (of note:
> >>>>>>     /var/log/journal was one of the bad inodes from btrfs check
> >>>>>>     previously)
> >>>>>>
> >>>>>>     Kindly let me know what you would recommend. I'm sadly back to an
> >>>>>>     unusable system vs. a complaining/worrisome one. This is similar to
> >>>>>>     the behavior I had with the m2.sata nvme drive in my original
> >>>>>>     experience. After trying all of --repair, --init-csum-tree, and
> >>>>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
> >>>>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
> >>>>>>     output. Hoping to not repeat that with this drive.
> >>>>>>
> >>>>>>     Thanks,
> >>>>>>     John
> >>>>>>
> >>>>>>
> >>>>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>>>>     >
> >>>>>>     >
> >>>>>>     >
> >>>>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
> >>>>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>>>>     > >>
> >>>>>>     > >>
> >>>>>>     > >>
> >>>>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
> >>>>>>     > >>> Greetings,
> >>>>>>     > >>>
> >>>>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
> >>>>>>     it was
> >>>>>>     > >>> the attachments, which I've converted to pastebin links.
> >>>>>>     > >>>
> >>>>>>     > >>> As an update, I'm now running off of a different drive (ssd,
> >>>>>>     not the
> >>>>>>     > >>> nvme) and I got the error again! I'm now inclined to think
> >>>>>>     this might
> >>>>>>     > >>> not be hardware after all, but something related to my setup
> >>>>>>     or a bug
> >>>>>>     > >>> with chromium.
> >>>>>>     > >>>
> >>>>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
> >>>>>>     > >>> similar parent transid/csum errors to my original post below.
> >>>>>>     I used
> >>>>>>     > >>> btrfs-inspect-internal to find the inode traced to
> >>>>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
> >>>>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
> >>>>>>     that and
> >>>>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
> >>>>>>     pool was
> >>>>>>     > >>> mounted ro just like the original problem below.
> >>>>>>     > >>>
> >>>>>>     > >>> dmesg after trying to start chromium:
> >>>>>>     > >>> - https://pastebin.com/CsCEQMJa
> >>>>>>     > >>
> >>>>>>     > >> So far, it's only transid bug in your csum tree.
> >>>>>>     > >>
> >>>>>>     > >> And two backref mismatch in data backref.
> >>>>>>     > >>
> >>>>>>     > >> In theory, you can fix your problem by `btrfs check --repair
> >>>>>>     > >> --init-csum-tree`.
> >>>>>>     > >>
> >>>>>>     > >
> >>>>>>     > > Now that I might be narrowing in on offending files, I'll wait
> >>>>>>     to see
> >>>>>>     > > what you think from my last response to Chris. I did try the above
> >>>>>>     > > when I first ran into this:
> >>>>>>     > > -
> >>>>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
> >>>>>>     >
> >>>>>>     > That RO is caused by the missing data backref.
> >>>>>>     >
> >>>>>>     > Which can be fixed by btrfs check --repair.
> >>>>>>     >
> >>>>>>     > Then you should be able to delete offending files them. (Or the whole
> >>>>>>     > chromium cache, and switch to firefox if you wish :P )
> >>>>>>     >
> >>>>>>     > But also please keep in mind that, the transid mismatch looks
> >>>>>>     happen in
> >>>>>>     > your csum tree, which means your csum tree is no longer reliable, and
> >>>>>>     > may cause -EIO reading unrelated files.
> >>>>>>     >
> >>>>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
> >>>>>>     >
> >>>>>>     > It can be done altogether by --repair --init-csum-tree, but to be
> >>>>>>     safe,
> >>>>>>     > please run --repair only first, then make sure btrfs check reports no
> >>>>>>     > error after that. Then go --init-csum-tree.
> >>>>>>     >
> >>>>>>     > >
> >>>>>>     > >> But I'm more interesting in how this happened.
> >>>>>>     > >
> >>>>>>     > > Me too :)
> >>>>>>     > >
> >>>>>>     > >> Have your every experienced any power loss for your NVME drive?
> >>>>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
> >>>>>>     be safe
> >>>>>>     > >> against power loss, I'm just curious about if mount time log
> >>>>>>     replay is
> >>>>>>     > >> involved, or just regular internal log replay.
> >>>>>>     > >>
> >>>>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
> >>>>>>     with 2144
> >>>>>>     > >> power cycles.
> >>>>>>     > >
> >>>>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> >>>>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
> >>>>>>     > > and mean to set up a script to force poweroff before that
> >>>>>>     happens. So,
> >>>>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> >>>>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
> >>>>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> >>>>>>     > > issued every 3, so the ssd drive is more like 5 years old.
> >>>>>>     > >
> >>>>>>     > >> Not sure if it's related.
> >>>>>>     > >>
> >>>>>>     > >> Another interesting point is, did you remember what's the
> >>>>>>     oldest kernel
> >>>>>>     > >> running on this fs? v5.4 or v5.5?
> >>>>>>     > >
> >>>>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
> >>>>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
> >>>>>>     and the
> >>>>>>     > > kernel/btrfs-progs was at 4.20 then:
> >>>>>>     > > - https://archive.archlinux.org/packages/l/linux/
> >>>>>>     >
> >>>>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> >>>>>>     > cause metadata corruption. And the symptom is transid error, which
> >>>>>>     also
> >>>>>>     > matches your problem.
> >>>>>>     >
> >>>>>>     > Thanks,
> >>>>>>     > Qu
> >>>>>>     >
> >>>>>>     > >
> >>>>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
> >>>>>>     so the
> >>>>>>     > > kernel version would have been even older.
> >>>>>>     > >
> >>>>>>     > > Thanks for your input,
> >>>>>>     > > John
> >>>>>>     > >
> >>>>>>     > >>
> >>>>>>     > >> Thanks,
> >>>>>>     > >> Qu
> >>>>>>     > >>>
> >>>>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
> >>>>>>     of a
> >>>>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
> >>>>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> >>>>>>     > >>> worried there is a deeper issue bound to recur :(
> >>>>>>     > >>>
> >>>>>>     > >>> Best regards,
> >>>>>>     > >>> John
> >>>>>>     > >>>
> >>>>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
> >>>>>>     <mailto:jw.hendy@gmail.com>> wrote:
> >>>>>>     > >>>>
> >>>>>>     > >>>> Greetings,
> >>>>>>     > >>>>
> >>>>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
> >>>>>>     couple of
> >>>>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
> >>>>>>     start it
> >>>>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
> >>>>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
> >>>>>>     biting
> >>>>>>     > >>>> the bullet and re-installing linux due to the number of dead end
> >>>>>>     > >>>> threads and slow response rates on diagnosing these issues,
> >>>>>>     and the
> >>>>>>     > >>>> issue occurred again shortly after.
> >>>>>>     > >>>>
> >>>>>>     > >>>> $ uname -a
> >>>>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
> >>>>>>     16:38:40
> >>>>>>     > >>>> +0000 x86_64 GNU/Linux
> >>>>>>     > >>>>
> >>>>>>     > >>>> $ btrfs --version
> >>>>>>     > >>>> btrfs-progs v5.4
> >>>>>>     > >>>>
> >>>>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
> >>>>>>     mounting a subvol on /
> >>>>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
> >>>>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
> >>>>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> >>>>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>>>>>     > >>>>
> >>>>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >>>>>>     > >>>> nvme0n1                                       259:5    0
> >>>>>>     232.9G  0 disk
> >>>>>>     > >>>> ├─nvme0n1p1                                   259:6    0
> >>>>>>      512M  0
> >>>>>>     > >>>> part  (/boot/efi)
> >>>>>>     > >>>> ├─nvme0n1p2                                   259:7    0
> >>>>>>      1G  0 part  (/boot)
> >>>>>>     > >>>> └─nvme0n1p3                                   259:8    0
> >>>>>>     231.4G  0 part (btrfs)
> >>>>>>     > >>>>
> >>>>>>     > >>>> I have the following subvols:
> >>>>>>     > >>>> arch: used for / when booting arch
> >>>>>>     > >>>> jwhendy: used for /home/jwhendy on arch
> >>>>>>     > >>>> vault: shared data between distros on /mnt/vault
> >>>>>>     > >>>> bionic: root when booting ubuntu bionic
> >>>>>>     > >>>>
> >>>>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>>>>>     > >>>>
> >>>>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >>>>>>     > >>>
> >>>>>>     > >>> Edit: links now:
> >>>>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
> >>>>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
> >>>>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
> >>>>>>     > >>>
> >>>>>>     > >>> btrfs dev stats (not worth a link):
> >>>>>>     > >>>
> >>>>>>     > >>> [/dev/mapper/old].write_io_errs    0
> >>>>>>     > >>> [/dev/mapper/old].read_io_errs     0
> >>>>>>     > >>> [/dev/mapper/old].flush_io_errs    0
> >>>>>>     > >>> [/dev/mapper/old].corruption_errs  0
> >>>>>>     > >>> [/dev/mapper/old].generation_errs  0
> >>>>>>     > >>>
> >>>>>>     > >>>
> >>>>>>     > >>>> If these are of interested, here are reddit threads where I
> >>>>>>     posted the
> >>>>>>     > >>>> issue and was referred here.
> >>>>>>     > >>>> 1)
> >>>>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >>>>>>     > >>>> 2)
> >>>>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>>>>>     > >>>>
> >>>>>>     > >>>> It has been suggested this is a hardware issue. I've already
> >>>>>>     ordered a
> >>>>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
> >>>>>>     > >>>> definitively this was the case. If anything stands out above that
> >>>>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
> >>>>>>     also be
> >>>>>>     > >>>> fantastic so I don't repeat the issue!
> >>>>>>     > >>>>
> >>>>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
> >>>>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
> >>>>>>     preferred.
> >>>>>>     > >>>>
> >>>>>>     > >>>>
> >>>>>>     > >>>> Many thanks for any input/suggestions,
> >>>>>>     > >>>> John
> >>>>>>     > >>
> >>>>>>     >
> >>>>>>
> >>>>>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-08 19:56         ` John Hendy
       [not found]           ` <CA+M2ft9dcMKKQstZVcGQ=9MREbfhPF5GG=xoMoh5Aq8MK9P8wA@mail.gmail.com>
@ 2020-02-09  3:46           ` Chris Murphy
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Murphy @ 2020-02-09  3:46 UTC (permalink / raw)
  To: John Hendy; +Cc: Qu Wenruo, Btrfs BTRFS

On Sat, Feb 8, 2020 at 12:57 PM John Hendy <jw.hendy@gmail.com> wrote:
>
> This was with btrfs-progs 5.4 (the install USB is maybe a month old).

5.4.1 is current and has extra checks for extent items, although I
have no idea if the extent problems you're running into are fixable
with the 5.4.1 enhancements.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  1:07                 ` Qu Wenruo
@ 2020-02-09  4:10                   ` John Hendy
  2020-02-09  5:01                     ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-02-09  4:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Sat, Feb 8, 2020 at 7:07 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/9 上午8:51, John Hendy wrote:
> > On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/2/9 上午5:57, John Hendy wrote:
> >>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
> >>> can't mount or boot any longer. I get the error:
> >>>
> >>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
> >>> to recover log tree)
> >>> BTRFS error (device dm-0): open_ctree failed
> >>
> >> That can be easily fixed by `btrfs rescue zero-log`.
> >>
> >
> > Whew. This was most helpful and it is wonderful to be booting at
> > least. I think the outstanding issues are:
> > - what should I do about `btrfs check --repair seg` faulting?
>
> That needs extra debugging. But you can try `btrfs check --repair
> --mode=lowmem` which sometimes can bring better result than regular mode.
> The trade-off is much slower speed.
>
> > - how can I deal with this (probably related to seg fault) ghost file
> > that cannot be deleted?
>
> Only `btrfs check` can handle it, kernel will only fallback to RO to
> prevent further corruption.
>
> > - I'm not sure if you looked at the post --repair log, but there a ton
> > of these errors that didn't used to be there:
> >
> > backpointer mismatch on [13037375488 20480]
> > ref mismatch on [13037395968 892928] extent item 0, found 1
> > data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
> > not found in extent tree
> > incorrect local backref count on 13037395968 root 263 owner 4257169
> > offset 0 found 1 wanted 0 back 0x5627f59cadc0
>
> All 13037395968 related line is just one problem, it's the original mode
> doing human-unfriendly output.
>
> But the extra transid looks kinda dangerous.
>
> I'd recommend to backup important data first before trying to repair.
>
> >
> > Here is the latest btrfs check output after the zero-log operation.
> > - https://pastebin.com/KWeUnk0y
> >
> > I'm hoping once that file is deleted, it's a matter of
> > --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
>
> --init-csum-tree has the least priority, thus it doesn't really matter.
>
> --init-extent-tree would in theory reset your extent tree, but the
> problem is, the transid mismatch may cause something wrong.
>
> So please backup your data before trying any repair.
> After data backup, please try `btrfs check --repair --mode=lowmem` first.
>

Current status:

- the nvme seems healed! All is well, and a scrub completed
successfully as well. Currently booted into that.

- the ssd is not doing well. I tried to do a backup and got a ton of
issues with rsync (input/output errors, unable to verify transaction).
I gave up as it just wasn't working well and would remount ro during
these operations. Then, I did `btrfs check --repair --mode=lowmem`. It
didn't seg fault, and did look to fix that spurious file (or at least
mention it).

Here's the current btrfs check output after the --repair --mode=lowmem attempt:
- https://pastebin.com/fHCHqrk7

If there are any suggestions on salvaging this, I would love to try.
For now, I still have my original nvme drive working as an OS again
and discard options are off everywhere. I can report back if this
continues to work.

Of interest to the list, I ran into these threads which you may already know of:
- https://linustechtips.com/main/topic/1066931-linux-51-kernel-hit-by-ssd-trim-bug-which-causes-massive-data-loss/
(dm-crypt + Samsung SSD + 5.1 kernel = data loss). From googling, 5.1
would have been ~May 2019 for arch linux, so well within this drive's
life
- also, the arch wiki
(https://wiki.archlinux.org/index.php/Solid_state_drive#Continuous_TRIM)
says certain drives have trim errors and certain features are
blacklisted in the kernel
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c#n4522).
My Samsung 850 SSD is in that list. I'm guessing some bad symptoms
occurred to earn it a spot on that list...

Current mount options to sanity check:
/dev/mapper/luks-dc2c470e-ec77-43df-bbe8-110c678785c2 on / type btrfs
(rw,relatime,compress=lzo,ssd,space_cache,subvolid=256,subvol=/arch

I will also do my best not to be extra rigorous about power loss as well.

Fingers crossed this was all about trim/discard.

Many thanks to Chris and Qu for the help. As you can imagine these
situations are awful and one can feel quite powerless. Really
appreciate the coaching and persistence.

Best regards,
John


> Thanks,
> Qu
> >
> > Thanks,
> > John
> >
> >> At least, btrfs check --repair didn't make things worse.
> >>
> >> Thanks,
> >> Qu
> >>>
> >>> John
> >>>
> >>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
> >>> <mailto:jw.hendy@gmail.com>> wrote:
> >>>
> >>>     This is not going so hot. Updates:
> >>>
> >>>     booted from arch install, pre repair btrfs check:
> >>>     - https://pastebin.com/6vNaSdf2
> >>>
> >>>     btrfs check --mode=lowmem as requested by Chris:
> >>>     - https://pastebin.com/uSwSTVVY
> >>>
> >>>     Then I did btrfs check --repair, which seg faulted at the end. I've
> >>>     typed them off of pictures I took:
> >>>
> >>>     Starting repair.
> >>>     Opening filesystem to check...
> >>>     Checking filesystem on /dev/mapper/ssd
> >>>     [1/7] checking root items
> >>>     Fixed 0 roots.
> >>>     [2/7] checking extents
> >>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>     448074
> >>>     parent transid verify failed on 20271138064 wanted 68719924810 found
> >>>     448074
> >>>     Ignoring transid failure
> >>>     # ... repeated the previous two lines maybe hundreds of times
> >>>     # ended with this:
> >>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
> >>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
> >>>     /dev/mapper/ssd
> >>>
> >>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
> >>>
> >>>     Here is the output of btrfs check after the --repair attempt:
> >>>     - https://pastebin.com/6MYRNdga
> >>>
> >>>     I rebooted to write this email given the seg fault, as I wanted to
> >>>     make sure that I should still follow-up --repair with
> >>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
> >>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
> >>>     pics and was acting
> >>>     really weird. In suspiciously checking dmesg, things have gone ro on
> >>>     me :(  Here is the dmesg from this session:
> >>>     - https://pastebin.com/a2z7xczy
> >>>
> >>>     The gist is:
> >>>
> >>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
> >>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
> >>>     start range (12980297728) of the next csum item
> >>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
> >>>     total ptrs 34 free space 29 owner 7
> >>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
> >>>     itemoff 14811 itemsize 1472
> >>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
> >>>     itemoff 13895 itemsize 916
> >>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
> >>>     itemoff 13811 itemsize 84
> >>>     # ... there's maybe 30 of these item n key lines in total
> >>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
> >>>     tree block corruption detected
> >>>     [   41.016793] BTRFS: error (device dm-0) in
> >>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
> >>>     writing out transaction)
> >>>     [   41.016799] BTRFS info (device dm-0): forced readonly
> >>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>     transaction.
> >>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
> >>>     errno=-5 IO failure
> >>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
> >>>     transaction.
> >>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
> >>>     [   44.509418] systemd-journald[416]:
> >>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>     Journal file corrupted, rotating.
> >>>     [   44.509440] systemd-journald[416]: Failed to rotate
> >>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
> >>>     Read-only file system
> >>>     [   44.509450] systemd-journald[416]: Failed to rotate
> >>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
> >>>     Read-only file system
> >>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
> >>>     705 bytes) despite vacuuming, ignoring: Bad message
> >>>     # ... then a bunch of these failed journal attempts (of note:
> >>>     /var/log/journal was one of the bad inodes from btrfs check
> >>>     previously)
> >>>
> >>>     Kindly let me know what you would recommend. I'm sadly back to an
> >>>     unusable system vs. a complaining/worrisome one. This is similar to
> >>>     the behavior I had with the m2.sata nvme drive in my original
> >>>     experience. After trying all of --repair, --init-csum-tree, and
> >>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
> >>>     password at boot, I just saw a bunch of [FAILED] in the text splash
> >>>     output. Hoping to not repeat that with this drive.
> >>>
> >>>     Thanks,
> >>>     John
> >>>
> >>>
> >>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>     >
> >>>     >
> >>>     >
> >>>     > On 2020/2/8 下午12:48, John Hendy wrote:
> >>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
> >>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
> >>>     > >>
> >>>     > >>
> >>>     > >>
> >>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
> >>>     > >>> Greetings,
> >>>     > >>>
> >>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
> >>>     it was
> >>>     > >>> the attachments, which I've converted to pastebin links.
> >>>     > >>>
> >>>     > >>> As an update, I'm now running off of a different drive (ssd,
> >>>     not the
> >>>     > >>> nvme) and I got the error again! I'm now inclined to think
> >>>     this might
> >>>     > >>> not be hardware after all, but something related to my setup
> >>>     or a bug
> >>>     > >>> with chromium.
> >>>     > >>>
> >>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
> >>>     > >>> similar parent transid/csum errors to my original post below.
> >>>     I used
> >>>     > >>> btrfs-inspect-internal to find the inode traced to
> >>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
> >>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
> >>>     that and
> >>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
> >>>     pool was
> >>>     > >>> mounted ro just like the original problem below.
> >>>     > >>>
> >>>     > >>> dmesg after trying to start chromium:
> >>>     > >>> - https://pastebin.com/CsCEQMJa
> >>>     > >>
> >>>     > >> So far, it's only transid bug in your csum tree.
> >>>     > >>
> >>>     > >> And two backref mismatch in data backref.
> >>>     > >>
> >>>     > >> In theory, you can fix your problem by `btrfs check --repair
> >>>     > >> --init-csum-tree`.
> >>>     > >>
> >>>     > >
> >>>     > > Now that I might be narrowing in on offending files, I'll wait
> >>>     to see
> >>>     > > what you think from my last response to Chris. I did try the above
> >>>     > > when I first ran into this:
> >>>     > > -
> >>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
> >>>     >
> >>>     > That RO is caused by the missing data backref.
> >>>     >
> >>>     > Which can be fixed by btrfs check --repair.
> >>>     >
> >>>     > Then you should be able to delete offending files them. (Or the whole
> >>>     > chromium cache, and switch to firefox if you wish :P )
> >>>     >
> >>>     > But also please keep in mind that, the transid mismatch looks
> >>>     happen in
> >>>     > your csum tree, which means your csum tree is no longer reliable, and
> >>>     > may cause -EIO reading unrelated files.
> >>>     >
> >>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
> >>>     >
> >>>     > It can be done altogether by --repair --init-csum-tree, but to be
> >>>     safe,
> >>>     > please run --repair only first, then make sure btrfs check reports no
> >>>     > error after that. Then go --init-csum-tree.
> >>>     >
> >>>     > >
> >>>     > >> But I'm more interesting in how this happened.
> >>>     > >
> >>>     > > Me too :)
> >>>     > >
> >>>     > >> Have your every experienced any power loss for your NVME drive?
> >>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
> >>>     be safe
> >>>     > >> against power loss, I'm just curious about if mount time log
> >>>     replay is
> >>>     > >> involved, or just regular internal log replay.
> >>>     > >>
> >>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
> >>>     with 2144
> >>>     > >> power cycles.
> >>>     > >
> >>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
> >>>     > > caught off gaurd by low battery and instant power-off, I kick myself
> >>>     > > and mean to set up a script to force poweroff before that
> >>>     happens. So,
> >>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
> >>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
> >>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
> >>>     > > issued every 3, so the ssd drive is more like 5 years old.
> >>>     > >
> >>>     > >> Not sure if it's related.
> >>>     > >>
> >>>     > >> Another interesting point is, did you remember what's the
> >>>     oldest kernel
> >>>     > >> running on this fs? v5.4 or v5.5?
> >>>     > >
> >>>     > > Hard to say, but arch linux maintains a package archive. The nvme
> >>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
> >>>     and the
> >>>     > > kernel/btrfs-progs was at 4.20 then:
> >>>     > > - https://archive.archlinux.org/packages/l/linux/
> >>>     >
> >>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
> >>>     > cause metadata corruption. And the symptom is transid error, which
> >>>     also
> >>>     > matches your problem.
> >>>     >
> >>>     > Thanks,
> >>>     > Qu
> >>>     >
> >>>     > >
> >>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
> >>>     so the
> >>>     > > kernel version would have been even older.
> >>>     > >
> >>>     > > Thanks for your input,
> >>>     > > John
> >>>     > >
> >>>     > >>
> >>>     > >> Thanks,
> >>>     > >> Qu
> >>>     > >>>
> >>>     > >>> Thanks for any pointers, as it would now seem that my purchase
> >>>     of a
> >>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
> >>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
> >>>     > >>> worried there is a deeper issue bound to recur :(
> >>>     > >>>
> >>>     > >>> Best regards,
> >>>     > >>> John
> >>>     > >>>
> >>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
> >>>     <mailto:jw.hendy@gmail.com>> wrote:
> >>>     > >>>>
> >>>     > >>>> Greetings,
> >>>     > >>>>
> >>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
> >>>     couple of
> >>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
> >>>     start it
> >>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
> >>>     > >>>> manually and was informed I was on a read-only fs! I ended up
> >>>     biting
> >>>     > >>>> the bullet and re-installing linux due to the number of dead end
> >>>     > >>>> threads and slow response rates on diagnosing these issues,
> >>>     and the
> >>>     > >>>> issue occurred again shortly after.
> >>>     > >>>>
> >>>     > >>>> $ uname -a
> >>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
> >>>     16:38:40
> >>>     > >>>> +0000 x86_64 GNU/Linux
> >>>     > >>>>
> >>>     > >>>> $ btrfs --version
> >>>     > >>>> btrfs-progs v5.4
> >>>     > >>>>
> >>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
> >>>     mounting a subvol on /
> >>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
> >>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
> >>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
> >>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>>     > >>>>
> >>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >>>     > >>>> nvme0n1                                       259:5    0
> >>>     232.9G  0 disk
> >>>     > >>>> ├─nvme0n1p1                                   259:6    0
> >>>      512M  0
> >>>     > >>>> part  (/boot/efi)
> >>>     > >>>> ├─nvme0n1p2                                   259:7    0
> >>>      1G  0 part  (/boot)
> >>>     > >>>> └─nvme0n1p3                                   259:8    0
> >>>     231.4G  0 part (btrfs)
> >>>     > >>>>
> >>>     > >>>> I have the following subvols:
> >>>     > >>>> arch: used for / when booting arch
> >>>     > >>>> jwhendy: used for /home/jwhendy on arch
> >>>     > >>>> vault: shared data between distros on /mnt/vault
> >>>     > >>>> bionic: root when booting ubuntu bionic
> >>>     > >>>>
> >>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>>     > >>>>
> >>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >>>     > >>>
> >>>     > >>> Edit: links now:
> >>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
> >>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
> >>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
> >>>     > >>>
> >>>     > >>> btrfs dev stats (not worth a link):
> >>>     > >>>
> >>>     > >>> [/dev/mapper/old].write_io_errs    0
> >>>     > >>> [/dev/mapper/old].read_io_errs     0
> >>>     > >>> [/dev/mapper/old].flush_io_errs    0
> >>>     > >>> [/dev/mapper/old].corruption_errs  0
> >>>     > >>> [/dev/mapper/old].generation_errs  0
> >>>     > >>>
> >>>     > >>>
> >>>     > >>>> If these are of interested, here are reddit threads where I
> >>>     posted the
> >>>     > >>>> issue and was referred here.
> >>>     > >>>> 1)
> >>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >>>     > >>>> 2)
> >>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>>     > >>>>
> >>>     > >>>> It has been suggested this is a hardware issue. I've already
> >>>     ordered a
> >>>     > >>>> replacement m2.sata, but for sanity it would be great to know
> >>>     > >>>> definitively this was the case. If anything stands out above that
> >>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
> >>>     also be
> >>>     > >>>> fantastic so I don't repeat the issue!
> >>>     > >>>>
> >>>     > >>>> The only thing I've stumbled on is that I have been mounting with
> >>>     > >>>> rd.luks.options=discard and that manually running fstrim is
> >>>     preferred.
> >>>     > >>>>
> >>>     > >>>>
> >>>     > >>>> Many thanks for any input/suggestions,
> >>>     > >>>> John
> >>>     > >>
> >>>     >
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-02-09  4:10                   ` John Hendy
@ 2020-02-09  5:01                     ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-02-09  5:01 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 21072 bytes --]



On 2020/2/9 下午12:10, John Hendy wrote:
> On Sat, Feb 8, 2020 at 7:07 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/2/9 上午8:51, John Hendy wrote:
>>> On Sat, Feb 8, 2020 at 5:56 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/2/9 上午5:57, John Hendy wrote:
>>>>> On phone due to no OS, so apologies if this is in html mode. Indeed, I
>>>>> can't mount or boot any longer. I get the error:
>>>>>
>>>>> Error (device dm-0) in btrfs_replay_log:2228: errno=-22 unknown (Failed
>>>>> to recover log tree)
>>>>> BTRFS error (device dm-0): open_ctree failed
>>>>
>>>> That can be easily fixed by `btrfs rescue zero-log`.
>>>>
>>>
>>> Whew. This was most helpful and it is wonderful to be booting at
>>> least. I think the outstanding issues are:
>>> - what should I do about `btrfs check --repair seg` faulting?
>>
>> That needs extra debugging. But you can try `btrfs check --repair
>> --mode=lowmem` which sometimes can bring better result than regular mode.
>> The trade-off is much slower speed.
>>
>>> - how can I deal with this (probably related to seg fault) ghost file
>>> that cannot be deleted?
>>
>> Only `btrfs check` can handle it, kernel will only fallback to RO to
>> prevent further corruption.
>>
>>> - I'm not sure if you looked at the post --repair log, but there a ton
>>> of these errors that didn't used to be there:
>>>
>>> backpointer mismatch on [13037375488 20480]
>>> ref mismatch on [13037395968 892928] extent item 0, found 1
>>> data backref 13037395968 root 263 owner 4257169 offset 0 num_refs 0
>>> not found in extent tree
>>> incorrect local backref count on 13037395968 root 263 owner 4257169
>>> offset 0 found 1 wanted 0 back 0x5627f59cadc0
>>
>> All 13037395968 related line is just one problem, it's the original mode
>> doing human-unfriendly output.
>>
>> But the extra transid looks kinda dangerous.
>>
>> I'd recommend to backup important data first before trying to repair.
>>
>>>
>>> Here is the latest btrfs check output after the zero-log operation.
>>> - https://pastebin.com/KWeUnk0y
>>>
>>> I'm hoping once that file is deleted, it's a matter of
>>> --init-csum-tree and perhaps I'm set? Or --init-extent-tree?
>>
>> --init-csum-tree has the least priority, thus it doesn't really matter.
>>
>> --init-extent-tree would in theory reset your extent tree, but the
>> problem is, the transid mismatch may cause something wrong.
>>
>> So please backup your data before trying any repair.
>> After data backup, please try `btrfs check --repair --mode=lowmem` first.
>>
> 
> Current status:
> 
> - the nvme seems healed! All is well, and a scrub completed
> successfully as well. Currently booted into that.

Great, we can just forget that case now.

> 
> - the ssd is not doing well. I tried to do a backup and got a ton of
> issues with rsync (input/output errors, unable to verify transaction).
> I gave up as it just wasn't working well and would remount ro during
> these operations. Then, I did `btrfs check --repair --mode=lowmem`. It
> didn't seg fault, and did look to fix that spurious file (or at least
> mention it).
> 
> Here's the current btrfs check output after the --repair --mode=lowmem attempt:
> - https://pastebin.com/fHCHqrk7

The problems are more serious than your NVME one.

Transid error in csum tree

Transid error itself already means metadata COW is broken, which also
comes with extent tree corruption.
Trim or v5.2 bug can all lead to similar problem.

Your current best way to salvage data would be btrfs-restore.
Since csum tree is corrupted, a tons of data read will fail anyway.

For repair, you may try --init-csum-tree tree first.
As you have nothing to lose, you may also try --init-extent-tree.
Or maybe even both.

Thanks,
Qu

> 
> If there are any suggestions on salvaging this, I would love to try.
> For now, I still have my original nvme drive working as an OS again
> and discard options are off everywhere. I can report back if this
> continues to work.
> 
> Of interest to the list, I ran into these threads which you may already know of:
> - https://linustechtips.com/main/topic/1066931-linux-51-kernel-hit-by-ssd-trim-bug-which-causes-massive-data-loss/
> (dm-crypt + Samsung SSD + 5.1 kernel = data loss). From googling, 5.1
> would have been ~May 2019 for arch linux, so well within this drive's
> life
> - also, the arch wiki
> (https://wiki.archlinux.org/index.php/Solid_state_drive#Continuous_TRIM)
> says certain drives have trim errors and certain features are
> blacklisted in the kernel
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c#n4522).
> My Samsung 850 SSD is in that list. I'm guessing some bad symptoms
> occurred to earn it a spot on that list...
> 
> Current mount options to sanity check:
> /dev/mapper/luks-dc2c470e-ec77-43df-bbe8-110c678785c2 on / type btrfs
> (rw,relatime,compress=lzo,ssd,space_cache,subvolid=256,subvol=/arch
> 
> I will also do my best not to be extra rigorous about power loss as well.
> 
> Fingers crossed this was all about trim/discard.
> 
> Many thanks to Chris and Qu for the help. As you can imagine these
> situations are awful and one can feel quite powerless. Really
> appreciate the coaching and persistence.
> 
> Best regards,
> John
> 
> 
>> Thanks,
>> Qu
>>>
>>> Thanks,
>>> John
>>>
>>>> At least, btrfs check --repair didn't make things worse.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> John
>>>>>
>>>>> On Sat, Feb 8, 2020, 1:56 PM John Hendy <jw.hendy@gmail.com
>>>>> <mailto:jw.hendy@gmail.com>> wrote:
>>>>>
>>>>>     This is not going so hot. Updates:
>>>>>
>>>>>     booted from arch install, pre repair btrfs check:
>>>>>     - https://pastebin.com/6vNaSdf2
>>>>>
>>>>>     btrfs check --mode=lowmem as requested by Chris:
>>>>>     - https://pastebin.com/uSwSTVVY
>>>>>
>>>>>     Then I did btrfs check --repair, which seg faulted at the end. I've
>>>>>     typed them off of pictures I took:
>>>>>
>>>>>     Starting repair.
>>>>>     Opening filesystem to check...
>>>>>     Checking filesystem on /dev/mapper/ssd
>>>>>     [1/7] checking root items
>>>>>     Fixed 0 roots.
>>>>>     [2/7] checking extents
>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>>     448074
>>>>>     parent transid verify failed on 20271138064 wanted 68719924810 found
>>>>>     448074
>>>>>     Ignoring transid failure
>>>>>     # ... repeated the previous two lines maybe hundreds of times
>>>>>     # ended with this:
>>>>>     ref mismatch on [12797435904 268505088] extent item 1, found 412
>>>>>     [1] 1814 segmentation fault (core dumped) btrfs check --repair
>>>>>     /dev/mapper/ssd
>>>>>
>>>>>     This was with btrfs-progs 5.4 (the install USB is maybe a month old).
>>>>>
>>>>>     Here is the output of btrfs check after the --repair attempt:
>>>>>     - https://pastebin.com/6MYRNdga
>>>>>
>>>>>     I rebooted to write this email given the seg fault, as I wanted to
>>>>>     make sure that I should still follow-up --repair with
>>>>>     --init-csum-tree. I had pictures of the --repair output, but Firefox
>>>>>     just wouldn't load imgur.com <http://imgur.com> for me to post the
>>>>>     pics and was acting
>>>>>     really weird. In suspiciously checking dmesg, things have gone ro on
>>>>>     me :(  Here is the dmesg from this session:
>>>>>     - https://pastebin.com/a2z7xczy
>>>>>
>>>>>     The gist is:
>>>>>
>>>>>     [   40.997935] BTRFS critical (device dm-0): corrupt leaf: root=7
>>>>>     block=172703744 slot=0, csum end range (12980568064) goes beyond the
>>>>>     start range (12980297728) of the next csum item
>>>>>     [   40.997941] BTRFS info (device dm-0): leaf 172703744 gen 450983
>>>>>     total ptrs 34 free space 29 owner 7
>>>>>     [   40.997942]     item 0 key (18446744073709551606 128 12979060736)
>>>>>     itemoff 14811 itemsize 1472
>>>>>     [   40.997944]     item 1 key (18446744073709551606 128 12980297728)
>>>>>     itemoff 13895 itemsize 916
>>>>>     [   40.997945]     item 2 key (18446744073709551606 128 12981235712)
>>>>>     itemoff 13811 itemsize 84
>>>>>     # ... there's maybe 30 of these item n key lines in total
>>>>>     [   40.997984] BTRFS error (device dm-0): block=172703744 write time
>>>>>     tree block corruption detected
>>>>>     [   41.016793] BTRFS: error (device dm-0) in
>>>>>     btrfs_commit_transaction:2332: errno=-5 IO failure (Error while
>>>>>     writing out transaction)
>>>>>     [   41.016799] BTRFS info (device dm-0): forced readonly
>>>>>     [   41.016802] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>>     transaction.
>>>>>     [   41.016804] BTRFS: error (device dm-0) in cleanup_transaction:1890:
>>>>>     errno=-5 IO failure
>>>>>     [   41.016807] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>>     [   41.023473] BTRFS warning (device dm-0): Skipping commit of aborted
>>>>>     transaction.
>>>>>     [   41.024297] BTRFS info (device dm-0): delayed_refs has NO entry
>>>>>     [   44.509418] systemd-journald[416]:
>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>>     Journal file corrupted, rotating.
>>>>>     [   44.509440] systemd-journald[416]: Failed to rotate
>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/system.journal:
>>>>>     Read-only file system
>>>>>     [   44.509450] systemd-journald[416]: Failed to rotate
>>>>>     /var/log/journal/45c06c25e25f434195204efa939019ab/user-1000.journal:
>>>>>     Read-only file system
>>>>>     [   44.509540] systemd-journald[416]: Failed to write entry (23 items,
>>>>>     705 bytes) despite vacuuming, ignoring: Bad message
>>>>>     # ... then a bunch of these failed journal attempts (of note:
>>>>>     /var/log/journal was one of the bad inodes from btrfs check
>>>>>     previously)
>>>>>
>>>>>     Kindly let me know what you would recommend. I'm sadly back to an
>>>>>     unusable system vs. a complaining/worrisome one. This is similar to
>>>>>     the behavior I had with the m2.sata nvme drive in my original
>>>>>     experience. After trying all of --repair, --init-csum-tree, and
>>>>>     --init-extent-tree, I couldn't boot anymore. After my dm-crypt
>>>>>     password at boot, I just saw a bunch of [FAILED] in the text splash
>>>>>     output. Hoping to not repeat that with this drive.
>>>>>
>>>>>     Thanks,
>>>>>     John
>>>>>
>>>>>
>>>>>     On Sat, Feb 8, 2020 at 1:29 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>>     >
>>>>>     >
>>>>>     >
>>>>>     > On 2020/2/8 下午12:48, John Hendy wrote:
>>>>>     > > On Fri, Feb 7, 2020 at 5:42 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>>>     <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>>>     > >>
>>>>>     > >>
>>>>>     > >>
>>>>>     > >> On 2020/2/8 上午1:52, John Hendy wrote:
>>>>>     > >>> Greetings,
>>>>>     > >>>
>>>>>     > >>> I'm resending, as this isn't showing in the archives. Perhaps
>>>>>     it was
>>>>>     > >>> the attachments, which I've converted to pastebin links.
>>>>>     > >>>
>>>>>     > >>> As an update, I'm now running off of a different drive (ssd,
>>>>>     not the
>>>>>     > >>> nvme) and I got the error again! I'm now inclined to think
>>>>>     this might
>>>>>     > >>> not be hardware after all, but something related to my setup
>>>>>     or a bug
>>>>>     > >>> with chromium.
>>>>>     > >>>
>>>>>     > >>> After a reboot, chromium wouldn't start for me and demsg showed
>>>>>     > >>> similar parent transid/csum errors to my original post below.
>>>>>     I used
>>>>>     > >>> btrfs-inspect-internal to find the inode traced to
>>>>>     > >>> ~/.config/chromium/History. I deleted that, and got a new set of
>>>>>     > >>> errors tracing to ~/.config/chromium/Cookies. After I deleted
>>>>>     that and
>>>>>     > >>> tried starting chromium, I found that my btrfs /home/jwhendy
>>>>>     pool was
>>>>>     > >>> mounted ro just like the original problem below.
>>>>>     > >>>
>>>>>     > >>> dmesg after trying to start chromium:
>>>>>     > >>> - https://pastebin.com/CsCEQMJa
>>>>>     > >>
>>>>>     > >> So far, it's only transid bug in your csum tree.
>>>>>     > >>
>>>>>     > >> And two backref mismatch in data backref.
>>>>>     > >>
>>>>>     > >> In theory, you can fix your problem by `btrfs check --repair
>>>>>     > >> --init-csum-tree`.
>>>>>     > >>
>>>>>     > >
>>>>>     > > Now that I might be narrowing in on offending files, I'll wait
>>>>>     to see
>>>>>     > > what you think from my last response to Chris. I did try the above
>>>>>     > > when I first ran into this:
>>>>>     > > -
>>>>>     https://lore.kernel.org/linux-btrfs/CA+M2ft8FpjdDQ7=XwMdYQazhyB95aha_D4WU_n15M59QrimrRg@mail.gmail.com/
>>>>>     >
>>>>>     > That RO is caused by the missing data backref.
>>>>>     >
>>>>>     > Which can be fixed by btrfs check --repair.
>>>>>     >
>>>>>     > Then you should be able to delete offending files them. (Or the whole
>>>>>     > chromium cache, and switch to firefox if you wish :P )
>>>>>     >
>>>>>     > But also please keep in mind that, the transid mismatch looks
>>>>>     happen in
>>>>>     > your csum tree, which means your csum tree is no longer reliable, and
>>>>>     > may cause -EIO reading unrelated files.
>>>>>     >
>>>>>     > Thus it's recommended to re-fill the csum tree by --init-csum-tree.
>>>>>     >
>>>>>     > It can be done altogether by --repair --init-csum-tree, but to be
>>>>>     safe,
>>>>>     > please run --repair only first, then make sure btrfs check reports no
>>>>>     > error after that. Then go --init-csum-tree.
>>>>>     >
>>>>>     > >
>>>>>     > >> But I'm more interesting in how this happened.
>>>>>     > >
>>>>>     > > Me too :)
>>>>>     > >
>>>>>     > >> Have your every experienced any power loss for your NVME drive?
>>>>>     > >> I'm not say btrfs is unsafe against power loss, all fs should
>>>>>     be safe
>>>>>     > >> against power loss, I'm just curious about if mount time log
>>>>>     replay is
>>>>>     > >> involved, or just regular internal log replay.
>>>>>     > >>
>>>>>     > >> From your smartctl, the drive experienced 61 unsafe shutdown
>>>>>     with 2144
>>>>>     > >> power cycles.
>>>>>     > >
>>>>>     > > Uhhh, hell yes, sadly. I'm a dummy running i3 and every time I get
>>>>>     > > caught off gaurd by low battery and instant power-off, I kick myself
>>>>>     > > and mean to set up a script to force poweroff before that
>>>>>     happens. So,
>>>>>     > > indeed, I've lost power a ton. Surprised it was 61 times, but maybe
>>>>>     > > not over ~2 years. And actually, I mis-stated the age. I haven't
>>>>>     > > *booted* from this drive in almost 2yrs. It's a corporate laptop,
>>>>>     > > issued every 3, so the ssd drive is more like 5 years old.
>>>>>     > >
>>>>>     > >> Not sure if it's related.
>>>>>     > >>
>>>>>     > >> Another interesting point is, did you remember what's the
>>>>>     oldest kernel
>>>>>     > >> running on this fs? v5.4 or v5.5?
>>>>>     > >
>>>>>     > > Hard to say, but arch linux maintains a package archive. The nvme
>>>>>     > > drive is from ~May 2018. The archives only go back to Jan 2019
>>>>>     and the
>>>>>     > > kernel/btrfs-progs was at 4.20 then:
>>>>>     > > - https://archive.archlinux.org/packages/l/linux/
>>>>>     >
>>>>>     > There is a known bug in v5.2.0~v5.2.14 (fixed in v5.2.15), which could
>>>>>     > cause metadata corruption. And the symptom is transid error, which
>>>>>     also
>>>>>     > matches your problem.
>>>>>     >
>>>>>     > Thanks,
>>>>>     > Qu
>>>>>     >
>>>>>     > >
>>>>>     > > Searching my Amazon orders, the SSD was in the 2015 time frame,
>>>>>     so the
>>>>>     > > kernel version would have been even older.
>>>>>     > >
>>>>>     > > Thanks for your input,
>>>>>     > > John
>>>>>     > >
>>>>>     > >>
>>>>>     > >> Thanks,
>>>>>     > >> Qu
>>>>>     > >>>
>>>>>     > >>> Thanks for any pointers, as it would now seem that my purchase
>>>>>     of a
>>>>>     > >>> new m2.sata may not buy my way out of this problem! While I didn't
>>>>>     > >>> want to reinstall, at least new hardware is a simple fix. Now I'm
>>>>>     > >>> worried there is a deeper issue bound to recur :(
>>>>>     > >>>
>>>>>     > >>> Best regards,
>>>>>     > >>> John
>>>>>     > >>>
>>>>>     > >>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com
>>>>>     <mailto:jw.hendy@gmail.com>> wrote:
>>>>>     > >>>>
>>>>>     > >>>> Greetings,
>>>>>     > >>>>
>>>>>     > >>>> I've had this issue occur twice, once ~1mo ago and once a
>>>>>     couple of
>>>>>     > >>>> weeks ago. Chromium suddenly quit on me, and when trying to
>>>>>     start it
>>>>>     > >>>> again, it complained about a lock file in ~. I tried to delete it
>>>>>     > >>>> manually and was informed I was on a read-only fs! I ended up
>>>>>     biting
>>>>>     > >>>> the bullet and re-installing linux due to the number of dead end
>>>>>     > >>>> threads and slow response rates on diagnosing these issues,
>>>>>     and the
>>>>>     > >>>> issue occurred again shortly after.
>>>>>     > >>>>
>>>>>     > >>>> $ uname -a
>>>>>     > >>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020
>>>>>     16:38:40
>>>>>     > >>>> +0000 x86_64 GNU/Linux
>>>>>     > >>>>
>>>>>     > >>>> $ btrfs --version
>>>>>     > >>>> btrfs-progs v5.4
>>>>>     > >>>>
>>>>>     > >>>> $ btrfs fi df /mnt/misc/ # full device; normally would be
>>>>>     mounting a subvol on /
>>>>>     > >>>> Data, single: total=114.01GiB, used=80.88GiB
>>>>>     > >>>> System, single: total=32.00MiB, used=16.00KiB
>>>>>     > >>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>>>     > >>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>>>     > >>>>
>>>>>     > >>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>>>     > >>>> nvme0n1                                       259:5    0
>>>>>     232.9G  0 disk
>>>>>     > >>>> ├─nvme0n1p1                                   259:6    0
>>>>>      512M  0
>>>>>     > >>>> part  (/boot/efi)
>>>>>     > >>>> ├─nvme0n1p2                                   259:7    0
>>>>>      1G  0 part  (/boot)
>>>>>     > >>>> └─nvme0n1p3                                   259:8    0
>>>>>     231.4G  0 part (btrfs)
>>>>>     > >>>>
>>>>>     > >>>> I have the following subvols:
>>>>>     > >>>> arch: used for / when booting arch
>>>>>     > >>>> jwhendy: used for /home/jwhendy on arch
>>>>>     > >>>> vault: shared data between distros on /mnt/vault
>>>>>     > >>>> bionic: root when booting ubuntu bionic
>>>>>     > >>>>
>>>>>     > >>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>>>     > >>>>
>>>>>     > >>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>>>     > >>>
>>>>>     > >>> Edit: links now:
>>>>>     > >>> - btrfs check: https://pastebin.com/nz6Bc145
>>>>>     > >>> - dmesg: https://pastebin.com/1GGpNiqk
>>>>>     > >>> - smartctl: https://pastebin.com/ADtYqfrd
>>>>>     > >>>
>>>>>     > >>> btrfs dev stats (not worth a link):
>>>>>     > >>>
>>>>>     > >>> [/dev/mapper/old].write_io_errs    0
>>>>>     > >>> [/dev/mapper/old].read_io_errs     0
>>>>>     > >>> [/dev/mapper/old].flush_io_errs    0
>>>>>     > >>> [/dev/mapper/old].corruption_errs  0
>>>>>     > >>> [/dev/mapper/old].generation_errs  0
>>>>>     > >>>
>>>>>     > >>>
>>>>>     > >>>> If these are of interested, here are reddit threads where I
>>>>>     posted the
>>>>>     > >>>> issue and was referred here.
>>>>>     > >>>> 1)
>>>>>     https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>>>     > >>>> 2)
>>>>>     https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>>>     > >>>>
>>>>>     > >>>> It has been suggested this is a hardware issue. I've already
>>>>>     ordered a
>>>>>     > >>>> replacement m2.sata, but for sanity it would be great to know
>>>>>     > >>>> definitively this was the case. If anything stands out above that
>>>>>     > >>>> could indicate I'm not setup properly re. btrfs, that would
>>>>>     also be
>>>>>     > >>>> fantastic so I don't repeat the issue!
>>>>>     > >>>>
>>>>>     > >>>> The only thing I've stumbled on is that I have been mounting with
>>>>>     > >>>> rd.luks.options=discard and that manually running fstrim is
>>>>>     preferred.
>>>>>     > >>>>
>>>>>     > >>>>
>>>>>     > >>>> Many thanks for any input/suggestions,
>>>>>     > >>>> John
>>>>>     > >>
>>>>>     >
>>>>>
>>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
       [not found] <CA+M2ft9zjGm7XJw1BUm364AMqGSd3a8QgsvQDCWz317qjP=o8g@mail.gmail.com>
  2020-02-07 17:52 ` btrfs root fs started remounting ro John Hendy
@ 2020-05-06  4:37 ` John Hendy
  2020-05-06  6:13   ` Qu Wenruo
  1 sibling, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-05-06  4:37 UTC (permalink / raw)
  To: Btrfs BTRFS

Greetings,


I'm following up to the below as this just occurred again. I think
there is something odd between btrfs behavior and browsers. Since the
last time, I was able to recover my drive, and have disabled
continuous trim (and have not manually trimmed for that matter).

I've switched to firefox almost exclusively (I can think of a handful
of times using it), but the problem was related chromium cache and the
problem this time was the file:

.cache/mozilla/firefox/tqxxilph.default-release/cache2/entries/D8FD7600C30A3A68D18D98B233F9C5DD3F7DDAD0

In this particular instance, I suspended my computer, and resumed to
find it read only. I opened it to reboot into windows, finding I
couldn't save my open file in emacs.

The dmesg is here: https://pastebin.com/B8nUkYzB

The file above was found uncorrectable via btrfs scrub, but after I
manually deleted it the scrub succeeded on the second try with no
errors.

$ btrfs --version
btrfs-progs v5.6

$ uname -a
Linux voltaur 5.6.10-arch1-1 #1 SMP PREEMPT Sat, 02 May 2020 19:11:54
+0000 x86_64 GNU/Linux

I don't know how to reproduce this at all, but it's always been
browser cache related. There are similar issues out there, but no
obvious pattern/solutions.
- https://forum.manjaro.org/t/root-and-home-become-read-only/46944
- https://bbs.archlinux.org/viewtopic.php?id=224243

Anything else to check on why this might occur?

Best regards,
John


On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>
> Greetings,
>
> I've had this issue occur twice, once ~1mo ago and once a couple of
> weeks ago. Chromium suddenly quit on me, and when trying to start it
> again, it complained about a lock file in ~. I tried to delete it
> manually and was informed I was on a read-only fs! I ended up biting
> the bullet and re-installing linux due to the number of dead end
> threads and slow response rates on diagnosing these issues, and the
> issue occurred again shortly after.
>
> $ uname -a
> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
> +0000 x86_64 GNU/Linux
>
> $ btrfs --version
> btrfs-progs v5.4
>
> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
> Data, single: total=114.01GiB, used=80.88GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=2.01GiB, used=769.61MiB
> GlobalReserve, single: total=140.73MiB, used=0.00B
>
> This is a single device, no RAID, not on a VM. HP Zbook 15.
> nvme0n1                                       259:5    0 232.9G  0 disk
> ├─nvme0n1p1                                   259:6    0   512M  0
> part  (/boot/efi)
> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>
> I have the following subvols:
> arch: used for / when booting arch
> jwhendy: used for /home/jwhendy on arch
> vault: shared data between distros on /mnt/vault
> bionic: root when booting ubuntu bionic
>
> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>
> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>
> If these are of interested, here are reddit threads where I posted the
> issue and was referred here.
> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>
> It has been suggested this is a hardware issue. I've already ordered a
> replacement m2.sata, but for sanity it would be great to know
> definitively this was the case. If anything stands out above that
> could indicate I'm not setup properly re. btrfs, that would also be
> fantastic so I don't repeat the issue!
>
> The only thing I've stumbled on is that I have been mounting with
> rd.luks.options=discard and that manually running fstrim is preferred.
>
>
> Many thanks for any input/suggestions,
> John

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-05-06  4:37 ` John Hendy
@ 2020-05-06  6:13   ` Qu Wenruo
  2020-05-06 15:29     ` John Hendy
  0 siblings, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-05-06  6:13 UTC (permalink / raw)
  To: John Hendy, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 5150 bytes --]



On 2020/5/6 下午12:37, John Hendy wrote:
> Greetings,
> 
> 
> I'm following up to the below as this just occurred again. I think
> there is something odd between btrfs behavior and browsers. Since the
> last time, I was able to recover my drive, and have disabled
> continuous trim (and have not manually trimmed for that matter).
> 
> I've switched to firefox almost exclusively (I can think of a handful
> of times using it), but the problem was related chromium cache and the
> problem this time was the file:
> 
> .cache/mozilla/firefox/tqxxilph.default-release/cache2/entries/D8FD7600C30A3A68D18D98B233F9C5DD3F7DDAD0
> 
> In this particular instance, I suspended my computer, and resumed to
> find it read only. I opened it to reboot into windows, finding I
> couldn't save my open file in emacs.
> 
> The dmesg is here: https://pastebin.com/B8nUkYzB

The reason is write time tree checker, surprised it get triggered:

[68515.682152] BTRFS critical (device dm-0): corrupt leaf: root=257
block=156161818624 slot=22 ino=1312604, name hash mismatch with key,
have 0x000000007a63c07f expect 0x00000000006820bc

In the dump included in the dmesg, unfortunately it doesn't include the
file name so I'm not sure which one is the culprit, but it has the inode
number, 1312604.


But consider this is from write time tree checker, not from read time
tree checker, this means, it's not your on-disk data corrupted from the
very beginning, but possibly your RAM (maybe related to suspension?)
causing the problem.

> 
> The file above was found uncorrectable via btrfs scrub, but after I
> manually deleted it the scrub succeeded on the second try with no
> errors.

Unfortunately, it may not related to that file, unless that file has the
inode number 1312604.

That to say, this is a completely different case.

Considering your previous csum corruption, have you considered a full
memtest?

Thanks,
Qu

> 
> $ btrfs --version
> btrfs-progs v5.6
> 
> $ uname -a
> Linux voltaur 5.6.10-arch1-1 #1 SMP PREEMPT Sat, 02 May 2020 19:11:54
> +0000 x86_64 GNU/Linux
> 
> I don't know how to reproduce this at all, but it's always been
> browser cache related. There are similar issues out there, but no
> obvious pattern/solutions.
> - https://forum.manjaro.org/t/root-and-home-become-read-only/46944
> - https://bbs.archlinux.org/viewtopic.php?id=224243
> 
> Anything else to check on why this might occur?
> 
> Best regards,
> John
> 
> 
> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>>
>> Greetings,
>>
>> I've had this issue occur twice, once ~1mo ago and once a couple of
>> weeks ago. Chromium suddenly quit on me, and when trying to start it
>> again, it complained about a lock file in ~. I tried to delete it
>> manually and was informed I was on a read-only fs! I ended up biting
>> the bullet and re-installing linux due to the number of dead end
>> threads and slow response rates on diagnosing these issues, and the
>> issue occurred again shortly after.
>>
>> $ uname -a
>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
>> +0000 x86_64 GNU/Linux
>>
>> $ btrfs --version
>> btrfs-progs v5.4
>>
>> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
>> Data, single: total=114.01GiB, used=80.88GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=2.01GiB, used=769.61MiB
>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>
>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>> nvme0n1                                       259:5    0 232.9G  0 disk
>> ├─nvme0n1p1                                   259:6    0   512M  0
>> part  (/boot/efi)
>> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
>> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>>
>> I have the following subvols:
>> arch: used for / when booting arch
>> jwhendy: used for /home/jwhendy on arch
>> vault: shared data between distros on /mnt/vault
>> bionic: root when booting ubuntu bionic
>>
>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>
>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>
>> If these are of interested, here are reddit threads where I posted the
>> issue and was referred here.
>> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>
>> It has been suggested this is a hardware issue. I've already ordered a
>> replacement m2.sata, but for sanity it would be great to know
>> definitively this was the case. If anything stands out above that
>> could indicate I'm not setup properly re. btrfs, that would also be
>> fantastic so I don't repeat the issue!
>>
>> The only thing I've stumbled on is that I have been mounting with
>> rd.luks.options=discard and that manually running fstrim is preferred.
>>
>>
>> Many thanks for any input/suggestions,
>> John


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-05-06  6:13   ` Qu Wenruo
@ 2020-05-06 15:29     ` John Hendy
  2020-05-06 22:50       ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: John Hendy @ 2020-05-06 15:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Wed, May 6, 2020 at 1:13 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/5/6 下午12:37, John Hendy wrote:
> > Greetings,
> >
> >
> > I'm following up to the below as this just occurred again. I think
> > there is something odd between btrfs behavior and browsers. Since the
> > last time, I was able to recover my drive, and have disabled
> > continuous trim (and have not manually trimmed for that matter).
> >
> > I've switched to firefox almost exclusively (I can think of a handful
> > of times using it), but the problem was related chromium cache and the
> > problem this time was the file:
> >
> > .cache/mozilla/firefox/tqxxilph.default-release/cache2/entries/D8FD7600C30A3A68D18D98B233F9C5DD3F7DDAD0
> >
> > In this particular instance, I suspended my computer, and resumed to
> > find it read only. I opened it to reboot into windows, finding I
> > couldn't save my open file in emacs.
> >
> > The dmesg is here: https://pastebin.com/B8nUkYzB
>
> The reason is write time tree checker, surprised it get triggered:
>
> [68515.682152] BTRFS critical (device dm-0): corrupt leaf: root=257
> block=156161818624 slot=22 ino=1312604, name hash mismatch with key,
> have 0x000000007a63c07f expect 0x00000000006820bc
>
> In the dump included in the dmesg, unfortunately it doesn't include the
> file name so I'm not sure which one is the culprit, but it has the inode
> number, 1312604.

Thanks for the input. The inode resolves to this path, but it's the
same base path as the problematic file for btrfs scrub.

$ sudo btrfs inspect-internal inode-resolve 1312604 /home/jwhendy
/home/jwhendy/.cache/mozilla/firefox/tqxxilph.default-release/cache2/entries

> But consider this is from write time tree checker, not from read time
> tree checker, this means, it's not your on-disk data corrupted from the
> very beginning, but possibly your RAM (maybe related to suspension?)
> causing the problem.

Interesting. I suspend al the time and have never encountered this,
but I do recall sending an email (in firefox) and quickly closing my
computer afterward as the last thing I did.

> >
> > The file above was found uncorrectable via btrfs scrub, but after I
> > manually deleted it the scrub succeeded on the second try with no
> > errors.
>
> Unfortunately, it may not related to that file, unless that file has the
> inode number 1312604.
>
> That to say, this is a completely different case.
>
> Considering your previous csum corruption, have you considered a full
> memtest?

I can certainly do this. At what point could hardware be ruled out and
something else pursued or troubleshot? Or is this a lost cause to try
and understand?

Many thanks,
John

> Thanks,
> Qu
>
> >
> > $ btrfs --version
> > btrfs-progs v5.6
> >
> > $ uname -a
> > Linux voltaur 5.6.10-arch1-1 #1 SMP PREEMPT Sat, 02 May 2020 19:11:54
> > +0000 x86_64 GNU/Linux
> >
> > I don't know how to reproduce this at all, but it's always been
> > browser cache related. There are similar issues out there, but no
> > obvious pattern/solutions.
> > - https://forum.manjaro.org/t/root-and-home-become-read-only/46944
> > - https://bbs.archlinux.org/viewtopic.php?id=224243
> >
> > Anything else to check on why this might occur?
> >
> > Best regards,
> > John
> >
> >
> > On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
> >>
> >> Greetings,
> >>
> >> I've had this issue occur twice, once ~1mo ago and once a couple of
> >> weeks ago. Chromium suddenly quit on me, and when trying to start it
> >> again, it complained about a lock file in ~. I tried to delete it
> >> manually and was informed I was on a read-only fs! I ended up biting
> >> the bullet and re-installing linux due to the number of dead end
> >> threads and slow response rates on diagnosing these issues, and the
> >> issue occurred again shortly after.
> >>
> >> $ uname -a
> >> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
> >> +0000 x86_64 GNU/Linux
> >>
> >> $ btrfs --version
> >> btrfs-progs v5.4
> >>
> >> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
> >> Data, single: total=114.01GiB, used=80.88GiB
> >> System, single: total=32.00MiB, used=16.00KiB
> >> Metadata, single: total=2.01GiB, used=769.61MiB
> >> GlobalReserve, single: total=140.73MiB, used=0.00B
> >>
> >> This is a single device, no RAID, not on a VM. HP Zbook 15.
> >> nvme0n1                                       259:5    0 232.9G  0 disk
> >> ├─nvme0n1p1                                   259:6    0   512M  0
> >> part  (/boot/efi)
> >> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
> >> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
> >>
> >> I have the following subvols:
> >> arch: used for / when booting arch
> >> jwhendy: used for /home/jwhendy on arch
> >> vault: shared data between distros on /mnt/vault
> >> bionic: root when booting ubuntu bionic
> >>
> >> nvme0n1p3 is encrypted with dm-crypt/LUKS.
> >>
> >> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
> >>
> >> If these are of interested, here are reddit threads where I posted the
> >> issue and was referred here.
> >> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
> >> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
> >>
> >> It has been suggested this is a hardware issue. I've already ordered a
> >> replacement m2.sata, but for sanity it would be great to know
> >> definitively this was the case. If anything stands out above that
> >> could indicate I'm not setup properly re. btrfs, that would also be
> >> fantastic so I don't repeat the issue!
> >>
> >> The only thing I've stumbled on is that I have been mounting with
> >> rd.luks.options=discard and that manually running fstrim is preferred.
> >>
> >>
> >> Many thanks for any input/suggestions,
> >> John
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: btrfs root fs started remounting ro
  2020-05-06 15:29     ` John Hendy
@ 2020-05-06 22:50       ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-05-06 22:50 UTC (permalink / raw)
  To: John Hendy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 6440 bytes --]



On 2020/5/6 下午11:29, John Hendy wrote:
> On Wed, May 6, 2020 at 1:13 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/5/6 下午12:37, John Hendy wrote:
>>> Greetings,
>>>
>>>
>>> I'm following up to the below as this just occurred again. I think
>>> there is something odd between btrfs behavior and browsers. Since the
>>> last time, I was able to recover my drive, and have disabled
>>> continuous trim (and have not manually trimmed for that matter).
>>>
>>> I've switched to firefox almost exclusively (I can think of a handful
>>> of times using it), but the problem was related chromium cache and the
>>> problem this time was the file:
>>>
>>> .cache/mozilla/firefox/tqxxilph.default-release/cache2/entries/D8FD7600C30A3A68D18D98B233F9C5DD3F7DDAD0
>>>
>>> In this particular instance, I suspended my computer, and resumed to
>>> find it read only. I opened it to reboot into windows, finding I
>>> couldn't save my open file in emacs.
>>>
>>> The dmesg is here: https://pastebin.com/B8nUkYzB
>>
>> The reason is write time tree checker, surprised it get triggered:
>>
>> [68515.682152] BTRFS critical (device dm-0): corrupt leaf: root=257
>> block=156161818624 slot=22 ino=1312604, name hash mismatch with key,
>> have 0x000000007a63c07f expect 0x00000000006820bc
>>
>> In the dump included in the dmesg, unfortunately it doesn't include the
>> file name so I'm not sure which one is the culprit, but it has the inode
>> number, 1312604.
> 
> Thanks for the input. The inode resolves to this path, but it's the
> same base path as the problematic file for btrfs scrub.
> 
> $ sudo btrfs inspect-internal inode-resolve 1312604 /home/jwhendy
> /home/jwhendy/.cache/mozilla/firefox/tqxxilph.default-release/cache2/entries
> 
>> But consider this is from write time tree checker, not from read time
>> tree checker, this means, it's not your on-disk data corrupted from the
>> very beginning, but possibly your RAM (maybe related to suspension?)
>> causing the problem.
> 
> Interesting. I suspend al the time and have never encountered this,
> but I do recall sending an email (in firefox) and quickly closing my
> computer afterward as the last thing I did.
> 
>>>
>>> The file above was found uncorrectable via btrfs scrub, but after I
>>> manually deleted it the scrub succeeded on the second try with no
>>> errors.
>>
>> Unfortunately, it may not related to that file, unless that file has the
>> inode number 1312604.
>>
>> That to say, this is a completely different case.
>>
>> Considering your previous csum corruption, have you considered a full
>> memtest?
> 
> I can certainly do this. At what point could hardware be ruled out and
> something else pursued or troubleshot? Or is this a lost cause to try
> and understand?

If a full memtest run finishes without problem, then we're hitting
something impossible.

As there shouldn't be anything causing write time tree checker error,
especially for name hash.

Thanks,
Qu

> 
> Many thanks,
> John
> 
>> Thanks,
>> Qu
>>
>>>
>>> $ btrfs --version
>>> btrfs-progs v5.6
>>>
>>> $ uname -a
>>> Linux voltaur 5.6.10-arch1-1 #1 SMP PREEMPT Sat, 02 May 2020 19:11:54
>>> +0000 x86_64 GNU/Linux
>>>
>>> I don't know how to reproduce this at all, but it's always been
>>> browser cache related. There are similar issues out there, but no
>>> obvious pattern/solutions.
>>> - https://forum.manjaro.org/t/root-and-home-become-read-only/46944
>>> - https://bbs.archlinux.org/viewtopic.php?id=224243
>>>
>>> Anything else to check on why this might occur?
>>>
>>> Best regards,
>>> John
>>>
>>>
>>> On Wed, Feb 5, 2020 at 10:01 AM John Hendy <jw.hendy@gmail.com> wrote:
>>>>
>>>> Greetings,
>>>>
>>>> I've had this issue occur twice, once ~1mo ago and once a couple of
>>>> weeks ago. Chromium suddenly quit on me, and when trying to start it
>>>> again, it complained about a lock file in ~. I tried to delete it
>>>> manually and was informed I was on a read-only fs! I ended up biting
>>>> the bullet and re-installing linux due to the number of dead end
>>>> threads and slow response rates on diagnosing these issues, and the
>>>> issue occurred again shortly after.
>>>>
>>>> $ uname -a
>>>> Linux whammy 5.5.1-arch1-1 #1 SMP PREEMPT Sat, 01 Feb 2020 16:38:40
>>>> +0000 x86_64 GNU/Linux
>>>>
>>>> $ btrfs --version
>>>> btrfs-progs v5.4
>>>>
>>>> $ btrfs fi df /mnt/misc/ # full device; normally would be mounting a subvol on /
>>>> Data, single: total=114.01GiB, used=80.88GiB
>>>> System, single: total=32.00MiB, used=16.00KiB
>>>> Metadata, single: total=2.01GiB, used=769.61MiB
>>>> GlobalReserve, single: total=140.73MiB, used=0.00B
>>>>
>>>> This is a single device, no RAID, not on a VM. HP Zbook 15.
>>>> nvme0n1                                       259:5    0 232.9G  0 disk
>>>> ├─nvme0n1p1                                   259:6    0   512M  0
>>>> part  (/boot/efi)
>>>> ├─nvme0n1p2                                   259:7    0     1G  0 part  (/boot)
>>>> └─nvme0n1p3                                   259:8    0 231.4G  0 part (btrfs)
>>>>
>>>> I have the following subvols:
>>>> arch: used for / when booting arch
>>>> jwhendy: used for /home/jwhendy on arch
>>>> vault: shared data between distros on /mnt/vault
>>>> bionic: root when booting ubuntu bionic
>>>>
>>>> nvme0n1p3 is encrypted with dm-crypt/LUKS.
>>>>
>>>> dmesg, smartctl, btrfs check, and btrfs dev stats attached.
>>>>
>>>> If these are of interested, here are reddit threads where I posted the
>>>> issue and was referred here.
>>>> 1) https://www.reddit.com/r/btrfs/comments/ejqhyq/any_hope_of_recovering_from_various_errors_root/
>>>> 2)  https://www.reddit.com/r/btrfs/comments/erh0f6/second_time_btrfs_root_started_remounting_as_ro/
>>>>
>>>> It has been suggested this is a hardware issue. I've already ordered a
>>>> replacement m2.sata, but for sanity it would be great to know
>>>> definitively this was the case. If anything stands out above that
>>>> could indicate I'm not setup properly re. btrfs, that would also be
>>>> fantastic so I don't repeat the issue!
>>>>
>>>> The only thing I've stumbled on is that I have been mounting with
>>>> rd.luks.options=discard and that manually running fstrim is preferred.
>>>>
>>>>
>>>> Many thanks for any input/suggestions,
>>>> John
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-05-06 22:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+M2ft9zjGm7XJw1BUm364AMqGSd3a8QgsvQDCWz317qjP=o8g@mail.gmail.com>
2020-02-07 17:52 ` btrfs root fs started remounting ro John Hendy
2020-02-07 20:21   ` Chris Murphy
2020-02-07 22:31     ` John Hendy
2020-02-07 23:17       ` Chris Murphy
2020-02-08  4:37         ` John Hendy
2020-02-07 23:42   ` Qu Wenruo
2020-02-08  4:48     ` John Hendy
2020-02-08  7:29       ` Qu Wenruo
2020-02-08 19:56         ` John Hendy
     [not found]           ` <CA+M2ft9dcMKKQstZVcGQ=9MREbfhPF5GG=xoMoh5Aq8MK9P8wA@mail.gmail.com>
2020-02-08 23:56             ` Qu Wenruo
2020-02-09  0:51               ` John Hendy
2020-02-09  0:59                 ` John Hendy
2020-02-09  1:09                   ` Qu Wenruo
2020-02-09  1:20                     ` John Hendy
2020-02-09  1:24                       ` Qu Wenruo
2020-02-09  1:49                         ` John Hendy
2020-02-09  1:07                 ` Qu Wenruo
2020-02-09  4:10                   ` John Hendy
2020-02-09  5:01                     ` Qu Wenruo
2020-02-09  3:46           ` Chris Murphy
2020-05-06  4:37 ` John Hendy
2020-05-06  6:13   ` Qu Wenruo
2020-05-06 15:29     ` John Hendy
2020-05-06 22:50       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).