All of lore.kernel.org
 help / color / mirror / Atom feed
* How to heel this btrfs fi corruption?
@ 2019-12-19 20:00 Ralf Zerres
  2019-12-19 20:59 ` Chris Murphy
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Ralf Zerres @ 2019-12-19 20:00 UTC (permalink / raw)
  To: 'linux-btrfs@vger.kernel.org'

Dear list,

at customer site i can't mount a given btrfs device in rw mode.
this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff.
Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over.
But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs.

Here are the details:

kernel: 5.2.2 (Ubuntu 18.04.3)
btrfs-progs: 5.2.1
HBA: DELL Perc
# storcli /c0/v0
# 0/0   RAID5 Optl  RW     Yes     RWBD  -   OFF 7.274 TB SSD-Data
#btrfs fi show /dev/sdX
#Label: 'Data-Ssd'  uuid: <my uuid>
#        Total devices 1 FS bytes used 7.12TiB
#        devid    1 size 7.27TiB used 7.27TiB path /dev/<mydev>

What happend:
Customer filled up the filesystem (lots of snapshots in a couple of subvolumes).
System was working with kernel 4.15 and btrfs-progs 4.15. I updated kernel and btrfs-progs with the assumption
more mainlined/actual tools could do a better job. Since they have seen lots of fixups.

1) As a first step, i did run

# btrfs check --mode lowmem --progress /dev/<mydev> 

got extend mismatches and wrong extend CRC's

2) As a second step i did try to mount in recovery mode

# mount -t btrfs -o defaults, recovery, skip_balance /dev/<mydev> /mnt

I included skip_balance, since there might be an unfinished balance run. But this didn't work out.

3) As a third step, got it mounted with ro mode

# mount -t  btrfs -o ro /dev/<mydev> /mnt

And filed data received via usage:

# btrfs fi usage /mnt
# Overall:
#    Device size:                   7.27TiB
#    Device allocated:              7.27TiB
#    Device unallocated:            1.00MiB
#    Device missing:                  0.00B
#    Used:                          7.13TiB
#    Free (estimated):            134.13GiB      (min: 134.13GiB)
#    Data ratio:                       1.00
#    Metadata ratio:                   2.00
#    Global reserve:              512.00MiB      (used: 0.00B)
#
# Data,single: Size:7.23TiB, Used:7.10TiB
#   /dev/<mydev>        7.23TiB
#
# Metadata,DUP: Size:21.50GiB, Used:14.31GiB
#   /dev/<mydev>       43.00GiB
#
# System,DUP: Size:8.00MiB, Used:864.00KiB
#   /dev/<mydev>       16.00MiB

# Unallocated:
#   /dev/<mydev>        1.00MiB

Obviously, totally filled up.
At that time i copied out all relevant data - you never know ... Finished!

Then tried to unmout, but that got to nowhere. Leads to a reboot .


4) As a forth step, i tried to repair it

# btrfs check --mode lowmem --progress --repair /dev/<mydev>
# enabling repair mode
# WARNING: low-memory mode repair support is only partial
# Opening filesystem to check...
# Checking filesystem on /dev/<mydev>
# UUID: <my UUID>
# [1/7] checking root items                      (0:00:33 elapsed, 20853512 items checked)
# Fixed 0 roots.
# ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
# ERROR: fail to allocate new chunk No space left on device
# Try to exclude all metadata blcoks and extents, it may be slow
# Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked)
# ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34
# Delete backref in extent [1988733435904 134217728]
# ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34
# ERROR: commit_root already set when starting transaction
# ERROR: fail to start transaction: Invalid argument
# ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319
# ERROR: commit_root already set when starting transaction
# ERROR: fail to start transaction Invalid argument

This ends with a core-dump.

Last not least my question:

I'm not experienced enough to solve this issue myself and need your help. 
Is it worth the time and effort to solve this issue? Developers might be interested while having a real live testbed?
Do you need any further info that will help to solve the issue?


Best regards
Ralf






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 20:00 How to heel this btrfs fi corruption? Ralf Zerres
@ 2019-12-19 20:59 ` Chris Murphy
  2019-12-19 21:25 ` Martin Steigerwald
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2019-12-19 20:59 UTC (permalink / raw)
  To: Ralf Zerres; +Cc: linux-btrfs, Qu Wenruo

On Thu, Dec 19, 2019 at 1:07 PM Ralf Zerres <Ralf.Zerres@networkx.de> wrote:
>
> Dear list,
>
> at customer site i can't mount a given btrfs device in rw mode.
> this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff.
> Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over.
> But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs.
>
> Here are the details:
>
> kernel: 5.2.2 (Ubuntu 18.04.3)

Unfortunate that these versions are still easily obtained. 5.2.0 -
5.2.14 had an pernicious bug. I can't tell if it applies in your case
though.

Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdmanana@kernel.org/T/#u

The bug is fixed since 5.2.15.



> btrfs-progs: 5.2.1
> # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> # enabling repair mode
> # WARNING: low-memory mode repair support is only partial
> # Opening filesystem to check...
> # Checking filesystem on /dev/<mydev>
> # UUID: <my UUID>
> # [1/7] checking root items                      (0:00:33 elapsed, 20853512 items checked)
> # Fixed 0 roots.
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
> # ERROR: fail to allocate new chunk No space left on device
> # Try to exclude all metadata blcoks and extents, it may be slow
> # Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked)
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34
> # Delete backref in extent [1988733435904 134217728]
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction: Invalid argument
> # ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction Invalid argument
>
> This ends with a core-dump.

Well it's easy to say a crash is a bug, and I'm also not sure if it's
fixed in btrfs-progs 5.4. But it might help isolate the problem if you
attach dmesg. At least the good news is there's a backup; but creating
a new volume and restoring this much data will be a little tedious.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 20:00 How to heel this btrfs fi corruption? Ralf Zerres
  2019-12-19 20:59 ` Chris Murphy
@ 2019-12-19 21:25 ` Martin Steigerwald
  2019-12-19 21:43   ` Chris Murphy
  2019-12-20  6:05 ` Qu Wenruo
       [not found] ` <CAK-xaQbGiO=b3XFS929DFcG=B3fsuT7AAFKLSaECaXbgUyZqzw@mail.gmail.com>
  3 siblings, 1 reply; 9+ messages in thread
From: Martin Steigerwald @ 2019-12-19 21:25 UTC (permalink / raw)
  To: Ralf Zerres; +Cc: 'linux-btrfs@vger.kernel.org'

Hi Ralf.

Ralf Zerres - 19.12.19, 21:00:12 CET:
> at customer site i can't mount a given btrfs device in rw mode.
> this is production data and i do have a backup and managed to mount
> the filesystem in ro mode. I did copy out relevant stuff. Having said
> this, if btrfs --repair can't heal the situation, i could reformat
> the filesystem and start all over. But i would prefere to save the
> time and take the heeling as a proof of "production ready" status of
> btrfs-progs.
> 
> Here are the details:
> 
> kernel: 5.2.2 (Ubuntu 18.04.3)
> btrfs-progs: 5.2.1
[…]
> 4) As a forth step, i tried to repair it
> 
> # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> # enabling repair mode
> # WARNING: low-memory mode repair support is only partial
> # Opening filesystem to check...
> # Checking filesystem on /dev/<mydev>
> # UUID: <my UUID>
> # [1/7] checking root items                      (0:00:33 elapsed,
> 20853512 items checked) 
> # Fixed 0 roots.
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch
> (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34 
> #  ERROR: fail to allocate new chunk No space left on device

Maybe the filesystem check failed due to that error?

Just guess work tough!

You could try adding a device to the filesystem and then check again. It 
could even be a good (!) USB stick. This way BTRFS would have some 
additional space and maybe 'btrfs check' would complete.

May or may not work, no idea. But I noticed that the check itself 
mentioned an out of space condition so I thought I'd mention it.

Best of success,
-- 
Martin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 21:25 ` Martin Steigerwald
@ 2019-12-19 21:43   ` Chris Murphy
  2019-12-19 22:34     ` Remi Gauvin
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-12-19 21:43 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Ralf Zerres, linux-btrfs

On Thu, Dec 19, 2019 at 2:35 PM Martin Steigerwald <martin@lichtvoll.de> wrote:
>
> Hi Ralf.
>
> Ralf Zerres - 19.12.19, 21:00:12 CET:
> > at customer site i can't mount a given btrfs device in rw mode.
> > this is production data and i do have a backup and managed to mount
> > the filesystem in ro mode. I did copy out relevant stuff. Having said
> > this, if btrfs --repair can't heal the situation, i could reformat
> > the filesystem and start all over. But i would prefere to save the
> > time and take the heeling as a proof of "production ready" status of
> > btrfs-progs.
> >
> > Here are the details:
> >
> > kernel: 5.2.2 (Ubuntu 18.04.3)
> > btrfs-progs: 5.2.1
> […]
> > 4) As a forth step, i tried to repair it
> >
> > # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> > # enabling repair mode
> > # WARNING: low-memory mode repair support is only partial
> > # Opening filesystem to check...
> > # Checking filesystem on /dev/<mydev>
> > # UUID: <my UUID>
> > # [1/7] checking root items                      (0:00:33 elapsed,
> > 20853512 items checked)
> > # Fixed 0 roots.
> > # ERROR: extent[1988733435904, 134217728] referencer count mismatch
> > (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
> > #  ERROR: fail to allocate new chunk No space left on device
>
> Maybe the filesystem check failed due to that error?
>
> Just guess work tough!
>
> You could try adding a device to the filesystem and then check again. It
> could even be a good (!) USB stick. This way BTRFS would have some
> additional space and maybe 'btrfs check' would complete.
>
> May or may not work, no idea. But I noticed that the check itself
> mentioned an out of space condition so I thought I'd mention it.

It's bogus.

> #    Free (estimated):            134.13GiB      (min: 134.13GiB)

I don't recommend  adding another device until the problem is
understood better. Hopefully a developer can respond.

It might be helpful to upgrade to a 5.3 or 5.4 kernel, which has more
consistency checks. If there's a call trace produced at mount or
during runtime it might give a developer useful information.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 21:43   ` Chris Murphy
@ 2019-12-19 22:34     ` Remi Gauvin
  2019-12-19 23:18       ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Remi Gauvin @ 2019-12-19 22:34 UTC (permalink / raw)
  To: Chris Murphy, Martin Steigerwald; +Cc: Ralf Zerres, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 508 bytes --]

On 2019-12-19 4:43 p.m., Chris Murphy wrote:
> It's bogus.
>
>> #    Free (estimated):            134.13GiB      (min: 134.13GiB)


Perhaps not.

Lots of free space, but it's *all* allocated.


#    Device size:                   7.27TiB
#    Device allocated:              7.27TiB

 Metadata,DUP: Size:21.50GiB, Used:14.31GiB
#   /dev/<mydev>       43.00GiB
#
# System,DUP: Size:8.00MiB, Used:864.00KiB
#   /dev/<mydev>       16.00MiB

# Unallocated:
#   /dev/<mydev>        1.00MiB


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 22:34     ` Remi Gauvin
@ 2019-12-19 23:18       ` Chris Murphy
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2019-12-19 23:18 UTC (permalink / raw)
  To: Remi Gauvin, Btrfs BTRFS

On Thu, Dec 19, 2019 at 3:34 PM Remi Gauvin <remi@georgianit.com> wrote:
>
> On 2019-12-19 4:43 p.m., Chris Murphy wrote:
> > It's bogus.
> >
> >> #    Free (estimated):            134.13GiB      (min: 134.13GiB)
>
>
> Perhaps not.
>
> Lots of free space, but it's *all* allocated.
>
>
> #    Device size:                   7.27TiB
> #    Device allocated:              7.27TiB
>
>  Metadata,DUP: Size:21.50GiB, Used:14.31GiB
> #   /dev/<mydev>       43.00GiB
> #
> # System,DUP: Size:8.00MiB, Used:864.00KiB
> #   /dev/<mydev>       16.00MiB
>
> # Unallocated:
> #   /dev/<mydev>        1.00MiB
>

True. The more recent cases of enospc seem to happen with plenty of
unused space in allocated block groups available. As is the case here.

It's possible a newer kernel will produce helpful error reporting, and
additionally mount with enospc_debug mount option.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-19 20:00 How to heel this btrfs fi corruption? Ralf Zerres
  2019-12-19 20:59 ` Chris Murphy
  2019-12-19 21:25 ` Martin Steigerwald
@ 2019-12-20  6:05 ` Qu Wenruo
  2019-12-20 11:36   ` Ralf Zerres
       [not found] ` <CAK-xaQbGiO=b3XFS929DFcG=B3fsuT7AAFKLSaECaXbgUyZqzw@mail.gmail.com>
  3 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2019-12-20  6:05 UTC (permalink / raw)
  To: Ralf Zerres, 'linux-btrfs@vger.kernel.org'


[-- Attachment #1.1: Type: text/plain, Size: 5885 bytes --]



On 2019/12/20 上午4:00, Ralf Zerres wrote:
> Dear list,
> 
> at customer site i can't mount a given btrfs device in rw mode.
> this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff.
> Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over.
> But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs.
> 
> Here are the details:
> 
> kernel: 5.2.2 (Ubuntu 18.04.3)
> btrfs-progs: 5.2.1
> HBA: DELL Perc
> # storcli /c0/v0
> # 0/0   RAID5 Optl  RW     Yes     RWBD  -   OFF 7.274 TB SSD-Data
> #btrfs fi show /dev/sdX
> #Label: 'Data-Ssd'  uuid: <my uuid>
> #        Total devices 1 FS bytes used 7.12TiB
> #        devid    1 size 7.27TiB used 7.27TiB path /dev/<mydev>
> 
> What happend:
> Customer filled up the filesystem (lots of snapshots in a couple of subvolumes).
> System was working with kernel 4.15 and btrfs-progs 4.15. I updated kernel and btrfs-progs with the assumption
> more mainlined/actual tools could do a better job. Since they have seen lots of fixups.
> 
> 1) As a first step, i did run
> 
> # btrfs check --mode lowmem --progress /dev/<mydev>

The initial report would help a lot to determine the root cause of
corruption in first place.

But if btrfs check (both modes) report error, you'd better not to think
--repair can do a better job.

Currently btrfs check is only good at finding problems, not really
fixing them.

As there are too many things to consider when doing repair, so at least
--repair is far from "production ready".
That's why in v5.4 progs, we add extra wait time for --repair.

> 
> got extend mismatches and wrong extend CRC's
> 
> 2) As a second step i did try to mount in recovery mode
> 
> # mount -t btrfs -o defaults, recovery, skip_balance /dev/<mydev> /mnt
> 
> I included skip_balance, since there might be an unfinished balance run. But this didn't work out.

The dmesg would help to find out what went wrong.

Just a tip for such report, the initial error message is always the most
important thing.

> 
> 3) As a third step, got it mounted with ro mode
> 
> # mount -t  btrfs -o ro /dev/<mydev> /mnt
> 
> And filed data received via usage:
> 
> # btrfs fi usage /mnt
> # Overall:
> #    Device size:                   7.27TiB
> #    Device allocated:              7.27TiB
> #    Device unallocated:            1.00MiB
> #    Device missing:                  0.00B
> #    Used:                          7.13TiB
> #    Free (estimated):            134.13GiB      (min: 134.13GiB)
> #    Data ratio:                       1.00
> #    Metadata ratio:                   2.00
> #    Global reserve:              512.00MiB      (used: 0.00B)
> #
> # Data,single: Size:7.23TiB, Used:7.10TiB
> #   /dev/<mydev>        7.23TiB
> #
> # Metadata,DUP: Size:21.50GiB, Used:14.31GiB
> #   /dev/<mydev>       43.00GiB
> #
> # System,DUP: Size:8.00MiB, Used:864.00KiB
> #   /dev/<mydev>       16.00MiB
> 
> # Unallocated:
> #   /dev/<mydev>        1.00MiB
> 
> Obviously, totally filled up.
> At that time i copied out all relevant data - you never know ... Finished!
> 
> Then tried to unmout, but that got to nowhere. Leads to a reboot .
> 
> 
> 4) As a forth step, i tried to repair it
> 
> # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> # enabling repair mode
> # WARNING: low-memory mode repair support is only partial
> # Opening filesystem to check...
> # Checking filesystem on /dev/<mydev>
> # UUID: <my UUID>
> # [1/7] checking root items                      (0:00:33 elapsed, 20853512 items checked)
> # Fixed 0 roots.
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
> # ERROR: fail to allocate new chunk No space left on device
> # Try to exclude all metadata blcoks and extents, it may be slow
> # Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked)
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34
> # Delete backref in extent [1988733435904 134217728]
> # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction: Invalid argument
> # ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319
> # ERROR: commit_root already set when starting transaction
> # ERROR: fail to start transaction Invalid argument
> 
> This ends with a core-dump.
> 
> Last not least my question:
> 
> I'm not experienced enough to solve this issue myself and need your help. 
> Is it worth the time and effort to solve this issue?

I don't think it would be worthy, unless you're a really super kind guy
who want to make btrfs-progs better.
The time to repair the image could easily be more than just restoring
the backup, not to mention it's not ensured to save it.

> Developers might be interested while having a real live testbed?
> Do you need any further info that will help to solve the issue?

In this case, the history of the corruption would be more useful.

But since it's 4.15 kernel which may not have enough fixes backported
(since it's Ubuntu, not SUSE kernel), and the 5.2.2 is not safe at all
(you need 5.3.0 or 5.2.15) we can't even determine if it's 5.2.2 causing
the corruption in the first place.

So I'm not sure if we can get more juice from the report.

Thanks,
Qu

> 
> 
> Best regards
> Ralf
> 
> 
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
  2019-12-20  6:05 ` Qu Wenruo
@ 2019-12-20 11:36   ` Ralf Zerres
  0 siblings, 0 replies; 9+ messages in thread
From: Ralf Zerres @ 2019-12-20 11:36 UTC (permalink / raw)
  To: linux-btrfs, quwenruo.btrfs






Am Freitag, den 20.12.2019, 14:05 +0800 schrieb Qu Wenruo:
> 
> On 2019/12/20 上午4:00, Ralf Zerres wrote:
> > Dear list,
> > 
> > at customer site i can't mount a given btrfs device in rw mode.
> > this is production data and i do have a backup and managed to mount the filesystem in ro mode. I did copy out relevant stuff.
> > Having said this, if btrfs --repair can't heal the situation, i could reformat the filesystem and start all over.
> > But i would prefere to save the time and take the heeling as a proof of "production ready" status of btrfs-progs.
> > 
> > Here are the details:
> > 
> > kernel: 5.2.2 (Ubuntu 18.04.3)
> > btrfs-progs: 5.2.1
> > HBA: DELL Perc
> > # storcli /c0/v0
> > # 0/0   RAID5 Optl  RW     Yes     RWBD  -   OFF 7.274 TB SSD-Data
> > #btrfs fi show /dev/sdX
> > #Label: 'Data-Ssd'  uuid: <my uuid>
> > #        Total devices 1 FS bytes used 7.12TiB
> > #        devid    1 size 7.27TiB used 7.27TiB path /dev/<mydev>
> > 
> > What happend:
> > Customer filled up the filesystem (lots of snapshots in a couple of subvolumes).
> > System was working with kernel 4.15 and btrfs-progs 4.15. I updated kernel and btrfs-progs with the assumption
> > more mainlined/actual tools could do a better job. Since they have seen lots of fixups.
> > 
> > 1) As a first step, i did run
> > 
> > # btrfs check --mode lowmem --progress /dev/<mydev>
> 
> The initial report would help a lot to determine the root cause of
> corruption in first place.
> 
> But if btrfs check (both modes) report error, you'd better not to think
> --repair can do a better job.
> 
> Currently btrfs check is only good at finding problems, not really
> fixing them.
> 
thanks for this clarification.

> As there are too many things to consider when doing repair, so at least
> --repair is far from "production ready".
> That's why in v5.4 progs, we add extra wait time for --repair.
> 
which means we have to wait until development can finish this task.
Until this situation i will regard --repair as a WIP function that may
help, may not. Only use it for data sets for which valid backups exist
or be prepared to loose data.

> > 
> > got extend mismatches and wrong extend CRC's
> > 
> > 2) As a second step i did try to mount in recovery mode
> > 
> > # mount -t btrfs -o defaults, recovery, skip_balance /dev/<mydev> /mnt
> > 
> > I included skip_balance, since there might be an unfinished balance run. But this didn't work out.
> 
> The dmesg would help to find out what went wrong.
> 
> Just a tip for such report, the initial error message is always the most
> important thing.
> 
> > 
> > 3) As a third step, got it mounted with ro mode
> > 
> > # mount -t  btrfs -o ro /dev/<mydev> /mnt
> > 
> > And filed data received via usage:
> > 
> > # btrfs fi usage /mnt
> > # Overall:
> > #    Device size:                   7.27TiB
> > #    Device allocated:              7.27TiB
> > #    Device unallocated:            1.00MiB
> > #    Device missing:                  0.00B
> > #    Used:                          7.13TiB
> > #    Free (estimated):            134.13GiB      (min: 134.13GiB)
> > #    Data ratio:                       1.00
> > #    Metadata ratio:                   2.00
> > #    Global reserve:              512.00MiB      (used: 0.00B)
> > #
> > # Data,single: Size:7.23TiB, Used:7.10TiB
> > #   /dev/<mydev>        7.23TiB
> > #
> > # Metadata,DUP: Size:21.50GiB, Used:14.31GiB
> > #   /dev/<mydev>       43.00GiB
> > #
> > # System,DUP: Size:8.00MiB, Used:864.00KiB
> > #   /dev/<mydev>       16.00MiB
> > 
> > # Unallocated:
> > #   /dev/<mydev>        1.00MiB
> > 
> > Obviously, totally filled up.
> > At that time i copied out all relevant data - you never know ... Finished!
> > 
> > Then tried to unmout, but that got to nowhere. Leads to a reboot .
> > 
> > 
> > 4) As a forth step, i tried to repair it
> > 
> > # btrfs check --mode lowmem --progress --repair /dev/<mydev>
> > # enabling repair mode
> > # WARNING: low-memory mode repair support is only partial
> > # Opening filesystem to check...
> > # Checking filesystem on /dev/<mydev>
> > # UUID: <my UUID>
> > # [1/7] checking root items                      (0:00:33 elapsed, 20853512 items checked)
> > # Fixed 0 roots.
> > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: # 28, have: 34
> > # ERROR: fail to allocate new chunk No space left on device
> > # Try to exclude all metadata blcoks and extents, it may be slow
> > # Delete backref in extent [1988733435904 134217728]07:16 elapsed, 40435 items checked)
> > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 27, have: 34
> > # Delete backref in extent [1988733435904 134217728]
> > # ERROR: extent[1988733435904, 134217728] referencer count mismatch (root: 261, owner: 286, offset: 5905580032) wanted: 26, have: 34
> > # ERROR: commit_root already set when starting transaction
> > # ERROR: fail to start transaction: Invalid argument
> > # ERROR: extent[2017321811968, 134217728] referencer count mismatch (root: 261, owner: 287, offset: 2281701376) wanted: 3215, have: 3319
> > # ERROR: commit_root already set when starting transaction
> > # ERROR: fail to start transaction Invalid argument
> > 
> > This ends with a core-dump.
> > 
> > Last not least my question:
> > 
> > I'm not experienced enough to solve this issue myself and need your help. 
> > Is it worth the time and effort to solve this issue?
> 
> I don't think it would be worthy, unless you're a really super kind guy
> who want to make btrfs-progs better.
> The time to repair the image could easily be more than just restoring
> the backup, not to mention it's not ensured to save it.
> 
I will give btrfs-prog 5.4 a run on 5.4 kernel booted system.
The ssd-pool is still availabel in the corrupted state. And it will not
go into production anyway, before the capacity can be extended.
The disks are ordered and are on there way.
I will just do the --repair as an academic process (not calling me a
super nice guy). But it might give some insight.

> > Developers might be interested while having a real live testbed?
> > Do you need any further info that will help to solve the issue?
> 
> In this case, the history of the corruption would be more useful.
> 
> But since it's 4.15 kernel which may not have enough fixes backported
> (since it's Ubuntu, not SUSE kernel), and the 5.2.2 is not safe at all
> (you need 5.3.0 or 5.2.15) we can't even determine if it's 5.2.2 causing
> the corruption in the first place.

Well, i do expect 5.4.0 to be equally valid. To bad that there is no
official backport for Ubuntu stable (aka 18.04.x)

> So I'm not sure if we can get more juice from the report.
> 
When i add the new disks to the Raid5, i will definetely reformat a new
btrfs filesystem to be sure it is clean and has no faults. Then the
subvols and data will be restored with btrfs-send/btrfs-receive. 
 
> Thanks,
> Qu
> 

Qu, thanks a bunch for your time and the fruitful information.
Ralf

> 
> > 
> > 
> > Best regards
> > Ralf
> > 
> > 
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to heel this btrfs fi corruption?
       [not found] ` <CAK-xaQbGiO=b3XFS929DFcG=B3fsuT7AAFKLSaECaXbgUyZqzw@mail.gmail.com>
@ 2019-12-20 13:38   ` Ralf Zerres
  0 siblings, 0 replies; 9+ messages in thread
From: Ralf Zerres @ 2019-12-20 13:38 UTC (permalink / raw)
  To: andrea.gelmini; +Cc: linux-btrfs, quwenruo.btrfs

Am Freitag, den 20.12.2019, 14:01 +0100 schrieb Andrea Gelmini:
> 
> Il ven 20 dic 2019, 12:40 Ralf Zerres <Ralf.Zerres@networkx.de> ha
> scritto:
> 
> 
> > Well, i do expect 5.4.0 to be equally valid. To bad that there is
> > no
> > official backport for Ubuntu stable (aka 18.04.x)
> > 
> > Use this:
> 
> https://kernel.ubuntu.com/~kernel-ppa/mainline/

Thanks for the link. That is exactly where i do pull the kernels ...
> I use it since year in production.
> 
> Also, my personal point of view, Qu and Facebook guys are doing
> incredibile work and improvement on btrfs. But I don't feel
> comfortable to use It in production. It's still too early.
> 

Yes, the improvement is seen in every version and very much
appreciated. 
But be fair, if you use btrfs as advertised (Raid1 or Raid0, no
gigantic qgroup dependencies, reasonable amount of snapshots prer
subvolumes (< 64)
the filesystem is stable. I'm running it since 2years in production
environment.

> Ciao,
> Gelma
> 

just my two coins ...
Ralf

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-12-20 13:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-19 20:00 How to heel this btrfs fi corruption? Ralf Zerres
2019-12-19 20:59 ` Chris Murphy
2019-12-19 21:25 ` Martin Steigerwald
2019-12-19 21:43   ` Chris Murphy
2019-12-19 22:34     ` Remi Gauvin
2019-12-19 23:18       ` Chris Murphy
2019-12-20  6:05 ` Qu Wenruo
2019-12-20 11:36   ` Ralf Zerres
     [not found] ` <CAK-xaQbGiO=b3XFS929DFcG=B3fsuT7AAFKLSaECaXbgUyZqzw@mail.gmail.com>
2019-12-20 13:38   ` Ralf Zerres

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.