All of lore.kernel.org
 help / color / mirror / Atom feed
* ERROR... please contact btrfs developers
@ 2020-09-30  1:44 Eric Levy
  2020-09-30  2:03 ` Qu Wenruo
  2020-10-05  1:33 ` Eric Levy
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Levy @ 2020-09-30  1:44 UTC (permalink / raw)
  To: linux-btrfs

I recently upgraded a Linux system running on btrfs from a 5.3.x
kernel to a 5.4.x version. The system failed to run for more than a
few minutes after the upgrade, because the root mount degraded to a
read-only state. I continued to use the system by booting using the
5.3.x kernel.

Some time later, I attempted to migrate the root subvolume using a
send-receive command pairing, and noticed that the operation would
invariably abort before completion. I also noticed that a full file
walk of the mounted volume was impossible, because operations on some
files generated errors from the file-system level.

Upon investigating using a check command, I learned that the file
system had errors.

Examining the error report (not saved), I noticed that overall my
situation had rather clear similarities to one described in an earlier
discussion [1].

Unfortunately, it appears that the differences in the kernels may have
corrupted the file system.

Based on eagerness for a resolution, and on an optimistic comment
toward the end of the discussion, I chose to run a check operation on
the partition with the --repair flag included.

Perhaps not surprisingly to some, the result of a read-only check
operation after the attempted repair gave a much more discouraging
report, suggesting that the damage to the file system was made worse
not better by the operation. I realize that this possibility is
explained in the documentation.

At the moment, the full report appears as below.

Presently, the file system mounts, but the ability to successfully
read files degrades the longer the system is mounted and the more
files are read during a continuous mount. Experiments involving
unmounting and then mounting again give some indication that this
degradation is not entirely permanent.

What possibility is open to recover all or part of the file system?
After such a rescue attempt, would I have any way to know what is lost
versus saved? Might I expect corruption within the file contents that
would not be detected by the rescue effort?

I would be thankful for any guidance that might lead to restoring the data


[1] https://www.spinics.net/lists/linux-btrfs/msg96735.html
---

Opening filesystem to check...
Checking filesystem on /dev/sda5
UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
[1/7] checking root items
[2/7] checking extents
ERROR: invalid generation for extent 4064026624, have 94810718697136
expect (0, 33469925]
ERROR: invalid generation for extent 16323178496, have 94811372174048
expect (0, 33469925]
ERROR: invalid generation for extent 79980945408, have 94811372219744
expect (0, 33469925]
ERROR: invalid generation for extent 318963990528, have 94810111593504
expect (0, 33469925]
ERROR: invalid generation for extent 319650189312, have 14758526976
expect (0, 33469925]
ERROR: invalid generation for extent 319677259776, have 414943019007
expect (0, 33469925]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
block group 71962722304 has wrong amount of free space, free space
cache has 266420224 block group has 266354688
ERROR: free space cache has more free space than block group item,
this could leads to serious corruption, please contact btrfs
developers
failed to load free space cache for block group 71962722304
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
found 399845548032 bytes used, error(s) found
total csum bytes: 349626220
total tree bytes: 5908873216
total fs tree bytes: 4414324736
total extent tree bytes: 879493120
btree space waste bytes: 1122882578
file data blocks allocated: 550505705472
 referenced 512080416768

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
  2020-09-30  1:44 ERROR... please contact btrfs developers Eric Levy
@ 2020-09-30  2:03 ` Qu Wenruo
       [not found]   ` <CA++hEgyyn0Os1-w-WE8seXCrDJVosgLnfL1pU7e2p_LpqRmJ_Q@mail.gmail.com>
       [not found]   ` <CA++hEgwsLH=9-PCpkR4X2MEqSwwK6ZMhpb+YEB=ze-kOJ8cwaQ@mail.gmail.com>
  2020-10-05  1:33 ` Eric Levy
  1 sibling, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-09-30  2:03 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4445 bytes --]



On 2020/9/30 上午9:44, Eric Levy wrote:
> I recently upgraded a Linux system running on btrfs from a 5.3.x
> kernel to a 5.4.x version. The system failed to run for more than a
> few minutes after the upgrade, because the root mount degraded to a
> read-only state. I continued to use the system by booting using the
> 5.3.x kernel.

Dmesg please. But according to your btrfs check result, I think it's
already caused by bad extent generation from older kernels.

> 
> Some time later, I attempted to migrate the root subvolume using a
> send-receive command pairing, and noticed that the operation would
> invariably abort before completion. I also noticed that a full file
> walk of the mounted volume was impossible, because operations on some
> files generated errors from the file-system level.
> 
> Upon investigating using a check command, I learned that the file
> system had errors.
> 
> Examining the error report (not saved), I noticed that overall my
> situation had rather clear similarities to one described in an earlier
> discussion [1].
> 
> Unfortunately, it appears that the differences in the kernels may have
> corrupted the file system.

Nope, your fs is still fine.

> 
> Based on eagerness for a resolution, and on an optimistic comment
> toward the end of the discussion, I chose to run a check operation on
> the partition with the --repair flag included.

And obviously it won't help. Since we don't have extent item repair
functionality yet.

There is an off-tree branch to do the repair:
https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair

You could try that to see if it works.

Thanks,
Qu

> 
> Perhaps not surprisingly to some, the result of a read-only check
> operation after the attempted repair gave a much more discouraging
> report, suggesting that the damage to the file system was made worse
> not better by the operation. I realize that this possibility is
> explained in the documentation.
> 
> At the moment, the full report appears as below.
> 
> Presently, the file system mounts, but the ability to successfully
> read files degrades the longer the system is mounted and the more
> files are read during a continuous mount. Experiments involving
> unmounting and then mounting again give some indication that this
> degradation is not entirely permanent.
> 
> What possibility is open to recover all or part of the file system?
> After such a rescue attempt, would I have any way to know what is lost
> versus saved? Might I expect corruption within the file contents that
> would not be detected by the rescue effort?
> 
> I would be thankful for any guidance that might lead to restoring the data
> 
> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg96735.html
> ---
> 
> Opening filesystem to check...
> Checking filesystem on /dev/sda5
> UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
> [1/7] checking root items
> [2/7] checking extents
> ERROR: invalid generation for extent 4064026624, have 94810718697136
> expect (0, 33469925]
> ERROR: invalid generation for extent 16323178496, have 94811372174048
> expect (0, 33469925]
> ERROR: invalid generation for extent 79980945408, have 94811372219744
> expect (0, 33469925]
> ERROR: invalid generation for extent 318963990528, have 94810111593504
> expect (0, 33469925]
> ERROR: invalid generation for extent 319650189312, have 14758526976
> expect (0, 33469925]
> ERROR: invalid generation for extent 319677259776, have 414943019007
> expect (0, 33469925]
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache
> block group 71962722304 has wrong amount of free space, free space
> cache has 266420224 block group has 266354688
> ERROR: free space cache has more free space than block group item,
> this could leads to serious corruption, please contact btrfs
> developers
> failed to load free space cache for block group 71962722304
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups
> found 399845548032 bytes used, error(s) found
> total csum bytes: 349626220
> total tree bytes: 5908873216
> total fs tree bytes: 4414324736
> total extent tree bytes: 879493120
> btree space waste bytes: 1122882578
> file data blocks allocated: 550505705472
>  referenced 512080416768
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
  2020-09-30  1:44 ERROR... please contact btrfs developers Eric Levy
  2020-09-30  2:03 ` Qu Wenruo
@ 2020-10-05  1:33 ` Eric Levy
  2020-10-05  1:36   ` Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Levy @ 2020-10-05  1:33 UTC (permalink / raw)
  To: linux-btrfs

A new observation is that I notice that through the RO option,
although the mount still degrades after continued use, it is more
stable than in a standard RW mode. At this point, I believe I have
recovered the crucial folders, though I have no guarantee that no
files are missing or corrupted. I would hope to restore this
filesystem to a fully functional state, or otherwise clone the
subvolumes successfully to another partition, with as much data
recovery as possible.

Even with RO, the receive command still fails rather abruptly, though
not always immediately:

ERROR: send ioctl failed with -5: Input/output error

I have written to the list because I believe that doing so best
satisfies the request within the error message to "please contact
btrfs developers". If another avenue of communication is more
suitable, then please advise.

On Tue, Sep 29, 2020 at 9:44 PM Eric Levy <ericlevy@gmail.com> wrote:
>
> I recently upgraded a Linux system running on btrfs from a 5.3.x
> kernel to a 5.4.x version. The system failed to run for more than a
> few minutes after the upgrade, because the root mount degraded to a
> read-only state. I continued to use the system by booting using the
> 5.3.x kernel.
>
> Some time later, I attempted to migrate the root subvolume using a
> send-receive command pairing, and noticed that the operation would
> invariably abort before completion. I also noticed that a full file
> walk of the mounted volume was impossible, because operations on some
> files generated errors from the file-system level.
>
> Upon investigating using a check command, I learned that the file
> system had errors.
>
> Examining the error report (not saved), I noticed that overall my
> situation had rather clear similarities to one described in an earlier
> discussion [1].
>
> Unfortunately, it appears that the differences in the kernels may have
> corrupted the file system.
>
> Based on eagerness for a resolution, and on an optimistic comment
> toward the end of the discussion, I chose to run a check operation on
> the partition with the --repair flag included.
>
> Perhaps not surprisingly to some, the result of a read-only check
> operation after the attempted repair gave a much more discouraging
> report, suggesting that the damage to the file system was made worse
> not better by the operation. I realize that this possibility is
> explained in the documentation.
>
> At the moment, the full report appears as below.
>
> Presently, the file system mounts, but the ability to successfully
> read files degrades the longer the system is mounted and the more
> files are read during a continuous mount. Experiments involving
> unmounting and then mounting again give some indication that this
> degradation is not entirely permanent.
>
> What possibility is open to recover all or part of the file system?
> After such a rescue attempt, would I have any way to know what is lost
> versus saved? Might I expect corruption within the file contents that
> would not be detected by the rescue effort?
>
> I would be thankful for any guidance that might lead to restoring the data
>
>
> [1] https://www.spinics.net/lists/linux-btrfs/msg96735.html
> ---
>
> Opening filesystem to check...
> Checking filesystem on /dev/sda5
> UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
> [1/7] checking root items
> [2/7] checking extents
> ERROR: invalid generation for extent 4064026624, have 94810718697136
> expect (0, 33469925]
> ERROR: invalid generation for extent 16323178496, have 94811372174048
> expect (0, 33469925]
> ERROR: invalid generation for extent 79980945408, have 94811372219744
> expect (0, 33469925]
> ERROR: invalid generation for extent 318963990528, have 94810111593504
> expect (0, 33469925]
> ERROR: invalid generation for extent 319650189312, have 14758526976
> expect (0, 33469925]
> ERROR: invalid generation for extent 319677259776, have 414943019007
> expect (0, 33469925]
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache
> block group 71962722304 has wrong amount of free space, free space
> cache has 266420224 block group has 266354688
> ERROR: free space cache has more free space than block group item,
> this could leads to serious corruption, please contact btrfs
> developers
> failed to load free space cache for block group 71962722304
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups
> found 399845548032 bytes used, error(s) found
> total csum bytes: 349626220
> total tree bytes: 5908873216
> total fs tree bytes: 4414324736
> total extent tree bytes: 879493120
> btree space waste bytes: 1122882578
> file data blocks allocated: 550505705472
>  referenced 512080416768

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
  2020-10-05  1:33 ` Eric Levy
@ 2020-10-05  1:36   ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  1:36 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5477 bytes --]



On 2020/10/5 上午9:33, Eric Levy wrote:
> A new observation is that I notice that through the RO option,
> although the mount still degrades after continued use, it is more
> stable than in a standard RW mode. At this point, I believe I have
> recovered the crucial folders, though I have no guarantee that no
> files are missing or corrupted.

As replied, none of your data is lost or corrupted by this incident.

You can reverted to older kernel without the extent item generation
check to continue your usage without any problem.

Also as said, using this branch with "btrfs check --repair" should solve
your problem, and make the fs safe against latest kernel
https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair

Thanks,
Qu


> I would hope to restore this
> filesystem to a fully functional state, or otherwise clone the
> subvolumes successfully to another partition, with as much data
> recovery as possible.
> 
> Even with RO, the receive command still fails rather abruptly, though
> not always immediately:
> 
> ERROR: send ioctl failed with -5: Input/output error
> 
> I have written to the list because I believe that doing so best
> satisfies the request within the error message to "please contact
> btrfs developers". If another avenue of communication is more
> suitable, then please advise.
> 
> On Tue, Sep 29, 2020 at 9:44 PM Eric Levy <ericlevy@gmail.com> wrote:
>>
>> I recently upgraded a Linux system running on btrfs from a 5.3.x
>> kernel to a 5.4.x version. The system failed to run for more than a
>> few minutes after the upgrade, because the root mount degraded to a
>> read-only state. I continued to use the system by booting using the
>> 5.3.x kernel.
>>
>> Some time later, I attempted to migrate the root subvolume using a
>> send-receive command pairing, and noticed that the operation would
>> invariably abort before completion. I also noticed that a full file
>> walk of the mounted volume was impossible, because operations on some
>> files generated errors from the file-system level.
>>
>> Upon investigating using a check command, I learned that the file
>> system had errors.
>>
>> Examining the error report (not saved), I noticed that overall my
>> situation had rather clear similarities to one described in an earlier
>> discussion [1].
>>
>> Unfortunately, it appears that the differences in the kernels may have
>> corrupted the file system.
>>
>> Based on eagerness for a resolution, and on an optimistic comment
>> toward the end of the discussion, I chose to run a check operation on
>> the partition with the --repair flag included.
>>
>> Perhaps not surprisingly to some, the result of a read-only check
>> operation after the attempted repair gave a much more discouraging
>> report, suggesting that the damage to the file system was made worse
>> not better by the operation. I realize that this possibility is
>> explained in the documentation.
>>
>> At the moment, the full report appears as below.
>>
>> Presently, the file system mounts, but the ability to successfully
>> read files degrades the longer the system is mounted and the more
>> files are read during a continuous mount. Experiments involving
>> unmounting and then mounting again give some indication that this
>> degradation is not entirely permanent.
>>
>> What possibility is open to recover all or part of the file system?
>> After such a rescue attempt, would I have any way to know what is lost
>> versus saved? Might I expect corruption within the file contents that
>> would not be detected by the rescue effort?
>>
>> I would be thankful for any guidance that might lead to restoring the data
>>
>>
>> [1] https://www.spinics.net/lists/linux-btrfs/msg96735.html
>> ---
>>
>> Opening filesystem to check...
>> Checking filesystem on /dev/sda5
>> UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
>> [1/7] checking root items
>> [2/7] checking extents
>> ERROR: invalid generation for extent 4064026624, have 94810718697136
>> expect (0, 33469925]
>> ERROR: invalid generation for extent 16323178496, have 94811372174048
>> expect (0, 33469925]
>> ERROR: invalid generation for extent 79980945408, have 94811372219744
>> expect (0, 33469925]
>> ERROR: invalid generation for extent 318963990528, have 94810111593504
>> expect (0, 33469925]
>> ERROR: invalid generation for extent 319650189312, have 14758526976
>> expect (0, 33469925]
>> ERROR: invalid generation for extent 319677259776, have 414943019007
>> expect (0, 33469925]
>> ERROR: errors found in extent allocation tree or chunk allocation
>> [3/7] checking free space cache
>> block group 71962722304 has wrong amount of free space, free space
>> cache has 266420224 block group has 266354688
>> ERROR: free space cache has more free space than block group item,
>> this could leads to serious corruption, please contact btrfs
>> developers
>> failed to load free space cache for block group 71962722304
>> [4/7] checking fs roots
>> [5/7] checking only csums items (without verifying data)
>> [6/7] checking root refs
>> [7/7] checking quota groups
>> found 399845548032 bytes used, error(s) found
>> total csum bytes: 349626220
>> total tree bytes: 5908873216
>> total fs tree bytes: 4414324736
>> total extent tree bytes: 879493120
>> btree space waste bytes: 1122882578
>> file data blocks allocated: 550505705472
>>  referenced 512080416768


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]   ` <CA++hEgyyn0Os1-w-WE8seXCrDJVosgLnfL1pU7e2p_LpqRmJ_Q@mail.gmail.com>
@ 2020-10-05  2:38     ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  2:38 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1560 bytes --]



On 2020/10/5 上午10:20, Eric Levy wrote:
>> There is an off-tree branch to do the repair:
>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
>>
>> You could try that to see if it works.
> 
> Would this utility build and run on a stock kernel?

It's kernel independent. So choose whatever kernel you like.

> The documentation
> suggests otherwise. It would be difficult to perform an operation on
> an environment other than a standard recovery loaded on a USB stick or
> an active desktop running an up-to-date distribution.
> 
> Also from the build output (Linux 5.4.0):
> 
>     [CC]     btrfs-convert.o
> In file included from btrfs-convert.c:22:
> kerncompat.h:300: warning: "__bitwise" redefined
>   300 | #define __bitwise
>       |
> In file included from kerncompat.h:30,
>                  from btrfs-convert.c:22:
> /usr/include/linux/types.h:22: note: this is the location of the
> previous definition
>    22 | #define __bitwise __bitwise__
>       |
> btrfs-convert.c: In function ‘ext2_xattr_check_entry’:
> btrfs-convert.c:626:11: error: ‘struct ext2_ext_attr_entry’ has no
> member named ‘e_value_block’
>   626 |  if (entry->e_value_block != 0 || value_size > size ||
>       |           ^~
> make: *** [Makefile:131: btrfs-convert.o] Error 1

This error is mostly due to older e2fsprogs headers.

You can skip the convert tool by './configure --disable-convert'.
Also you may want to disable document and zstd too.
Or you will need more deps.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]   ` <CA++hEgwsLH=9-PCpkR4X2MEqSwwK6ZMhpb+YEB=ze-kOJ8cwaQ@mail.gmail.com>
@ 2020-10-05  7:14     ` Qu Wenruo
       [not found]     ` <CA++hEgzbFsf6LgPb+XJbf-kkEYEy0cYAbaF=+m3pbEdSd+f62g@mail.gmail.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  7:14 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1936 bytes --]



On 2020/10/5 上午11:35, Eric Levy wrote:
>> There is an off-tree branch to do the repair:
>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
> 
> Ok. I was able to build and run. Part of the earlier confusion was
> from reading the documentation in the wrong branch of the repository.
> 
> I ran the repair, and now the check passes in both the stock and
> forked version of the utility.
> 
> However, the file system is still behaving badly. It reverts to RO
> mode after several minutes of use.

Would you provide the dmesg of when the RO happens?

It should be another problem.

Thanks,
Qu
> 
> Even a scrub operation fails ('aborted" result was not from a manual
> intervention).
> 
> $ btrfs check /dev/sda5
> Opening filesystem to check...
> Checking filesystem on /dev/sda5
> UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups
> Rescan hasn't been initialized, a difference in qgroup accounting is expected
> Qgroup are marked as inconsistent.
> found 399944884224 bytes used, no error found
> total csum bytes: 349626220
> total tree bytes: 6007685120
> total fs tree bytes: 4510924800
> total extent tree bytes: 881704960
> btree space waste bytes: 1148459015
> file data blocks allocated: 570546290688
>  referenced 530623602688
> 
> $ sudo btrfs scrub start -B /mnt/custom
> ERROR: scrubbing /mnt/custom failed for device id 1: ret=-1, errno=5
> (Input/output error)
> scrub canceled for 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
> Scrub started:    Sun Oct  4 23:25:22 2020
> Status:           aborted
> Duration:         0:01:41
> Total to scrub:   378.04GiB
> Rate:             0.00B/s
> Error summary:    no errors found
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]     ` <CA++hEgzbFsf6LgPb+XJbf-kkEYEy0cYAbaF=+m3pbEdSd+f62g@mail.gmail.com>
@ 2020-10-05  8:51       ` Qu Wenruo
       [not found]         ` <CA++hEgwdYmfGFudNvkBR6zo3Ux01UFRwHN1WDd7csH5_jBZ0Rg@mail.gmail.com>
       [not found]       ` <CA++hEgzRkz+qQQf_+YBX2r5bBiNvtexiguPG99jBzVM6JhtPzg@mail.gmail.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  8:51 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5623 bytes --]



On 2020/10/5 下午3:58, Eric Levy wrote:
> Well, I see the complaint about limited disk space. I suppose it is a
> surprise to me that disk usage causes this problem, because the mount
> was fully functional under kernel versions 5.3.x.
> 
> Is the best solution simply to free disk space? If so, then the act
> would have to fall in the time window during which the mount retains
> RW state.
> 
> 
> [277402.736070] usb 2-1: new SuperSpeed Gen 1 USB device number 34
> using xhci_hcd
> [277402.756644] usb 2-1: New USB device found, idVendor=152d,
> idProduct=0583, bcdDevice= 2.08
> [277402.756648] usb 2-1: New USB device strings: Mfr=1, Product=2,
> SerialNumber=3
> [277402.756650] usb 2-1: Product: USB to PCIE Bridge
> [277402.756652] usb 2-1: Manufacturer: JMicron
> [277402.756653] usb 2-1: SerialNumber: 0123456789ABCDEF
> [277402.761143] scsi host1: uas
> [277402.761780] scsi 1:0:0:0: Direct-Access     JMicron  Generic
>    0208 PQ: 0 ANSI: 6
> [277402.762495] sd 1:0:0:0: Attached scsi generic sg0 type 0
> [277404.731514] sd 1:0:0:0: [sda] 1000215216 512-byte logical blocks:
> (512 GB/477 GiB)
> [277404.731517] sd 1:0:0:0: [sda] 4096-byte physical blocks
> [277404.731635] sd 1:0:0:0: [sda] Write Protect is off
> [277404.731637] sd 1:0:0:0: [sda] Mode Sense: 5f 00 00 08
> [277404.731877] sd 1:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
> [277404.732086] sd 1:0:0:0: [sda] Optimal transfer size 33553920 bytes
> not a multiple of physical block size (4096 bytes)
> [277405.557376]  sda: sda1 sda2 sda3 sda4 sda5
> [277405.559054] sd 1:0:0:0: [sda] Attached SCSI disk
> [277410.526995] BTRFS info (device sda5): disk space caching is enabled
> [277410.526999] BTRFS info (device sda5): has skinny extents
> [277410.877674] BTRFS info (device sda5): enabling ssd optimizations
> [277431.217341] BTRFS info (device sda5): checking UUID tree
> [277435.854502] BTRFS info (device sda5): balance: resume
> -mconvert=dup,soft -sconvert=dup,soft
> [277435.854776] BTRFS info (device sda5): relocating block group
> 518463160320 flags metadata
> [277506.944843] BTRFS info (device sda5): relocating block group
> 483205840896 flags metadata
> [277597.158985] BTRFS: error (device sda5) in
> btrfs_drop_snapshot:5428: errno=-28 No space left

Oh, that's a completely different bug.

Somehow btrfs exhausted the metadata space.

Normally caused by unbalanced data/metadata usage and multi-device.
(Currently, RAID1/RAID0/RAID10/RAID5/RAID6 can all over-estimate the
available space, and cause ENOSPC to happen in critical context, where
we can only abort transaction to avoid further corruption)

Currently I guess you need to don't do any balance, but try to remove as
many unused files as possible, until you have enough unallocated space
for metadata.

To check your unalloated space, you can use btrfs fi usage:

You need enough "Device unallocated" for your profile. (1MiB is not
usable, as that is reserved space for superblock)

E.g. if you're using RAID1, you need *2* devices with enough unallocated
space.

Normally this is pretty hard by just deleting enough files/snapshots to
free a full data block.
But if you did it, then you can try balance data space to free more
space and get out of the ENOSPC spiral.

I really need to push the fix harder before it affects more people.

Thanks,
Qu

> [277597.158988] BTRFS info (device sda5): forced readonly
> [277597.159022] BTRFS info (device sda5): 2 enospc errors during balance
> [277597.159022] BTRFS info (device sda5): balance: ended with status: -30
> [277607.030026] BTRFS info (device sda5): delayed_refs has NO entry
> 
> On Sun, Oct 4, 2020 at 11:35 PM Eric Levy <ericlevy@gmail.com> wrote:
>>
>>> There is an off-tree branch to do the repair:
>>> https://github.com/adam900710/btrfs-progs/tree/extent_gen_repair
>>
>> Ok. I was able to build and run. Part of the earlier confusion was
>> from reading the documentation in the wrong branch of the repository.
>>
>> I ran the repair, and now the check passes in both the stock and
>> forked version of the utility.
>>
>> However, the file system is still behaving badly. It reverts to RO
>> mode after several minutes of use.
>>
>> Even a scrub operation fails ('aborted" result was not from a manual
>> intervention).
>>
>> $ btrfs check /dev/sda5
>> Opening filesystem to check...
>> Checking filesystem on /dev/sda5
>> UUID: 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
>> [1/7] checking root items
>> [2/7] checking extents
>> [3/7] checking free space cache
>> [4/7] checking fs roots
>> [5/7] checking only csums items (without verifying data)
>> [6/7] checking root refs
>> [7/7] checking quota groups
>> Rescan hasn't been initialized, a difference in qgroup accounting is expected
>> Qgroup are marked as inconsistent.
>> found 399944884224 bytes used, no error found
>> total csum bytes: 349626220
>> total tree bytes: 6007685120
>> total fs tree bytes: 4510924800
>> total extent tree bytes: 881704960
>> btree space waste bytes: 1148459015
>> file data blocks allocated: 570546290688
>>  referenced 530623602688
>>
>> $ sudo btrfs scrub start -B /mnt/custom
>> ERROR: scrubbing /mnt/custom failed for device id 1: ret=-1, errno=5
>> (Input/output error)
>> scrub canceled for 9a4da0b6-7e39-4a5f-85eb-74acd11f5b94
>> Scrub started:    Sun Oct  4 23:25:22 2020
>> Status:           aborted
>> Duration:         0:01:41
>> Total to scrub:   378.04GiB
>> Rate:             0.00B/s
>> Error summary:    no errors found


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]       ` <CA++hEgzRkz+qQQf_+YBX2r5bBiNvtexiguPG99jBzVM6JhtPzg@mail.gmail.com>
@ 2020-10-05  8:54         ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  8:54 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1378 bytes --]



On 2020/10/5 下午4:25, Eric Levy wrote:
> I freed considerable space by destroying the swap file and old
> snapshots. The usage is down to 86% of the total volume, as reported
> by df. Nevertheless, I get the same message, "No space left";
Yeah, it's really hard to free enough space to free one data block group.

E.g. if one data block group sized 10G, has just 4K used by a file.
Then the whole 10G can't be freed for metadata usage.
Btrfs can still utilize the remaining (10G - 4K) space, but only for
data, not metadata.

Thus it's really really hard to free enough continous space to free a
full data block group.

BTW, btrfs fi usage output would definitely help in this case.

Thanks,
Qu

> 
> [279314.876489] BTRFS info (device sda5): relocating block group
> 518463160320 flags metadata
> [279372.777369] BTRFS: error (device sda5) in
> btrfs_drop_snapshot:5428: errno=-28 No space left
> 
> 
> On Mon, Oct 5, 2020 at 3:58 AM Eric Levy <ericlevy@gmail.com> wrote:
>>
>> Well, I see the complaint about limited disk space. I suppose it is a
>> surprise to me that disk usage causes this problem, because the mount
>> was fully functional under kernel versions 5.3.x.
>>
>> Is the best solution simply to free disk space? If so, then the act
>> would have to fall in the time window during which the mount retains
>> RW state.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]         ` <CA++hEgwdYmfGFudNvkBR6zo3Ux01UFRwHN1WDd7csH5_jBZ0Rg@mail.gmail.com>
@ 2020-10-05  9:17           ` Qu Wenruo
       [not found]             ` <CA++hEgx4d_-Y4Be7_fpDLTbCnN2-2yAecbyjJWSJuU-qSFvVuw@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05  9:17 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2801 bytes --]



On 2020/10/5 下午5:05, Eric Levy wrote:
> The volume is not RAID, only a single NVMe card.

Then this means, we may have a bigger problem.

Normally btrfs space reservation keeps enough margin for critical
operations, thus it will return ENOSPC before we really go to do some
space consuming operations, thus no abort_transaction problem like yours.

If it's single device, and still hit the case we need more space than we
have, either the space reservation or the space consumer,
btrfs_drop_snapshot(), has something wrong.
> 
> I deleted all the files and subvolumes except what I need to recover.

You may want to wait until btrfs really drops the subvolume/snapshot.

The command is "btrfs subv sync <mount>". And then check if the usage drops.

> 
> Below is the dump from fi usage. It looks as though 1.00Mb is all that
> is unallocated.
> 
> 
> $ sudo btrfs fi usage /mnt/custom
> Overall:
>     Device size:         378.04GiB
>     Device allocated:         378.04GiB
>     Device unallocated:           1.00MiB
>     Device missing:             0.00B
>     Used:             323.54GiB
>     Free (estimated):          53.13GiB    (min: 53.13GiB)
>     Data ratio:                  1.00
>     Metadata ratio:              1.00
>     Global reserve:         512.00MiB    (used: 0.00B)
> 
> Data,single: Size:371.02GiB, Used:317.89GiB (85.68%)
>    /dev/sdb5     371.02GiB
> 
> Metadata,single: Size:7.01GiB, Used:5.65GiB (80.66%)
>    /dev/sdb5       7.01GiB

This is strange. Even accounting the GlobalRSV (0.5G), we should still
have 1GiB space for metadata.
We shouldn't use that much space just for delete.

Would you please try a more uptodate kernel to see if it works?
(One simpler solution is using rolling release ISOs, like OpenSUSE
tumbleweed or Arch install iso).

I'm wondering it may be a bug not fixed in v4.15 due to the hardness to
backport.


BTW, when mounting the fs, you may want to mount with skip_balance mount
option.

Thanks,
Qu

> 
> System,single: Size:4.00MiB, Used:64.00KiB (1.56%)
>    /dev/sdb5       4.00MiB
> 
> Unallocated:
>    /dev/sdb5       1.00MiB
> 
> 
>> Oh, that's a completely different bug.
>>
>> Somehow btrfs exhausted the metadata space.
>>
>> Normally caused by unbalanced data/metadata usage and multi-device.
>> (Currently, RAID1/RAID0/RAID10/RAID5/RAID6 can all over-estimate the
>> available space, and cause ENOSPC to happen in critical context, where
>> we can only abort transaction to avoid further corruption)
>>
>> Currently I guess you need to don't do any balance, but try to remove as
>> many unused files as possible, until you have enough unallocated space
>> for metadata.
>>
>> To check your unalloated space, you can use btrfs fi usage:


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ERROR... please contact btrfs developers
       [not found]             ` <CA++hEgx4d_-Y4Be7_fpDLTbCnN2-2yAecbyjJWSJuU-qSFvVuw@mail.gmail.com>
@ 2020-10-05 12:12               ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2020-10-05 12:12 UTC (permalink / raw)
  To: Eric Levy, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2385 bytes --]



On 2020/10/5 下午8:03, Eric Levy wrote:
> I have two pieces of good news.
> 
> First, after the extents were repaired, a RO mount was sufficient to
> duplicate successfully the subvolumes on other media using the send
> and receive commands.
> 
> Second, with respect to the further problem of the balance operation
> failing, I was able to synchronize the subvolumes by mounting with
> balancing disabled. The effect of this step was to gain 22GB of
> unallocated space. Balancing succeeded after I unmounted and then
> mounted normally.

Great!

> 
> I then began a scrub operation, which completed successfully.

That ensures your data is all good.

Since your previous btrfs check ensures your metadata is also good, and
you freed enough unallocated space, your fs is officially very healthy now.

Really all good news.

> 
> I am hopeful that we may consider this file system healthy, with no
> loss of data due to file system problems. Currently, all the evidence
> suggests as such.
> 
> Needless to say, it would be very valuable to pursue improvements that
> reduce the chances of similar problems for users in the future.

Although I really doubt about the ENOSPC behavior, it looks like
something wrong with the space reserve code, and should not happen for
single device case, at least for latest kernel.

But since now you have enough free space, I guess we can't find out the
result anymore.

Thanks,
Qu

> 
> 
> On Mon, Oct 5, 2020 at 5:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/10/5 下午5:05, Eric Levy wrote:
>>> The volume is not RAID, only a single NVMe card.
>>
>> Then this means, we may have a bigger problem.
>>
>> Normally btrfs space reservation keeps enough margin for critical
>> operations, thus it will return ENOSPC before we really go to do some
>> space consuming operations, thus no abort_transaction problem like yours.
>>
>> If it's single device, and still hit the case we need more space than we
>> have, either the space reservation or the space consumer,
>> btrfs_drop_snapshot(), has something wrong.
>>>
>>> I deleted all the files and subvolumes except what I need to recover.
>>
>> You may want to wait until btrfs really drops the subvolume/snapshot.
>>
>> The command is "btrfs subv sync <mount>". And then check if the usage drops.
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-10-05 12:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30  1:44 ERROR... please contact btrfs developers Eric Levy
2020-09-30  2:03 ` Qu Wenruo
     [not found]   ` <CA++hEgyyn0Os1-w-WE8seXCrDJVosgLnfL1pU7e2p_LpqRmJ_Q@mail.gmail.com>
2020-10-05  2:38     ` Qu Wenruo
     [not found]   ` <CA++hEgwsLH=9-PCpkR4X2MEqSwwK6ZMhpb+YEB=ze-kOJ8cwaQ@mail.gmail.com>
2020-10-05  7:14     ` Qu Wenruo
     [not found]     ` <CA++hEgzbFsf6LgPb+XJbf-kkEYEy0cYAbaF=+m3pbEdSd+f62g@mail.gmail.com>
2020-10-05  8:51       ` Qu Wenruo
     [not found]         ` <CA++hEgwdYmfGFudNvkBR6zo3Ux01UFRwHN1WDd7csH5_jBZ0Rg@mail.gmail.com>
2020-10-05  9:17           ` Qu Wenruo
     [not found]             ` <CA++hEgx4d_-Y4Be7_fpDLTbCnN2-2yAecbyjJWSJuU-qSFvVuw@mail.gmail.com>
2020-10-05 12:12               ` Qu Wenruo
     [not found]       ` <CA++hEgzRkz+qQQf_+YBX2r5bBiNvtexiguPG99jBzVM6JhtPzg@mail.gmail.com>
2020-10-05  8:54         ` Qu Wenruo
2020-10-05  1:33 ` Eric Levy
2020-10-05  1:36   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.