All of lore.kernel.org
 help / color / mirror / Atom feed
* Unrecoverable scrub errors
@ 2017-11-17 15:41 Nazar Mokrynskyi
  2017-11-18  3:19 ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-17 15:41 UTC (permalink / raw)
  To: linux-btrfs

Hi folks,

I'm a long-term btrfs user (permanently for my root partition and other stuff for ~3 years now, with compression, most of the way with RAID0 on various SSD, etc).

In simple words my setup consists of root partition and backup partition. There are automated snapshots on root partition which are then copied to online backup partition (send/receive, handled by "Just backup btrfs") and occasionally to offline backup partition (handled by "Btrfs sync subvolumes").

I've recently found that my online backup partition has some unrecoverable errors as reported after running scrub:

> scrub status for 82cfcb0f-0b80-4764-bed6-f529f2030ac5
>         scrub started at Fri Nov 17 15:05:12 2017 and finished after 02:07:30
>         total bytes scrubbed: 915.16GiB with 12 errors
>         error details: csum=12
>         corrected errors: 0, uncorrectable errors: 12, unverified errors: 0
dmesg (this is all related to mentioned errors):

> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
> [551049.038723] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
> [551049.039637] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.413473] BTRFS warning (device dm-2): checksum error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238080: metadata leaf (level 0) in tree 985
> [551049.413473] BTRFS warning (device dm-2): checksum error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238080: metadata leaf (level 0) in tree 985
> [551049.413475] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> [551049.413685] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.413910] BTRFS warning (device dm-2): checksum error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238112: metadata leaf (level 0) in tree 985
> [551049.413911] BTRFS warning (device dm-2): checksum error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238112: metadata leaf (level 0) in tree 985
> [551049.413912] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> [551049.414121] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.414354] BTRFS warning (device dm-2): checksum error at logical 470069510144 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238144: metadata leaf (level 0) in tree 985
> [551049.414355] BTRFS warning (device dm-2): checksum error at logical 470069510144 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238144: metadata leaf (level 0) in tree 985
> [551049.414356] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
> [551049.414567] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069510144 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.479023] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.479989] BTRFS warning (device dm-2): checksum error at logical 470069542912 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238208: metadata leaf (level 0) in tree 985
> [551049.479993] BTRFS warning (device dm-2): checksum error at logical 470069542912 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238208: metadata leaf (level 0) in tree 985
> [551049.479997] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
> [551049.523539] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069542912 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551051.672589] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286624: metadata leaf (level 0) in tree 985
> [551051.672593] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286624: metadata leaf (level 0) in tree 985
> [551051.672597] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
> [551051.820776] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551051.945310] BTRFS warning (device dm-2): checksum error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286656: metadata leaf (level 0) in tree 985
> [551051.945314] BTRFS warning (device dm-2): checksum error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286656: metadata leaf (level 0) in tree 985
> [551051.945318] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
> [551052.112245] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286752: metadata leaf (level 0) in tree 985
> [551052.112247] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286752: metadata leaf (level 0) in tree 985
> [551052.112248] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
> [551052.183671] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069477376 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551052.253278] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551052.260305] BTRFS warning (device dm-2): checksum error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286688: metadata leaf (level 0) in tree 985
> [551052.260307] BTRFS warning (device dm-2): checksum error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 943286688: metadata leaf (level 0) in tree 985
> [551052.260308] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
> [551052.300024] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069493760 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
This is an online backup partition and I have an offline backup partition with the same data, so not very concerned about loosing any data here, but would like to repair it.

Are there any better options before resorting to `btrfsck --repair`? Maybe I can find snapshot that contains file with wrong checksum and remove corresponding snapshot or something like that?

> nazar-pc@nazar-pc ~> sudo btrfs filesystem show /media/Backup
> Label: 'Backup'  uuid: 82cfcb0f-0b80-4764-bed6-f529f2030ac5
>     Total devices 1 FS bytes used 896.20GiB
>     devid    1 size 1.00TiB used 920.09GiB path /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>
> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
> Data, single: total=879.01GiB, used=877.24GiB
> System, DUP: total=40.00MiB, used=128.00KiB
> Metadata, DUP: total=20.50GiB, used=18.96GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> nazar-pc@nazar-pc ~> btrfs --version
> btrfs-progs v4.13.3
>
> nazar-pc@nazar-pc ~> uname -a
> Linux nazar-pc 4.13.0-16-generic #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

-- 
Sincerely, Nazar Mokrynskyi
github.com/nazar-pc


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-17 15:41 Unrecoverable scrub errors Nazar Mokrynskyi
@ 2017-11-18  3:19 ` Chris Murphy
  2017-11-18  3:33   ` Adam Borowski
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2017-11-18  3:19 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: Btrfs BTRFS

On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:

>> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
>> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
>> [551049.038723] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
>> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
>> [551049.039637] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1


These are metadata errors. Are there any other storage stack related
errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
link reset messages?

> Are there any better options before resorting to `btrfsck --repair`?

I wouldn't try it just yet. What do you get for btrfs check without
repair? This will check the metadata and it should run into the same
problem, but if it craps out then chances are --repair will too.


>Maybe I can find snapshot that contains file with wrong checksum and remove corresponding snapshot or something like that?

It's not a file. It's metadata leaf.


>> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
>> Data, single: total=879.01GiB, used=877.24GiB
>> System, DUP: total=40.00MiB, used=128.00KiB
>> Metadata, DUP: total=20.50GiB, used=18.96GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B

Metadata is DUP, but both copies have corruption. Kinda strange. But I
don't know how close the DUP copies are to each other, if possibly a
big enough media defect can explain this.

What do you get for smartctl -l scterc /dev/ (whole physical device,
not the dm device)

In the meantime, take the drive offline (umount it), and run smartctl
-t long, and after that finishes, smartctl -x. Attach that as a plain
text file, it should be small enough for the list to handle it, and
avoids reformatting problems.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-18  3:19 ` Chris Murphy
@ 2017-11-18  3:33   ` Adam Borowski
  2017-11-18  8:15     ` Nazar Mokrynskyi
  0 siblings, 1 reply; 13+ messages in thread
From: Adam Borowski @ 2017-11-18  3:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nazar Mokrynskyi, Btrfs BTRFS

On Fri, Nov 17, 2017 at 08:19:11PM -0700, Chris Murphy wrote:
> On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> 
> >> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038723] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> >> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039637] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> >> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> 
> These are metadata errors. Are there any other storage stack related
> errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
> link reset messages?
> 
> >Maybe I can find snapshot that contains file with wrong checksum and
> > remove corresponding snapshot or something like that?
> 
> It's not a file. It's metadata leaf.

Just for the record: had this be a data block (ie, a non-inline file
extent), the dmesg message would include one of filenames that refer to that
extent.  To clear the error, you'd need to remove all such files.

> >> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
> >> Data, single: total=879.01GiB, used=877.24GiB
> >> System, DUP: total=40.00MiB, used=128.00KiB
> >> Metadata, DUP: total=20.50GiB, used=18.96GiB
> >> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Metadata is DUP, but both copies have corruption. Kinda strange. But I
> don't know how close the DUP copies are to each other, if possibly a
> big enough media defect can explain this.

The original post mentioned SSD (but was unclear if _this_ filesystem is
backed by one).  If so, DUP is nearly worthless as both copies will be
written to physical cells next to each other, no matter what positions the
FTL shows them at.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄⠀⠀⠀⠀ sky.  Your cat demands food.  The priority should be obvious...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-18  3:33   ` Adam Borowski
@ 2017-11-18  8:15     ` Nazar Mokrynskyi
  2017-11-19  3:19       ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-18  8:15 UTC (permalink / raw)
  To: Adam Borowski, Chris Murphy; +Cc: Btrfs BTRFS

I can assure you that drive (it is HDD) is perfectly functional with 0 SMART errors or warnings and doesn't have any problems. dmesg is clean in that regard too, HDD itself can be excluded from potential causes.

There were however some memory-related issues on my machine a few months ago, so there is a chance that data might have being written incorrectly to the drive back then (I didn't run scrub on backup drive for a long time).

How can I identify to which files these metadata belong to replace or just remove them (files)?

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc

18.11.17 05:33, Adam Borowski пише:
> On Fri, Nov 17, 2017 at 08:19:11PM -0700, Chris Murphy wrote:
>> On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>>
>>>> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
>>>> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238048: metadata leaf (level 0) in tree 985
>>>> [551049.038723] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>>>> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
>>>> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 470069526528 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 942238176: metadata leaf (level 0) in tree 985
>>>> [551049.039637] BTRFS error (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>>>> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error at logical 470069460992 on dev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> These are metadata errors. Are there any other storage stack related
>> errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
>> link reset messages?
>>
>>> Maybe I can find snapshot that contains file with wrong checksum and
>>> remove corresponding snapshot or something like that?
>> It's not a file. It's metadata leaf.
> Just for the record: had this be a data block (ie, a non-inline file
> extent), the dmesg message would include one of filenames that refer to that
> extent.  To clear the error, you'd need to remove all such files.
>
>>>> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
>>>> Data, single: total=879.01GiB, used=877.24GiB
>>>> System, DUP: total=40.00MiB, used=128.00KiB
>>>> Metadata, DUP: total=20.50GiB, used=18.96GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> Metadata is DUP, but both copies have corruption. Kinda strange. But I
>> don't know how close the DUP copies are to each other, if possibly a
>> big enough media defect can explain this.
> The original post mentioned SSD (but was unclear if _this_ filesystem is
> backed by one).  If so, DUP is nearly worthless as both copies will be
> written to physical cells next to each other, no matter what positions the
> FTL shows them at.
>
>
> Meow!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-18  8:15     ` Nazar Mokrynskyi
@ 2017-11-19  3:19       ` Chris Murphy
  2017-11-19  3:45         ` Nazar Mokrynskyi
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2017-11-19  3:19 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: Adam Borowski, Chris Murphy, Btrfs BTRFS

On Sat, Nov 18, 2017 at 1:15 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> I can assure you that drive (it is HDD) is perfectly functional with 0 SMART errors or warnings and doesn't have any problems. dmesg is clean in that regard too, HDD itself can be excluded from potential causes.
>
> There were however some memory-related issues on my machine a few months ago, so there is a chance that data might have being written incorrectly to the drive back then (I didn't run scrub on backup drive for a long time).
>
> How can I identify to which files these metadata belong to replace or just remove them (files)?

You might look through the archives about bad ram and btrfs check
--repair and include Hugo Mills in the search, I'm pretty sure there
is code in repair that can fix certain kinds of memory induced
corruption in metadata. But I have no idea if this is that type or if
repair can make things worse in this case. So I'd say you get
everything off this file system that you want, and then go ahead and
try --repair and see what happens.

One alternative is to just leave it alone. If you're not hitting these
leaves in day to day operation, they won't hurt anything.

Another alternative is to umount, and use btrfs-debug-tree -b  on one
of the leaf/node addresses and see what you get (probably an error),
but it might still also show the node content so we have some idea
what's affected by the error. If it flat out refuses to show the node,
might be a feature request to get a flag that forces display of the
node such as it is...



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19  3:19       ` Chris Murphy
@ 2017-11-19  3:45         ` Nazar Mokrynskyi
  2017-11-19  4:40           ` Chris Murphy
       [not found]           ` <CAJCQCtRKPekR+bEE0xaX02Wwz0E_N1wcUxJdc_D-C9By2qMvWw@mail.gmail.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-19  3:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Adam Borowski, Btrfs BTRFS

19.11.17 05:19, Chris Murphy пише:
> On Sat, Nov 18, 2017 at 1:15 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> I can assure you that drive (it is HDD) is perfectly functional with 0 SMART errors or warnings and doesn't have any problems. dmesg is clean in that regard too, HDD itself can be excluded from potential causes.
>>
>> There were however some memory-related issues on my machine a few months ago, so there is a chance that data might have being written incorrectly to the drive back then (I didn't run scrub on backup drive for a long time).
>>
>> How can I identify to which files these metadata belong to replace or just remove them (files)?
> You might look through the archives about bad ram and btrfs check
> --repair and include Hugo Mills in the search, I'm pretty sure there
> is code in repair that can fix certain kinds of memory induced
> corruption in metadata. But I have no idea if this is that type or if
> repair can make things worse in this case. So I'd say you get
> everything off this file system that you want, and then go ahead and
> try --repair and see what happens.

In this case I'm not sure if data were written incorrectly or checksum or both. So I'd like to first identify the files affected, check them manually and then decide what to do with it. Especially there not many errors yet.

> One alternative is to just leave it alone. If you're not hitting these
> leaves in day to day operation, they won't hurt anything.
It was working for some time, but I have suspicion that occasionally it causes spikes of disk activity because of this errors (which is why I run scrub initially).
> Another alternative is to umount, and use btrfs-debug-tree -b  on one
> of the leaf/node addresses and see what you get (probably an error),
> but it might still also show the node content so we have some idea
> what's affected by the error. If it flat out refuses to show the node,
> might be a feature request to get a flag that forces display of the
> node such as it is...

Here is what I've got:

> nazar-pc@nazar-pc ~> sudo btrfs-debug-tree -b 470069460992 /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> btrfs-progs v4.13.3
> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
> Csum didn't match
> ERROR: failed to read 470069460992
Looks like I indeed need a --force here.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19  3:45         ` Nazar Mokrynskyi
@ 2017-11-19  4:40           ` Chris Murphy
       [not found]           ` <CAJCQCtRKPekR+bEE0xaX02Wwz0E_N1wcUxJdc_D-C9By2qMvWw@mail.gmail.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2017-11-19  4:40 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat, Nov 18, 2017 at 8:45 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> 19.11.17 05:19, Chris Murphy пише:
>> On Sat, Nov 18, 2017 at 1:15 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>>> I can assure you that drive (it is HDD) is perfectly functional with 0 SMART errors or warnings and doesn't have any problems. dmesg is clean in that regard too, HDD itself can be excluded from potential causes.
>>>
>>> There were however some memory-related issues on my machine a few months ago, so there is a chance that data might have being written incorrectly to the drive back then (I didn't run scrub on backup drive for a long time).
>>>
>>> How can I identify to which files these metadata belong to replace or just remove them (files)?
>> You might look through the archives about bad ram and btrfs check
>> --repair and include Hugo Mills in the search, I'm pretty sure there
>> is code in repair that can fix certain kinds of memory induced
>> corruption in metadata. But I have no idea if this is that type or if
>> repair can make things worse in this case. So I'd say you get
>> everything off this file system that you want, and then go ahead and
>> try --repair and see what happens.
>
> In this case I'm not sure if data were written incorrectly or checksum or both. So I'd like to first identify the files affected, check them manually and then decide what to do with it. Especially there not many errors yet.
>
>> One alternative is to just leave it alone. If you're not hitting these
>> leaves in day to day operation, they won't hurt anything.
> It was working for some time, but I have suspicion that occasionally it causes spikes of disk activity because of this errors (which is why I run scrub initially).
>> Another alternative is to umount, and use btrfs-debug-tree -b  on one
>> of the leaf/node addresses and see what you get (probably an error),
>> but it might still also show the node content so we have some idea
>> what's affected by the error. If it flat out refuses to show the node,
>> might be a feature request to get a flag that forces display of the
>> node such as it is...
>
> Here is what I've got:
>
>> nazar-pc@nazar-pc ~> sudo btrfs-debug-tree -b 470069460992 /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> btrfs-progs v4.13.3
>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>> Csum didn't match
>> ERROR: failed to read 470069460992
> Looks like I indeed need a --force here.
>

Huh, seems overdue. But what do I know?

You can use btrfs-map-logical -l to get a physical address for this
leaf, and then plug that into dd

# dd if=/dev/ skip=<physicaladress> bs=1 count=16384 2>/dev/null | hexdump -C

Gotcha of course is this is not translated into the more plain
language output by btrfs-debug-tree. And you're in the weeds with the
on disk format documentation. But maybe you'll see filenames on the
right hand side of the hexdump output and maybe that's enough... Or
maybe it's worth computing a csum on that leaf to check against the
csum for that leaf which is found in the first field of the leaf. I'd
expect the csum itself is what's wrong, because if you get memory
corruption in creating the node, the resulting csum will be *correct*
for that malformed node and there'd be no csum error, you'd just see
some other crazy faceplant.


Example.

I need a metadata leaf, so ask debug tree to show the files tree for
an empty subvolume. In your case, you've got a bad leaf address
already, so you just plug that into btrfs-map-logical as shown below:

# btrfs-debug-tree -t 340 /dev/nvme0n1p8
btrfs-progs v4.13.3
file tree key (340 ROOT_ITEM 0)
leaf 155375550464 items 3 free space 15942 generation 249992 owner 340
leaf 155375550464 flags 0x1(WRITTEN) backref revision 1
fs uuid 2662057f-e6c7-47fa-8af9-ad933a22f6ec
chunk uuid 1df72dcf-f515-404a-894a-f7345f988793
    item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
        generation 50968 transid 249992 size 0 nbytes 0
        block group 0 mode 40700 links 1 uid 0 gid 0 rdev 0
        sequence 0 flags 0x124(none)
        atime 1510866942.430740536 (2017-11-16 14:15:42)
        ctime 1511053088.58606103 (2017-11-18 17:58:08)
        mtime 1494741970.844618722 (2017-05-14 00:06:10)
        otime 1494741970.844618722 (2017-05-14 00:06:10)
    item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
        index 0 namelen 2 name: ..
    item 2 key (256 XATTR_ITEM 3817753667) itemoff 16017 itemsize 94
        location key (0 UNKNOWN.0 0) type XATTR
        transid 50969 data_len 48 name_len 16
        name: security.selinux
        data system_u:object_r:systemd_machined_var_lib_t:s0
total bytes 75161927680
bytes used 23639638016
uuid 2662057f-e6c7-47fa-8af9-ad933a22f6ec

Get a physical address from the Btrfs internal logical address:

# btrfs-map-logical -l 155375550464 /dev/nvme0n1p8
mirror 1 logical 155375550464 physical 1609220096 device /dev/nvme0n1p8


Read that physical address using dd. I want bs=1 because all Btrfs
addresses are bytes, dd defaults to 512 byte blocks. And I need a
count of 16KiB which is the default leaf size.

# dd if=/dev/nvme0n1p8 skip=1609220096 bs=1 count=16384 2>/dev/null | hexdump -C



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
       [not found]           ` <CAJCQCtRKPekR+bEE0xaX02Wwz0E_N1wcUxJdc_D-C9By2qMvWw@mail.gmail.com>
@ 2017-11-19  5:13             ` Nazar Mokrynskyi
  2017-11-19  5:23               ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-19  5:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

19.11.17 06:33, Chris Murphy пише:
> On Sat, Nov 18, 2017 at 8:45 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> 19.11.17 05:19, Chris Murphy пише:
>>> On Sat, Nov 18, 2017 at 1:15 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>>>> I can assure you that drive (it is HDD) is perfectly functional with 0 SMART errors or warnings and doesn't have any problems. dmesg is clean in that regard too, HDD itself can be excluded from potential causes.
>>>>
>>>> There were however some memory-related issues on my machine a few months ago, so there is a chance that data might have being written incorrectly to the drive back then (I didn't run scrub on backup drive for a long time).
>>>>
>>>> How can I identify to which files these metadata belong to replace or just remove them (files)?
>>> You might look through the archives about bad ram and btrfs check
>>> --repair and include Hugo Mills in the search, I'm pretty sure there
>>> is code in repair that can fix certain kinds of memory induced
>>> corruption in metadata. But I have no idea if this is that type or if
>>> repair can make things worse in this case. So I'd say you get
>>> everything off this file system that you want, and then go ahead and
>>> try --repair and see what happens.
>> In this case I'm not sure if data were written incorrectly or checksum or both. So I'd like to first identify the files affected, check them manually and then decide what to do with it. Especially there not many errors yet.
>>
>>> One alternative is to just leave it alone. If you're not hitting these
>>> leaves in day to day operation, they won't hurt anything.
>> It was working for some time, but I have suspicion that occasionally it causes spikes of disk activity because of this errors (which is why I run scrub initially).
>>> Another alternative is to umount, and use btrfs-debug-tree -b  on one
>>> of the leaf/node addresses and see what you get (probably an error),
>>> but it might still also show the node content so we have some idea
>>> what's affected by the error. If it flat out refuses to show the node,
>>> might be a feature request to get a flag that forces display of the
>>> node such as it is...
>> Here is what I've got:
>>
>>> nazar-pc@nazar-pc ~> sudo btrfs-debug-tree -b 470069460992 /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>>> btrfs-progs v4.13.3
>>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>>> checksum verify failed on 470069460992 found FD171FBB wanted 54C49539
>>> Csum didn't match
>>> ERROR: failed to read 470069460992
>> Looks like I indeed need a --force here.
>>
> Huh, seems overdue. But what do I know?
>
> You can use btrfs-map-logical -l to get a physical address for this
> leaf, and then plug that into dd
>
> # dd if=/dev/ skip=<physicaladress> bs=1 count=16384 2>/dev/null | hexdump -C
>
> Gotcha of course is this is not translated into the more plain
> language output by btrfs-debug-tree. And you're in the weeds with the
> on disk format documentation. But maybe you'll see filenames on the
> right hand side of the hexdump output and maybe that's enough... Or
> maybe it's worth computing a csum on that leaf to check against the
> csum for that leaf which is found in the first field of the leaf. I'd
> expect the csum itself is what's wrong, because if you get memory
> corruption in creating the node, the resulting csum will be *correct*
> for that malformed node and there'd be no csum error, you'd just see
> some other crazy faceplant.

That was eventually useful:

* found some familiar file names (mangled eCryptfs file names from times when I used it for home directory) and decided to search for it in old snapshots of home directory (about 1/3 of snapshots on that partition)
* file name was present in snapshots back to July of 2015, but during search through snapshot from 2016-10-26_18:47:04 I've got I/O error reported by find command at one directory
* tried to open directory in file manager - same error, fails to open
* after removing this lets call it "broken" snapshot started new scrub, hopefully it'll finish fine

If it is not actually related to recent memory issues I'd be positively surprised. Not sure what happened towards the end of October 2016 though, especially that backups were on different physical device back then.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19  5:13             ` Nazar Mokrynskyi
@ 2017-11-19  5:23               ` Chris Murphy
  2017-11-19  5:30                 ` Nazar Mokrynskyi
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2017-11-19  5:23 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: Chris Murphy, Btrfs BTRFS

On Sat, Nov 18, 2017 at 10:13 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:

>
> That was eventually useful:
>
> * found some familiar file names (mangled eCryptfs file names from times when I used it for home directory) and decided to search for it in old snapshots of home directory (about 1/3 of snapshots on that partition)
> * file name was present in snapshots back to July of 2015, but during search through snapshot from 2016-10-26_18:47:04 I've got I/O error reported by find command at one directory
> * tried to open directory in file manager - same error, fails to open
> * after removing this lets call it "broken" snapshot started new scrub, hopefully it'll finish fine
>
> If it is not actually related to recent memory issues I'd be positively surprised. Not sure what happened towards the end of October 2016 though, especially that backups were on different physical device back then.

Wrong csum computation during the transfer? Did you use btrfs send receive?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19  5:23               ` Chris Murphy
@ 2017-11-19  5:30                 ` Nazar Mokrynskyi
  2017-11-19 11:17                   ` Nazar Mokrynskyi
  0 siblings, 1 reply; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-19  5:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

19.11.17 07:23, Chris Murphy пише:
> On Sat, Nov 18, 2017 at 10:13 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>
>> That was eventually useful:
>>
>> * found some familiar file names (mangled eCryptfs file names from times when I used it for home directory) and decided to search for it in old snapshots of home directory (about 1/3 of snapshots on that partition)
>> * file name was present in snapshots back to July of 2015, but during search through snapshot from 2016-10-26_18:47:04 I've got I/O error reported by find command at one directory
>> * tried to open directory in file manager - same error, fails to open
>> * after removing this lets call it "broken" snapshot started new scrub, hopefully it'll finish fine
>>
>> If it is not actually related to recent memory issues I'd be positively surprised. Not sure what happened towards the end of October 2016 though, especially that backups were on different physical device back then.
> Wrong csum computation during the transfer? Did you use btrfs send receive?

Yes, I've used send/receive to copy snapshots from primary SSD to backup HDD.

Not sure when wrong csum computation happened, since SSD contains only most recent snapshots and only HDD contains older snapshots. Even if error happened on SSD, those older snapshots are gone a long time ago and there is no way to check this.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19  5:30                 ` Nazar Mokrynskyi
@ 2017-11-19 11:17                   ` Nazar Mokrynskyi
  2017-11-19 20:39                     ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-19 11:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Looks like it is not going to resolve nicely.

After removing that problematic snapshot filesystem quickly becomes readonly like so:

> [23552.839055] BTRFS error (device dm-2): cleaner transaction attach returned -30
> [23577.374390] BTRFS info (device dm-2): use lzo compression
> [23577.374391] BTRFS info (device dm-2): disk space caching is enabled
> [23577.374392] BTRFS info (device dm-2): has skinny extents
> [23577.506214] BTRFS info (device dm-2): bdev /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, flush 0, corrupt 24, gen 0
> [23795.026390] BTRFS error (device dm-2): bad tree block start 0 470069510144
> [23795.148193] BTRFS error (device dm-2): bad tree block start 56 470069542912
> [23795.148424] BTRFS warning (device dm-2): dm-2 checksum verify failed on 470069460992 wanted 54C49539 found FD171FBB level 0
> [23795.148526] BTRFS error (device dm-2): bad tree block start 0 470069493760
> [23795.150461] BTRFS error (device dm-2): bad tree block start 1459617832 470069477376
> [23795.639781] BTRFS error (device dm-2): bad tree block start 0 470069510144
> [23795.655487] BTRFS error (device dm-2): bad tree block start 0 470069510144
> [23795.655496] BTRFS: error (device dm-2) in btrfs_drop_snapshot:9244: errno=-5 IO failure
> [23795.655498] BTRFS info (device dm-2): forced readonly
Check and repaid doesn't help either:

> nazar-pc@nazar-pc ~> sudo btrfs check -p /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> Checking filesystem on /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> UUID: 82cfcb0f-0b80-4764-bed6-f529f2030ac5
> Extent back ref already exists for 797694840832 parent 330760175616 root 0 owner 0 offset 0 num_refs 1
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> Ignoring transid failure
> leaf parent key incorrect 470072098816
> bad block 470072098816
>
> ERROR: errors found in extent allocation tree or chunk allocation
> There is no free space entry for 797694844928-797694808064
> There is no free space entry for 797694844928-797819535360
> cache appears valid but isn't 796745793536
> There is no free space entry for 814739984384-814739988480
> There is no free space entry for 814739984384-814999404544
> cache appears valid but isn't 813925662720
> block group 894456299520 has wrong amount of free space
> failed to load free space cache for block group 894456299520
> block group 922910457856 has wrong amount of free space
> failed to load free space cache for block group 922910457856
>
> ERROR: errors found in free space cache
> found 963515335717 bytes used, error(s) found
> total csum bytes: 921699896
> total tree bytes: 20361920512
> total fs tree bytes: 17621073920
> total extent tree bytes: 1629323264
> btree space waste bytes: 3812167723
> file data blocks allocated: 21167059447808
>  referenced 2283091746816
>
> nazar-pc@nazar-pc ~> sudo btrfs check --repair -p /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> enabling repair mode
> Checking filesystem on /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> UUID: 82cfcb0f-0b80-4764-bed6-f529f2030ac5
> Extent back ref already exists for 797694840832 parent 330760175616 root 0 owner 0 offset 0 num_refs 1
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> parent transid verify failed on 470072098816 wanted 1431 found 307965
> Ignoring transid failure
> leaf parent key incorrect 470072098816
> bad block 470072098816
>
> ERROR: errors found in extent allocation tree or chunk allocation
> Fixed 0 roots.
> There is no free space entry for 797694844928-797694808064
> There is no free space entry for 797694844928-797819535360
> cache appears valid but isn't 796745793536
> There is no free space entry for 814739984384-814739988480
> There is no free space entry for 814739984384-814999404544
> cache appears valid but isn't 813925662720
> block group 894456299520 has wrong amount of free space
> failed to load free space cache for block group 894456299520
> block group 922910457856 has wrong amount of free space
> failed to load free space cache for block group 922910457856
>
> ERROR: errors found in free space cache
> found 963515335717 bytes used, error(s) found
> total csum bytes: 921699896
> total tree bytes: 20361920512
> total fs tree bytes: 17621073920
> total extent tree bytes: 1629323264
> btree space waste bytes: 3812167723
> file data blocks allocated: 21167059447808
>  referenced 2283091746816
Anything else I can try before starting from scratch?

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc

19.11.17 07:30, Nazar Mokrynskyi пише:
> 19.11.17 07:23, Chris Murphy пише:
>> On Sat, Nov 18, 2017 at 10:13 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>>
>>> That was eventually useful:
>>>
>>> * found some familiar file names (mangled eCryptfs file names from times when I used it for home directory) and decided to search for it in old snapshots of home directory (about 1/3 of snapshots on that partition)
>>> * file name was present in snapshots back to July of 2015, but during search through snapshot from 2016-10-26_18:47:04 I've got I/O error reported by find command at one directory
>>> * tried to open directory in file manager - same error, fails to open
>>> * after removing this lets call it "broken" snapshot started new scrub, hopefully it'll finish fine
>>>
>>> If it is not actually related to recent memory issues I'd be positively surprised. Not sure what happened towards the end of October 2016 though, especially that backups were on different physical device back then.
>> Wrong csum computation during the transfer? Did you use btrfs send receive?
> Yes, I've used send/receive to copy snapshots from primary SSD to backup HDD.
>
> Not sure when wrong csum computation happened, since SSD contains only most recent snapshots and only HDD contains older snapshots. Even if error happened on SSD, those older snapshots are gone a long time ago and there is no way to check this.
>
> Sincerely, Nazar Mokrynskyi
> github.com/nazar-pc
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19 11:17                   ` Nazar Mokrynskyi
@ 2017-11-19 20:39                     ` Roy Sigurd Karlsbakk
  2017-11-19 21:18                       ` Nazar Mokrynskyi
  0 siblings, 1 reply; 13+ messages in thread
From: Roy Sigurd Karlsbakk @ 2017-11-19 20:39 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: Chris Murphy, linux-btrfs

I guess not using RAID-0 would be a good start…

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

----- Original Message -----
> From: "Nazar Mokrynskyi" <nazar@mokrynskyi.com>
> To: "Chris Murphy" <lists@colorremedies.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, 19 November, 2017 12:17:36
> Subject: Re: Unrecoverable scrub errors

> Looks like it is not going to resolve nicely.
> 
> After removing that problematic snapshot filesystem quickly becomes readonly
> like so:
> 
>> [23552.839055] BTRFS error (device dm-2): cleaner transaction attach returned
>> -30
>> [23577.374390] BTRFS info (device dm-2): use lzo compression
>> [23577.374391] BTRFS info (device dm-2): disk space caching is enabled
>> [23577.374392] BTRFS info (device dm-2): has skinny extents
>> [23577.506214] BTRFS info (device dm-2): bdev
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0,
>> flush 0, corrupt 24, gen 0
>> [23795.026390] BTRFS error (device dm-2): bad tree block start 0 470069510144
>> [23795.148193] BTRFS error (device dm-2): bad tree block start 56 470069542912
>> [23795.148424] BTRFS warning (device dm-2): dm-2 checksum verify failed on
>> 470069460992 wanted 54C49539 found FD171FBB level 0
>> [23795.148526] BTRFS error (device dm-2): bad tree block start 0 470069493760
>> [23795.150461] BTRFS error (device dm-2): bad tree block start 1459617832
>> 470069477376
>> [23795.639781] BTRFS error (device dm-2): bad tree block start 0 470069510144
>> [23795.655487] BTRFS error (device dm-2): bad tree block start 0 470069510144
>> [23795.655496] BTRFS: error (device dm-2) in btrfs_drop_snapshot:9244: errno=-5
>> IO failure
>> [23795.655498] BTRFS info (device dm-2): forced readonly
> Check and repaid doesn't help either:
> 
>> nazar-pc@nazar-pc ~> sudo btrfs check -p
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> Checking filesystem on
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> UUID: 82cfcb0f-0b80-4764-bed6-f529f2030ac5
>> Extent back ref already exists for 797694840832 parent 330760175616 root 0 owner
>> 0 offset 0 num_refs 1
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> Ignoring transid failure
>> leaf parent key incorrect 470072098816
>> bad block 470072098816
>>
>> ERROR: errors found in extent allocation tree or chunk allocation
>> There is no free space entry for 797694844928-797694808064
>> There is no free space entry for 797694844928-797819535360
>> cache appears valid but isn't 796745793536
>> There is no free space entry for 814739984384-814739988480
>> There is no free space entry for 814739984384-814999404544
>> cache appears valid but isn't 813925662720
>> block group 894456299520 has wrong amount of free space
>> failed to load free space cache for block group 894456299520
>> block group 922910457856 has wrong amount of free space
>> failed to load free space cache for block group 922910457856
>>
>> ERROR: errors found in free space cache
>> found 963515335717 bytes used, error(s) found
>> total csum bytes: 921699896
>> total tree bytes: 20361920512
>> total fs tree bytes: 17621073920
>> total extent tree bytes: 1629323264
>> btree space waste bytes: 3812167723
>> file data blocks allocated: 21167059447808
>>  referenced 2283091746816
>>
>> nazar-pc@nazar-pc ~> sudo btrfs check --repair -p
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> enabling repair mode
>> Checking filesystem on
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
>> UUID: 82cfcb0f-0b80-4764-bed6-f529f2030ac5
>> Extent back ref already exists for 797694840832 parent 330760175616 root 0 owner
>> 0 offset 0 num_refs 1
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> parent transid verify failed on 470072098816 wanted 1431 found 307965
>> Ignoring transid failure
>> leaf parent key incorrect 470072098816
>> bad block 470072098816
>>
>> ERROR: errors found in extent allocation tree or chunk allocation
>> Fixed 0 roots.
>> There is no free space entry for 797694844928-797694808064
>> There is no free space entry for 797694844928-797819535360
>> cache appears valid but isn't 796745793536
>> There is no free space entry for 814739984384-814739988480
>> There is no free space entry for 814739984384-814999404544
>> cache appears valid but isn't 813925662720
>> block group 894456299520 has wrong amount of free space
>> failed to load free space cache for block group 894456299520
>> block group 922910457856 has wrong amount of free space
>> failed to load free space cache for block group 922910457856
>>
>> ERROR: errors found in free space cache
>> found 963515335717 bytes used, error(s) found
>> total csum bytes: 921699896
>> total tree bytes: 20361920512
>> total fs tree bytes: 17621073920
>> total extent tree bytes: 1629323264
>> btree space waste bytes: 3812167723
>> file data blocks allocated: 21167059447808
>>  referenced 2283091746816
> Anything else I can try before starting from scratch?
> 
> Sincerely, Nazar Mokrynskyi
> github.com/nazar-pc
> 
> 19.11.17 07:30, Nazar Mokrynskyi пише:
>> 19.11.17 07:23, Chris Murphy пише:
>>> On Sat, Nov 18, 2017 at 10:13 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>>>
>>>> That was eventually useful:
>>>>
>>>> * found some familiar file names (mangled eCryptfs file names from times when I
>>>> used it for home directory) and decided to search for it in old snapshots of
>>>> home directory (about 1/3 of snapshots on that partition)
>>>> * file name was present in snapshots back to July of 2015, but during search
>>>> through snapshot from 2016-10-26_18:47:04 I've got I/O error reported by find
>>>> command at one directory
>>>> * tried to open directory in file manager - same error, fails to open
>>>> * after removing this lets call it "broken" snapshot started new scrub,
>>>> hopefully it'll finish fine
>>>>
>>>> If it is not actually related to recent memory issues I'd be positively
>>>> surprised. Not sure what happened towards the end of October 2016 though,
>>>> especially that backups were on different physical device back then.
>>> Wrong csum computation during the transfer? Did you use btrfs send receive?
>> Yes, I've used send/receive to copy snapshots from primary SSD to backup HDD.
>>
>> Not sure when wrong csum computation happened, since SSD contains only most
>> recent snapshots and only HDD contains older snapshots. Even if error happened
>> on SSD, those older snapshots are gone a long time ago and there is no way to
>> check this.
>>
>> Sincerely, Nazar Mokrynskyi
>> github.com/nazar-pc
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unrecoverable scrub errors
  2017-11-19 20:39                     ` Roy Sigurd Karlsbakk
@ 2017-11-19 21:18                       ` Nazar Mokrynskyi
  0 siblings, 0 replies; 13+ messages in thread
From: Nazar Mokrynskyi @ 2017-11-19 21:18 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-btrfs

This particular partition was initially created in July 2015. I've added/removed drives a few times when migrating from older to newer hardware, but never used RAID0 or any other RAID level beyond that.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc

19.11.17 22:39, Roy Sigurd Karlsbakk пише:
> I guess not using RAID-0 would be a good start…
>
> Vennlig hilsen
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> Hið góða skaltu í stein höggva, hið illa í snjó rita.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-19 21:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17 15:41 Unrecoverable scrub errors Nazar Mokrynskyi
2017-11-18  3:19 ` Chris Murphy
2017-11-18  3:33   ` Adam Borowski
2017-11-18  8:15     ` Nazar Mokrynskyi
2017-11-19  3:19       ` Chris Murphy
2017-11-19  3:45         ` Nazar Mokrynskyi
2017-11-19  4:40           ` Chris Murphy
     [not found]           ` <CAJCQCtRKPekR+bEE0xaX02Wwz0E_N1wcUxJdc_D-C9By2qMvWw@mail.gmail.com>
2017-11-19  5:13             ` Nazar Mokrynskyi
2017-11-19  5:23               ` Chris Murphy
2017-11-19  5:30                 ` Nazar Mokrynskyi
2017-11-19 11:17                   ` Nazar Mokrynskyi
2017-11-19 20:39                     ` Roy Sigurd Karlsbakk
2017-11-19 21:18                       ` Nazar Mokrynskyi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.