All of lore.kernel.org
 help / color / mirror / Atom feed
* FIDEDUPERANGE woes may continue (or unrelated issue?)
@ 2020-02-22  6:18 halfdog
  2020-02-22  7:22 ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: halfdog @ 2020-02-22  6:18 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

Thanks for the fix to the bug reported in "FIDEDUPERANGE woes".

I now got a kernel including that fix. I verified, that deduplication
now also worked on file ends. Deduplication also picked up the
unfragmented version of a file in cases, where the second file
was fragmented due to the bug having a single tail block extent.

Looking now at the statistics, I noticed that the "duperemove"
did not submit all small files to deduplication for unknown
reason (also not relevant for the current problem).

So I modified the process to submit all identical files and started
to run it on a drive with bad fragmentation of small files due to
earlier bug. This procedure terminated middle of night with
following errors:

[53011.819395] BTRFS error (device dm-1): bad tree block start, want 90184380416 have 5483876589805514108
[53011.819426] BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5 IO failure
[53011.819434] BTRFS info (device dm-1): forced readonly
[53011.819441] BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188: errno=-5 IO failure
[53624.948618] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200
[53651.093717] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200
[53706.250680] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200

Could this be related to the previous fix or is this more likely
just a random hardware/software bug at a weird time?

I also tried to apply the btrfs-check to that, but it fails opening
the device, thus not attempting to repair it ...

# btrfs check --progress --clear-space-cache v1 --clear-space-cache v2 --super 1 [device]
using SB copy 1, bytenr 67108864
Opening filesystem to check...
checksum verify failed on 90184388608 found 00000000 wanted FFFFFF96
checksum verify failed on 90184388608 found 00000000 wanted FFFFFF96
bad tree block 90184388608, bytenr mismatch, want=90184388608, have=15484548251864346855
ERROR: failed to read block groups: Input/output error
ERROR: cannot open file system

Any ideas?


PS: I have a database of all file checksums on another drive
for comparison of file content, if useful.


Filipe Manana writes:
> On Sat, Dec 14, 2019 at 5:46 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Fri, Dec 13, 2019 at 05:32:14PM +0800, Qu Wenruo wrote:
> > >
> > >
> > > On 2019/12/13 =E4=B8=8A=E5=8D=8812:15, halfdog wrote:
> > > > Hello list,
> > > >
> > > > Using btrfs on
> > > >
> > > > Linux version 5.3.0-2-amd64 (debian-kernel@lists.debian.org) (gcc ver=
> sion 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3 (2019-11-19)
> > > >
> > > > the FIDEDUPERANGE exposes weird behaviour on two identical but
> > > > not too large files that seems to be depending on the file size.
> > > > Before FIDEDUPERANGE both files have a single extent, afterwards
> > > > first file is still single extent, second file has all bytes sharing
> > > > with the extent of the first file except for the last 4096 bytes.
> > > >
> > > > Is there anything known about a bug fixed since the above mentioned
> > > > kernel version?
> > > >
> > > >
> > > >
> > > > If no, does following reproducer still show the same behaviour
> > > > on current Linux kernel (my Python test tools also attached)?
> > > >
> > > >> dd if=3D/dev/zero bs=3D1M count=3D32 of=3Ddisk
> > > >> mkfs.btrfs --mixed --metadata single --data single --nodesize 4096 d=
> isk
> > > >> mount disk /mnt/test
> > > >> mkdir /mnt/test/x
> > > >> dd bs=3D1 count=3D155489 if=3D/dev/urandom of=3D/mnt/test/x/file-0
> > >
> > > 155489 is not sector size aligned, thus the last extent will be padded
> > > with zero.
> > >
> > > >> cat /mnt/test/x/file-0 > /mnt/test/x/file-1
> > >
> > > Same for the new file.
> > >
> > > For the tailing padding part, it's not aligned, and it's smaller than
> > > the inode size.
> > >
> > > Thus we won't dedupe that tailing part.
> >
> > We definitely *must* dedupe the tailing part on btrfs; otherwise, we can'=
> t
> > eliminate the reference to the last (partial) block in the last extent of
> > the file, and there is no way to dedupe the _entire_ file in this example=
> .
> > It does pretty bad things to dedupe hit rates on uncompressed contiguous
> > files, where you can lose an average of 64MB of space per file.
> >
> > I had been wondering why dedupe scores seemed so low on recent kernels,
> > and this bug would certainly contribute to that.
> >
> > It worked in 4.20, broken in 5.0.  My guess is commit
> > 34a28e3d77535efc7761aa8d67275c07d1fe2c58 ("Btrfs: use
> > generic_remap_file_range_prep() for cloning and deduplication") but I
> > haven't run a test to confirm.
>
> The problem comes from the generic (vfs) code, which always rounds
> down the deduplication length to a multiple of the filesystem's sector
> size.
> That should be done only when the destination range's end does not
> match the destination's file size, to avoid the corruption I found
> over a year ago [1].
> For some odd reason it has the correct behavior for cloning, it only
> rounds down the destination range's end is less then the destination's
> file size.
>
> I'll see if I get that fixed next week.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
> it/?id=3Dde02b9f6bb65a6a1848f346f7a3617b7a9b930c0
>
> >
> >
> > > Thanks,
> > > Qu
> > >
> > > >
> > > >> ./SimpleIndexer x > x.json
> > > >> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json /mnt/test/=
> x > dedup.list
> > > > Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)], b'/mnt/test=
> /x/file-1': [(0, 5472256, 155648)]}
> > > > ...
> > > >> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0 0 /mnt/t=
> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
> > > > ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D0=
> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=3D=
> 0}]} =3D> {info=3D[{bytes_deduped=3D155489, status=3D0}]}) =3D 0
> > > >> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json /mnt/test/=
> x > dedup.list
> > > > Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)], b'/mnt/test=
> /x/file-1': [(0, 5316608, 151552), (151552, 5623808, 4096)]}
> > > > ...
> > > >> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0 0 /mnt/t=
> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
> > > > ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D0=
> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=3D=
> 0}]} =3D> {info=3D[{bytes_deduped=3D155489, status=3D0}]}) =3D 0
> > > >> strace -s256 -f btrfs-extent-same 4096 /mnt/test/x/file-0 151552 /mn=
> t/test/x/file-1 151552 2>&1 | grep -E -e FIDEDUPERANGE
> > > > ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D1=
> 51552, src_length=3D4096, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=
> =3D151552}]}) =3D -1 EINVAL (Invalid argument)
> > > >
> > >
> >
> >
> >
>
>
> --=20
> Filipe David Manana,
>
> =E2=80=9CWhether you think you can, or you think you can't =E2=80=94 you're=
>  right.=E2=80=9D


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-02-22  6:18 FIDEDUPERANGE woes may continue (or unrelated issue?) halfdog
@ 2020-02-22  7:22 ` Qu Wenruo
  2020-02-22 18:30   ` halfdog
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-02-22  7:22 UTC (permalink / raw)
  To: halfdog, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7800 bytes --]



On 2020/2/22 下午2:18, halfdog wrote:
> Hello list,
> 
> Thanks for the fix to the bug reported in "FIDEDUPERANGE woes".
> 
> I now got a kernel including that fix. I verified, that deduplication
> now also worked on file ends. Deduplication also picked up the
> unfragmented version of a file in cases, where the second file
> was fragmented due to the bug having a single tail block extent.
> 
> Looking now at the statistics, I noticed that the "duperemove"
> did not submit all small files to deduplication for unknown
> reason (also not relevant for the current problem).
> 
> So I modified the process to submit all identical files and started
> to run it on a drive with bad fragmentation of small files due to
> earlier bug. This procedure terminated middle of night with
> following errors:
> 
> [53011.819395] BTRFS error (device dm-1): bad tree block start, want 90184380416 have 5483876589805514108
> [53011.819426] BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5 IO failure

Extent tree already screwed up.

This means your fs is already screwed up.

> [53011.819434] BTRFS info (device dm-1): forced readonly
> [53011.819441] BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188: errno=-5 IO failure
> [53624.948618] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200
> [53651.093717] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200
> [53706.250680] BTRFS error (device dm-1): parent transid verify failed on 67651596288 wanted 27538 found 27200

Transid error is not a good sign either.

> 
> Could this be related to the previous fix or is this more likely
> just a random hardware/software bug at a weird time?

Not sure, but I don't think the reflink-beyond-eof fix is related to the
bug, as the main problem is the tree block COW.

Normally such transid corruption is related to extent tree corruption,
which matches the result from btrfs check.

Thanks,
Qu

> 
> I also tried to apply the btrfs-check to that, but it fails opening
> the device, thus not attempting to repair it ...
> 
> # btrfs check --progress --clear-space-cache v1 --clear-space-cache v2 --super 1 [device]
> using SB copy 1, bytenr 67108864
> Opening filesystem to check...
> checksum verify failed on 90184388608 found 00000000 wanted FFFFFF96
> checksum verify failed on 90184388608 found 00000000 wanted FFFFFF96
> bad tree block 90184388608, bytenr mismatch, want=90184388608, have=15484548251864346855
> ERROR: failed to read block groups: Input/output error
> ERROR: cannot open file system
> 
> Any ideas?
> 
> 
> PS: I have a database of all file checksums on another drive
> for comparison of file content, if useful.
> 
> 
> Filipe Manana writes:
>> On Sat, Dec 14, 2019 at 5:46 AM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> On Fri, Dec 13, 2019 at 05:32:14PM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2019/12/13 =E4=B8=8A=E5=8D=8812:15, halfdog wrote:
>>>>> Hello list,
>>>>>
>>>>> Using btrfs on
>>>>>
>>>>> Linux version 5.3.0-2-amd64 (debian-kernel@lists.debian.org) (gcc ver=
>> sion 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3 (2019-11-19)
>>>>>
>>>>> the FIDEDUPERANGE exposes weird behaviour on two identical but
>>>>> not too large files that seems to be depending on the file size.
>>>>> Before FIDEDUPERANGE both files have a single extent, afterwards
>>>>> first file is still single extent, second file has all bytes sharing
>>>>> with the extent of the first file except for the last 4096 bytes.
>>>>>
>>>>> Is there anything known about a bug fixed since the above mentioned
>>>>> kernel version?
>>>>>
>>>>>
>>>>>
>>>>> If no, does following reproducer still show the same behaviour
>>>>> on current Linux kernel (my Python test tools also attached)?
>>>>>
>>>>>> dd if=3D/dev/zero bs=3D1M count=3D32 of=3Ddisk
>>>>>> mkfs.btrfs --mixed --metadata single --data single --nodesize 4096 d=
>> isk
>>>>>> mount disk /mnt/test
>>>>>> mkdir /mnt/test/x
>>>>>> dd bs=3D1 count=3D155489 if=3D/dev/urandom of=3D/mnt/test/x/file-0
>>>>
>>>> 155489 is not sector size aligned, thus the last extent will be padded
>>>> with zero.
>>>>
>>>>>> cat /mnt/test/x/file-0 > /mnt/test/x/file-1
>>>>
>>>> Same for the new file.
>>>>
>>>> For the tailing padding part, it's not aligned, and it's smaller than
>>>> the inode size.
>>>>
>>>> Thus we won't dedupe that tailing part.
>>>
>>> We definitely *must* dedupe the tailing part on btrfs; otherwise, we can'=
>> t
>>> eliminate the reference to the last (partial) block in the last extent of
>>> the file, and there is no way to dedupe the _entire_ file in this example=
>> .
>>> It does pretty bad things to dedupe hit rates on uncompressed contiguous
>>> files, where you can lose an average of 64MB of space per file.
>>>
>>> I had been wondering why dedupe scores seemed so low on recent kernels,
>>> and this bug would certainly contribute to that.
>>>
>>> It worked in 4.20, broken in 5.0.  My guess is commit
>>> 34a28e3d77535efc7761aa8d67275c07d1fe2c58 ("Btrfs: use
>>> generic_remap_file_range_prep() for cloning and deduplication") but I
>>> haven't run a test to confirm.
>>
>> The problem comes from the generic (vfs) code, which always rounds
>> down the deduplication length to a multiple of the filesystem's sector
>> size.
>> That should be done only when the destination range's end does not
>> match the destination's file size, to avoid the corruption I found
>> over a year ago [1].
>> For some odd reason it has the correct behavior for cloning, it only
>> rounds down the destination range's end is less then the destination's
>> file size.
>>
>> I'll see if I get that fixed next week.
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
>> it/?id=3Dde02b9f6bb65a6a1848f346f7a3617b7a9b930c0
>>
>>>
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>> ./SimpleIndexer x > x.json
>>>>>> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json /mnt/test/=
>> x > dedup.list
>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)], b'/mnt/test=
>> /x/file-1': [(0, 5472256, 155648)]}
>>>>> ...
>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0 0 /mnt/t=
>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D0=
>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=3D=
>> 0}]} =3D> {info=3D[{bytes_deduped=3D155489, status=3D0}]}) =3D 0
>>>>>> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json /mnt/test/=
>> x > dedup.list
>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)], b'/mnt/test=
>> /x/file-1': [(0, 5316608, 151552), (151552, 5623808, 4096)]}
>>>>> ...
>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0 0 /mnt/t=
>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D0=
>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=3D=
>> 0}]} =3D> {info=3D[{bytes_deduped=3D155489, status=3D0}]}) =3D 0
>>>>>> strace -s256 -f btrfs-extent-same 4096 /mnt/test/x/file-0 151552 /mn=
>> t/test/x/file-1 151552 2>&1 | grep -E -e FIDEDUPERANGE
>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=3D1=
>> 51552, src_length=3D4096, dest_count=3D1, info=3D[{dest_fd=3D4, dest_offset=
>> =3D151552}]}) =3D -1 EINVAL (Invalid argument)
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --=20
>> Filipe David Manana,
>>
>> =E2=80=9CWhether you think you can, or you think you can't =E2=80=94 you're=
>>  right.=E2=80=9D
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-02-22  7:22 ` Qu Wenruo
@ 2020-02-22 18:30   ` halfdog
  2020-03-24  8:27     ` halfdog
  0 siblings, 1 reply; 13+ messages in thread
From: halfdog @ 2020-02-22 18:30 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo writes:
> On 2020/2/22 下午2:18, halfdog wrote:
>> Hello list,
>>
>> Thanks for the fix to the bug reported in "FIDEDUPERANGE woes".
>>
>> I now got a kernel including that fix. I verified, that
>> deduplication now also worked on file ends. Deduplication
>> also picked up the unfragmented version of a file in cases,
>> where the second file was fragmented due to the bug having
>> a single tail block extent.
>>
>> Looking now at the statistics, I noticed that the "duperemove"
>> did not submit all small files to deduplication for unknown
>> reason (also not relevant for the current problem).
>>
>> So I modified the process to submit all identical files and
>> started to run it on a drive with bad fragmentation of small
>> files due to earlier bug. This procedure terminated middle
>> of night with following errors:
>>
>> [53011.819395] BTRFS error (device dm-1): bad tree block start,
>> want 90184380416 have 5483876589805514108 [53011.819426] BTRFS:
>> error (device dm-1) in __btrfs_free_extent:3080: errno=-5
>> IO failure
>
> Extent tree already screwed up.
>
> This means your fs is already screwed up.

This is what I assumed. It is just a funny coincidence, that
this happened exactly using the disk the first time after a
fsck to run massive deduplication on a filesystem with approximately
each file existing in 20x copies, with at least one duplicated
copy for each set of same files.

>> [53011.819434] BTRFS info (device dm-1): forced readonly
>> [53011.819441] BTRFS: error (device dm-1) in
>> btrfs_run_delayed_refs:2188: errno=-5 IO failure [53624.948618]
>> BTRFS error (device dm-1): parent transid verify failed on
>> 67651596288 wanted 27538 found 27200 [53651.093717] BTRFS
>> error (device dm-1): parent transid verify failed on 67651596288
>> wanted 27538 found 27200 [53706.250680] BTRFS error (device
>> dm-1): parent transid verify failed on 67651596288 wanted
>> 27538 found 27200
>
> Transid error is not a good sign either.
>
>>
>> Could this be related to the previous fix or is this more
>> likely just a random hardware/software bug at a weird time?
>
> Not sure, but I don't think the reflink-beyond-eof fix is related
> to the bug, as the main problem is the tree block COW.
>
> Normally such transid corruption is related to extent tree
> corruption, which matches the result from btrfs check.

So it seems, but the reason is unclear. At least there were no
hardware failures (SMART, ECC, processor temperature) reported,
no bus transfer errors, so this might be a software fault.

As the data on the disk is not really needed, the disk is currently
cold without any need to bring it online soon, this would allow
to use it for debugging the cause of corruption.

If there is any interest in that, I could try to assist, e.g.
creating VM with trunk kernel/tools for debugging and execute
the tests. Otherwise I would just flatten the filesystem and
initialize the disk anew.


>> I also tried to apply the btrfs-check to that, but it fails
>> opening the device, thus not attempting to repair it ...
>>
>> # btrfs check --progress --clear-space-cache v1 --clear-space-cache
>> v2 --super 1 [device] using SB copy 1, bytenr 67108864 Opening
>> filesystem to check... checksum verify failed on 90184388608
>> found 00000000 wanted FFFFFF96 checksum verify failed on
>> 90184388608 found 00000000 wanted FFFFFF96 bad tree block
>> 90184388608, bytenr mismatch, want=90184388608,
>> have=15484548251864346855 ERROR: failed to read block groups:
>> Input/output error ERROR: cannot open file system
>>
>> Any ideas?
>>
>>
>> PS: I have a database of all file checksums on another drive
>> for comparison of file content, if useful.
>>
>>
>> Filipe Manana writes:
>>> On Sat, Dec 14, 2019 at 5:46 AM Zygo Blaxell
>>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>>
>>>> On Fri, Dec 13, 2019 at 05:32:14PM +0800, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2019/12/13 =E4=B8=8A=E5=8D=8812:15, halfdog wrote:
>>>>>> Hello list,
>>>>>>
>>>>>> Using btrfs on
>>>>>>
>>>>>> Linux version 5.3.0-2-amd64 (debian-kernel@lists.debian.org)
>>>>>> (gcc ver=
>>> sion 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3
>>> (2019-11-19)
>>>>>>
>>>>>> the FIDEDUPERANGE exposes weird behaviour on two identical
>>>>>> but not too large files that seems to be depending on
>>>>>> the file size. Before FIDEDUPERANGE both files have a
>>>>>> single extent, afterwards first file is still single extent,
>>>>>> second file has all bytes sharing with the extent of the
>>>>>> first file except for the last 4096 bytes.
>>>>>>
>>>>>> Is there anything known about a bug fixed since the above
>>>>>> mentioned kernel version?
>>>>>>
>>>>>>
>>>>>>
>>>>>> If no, does following reproducer still show the same behaviour
>>>>>> on current Linux kernel (my Python test tools also attached)?
>>>>>>
>>>>>>> dd if=3D/dev/zero bs=3D1M count=3D32 of=3Ddisk mkfs.btrfs
>>>>>>> --mixed --metadata single --data single --nodesize 4096
>>>>>>> d=
>>> isk
>>>>>>> mount disk /mnt/test mkdir /mnt/test/x dd bs=3D1
>>>>>>> count=3D155489 if=3D/dev/urandom of=3D/mnt/test/x/file-0
>>>>>
>>>>> 155489 is not sector size aligned, thus the last extent
>>>>> will be padded with zero.
>>>>>
>>>>>>> cat /mnt/test/x/file-0 > /mnt/test/x/file-1
>>>>>
>>>>> Same for the new file.
>>>>>
>>>>> For the tailing padding part, it's not aligned, and it's
>>>>> smaller than the inode size.
>>>>>
>>>>> Thus we won't dedupe that tailing part.
>>>>
>>>> We definitely *must* dedupe the tailing part on btrfs; otherwise,
>>>> we can'=
>>> t
>>>> eliminate the reference to the last (partial) block in the
>>>> last extent of the file, and there is no way to dedupe the
>>>> _entire_ file in this example=
>>> .
>>>> It does pretty bad things to dedupe hit rates on uncompressed
>>>> contiguous files, where you can lose an average of 64MB
>>>> of space per file.
>>>>
>>>> I had been wondering why dedupe scores seemed so low on
>>>> recent kernels, and this bug would certainly contribute
>>>> to that.
>>>>
>>>> It worked in 4.20, broken in 5.0.  My guess is commit
>>>> 34a28e3d77535efc7761aa8d67275c07d1fe2c58 ("Btrfs: use
>>>> generic_remap_file_range_prep() for cloning and deduplication")
>>>> but I haven't run a test to confirm.
>>>
>>> The problem comes from the generic (vfs) code, which always
>>> rounds down the deduplication length to a multiple of the
>>> filesystem's sector size. That should be done only when the
>>> destination range's end does not match the destination's
>>> file size, to avoid the corruption I found over a year ago
>>> [1]. For some odd reason it has the correct behavior for
>>> cloning, it only rounds down the destination range's end
>>> is less then the destination's file size.
>>>
>>> I'll see if I get that fixed next week.
>>>
>>> [1]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
>>> it/?id=3Dde02b9f6bb65a6a1848f346f7a3617b7a9b930c0
>>>
>>>>
>>>>
>>>>> Thanks, Qu
>>>>>
>>>>>>
>>>>>>> ./SimpleIndexer x > x.json ./IndexDeduplicationAnalyzer
>>>>>>> --IndexFile /mnt/test/x.json /mnt/test/=
>>> x > dedup.list
>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
>>>>>> b'/mnt/test=
>>> /x/file-1': [(0, 5472256, 155648)]}
>>>>>> ...
>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
>>>>>>> 0 /mnt/t=
>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>> {src_offset=3D0=
>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
>>> status=3D0}]}) =3D 0
>>>>>>> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json
>>>>>>> /mnt/test/=
>>> x > dedup.list
>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
>>>>>> b'/mnt/test=
>>> /x/file-1': [(0, 5316608, 151552), (151552, 5623808, 4096)]}
>>>>>> ...
>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
>>>>>>> 0 /mnt/t=
>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>> {src_offset=3D0=
>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
>>> status=3D0}]}) =3D 0
>>>>>>> strace -s256 -f btrfs-extent-same 4096 /mnt/test/x/file-0
>>>>>>> 151552 /mn=
>>> t/test/x/file-1 151552 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>> {src_offset=3D1=
>>> 51552, src_length=3D4096, dest_count=3D1, info=3D[{dest_fd=3D4,
>>> dest_offset= =3D151552}]}) =3D -1 EINVAL (Invalid argument)
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --=20 Filipe David Manana,
>>>
>>> =E2=80=9CWhether you think you can, or you think you can't
>>> =E2=80=94 you're= right.=E2=80=9D
>>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-02-22 18:30   ` halfdog
@ 2020-03-24  8:27     ` halfdog
  2020-03-25  3:53       ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: halfdog @ 2020-03-24  8:27 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

It seems the woes really continued ... After trashing the old,
corrupted filesystem (see old message below) I started rebuilding
the storage. Synchronization from another (still working) storage
roughly should have performed the same actions as during initial
build (minus count and time of mounts/unmounts, transfer interrupts,
...).

It does not seem to be a mere coincidence, that the corruption
occured when deduplicating the exact same file as last time.
While corruption last time made disk completely inaccessible,
this time it just was mounted readonly with a different error
message:

[156603.177699] BTRFS error (device dm-1): parent transid verify failed on 6680428544 wanted 12947 found 12945
[156603.177707] BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5 IO failure
[156603.177708] BTRFS info (device dm-1): forced readonly
[156603.177711] BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188: errno=-5 IO failure

As it seems that the bug here is somehow reproducible, I would
like to try to develop a reproducer exploit and fix for that
bug as an excercise. Unfortunately the fault occurs only after
transfering and deduplicating ~20TB of data.

Are there any recommendations e.g. how to "bisect" that problem?
Is there a way (switch or source code modification) to log all
internal btrfs state transitions for later analysis? Other ideas
for debugging that? Just creating the same number of snapshots
and putting just that single file into each of them did not trigger
the bug during deduplication.

Kind regards,
hd



halfdog writes:
> Qu Wenruo writes:
>> On 2020/2/22 下午2:18, halfdog wrote:
>>> Hello list,
>>>
>>> Thanks for the fix to the bug reported in "FIDEDUPERANGE
>>> woes".
>>>
>>> I now got a kernel including that fix. I verified, that
>>> deduplication now also worked on file ends. Deduplication
>>> also picked up the unfragmented version of a file in cases,
>>> where the second file was fragmented due to the bug having
>>> a single tail block extent.
>>>
>>> Looking now at the statistics, I noticed that the "duperemove"
>>> did not submit all small files to deduplication for unknown
>>> reason (also not relevant for the current problem).
>>>
>>> So I modified the process to submit all identical files and
>>> started to run it on a drive with bad fragmentation of small
>>> files due to earlier bug. This procedure terminated middle
>>> of night with following errors:
>>>
>>> [53011.819395] BTRFS error (device dm-1): bad tree block
>>> start, want 90184380416 have 5483876589805514108 [53011.819426]
>>> BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5
>>> IO failure
>>
>> Extent tree already screwed up.
>>
>> This means your fs is already screwed up.
>
> This is what I assumed. It is just a funny coincidence, that
> this happened exactly using the disk the first time after a
> fsck to run massive deduplication on a filesystem with approximately
> each file existing in 20x copies, with at least one duplicated
> copy for each set of same files.
>
>>> [53011.819434] BTRFS info (device dm-1): forced readonly
>>> [53011.819441] BTRFS: error (device dm-1) in
>>> btrfs_run_delayed_refs:2188: errno=-5 IO failure [53624.948618]
>>> BTRFS error (device dm-1): parent transid verify failed on
>>> 67651596288 wanted 27538 found 27200 [53651.093717] BTRFS
>>> error (device dm-1): parent transid verify failed on 67651596288
>>> wanted 27538 found 27200 [53706.250680] BTRFS error (device
>>> dm-1): parent transid verify failed on 67651596288 wanted
>>> 27538 found 27200
>>
>> Transid error is not a good sign either.
>>
>>>
>>> Could this be related to the previous fix or is this more
>>> likely just a random hardware/software bug at a weird time?
>>
>> Not sure, but I don't think the reflink-beyond-eof fix is
>> related to the bug, as the main problem is the tree block
>> COW.
>>
>> Normally such transid corruption is related to extent tree
>> corruption, which matches the result from btrfs check.
>
> So it seems, but the reason is unclear. At least there were
> no hardware failures (SMART, ECC, processor temperature) reported,
> no bus transfer errors, so this might be a software fault.
>
> As the data on the disk is not really needed, the disk is currently
> cold without any need to bring it online soon, this would allow
> to use it for debugging the cause of corruption.
>
> If there is any interest in that, I could try to assist, e.g.
> creating VM with trunk kernel/tools for debugging and execute
> the tests. Otherwise I would just flatten the filesystem and
> initialize the disk anew.
>
>
>>> I also tried to apply the btrfs-check to that, but it fails
>>> opening the device, thus not attempting to repair it ...
>>>
>>> # btrfs check --progress --clear-space-cache v1
>>> --clear-space-cache v2 --super 1 [device] using SB copy 1,
>>> bytenr 67108864 Opening filesystem to check... checksum verify
>>> failed on 90184388608 found 00000000 wanted FFFFFF96 checksum
>>> verify failed on 90184388608 found 00000000 wanted FFFFFF96
>>> bad tree block 90184388608, bytenr mismatch, want=90184388608,
>>> have=15484548251864346855 ERROR: failed to read block groups:
>>> Input/output error ERROR: cannot open file system
>>>
>>> Any ideas?
>>>
>>>
>>> PS: I have a database of all file checksums on another drive
>>> for comparison of file content, if useful.
>>>
>>>
>>> Filipe Manana writes:
>>>> On Sat, Dec 14, 2019 at 5:46 AM Zygo Blaxell
>>>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>>>
>>>>> On Fri, Dec 13, 2019 at 05:32:14PM +0800, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2019/12/13 =E4=B8=8A=E5=8D=8812:15, halfdog wrote:
>>>>>>> Hello list,
>>>>>>>
>>>>>>> Using btrfs on
>>>>>>>
>>>>>>> Linux version 5.3.0-2-amd64 (debian-kernel@lists.debian.org)
>>>>>>> (gcc ver=
>>>> sion 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3
>>>> (2019-11-19)
>>>>>>>
>>>>>>> the FIDEDUPERANGE exposes weird behaviour on two identical
>>>>>>> but not too large files that seems to be depending on
>>>>>>> the file size. Before FIDEDUPERANGE both files have a
>>>>>>> single extent, afterwards first file is still single
>>>>>>> extent, second file has all bytes sharing with the extent
>>>>>>> of the first file except for the last 4096 bytes.
>>>>>>>
>>>>>>> Is there anything known about a bug fixed since the above
>>>>>>> mentioned kernel version?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If no, does following reproducer still show the same
>>>>>>> behaviour on current Linux kernel (my Python test tools
>>>>>>> also attached)?
>>>>>>>
>>>>>>>> dd if=3D/dev/zero bs=3D1M count=3D32 of=3Ddisk mkfs.btrfs
>>>>>>>> --mixed --metadata single --data single --nodesize 4096
>>>>>>>> d=
>>>> isk
>>>>>>>> mount disk /mnt/test mkdir /mnt/test/x dd bs=3D1
>>>>>>>> count=3D155489 if=3D/dev/urandom of=3D/mnt/test/x/file-0
>>>>>>
>>>>>> 155489 is not sector size aligned, thus the last extent
>>>>>> will be padded with zero.
>>>>>>
>>>>>>>> cat /mnt/test/x/file-0 > /mnt/test/x/file-1
>>>>>>
>>>>>> Same for the new file.
>>>>>>
>>>>>> For the tailing padding part, it's not aligned, and it's
>>>>>> smaller than the inode size.
>>>>>>
>>>>>> Thus we won't dedupe that tailing part.
>>>>>
>>>>> We definitely *must* dedupe the tailing part on btrfs;
>>>>> otherwise, we can'=
>>>> t
>>>>> eliminate the reference to the last (partial) block in
>>>>> the last extent of the file, and there is no way to dedupe
>>>>> the _entire_ file in this example=
>>>> .
>>>>> It does pretty bad things to dedupe hit rates on uncompressed
>>>>> contiguous files, where you can lose an average of 64MB
>>>>> of space per file.
>>>>>
>>>>> I had been wondering why dedupe scores seemed so low on
>>>>> recent kernels, and this bug would certainly contribute
>>>>> to that.
>>>>>
>>>>> It worked in 4.20, broken in 5.0.  My guess is commit
>>>>> 34a28e3d77535efc7761aa8d67275c07d1fe2c58 ("Btrfs: use
>>>>> generic_remap_file_range_prep() for cloning and deduplication")
>>>>> but I haven't run a test to confirm.
>>>>
>>>> The problem comes from the generic (vfs) code, which always
>>>> rounds down the deduplication length to a multiple of the
>>>> filesystem's sector size. That should be done only when
>>>> the destination range's end does not match the destination's
>>>> file size, to avoid the corruption I found over a year ago
>>>> [1]. For some odd reason it has the correct behavior for
>>>> cloning, it only rounds down the destination range's end
>>>> is less then the destination's file size.
>>>>
>>>> I'll see if I get that fixed next week.
>>>>
>>>> [1]
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
>>>> it/?id=3Dde02b9f6bb65a6a1848f346f7a3617b7a9b930c0
>>>>
>>>>>
>>>>>
>>>>>> Thanks, Qu
>>>>>>
>>>>>>>
>>>>>>>> ./SimpleIndexer x > x.json ./IndexDeduplicationAnalyzer
>>>>>>>> --IndexFile /mnt/test/x.json /mnt/test/=
>>>> x > dedup.list
>>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
>>>>>>> b'/mnt/test=
>>>> /x/file-1': [(0, 5472256, 155648)]}
>>>>>>> ...
>>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
>>>>>>>> 0 /mnt/t=
>>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>>> {src_offset=3D0=
>>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
>>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
>>>> status=3D0}]}) =3D 0
>>>>>>>> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json
>>>>>>>> /mnt/test/=
>>>> x > dedup.list
>>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
>>>>>>> b'/mnt/test=
>>>> /x/file-1': [(0, 5316608, 151552), (151552, 5623808, 4096)]}
>>>>>>> ...
>>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
>>>>>>>> 0 /mnt/t=
>>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>>> {src_offset=3D0=
>>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
>>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
>>>> status=3D0}]}) =3D 0
>>>>>>>> strace -s256 -f btrfs-extent-same 4096 /mnt/test/x/file-0
>>>>>>>> 151552 /mn=
>>>> t/test/x/file-1 151552 2>&1 | grep -E -e FIDEDUPERANGE
>>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
>>>>>>> {src_offset=3D1=
>>>> 51552, src_length=3D4096, dest_count=3D1, info=3D[{dest_fd=3D4,
>>>> dest_offset= =3D151552}]}) =3D -1 EINVAL (Invalid argument)
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --=20 Filipe David Manana,
>>>>
>>>> =E2=80=9CWhether you think you can, or you think you can't
>>>> =E2=80=94 you're= right.=E2=80=9D
>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-24  8:27     ` halfdog
@ 2020-03-25  3:53       ` Zygo Blaxell
  2020-03-26  9:53         ` halfdog
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-03-25  3:53 UTC (permalink / raw)
  To: halfdog; +Cc: linux-btrfs

On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
> Hello list,
> 
> It seems the woes really continued ... After trashing the old,
> corrupted filesystem (see old message below) I started rebuilding
> the storage. Synchronization from another (still working) storage
> roughly should have performed the same actions as during initial
> build (minus count and time of mounts/unmounts, transfer interrupts,
> ...).
> 
> It does not seem to be a mere coincidence, that the corruption
> occured when deduplicating the exact same file as last time.
> While corruption last time made disk completely inaccessible,
> this time it just was mounted readonly with a different error
> message:
> 
> [156603.177699] BTRFS error (device dm-1): parent transid verify failed on 6680428544 wanted 12947 found 12945
> [156603.177707] BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5 IO failure
> [156603.177708] BTRFS info (device dm-1): forced readonly
> [156603.177711] BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188: errno=-5 IO failure

Normally those messages mean your hardware is dropping writes somewhere;
however, you previously reported running kernels 5.3.0 and 5.3.9, so
there may be another explanation.

Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do not use kernels
from 5.1-rc1 to 5.4.13 inclusive unless backported fixes are included.

I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because the
later ones include the EOF dedupe fix.  4.19 avoids the regressions of
later kernels.

> As it seems that the bug here is somehow reproducible, I would
> like to try to develop a reproducer exploit and fix for that
> bug as an excercise. Unfortunately the fault occurs only after
> transfering and deduplicating ~20TB of data.
> 
> Are there any recommendations e.g. how to "bisect" that problem?

Find someone who has already done it and ask.  ;)

Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want the dedupe
fix too).  Don't run any kernel in between for btrfs.

There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which corrupts
metadata.  It's a UAF bug, so its behavior can be unpredictable, but quite
often the symptom is corrupted metadata or write-time tree-checker errors.
Sometimes you just get a harmless NULL dereference crash, or some noise
warnings.

There are at least two other filesystem corrupting bugs with lifetimes
overlapping that range of kernel versions; however both of those were
fixed by 5.3.

> Is there a way (switch or source code modification) to log all
> internal btrfs state transitions for later analysis? 

There are (e.g. the dm write logger), but most bugs that would be
found in unit tests by such tools have been fixed by the time a kernel
is released, and they'll only tell you that btrfs did something wrong,
not why.

Also, there can be tens of thousands of btrfs state transitions per
second during dedupe, so the volume of logs themselves can present data
wrangling challenges.

The more invasively you try to track internal btrfs state, the more the
tools become _part_ of that state, and introduce additional problems.
e.g. there is the ref verifier, and the _bug fix history_ of the ref
verifier...

> Other ideas for debugging that?

Run dedupe on a test machine with a few TB test corpus (or whatever your
workload is) on a debug-enabled kernel, report every bug that kills the
box or hurts the data, update the kernel to get fixes for the bugs that
were reported.  Repeat until the box stops crapping itself, then use the
kernel it stopped on (5.4.14 in this case).  Do that for every kernel
upgrade because regressions are a thing.

> Just creating the same number of snapshots
> and putting just that single file into each of them did not trigger
> the bug during deduplication.

Dedupe itself is fine, but some of the supporting ioctls a deduper has to
use to get information about the filesystem structure triggered a lot of
bugs.

> Kind regards,
> hd
> 
> 
> 
> halfdog writes:
> > Qu Wenruo writes:
> >> On 2020/2/22 下午2:18, halfdog wrote:
> >>> Hello list,
> >>>
> >>> Thanks for the fix to the bug reported in "FIDEDUPERANGE
> >>> woes".
> >>>
> >>> I now got a kernel including that fix. I verified, that
> >>> deduplication now also worked on file ends. Deduplication
> >>> also picked up the unfragmented version of a file in cases,
> >>> where the second file was fragmented due to the bug having
> >>> a single tail block extent.
> >>>
> >>> Looking now at the statistics, I noticed that the "duperemove"
> >>> did not submit all small files to deduplication for unknown
> >>> reason (also not relevant for the current problem).
> >>>
> >>> So I modified the process to submit all identical files and
> >>> started to run it on a drive with bad fragmentation of small
> >>> files due to earlier bug. This procedure terminated middle
> >>> of night with following errors:
> >>>
> >>> [53011.819395] BTRFS error (device dm-1): bad tree block
> >>> start, want 90184380416 have 5483876589805514108 [53011.819426]
> >>> BTRFS: error (device dm-1) in __btrfs_free_extent:3080: errno=-5
> >>> IO failure
> >>
> >> Extent tree already screwed up.
> >>
> >> This means your fs is already screwed up.
> >
> > This is what I assumed. It is just a funny coincidence, that
> > this happened exactly using the disk the first time after a
> > fsck to run massive deduplication on a filesystem with approximately
> > each file existing in 20x copies, with at least one duplicated
> > copy for each set of same files.
> >
> >>> [53011.819434] BTRFS info (device dm-1): forced readonly
> >>> [53011.819441] BTRFS: error (device dm-1) in
> >>> btrfs_run_delayed_refs:2188: errno=-5 IO failure [53624.948618]
> >>> BTRFS error (device dm-1): parent transid verify failed on
> >>> 67651596288 wanted 27538 found 27200 [53651.093717] BTRFS
> >>> error (device dm-1): parent transid verify failed on 67651596288
> >>> wanted 27538 found 27200 [53706.250680] BTRFS error (device
> >>> dm-1): parent transid verify failed on 67651596288 wanted
> >>> 27538 found 27200
> >>
> >> Transid error is not a good sign either.
> >>
> >>>
> >>> Could this be related to the previous fix or is this more
> >>> likely just a random hardware/software bug at a weird time?
> >>
> >> Not sure, but I don't think the reflink-beyond-eof fix is
> >> related to the bug, as the main problem is the tree block
> >> COW.
> >>
> >> Normally such transid corruption is related to extent tree
> >> corruption, which matches the result from btrfs check.
> >
> > So it seems, but the reason is unclear. At least there were
> > no hardware failures (SMART, ECC, processor temperature) reported,
> > no bus transfer errors, so this might be a software fault.
> >
> > As the data on the disk is not really needed, the disk is currently
> > cold without any need to bring it online soon, this would allow
> > to use it for debugging the cause of corruption.
> >
> > If there is any interest in that, I could try to assist, e.g.
> > creating VM with trunk kernel/tools for debugging and execute
> > the tests. Otherwise I would just flatten the filesystem and
> > initialize the disk anew.
> >
> >
> >>> I also tried to apply the btrfs-check to that, but it fails
> >>> opening the device, thus not attempting to repair it ...
> >>>
> >>> # btrfs check --progress --clear-space-cache v1
> >>> --clear-space-cache v2 --super 1 [device] using SB copy 1,
> >>> bytenr 67108864 Opening filesystem to check... checksum verify
> >>> failed on 90184388608 found 00000000 wanted FFFFFF96 checksum
> >>> verify failed on 90184388608 found 00000000 wanted FFFFFF96
> >>> bad tree block 90184388608, bytenr mismatch, want=90184388608,
> >>> have=15484548251864346855 ERROR: failed to read block groups:
> >>> Input/output error ERROR: cannot open file system
> >>>
> >>> Any ideas?
> >>>
> >>>
> >>> PS: I have a database of all file checksums on another drive
> >>> for comparison of file content, if useful.
> >>>
> >>>
> >>> Filipe Manana writes:
> >>>> On Sat, Dec 14, 2019 at 5:46 AM Zygo Blaxell
> >>>> <ce3g8jdj@umail.furryterror.org> wrote:
> >>>>>
> >>>>> On Fri, Dec 13, 2019 at 05:32:14PM +0800, Qu Wenruo wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 2019/12/13 =E4=B8=8A=E5=8D=8812:15, halfdog wrote:
> >>>>>>> Hello list,
> >>>>>>>
> >>>>>>> Using btrfs on
> >>>>>>>
> >>>>>>> Linux version 5.3.0-2-amd64 (debian-kernel@lists.debian.org)
> >>>>>>> (gcc ver=
> >>>> sion 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3
> >>>> (2019-11-19)
> >>>>>>>
> >>>>>>> the FIDEDUPERANGE exposes weird behaviour on two identical
> >>>>>>> but not too large files that seems to be depending on
> >>>>>>> the file size. Before FIDEDUPERANGE both files have a
> >>>>>>> single extent, afterwards first file is still single
> >>>>>>> extent, second file has all bytes sharing with the extent
> >>>>>>> of the first file except for the last 4096 bytes.
> >>>>>>>
> >>>>>>> Is there anything known about a bug fixed since the above
> >>>>>>> mentioned kernel version?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If no, does following reproducer still show the same
> >>>>>>> behaviour on current Linux kernel (my Python test tools
> >>>>>>> also attached)?
> >>>>>>>
> >>>>>>>> dd if=3D/dev/zero bs=3D1M count=3D32 of=3Ddisk mkfs.btrfs
> >>>>>>>> --mixed --metadata single --data single --nodesize 4096
> >>>>>>>> d=
> >>>> isk
> >>>>>>>> mount disk /mnt/test mkdir /mnt/test/x dd bs=3D1
> >>>>>>>> count=3D155489 if=3D/dev/urandom of=3D/mnt/test/x/file-0
> >>>>>>
> >>>>>> 155489 is not sector size aligned, thus the last extent
> >>>>>> will be padded with zero.
> >>>>>>
> >>>>>>>> cat /mnt/test/x/file-0 > /mnt/test/x/file-1
> >>>>>>
> >>>>>> Same for the new file.
> >>>>>>
> >>>>>> For the tailing padding part, it's not aligned, and it's
> >>>>>> smaller than the inode size.
> >>>>>>
> >>>>>> Thus we won't dedupe that tailing part.
> >>>>>
> >>>>> We definitely *must* dedupe the tailing part on btrfs;
> >>>>> otherwise, we can'=
> >>>> t
> >>>>> eliminate the reference to the last (partial) block in
> >>>>> the last extent of the file, and there is no way to dedupe
> >>>>> the _entire_ file in this example=
> >>>> .
> >>>>> It does pretty bad things to dedupe hit rates on uncompressed
> >>>>> contiguous files, where you can lose an average of 64MB
> >>>>> of space per file.
> >>>>>
> >>>>> I had been wondering why dedupe scores seemed so low on
> >>>>> recent kernels, and this bug would certainly contribute
> >>>>> to that.
> >>>>>
> >>>>> It worked in 4.20, broken in 5.0.  My guess is commit
> >>>>> 34a28e3d77535efc7761aa8d67275c07d1fe2c58 ("Btrfs: use
> >>>>> generic_remap_file_range_prep() for cloning and deduplication")
> >>>>> but I haven't run a test to confirm.
> >>>>
> >>>> The problem comes from the generic (vfs) code, which always
> >>>> rounds down the deduplication length to a multiple of the
> >>>> filesystem's sector size. That should be done only when
> >>>> the destination range's end does not match the destination's
> >>>> file size, to avoid the corruption I found over a year ago
> >>>> [1]. For some odd reason it has the correct behavior for
> >>>> cloning, it only rounds down the destination range's end
> >>>> is less then the destination's file size.
> >>>>
> >>>> I'll see if I get that fixed next week.
> >>>>
> >>>> [1]
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm=
> >>>> it/?id=3Dde02b9f6bb65a6a1848f346f7a3617b7a9b930c0
> >>>>
> >>>>>
> >>>>>
> >>>>>> Thanks, Qu
> >>>>>>
> >>>>>>>
> >>>>>>>> ./SimpleIndexer x > x.json ./IndexDeduplicationAnalyzer
> >>>>>>>> --IndexFile /mnt/test/x.json /mnt/test/=
> >>>> x > dedup.list
> >>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
> >>>>>>> b'/mnt/test=
> >>>> /x/file-1': [(0, 5472256, 155648)]}
> >>>>>>> ...
> >>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
> >>>>>>>> 0 /mnt/t=
> >>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
> >>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
> >>>>>>> {src_offset=3D0=
> >>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
> >>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
> >>>> status=3D0}]}) =3D 0
> >>>>>>>> ./IndexDeduplicationAnalyzer --IndexFile /mnt/test/x.json
> >>>>>>>> /mnt/test/=
> >>>> x > dedup.list
> >>>>>>> Got dict: {b'/mnt/test/x/file-0': [(0, 5316608, 155648)],
> >>>>>>> b'/mnt/test=
> >>>> /x/file-1': [(0, 5316608, 151552), (151552, 5623808, 4096)]}
> >>>>>>> ...
> >>>>>>>> strace -s256 -f btrfs-extent-same 155489 /mnt/test/x/file-0
> >>>>>>>> 0 /mnt/t=
> >>>> est/x/file-1 0 2>&1 | grep -E -e FIDEDUPERANGE
> >>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
> >>>>>>> {src_offset=3D0=
> >>>> , src_length=3D155489, dest_count=3D1, info=3D[{dest_fd=3D4,
> >>>> dest_offset=3D= 0}]} =3D> {info=3D[{bytes_deduped=3D155489,
> >>>> status=3D0}]}) =3D 0
> >>>>>>>> strace -s256 -f btrfs-extent-same 4096 /mnt/test/x/file-0
> >>>>>>>> 151552 /mn=
> >>>> t/test/x/file-1 151552 2>&1 | grep -E -e FIDEDUPERANGE
> >>>>>>> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE,
> >>>>>>> {src_offset=3D1=
> >>>> 51552, src_length=3D4096, dest_count=3D1, info=3D[{dest_fd=3D4,
> >>>> dest_offset= =3D151552}]}) =3D -1 EINVAL (Invalid argument)
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --=20 Filipe David Manana,
> >>>>
> >>>> =E2=80=9CWhether you think you can, or you think you can't
> >>>> =E2=80=94 you're= right.=E2=80=9D
> >>>
> >>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-25  3:53       ` Zygo Blaxell
@ 2020-03-26  9:53         ` halfdog
  2020-03-26 12:12           ` Qu Wenruo
  2020-03-26 13:23           ` Zygo Blaxell
  0 siblings, 2 replies; 13+ messages in thread
From: halfdog @ 2020-03-26  9:53 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Thanks for your lengthy reply!

Zygo Blaxell writes:
> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
>> Hello list,
>>
>> It seems the woes really continued ... After trashing the
>> old, corrupted filesystem (see old message below) I started
>> rebuilding the storage. Synchronization from another (still
>> working) storage roughly should have performed the same actions
>> as during initial build (minus count and time of mounts/unmounts,
>> transfer interrupts, ...).
>>
>> It does not seem to be a mere coincidence, that the corruption
>> occured when deduplicating the exact same file as last time.
>> While corruption last time made disk completely inaccessible,
>> this time it just was mounted readonly with a different error
>> message:
>>
>> [156603.177699] BTRFS error (device dm-1): parent transid
>> verify failed on 6680428544 wanted 12947 found 12945
>> [156603.177707] BTRFS: error (device dm-1) in
>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
>> BTRFS info (device dm-1): forced readonly [156603.177711]
>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
>> errno=-5 IO failure
>
> Normally those messages mean your hardware is dropping writes
> somewhere; however, you previously reported running kernels
> 5.3.0 and 5.3.9, so there may be another explanation.
>
> Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
> fixes are included.

Sorry, I forgot to update on that: I used the old kernel but also
managed t reproduce on
ii  linux-image-5.4.0-4-amd64            5.4.19-1                            amd64        Linux 5.4 for 64-bit PCs (signed)
Linux version 5.4.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 9.2.1 20200203 (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)

> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
> the later ones include the EOF dedupe fix.  4.19 avoids the
> regressions of later kernels.

5.4.19-1 matches your spec, but as latest Debian experimental
is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
your  5.5.3 recommendation, should I try again with that kernel
or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
5.5.13 LTS and build an own kernel?

>> As it seems that the bug here is somehow reproducible, I would
>> like to try to develop a reproducer exploit and fix for that
>> bug as an excercise. Unfortunately the fault occurs only after
>> transfering and deduplicating ~20TB of data.
>>
>> Are there any recommendations e.g. how to "bisect" that problem?
>
> Find someone who has already done it and ask.  ;)

Seems I found someone with good recommendations already :)

Thank you!

> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want
> the dedupe fix too).  Don't run any kernel in between for btrfs.
>
> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
> corrupts metadata.  It's a UAF bug, so its behavior can be
> unpredictable, but quite often the symptom is corrupted metadata
> or write-time tree-checker errors. Sometimes you just get a
> harmless NULL dereference crash, or some noise warnings.
>
> There are at least two other filesystem corrupting bugs with
> lifetimes overlapping that range of kernel versions; however
> both of those were fixed by 5.3.

So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like recommended
anyway?

>> Is there a way (switch or source code modification) to log
>> all internal btrfs state transitions for later analysis?
>
> There are (e.g. the dm write logger), but most bugs that would
> be found in unit tests by such tools have been fixed by the
> time a kernel is released, and they'll only tell you that btrfs
> did something wrong, not why.

As IO seems sane, the error reported "verify failed on 6680428544
wanted 12947 found 12945" seems not to point to a data structure
problem at a sector/page/block boundary (12947==0x3293), I would
also guess, that basic IO/paging is not involved in it, but that
the data structure is corrupted in memory and used directly or
written/reread ... therefore I would deem write logs also as
not the first way to go ..

> Also, there can be tens of thousands of btrfs state transitions
> per second during dedupe, so the volume of logs themselves
> can present data wrangling challenges.

Yes, that's why me asking. Maybe someone has already taken up
that challenge as such a tool-chain (generic transaction logging
with userspace stream compression, analysis) might be quite
handy for such task, but hell effort to build ...

> The more invasively you try to track internal btrfs state,
> the more the tools become _part_ of that state, and introduce
> additional problems. e.g. there is the ref verifier, and the
> _bug fix history_ of the ref verifier...

That is right. Therefore I hoped, that some minimal invasive
toolsets might be available already for kernel or maybe could
be written, e.g.

* Install an alternative kernel page fault handler
* Set breakpoints on btrfs functions
  * When entering the function, record return address, stack
    and register arguments, send to userspace
  * Strip write bits kernel from page table for most pages
    exept those needed by page fault handler
  * Continue execution
* For each pagefault, the pagefault flips back to original
  page table, sends information about write fault (what, where)
  to userspace, performs the faulted instruction before switching
  back to read-only page table and continuing btrfs function
* When returning from the last btrfs function, also switch back
  to standard page table.

By being completely btrfs-agnostic, such tool should not introduce
any btrfs-specific issues due to the analysis process. Does someone
know about such a tool or a simplified version of it?

Doing similar over qemu/kernel debugging tools might be easier
to implement but too slow to handle that huge amount of data.

>> Other ideas for debugging that?
>
> Run dedupe on a test machine with a few TB test corpus (or
> whatever your workload is) on a debug-enabled kernel, report
> every bug that kills the box or hurts the data, update the
> kernel to get fixes for the bugs that were reported.  Repeat
> until the box stops crapping itself, then use the kernel it
> stopped on (5.4.14 in this case).  Do that for every kernel
> upgrade because regressions are a thing.

Well, that seems like overkill. My btrfs is not haunted by a
load of bugs, just one that corrupted the filesystem two times
when trying to deduplicate the same set of files.

As desccribed, just creating a btrfs with only that file did
not trigger the corruption. If this is not a super-rare coincidence,
then something in the other 20TB of transferred files has to
have corrupted the file system or at least brought it to a state,
where then deduplication of exact that problematic set of files
triggered the final fault.

>> Just creating the same number of snapshots and putting just
>> that single file into each of them did not trigger the bug
>> during deduplication.
>
> Dedupe itself is fine, but some of the supporting ioctls a
> deduper has to use to get information about the filesystem
> structure triggered a lot of bugs.

To get rid of that, I already ripped out quite some of the userspace
deduping part. I now do the extent queries in a Python tool
using ctypes, split the dedup request into smaller chunks (to
improve logging granularity) and just use the deduper to do
that single FIDEDUPERANGE call (I was to lazy to ctype that
in Python too).

Still deduplicating the same files caused corruption again.

hd

> ...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-26  9:53         ` halfdog
@ 2020-03-26 12:12           ` Qu Wenruo
  2020-03-30  8:10             ` halfdog
  2020-03-26 13:23           ` Zygo Blaxell
  1 sibling, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2020-03-26 12:12 UTC (permalink / raw)
  To: halfdog, Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8552 bytes --]



On 2020/3/26 下午5:53, halfdog wrote:
> Thanks for your lengthy reply!
> 
> Zygo Blaxell writes:
>> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
>>> Hello list,
>>>
>>> It seems the woes really continued ... After trashing the
>>> old, corrupted filesystem (see old message below) I started
>>> rebuilding the storage. Synchronization from another (still
>>> working) storage roughly should have performed the same actions
>>> as during initial build (minus count and time of mounts/unmounts,
>>> transfer interrupts, ...).
>>>
>>> It does not seem to be a mere coincidence, that the corruption
>>> occured when deduplicating the exact same file as last time.
>>> While corruption last time made disk completely inaccessible,
>>> this time it just was mounted readonly with a different error
>>> message:
>>>
>>> [156603.177699] BTRFS error (device dm-1): parent transid
>>> verify failed on 6680428544 wanted 12947 found 12945
>>> [156603.177707] BTRFS: error (device dm-1) in
>>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
>>> BTRFS info (device dm-1): forced readonly [156603.177711]
>>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
>>> errno=-5 IO failure
>>
>> Normally those messages mean your hardware is dropping writes
>> somewhere; however, you previously reported running kernels
>> 5.3.0 and 5.3.9, so there may be another explanation.
>>
>> Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
>> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
>> fixes are included.
> 
> Sorry, I forgot to update on that: I used the old kernel but also
> managed t reproduce on
> ii  linux-image-5.4.0-4-amd64            5.4.19-1                            amd64        Linux 5.4 for 64-bit PCs (signed)
> Linux version 5.4.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 9.2.1 20200203 (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)
> 
>> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
>> the later ones include the EOF dedupe fix.  4.19 avoids the
>> regressions of later kernels.
> 
> 5.4.19-1 matches your spec, but as latest Debian experimental
> is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
> your  5.5.3 recommendation, should I try again with that kernel
> or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
> 5.5.13 LTS and build an own kernel?

Despite the kernel version, would you like to mention any other history
of the fs?

Especially about even clean shutdown/reboot of the system?

And further more, what's the storage stack below btrfs?
(Things like bcache, lvm, dmraid)

Furthermore, the specific storage hardware (e.g. SATA/SAS HDD with its
model name, the raid card if involved)

Have you experienced the same problem on other systems?

Thanks,
Qu

> 
>>> As it seems that the bug here is somehow reproducible, I would
>>> like to try to develop a reproducer exploit and fix for that
>>> bug as an excercise. Unfortunately the fault occurs only after
>>> transfering and deduplicating ~20TB of data.
>>>
>>> Are there any recommendations e.g. how to "bisect" that problem?
>>
>> Find someone who has already done it and ask.  ;)
> 
> Seems I found someone with good recommendations already :)
> 
> Thank you!
> 
>> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want
>> the dedupe fix too).  Don't run any kernel in between for btrfs.
>>
>> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
>> corrupts metadata.  It's a UAF bug, so its behavior can be
>> unpredictable, but quite often the symptom is corrupted metadata
>> or write-time tree-checker errors. Sometimes you just get a
>> harmless NULL dereference crash, or some noise warnings.
>>
>> There are at least two other filesystem corrupting bugs with
>> lifetimes overlapping that range of kernel versions; however
>> both of those were fixed by 5.3.
> 
> So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like recommended
> anyway?
> 
>>> Is there a way (switch or source code modification) to log
>>> all internal btrfs state transitions for later analysis?
>>
>> There are (e.g. the dm write logger), but most bugs that would
>> be found in unit tests by such tools have been fixed by the
>> time a kernel is released, and they'll only tell you that btrfs
>> did something wrong, not why.
> 
> As IO seems sane, the error reported "verify failed on 6680428544
> wanted 12947 found 12945" seems not to point to a data structure
> problem at a sector/page/block boundary (12947==0x3293), I would
> also guess, that basic IO/paging is not involved in it, but that
> the data structure is corrupted in memory and used directly or
> written/reread ... therefore I would deem write logs also as
> not the first way to go ..
> 
>> Also, there can be tens of thousands of btrfs state transitions
>> per second during dedupe, so the volume of logs themselves
>> can present data wrangling challenges.
> 
> Yes, that's why me asking. Maybe someone has already taken up
> that challenge as such a tool-chain (generic transaction logging
> with userspace stream compression, analysis) might be quite
> handy for such task, but hell effort to build ...
> 
>> The more invasively you try to track internal btrfs state,
>> the more the tools become _part_ of that state, and introduce
>> additional problems. e.g. there is the ref verifier, and the
>> _bug fix history_ of the ref verifier...
> 
> That is right. Therefore I hoped, that some minimal invasive
> toolsets might be available already for kernel or maybe could
> be written, e.g.
> 
> * Install an alternative kernel page fault handler
> * Set breakpoints on btrfs functions
>   * When entering the function, record return address, stack
>     and register arguments, send to userspace
>   * Strip write bits kernel from page table for most pages
>     exept those needed by page fault handler
>   * Continue execution
> * For each pagefault, the pagefault flips back to original
>   page table, sends information about write fault (what, where)
>   to userspace, performs the faulted instruction before switching
>   back to read-only page table and continuing btrfs function
> * When returning from the last btrfs function, also switch back
>   to standard page table.
> 
> By being completely btrfs-agnostic, such tool should not introduce
> any btrfs-specific issues due to the analysis process. Does someone
> know about such a tool or a simplified version of it?
> 
> Doing similar over qemu/kernel debugging tools might be easier
> to implement but too slow to handle that huge amount of data.
> 
>>> Other ideas for debugging that?
>>
>> Run dedupe on a test machine with a few TB test corpus (or
>> whatever your workload is) on a debug-enabled kernel, report
>> every bug that kills the box or hurts the data, update the
>> kernel to get fixes for the bugs that were reported.  Repeat
>> until the box stops crapping itself, then use the kernel it
>> stopped on (5.4.14 in this case).  Do that for every kernel
>> upgrade because regressions are a thing.
> 
> Well, that seems like overkill. My btrfs is not haunted by a
> load of bugs, just one that corrupted the filesystem two times
> when trying to deduplicate the same set of files.
> 
> As desccribed, just creating a btrfs with only that file did
> not trigger the corruption. If this is not a super-rare coincidence,
> then something in the other 20TB of transferred files has to
> have corrupted the file system or at least brought it to a state,
> where then deduplication of exact that problematic set of files
> triggered the final fault.
> 
>>> Just creating the same number of snapshots and putting just
>>> that single file into each of them did not trigger the bug
>>> during deduplication.
>>
>> Dedupe itself is fine, but some of the supporting ioctls a
>> deduper has to use to get information about the filesystem
>> structure triggered a lot of bugs.
> 
> To get rid of that, I already ripped out quite some of the userspace
> deduping part. I now do the extent queries in a Python tool
> using ctypes, split the dedup request into smaller chunks (to
> improve logging granularity) and just use the deduper to do
> that single FIDEDUPERANGE call (I was to lazy to ctype that
> in Python too).
> 
> Still deduplicating the same files caused corruption again.
> 
> hd
> 
>> ...
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-26  9:53         ` halfdog
  2020-03-26 12:12           ` Qu Wenruo
@ 2020-03-26 13:23           ` Zygo Blaxell
  2020-03-30  8:37             ` halfdog
  1 sibling, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-03-26 13:23 UTC (permalink / raw)
  To: halfdog; +Cc: linux-btrfs

On Thu, Mar 26, 2020 at 09:53:08AM +0000, halfdog wrote:
> Thanks for your lengthy reply!
> 
> Zygo Blaxell writes:
> > On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
> >> Hello list,
> >>
> >> It seems the woes really continued ... After trashing the
> >> old, corrupted filesystem (see old message below) I started
> >> rebuilding the storage. Synchronization from another (still
> >> working) storage roughly should have performed the same actions
> >> as during initial build (minus count and time of mounts/unmounts,
> >> transfer interrupts, ...).
> >>
> >> It does not seem to be a mere coincidence, that the corruption
> >> occured when deduplicating the exact same file as last time.
> >> While corruption last time made disk completely inaccessible,
> >> this time it just was mounted readonly with a different error
> >> message:
> >>
> >> [156603.177699] BTRFS error (device dm-1): parent transid
> >> verify failed on 6680428544 wanted 12947 found 12945
> >> [156603.177707] BTRFS: error (device dm-1) in
> >> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
> >> BTRFS info (device dm-1): forced readonly [156603.177711]
> >> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
> >> errno=-5 IO failure
> >
> > Normally those messages mean your hardware is dropping writes
> > somewhere; however, you previously reported running kernels
> > 5.3.0 and 5.3.9, so there may be another explanation.
> >
> > Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
> > not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
> > fixes are included.
> 
> Sorry, I forgot to update on that: I used the old kernel but also
> managed t reproduce on
> ii  linux-image-5.4.0-4-amd64            5.4.19-1                            amd64        Linux 5.4 for 64-bit PCs (signed)
> Linux version 5.4.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 9.2.1 20200203 (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)
> 
> > I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
> > the later ones include the EOF dedupe fix.  4.19 avoids the
> > regressions of later kernels.
> 
> 5.4.19-1 matches your spec, but as latest Debian experimental
> is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
> your  5.5.3 recommendation, should I try again with that kernel
> or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
> 5.5.13 LTS and build an own kernel?

I would try a mainline kernel just to make sure Debian didn't backport
something they shouldn't have.

> >> As it seems that the bug here is somehow reproducible, I would
> >> like to try to develop a reproducer exploit and fix for that
> >> bug as an excercise. Unfortunately the fault occurs only after
> >> transfering and deduplicating ~20TB of data.
> >>
> >> Are there any recommendations e.g. how to "bisect" that problem?
> >
> > Find someone who has already done it and ask.  ;)
> 
> Seems I found someone with good recommendations already :)
> 
> Thank you!
> 
> > Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you want
> > the dedupe fix too).  Don't run any kernel in between for btrfs.
> >
> > There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
> > corrupts metadata.  It's a UAF bug, so its behavior can be
> > unpredictable, but quite often the symptom is corrupted metadata
> > or write-time tree-checker errors. Sometimes you just get a
> > harmless NULL dereference crash, or some noise warnings.
> >
> > There are at least two other filesystem corrupting bugs with
> > lifetimes overlapping that range of kernel versions; however
> > both of those were fixed by 5.3.
> 
> So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like recommended
> anyway?

I've logged an order of magnitude more testing hours on 5.4.x than
on 5.5.x, so I can't recommend one over the other due to lack of
symmetrical data.

I'd really like to know if you can reproduce this on 4.19 or 5.0.21.
Those kernels predate the 5.1.x reloc_root lifespan change that we
are still debugging to this day (in addition to metadata corruption,
that change also seems to cause the recent looping balance issue and
possibly other issues that haven't been identified yet).

> >> Is there a way (switch or source code modification) to log
> >> all internal btrfs state transitions for later analysis?
> >
> > There are (e.g. the dm write logger), but most bugs that would
> > be found in unit tests by such tools have been fixed by the
> > time a kernel is released, and they'll only tell you that btrfs
> > did something wrong, not why.
> 
> As IO seems sane, the error reported "verify failed on 6680428544
> wanted 12947 found 12945" seems not to point to a data structure
> problem at a sector/page/block boundary (12947==0x3293), I would
> also guess, that basic IO/paging is not involved in it, but that
> the data structure is corrupted in memory and used directly or
> written/reread ... therefore I would deem write logs also as
> not the first way to go ..

transids are _timestamps_ not _positions_.  The message is saying while
walking the tree, we were expecting to read a block that was written at
time 12947 (the reference item's timestamp), but we found one that was
written at an earlier time 12945 (the referred block's timestamp) instead.

This check is normally used to detect firmware bugs where the drive said
sectors were written (so we wrote the referring block) but the sectors
were not in fact written (so the referenced block contains old data with
a good csum and some other check fields).

transid verify also detects split-brain raid1 failures:  drive A completed
write 12945 but drive B did not, then later we read drive B and due to
the transid verify check, we can know that drive B is wrong and copy
replacement data from drive A to correct drive B.

You can also get transid verify from host memory corruption, either
due to RAM failure or kernel bug.  RAM failure is not usually so
repeatable, but a kernel bug could be.

> > Also, there can be tens of thousands of btrfs state transitions
> > per second during dedupe, so the volume of logs themselves
> > can present data wrangling challenges.
> 
> Yes, that's why me asking. Maybe someone has already taken up
> that challenge as such a tool-chain (generic transaction logging
> with userspace stream compression, analysis) might be quite
> handy for such task, but hell effort to build ...
> 
> > The more invasively you try to track internal btrfs state,
> > the more the tools become _part_ of that state, and introduce
> > additional problems. e.g. there is the ref verifier, and the
> > _bug fix history_ of the ref verifier...
> 
> That is right. Therefore I hoped, that some minimal invasive
> toolsets might be available already for kernel or maybe could
> be written, e.g.
> 
> * Install an alternative kernel page fault handler
> * Set breakpoints on btrfs functions
>   * When entering the function, record return address, stack
>     and register arguments, send to userspace
>   * Strip write bits kernel from page table for most pages
>     exept those needed by page fault handler
>   * Continue execution
> * For each pagefault, the pagefault flips back to original
>   page table, sends information about write fault (what, where)
>   to userspace, performs the faulted instruction before switching
>   back to read-only page table and continuing btrfs function
> * When returning from the last btrfs function, also switch back
>   to standard page table.
> 
> By being completely btrfs-agnostic, such tool should not introduce
> any btrfs-specific issues due to the analysis process. Does someone
> know about such a tool or a simplified version of it?
> 
> Doing similar over qemu/kernel debugging tools might be easier
> to implement but too slow to handle that huge amount of data.
> 
> >> Other ideas for debugging that?
> >
> > Run dedupe on a test machine with a few TB test corpus (or
> > whatever your workload is) on a debug-enabled kernel, report
> > every bug that kills the box or hurts the data, update the
> > kernel to get fixes for the bugs that were reported.  Repeat
> > until the box stops crapping itself, then use the kernel it
> > stopped on (5.4.14 in this case).  Do that for every kernel
> > upgrade because regressions are a thing.
> 
> Well, that seems like overkill. My btrfs is not haunted by a
> load of bugs, just one that corrupted the filesystem two times
> when trying to deduplicate the same set of files.
> 
> As desccribed, just creating a btrfs with only that file did
> not trigger the corruption. If this is not a super-rare coincidence,
> then something in the other 20TB of transferred files has to
> have corrupted the file system or at least brought it to a state,
> where then deduplication of exact that problematic set of files
> triggered the final fault.

The above is a description of my standard test setup, which in turn
is a more compact version of production workloads.  I haven't seen
a 5.4.x filesystem splat since 5.4.14, but OTOH that's only 2 months.

You could be finding something new here, but we have to make sure there's
nothing wrong with your test setup.  Running a test with an older kernel
would help with that.

> >> Just creating the same number of snapshots and putting just
> >> that single file into each of them did not trigger the bug
> >> during deduplication.
> >
> > Dedupe itself is fine, but some of the supporting ioctls a
> > deduper has to use to get information about the filesystem
> > structure triggered a lot of bugs.
> 
> To get rid of that, I already ripped out quite some of the userspace
> deduping part. I now do the extent queries in a Python tool
> using ctypes, split the dedup request into smaller chunks (to
> improve logging granularity) and just use the deduper to do
> that single FIDEDUPERANGE call (I was to lazy to ctype that
> in Python too).
> 
> Still deduplicating the same files caused corruption again.
> 
> hd
> 
> > ...
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-26 12:12           ` Qu Wenruo
@ 2020-03-30  8:10             ` halfdog
  0 siblings, 0 replies; 13+ messages in thread
From: halfdog @ 2020-03-30  8:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Qu Wenruo writes:
> On 2020/3/26 下午5:53, halfdog wrote:
>> Thanks for your lengthy reply!
>>
>> Zygo Blaxell writes:
>>> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
>>>> Hello list,
>>>>
>>>> It seems the woes really continued ... After trashing the
>>>> old, corrupted filesystem (see old message below) I started
>>>> rebuilding the storage. Synchronization from another (still
>>>> working) storage roughly should have performed the same
>>>> actions as during initial build (minus count and time of
>>>> mounts/unmounts, transfer interrupts, ...).
>>>>
>>>> It does not seem to be a mere coincidence, that the corruption
>>>> occured when deduplicating the exact same file as last time.
>>>> While corruption last time made disk completely inaccessible,
>>>> this time it just was mounted readonly with a different
>>>> error message:
>>>>
>>>> [156603.177699] BTRFS error (device dm-1): parent transid
>>>> verify failed on 6680428544 wanted 12947 found 12945
>>>> [156603.177707] BTRFS: error (device dm-1) in
>>>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
>>>> BTRFS info (device dm-1): forced readonly [156603.177711]
>>>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
>>>> errno=-5 IO failure
>>>
>>> Normally those messages mean your hardware is dropping writes
>>> somewhere; however, you previously reported running kernels
>>> 5.3.0 and 5.3.9, so there may be another explanation.
>>>
>>> Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
>>> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
>>> fixes are included.
>>
>> Sorry, I forgot to update on that: I used the old kernel but
>> also managed t reproduce on ii  linux-image-5.4.0-4-amd64
>>            5.4.19-1                            amd64     
>>   Linux 5.4 for 64-bit PCs (signed) Linux version 5.4.0-4-amd64
>> (debian-kernel@lists.debian.org) (gcc version 9.2.1 20200203
>> (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)
>>
>>> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
>>> the later ones include the EOF dedupe fix.  4.19 avoids the
>>> regressions of later kernels.
>>
>> 5.4.19-1 matches your spec, but as latest Debian experimental
>> is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
>> your  5.5.3 recommendation, should I try again with that kernel
>> or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
>> 5.5.13 LTS and build an own kernel?
>
> Despite the kernel version, would you like to mention any other
> history of the fs?

After crashing the btrfs the first time, I change the underlying
crypto keys (so each data block from previous btrfs would become
garbage if picked up on error, e.g. by wrong metadata offsets),
created a crypt-setup for whole partition, formatted freshly and
started filling it. As filling/dedup took long, the btrfs was
unmounted in between, but not more than 10 times.

> Especially about even clean shutdown/reboot of the system?

After the crash of the old fs, I started to adhere to following
procedure for unmounting:

sync
[10sec]
sync
umount
sync
[10sec]
sync
cryptsetup close
sync
[10sec]
sync
[10sec]
sg_sync

Afterwards dmesg log was checked and verified to be clean.

> And further more, what's the storage stack below btrfs? (Things
> like bcache, lvm, dmraid)

Following layout:
* USB-Disk
* Partition table
* cryptsetup on partition
* btrfs on /dev/mapper/... device

As regarding IO errors, instability the USB layer is the most risky
part, all logs were monitored for any USB messages (none found).
Due to encryption, any bit error on transfer would have caused
at least one cipher-block to be corrupted, including all following
till end of sector, so any observed error should contain random
data, but this was not found either: all error messages mentioned
counts to be off by some numbers, not randomly.

> Furthermore, the specific storage hardware (e.g. SATA/SAS HDD
> with its model name, the raid card if involved)

2TB SATA disk, no smartctl messages, no reallocated sectors, g-force
sensor warnings, temperature problems, "smartctl -t long" successful.

USB hardware was in use for similar application (backup) for
~4 years but with encrypted ext4. During that time yearly checksum
list verification of all files, always without any problem.

No raid, special hardware in place. Disk data sanity is validated
regularily againts external database with all files, hashes, ...
Synchronization is done using those databases and transfer or
changes using rsync with immediate hash checks afterwards.

One that one disk in question I also started now to use the deduplication
features but they are disabled for all other redundant copies
as I do not want to lose all backup storages at once due to some
kind of bug. As deduplication might have been related to the
two crashes, I do not have plans to expand use of deduplication
to the two other devices yet.

> Have you experienced the same problem on other systems?

Not yet: As btrfs is getting more mature and would be really
the perfect filesystem, I want to get btrfs into production.
Using it for redundant backup disks for about 1-2 years with
all standard monthly/yearly backup validation procedures in
place is used as a test for production fitness. Therefore I
only have those two crashes on backup hardware yet.

>>>> As it seems that the bug here is somehow reproducible, I
>>>> would like to try to develop a reproducer exploit and fix
>>>> for that bug as an excercise. Unfortunately the fault occurs
>>>> only after transfering and deduplicating ~20TB of data.
>>>>
>>>> Are there any recommendations e.g. how to "bisect" that
>>>> problem?
>>>
>>> Find someone who has already done it and ask.  ;)
>>
>> Seems I found someone with good recommendations already :)
>>
>> Thank you!
>>
>>> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you
>>> want the dedupe fix too).  Don't run any kernel in between
>>> for btrfs.
>>>
>>> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
>>> corrupts metadata.  It's a UAF bug, so its behavior can be
>>> unpredictable, but quite often the symptom is corrupted metadata
>>> or write-time tree-checker errors. Sometimes you just get
>>> a harmless NULL dereference crash, or some noise warnings.
>>>
>>> There are at least two other filesystem corrupting bugs with
>>> lifetimes overlapping that range of kernel versions; however
>>> both of those were fixed by 5.3.
>>
>> So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like
>> recommended anyway?
>>
>>>> Is there a way (switch or source code modification) to log
>>>> all internal btrfs state transitions for later analysis?
>>>
>>> There are (e.g. the dm write logger), but most bugs that
>>> would be found in unit tests by such tools have been fixed
>>> by the time a kernel is released, and they'll only tell you
>>> that btrfs did something wrong, not why.
>>
>> As IO seems sane, the error reported "verify failed on 6680428544
>> wanted 12947 found 12945" seems not to point to a data structure
>> problem at a sector/page/block boundary (12947==0x3293), I
>> would also guess, that basic IO/paging is not involved in
>> it, but that the data structure is corrupted in memory and
>> used directly or written/reread ... therefore I would deem
>> write logs also as not the first way to go ..
>>
>>> Also, there can be tens of thousands of btrfs state transitions
>>> per second during dedupe, so the volume of logs themselves
>>> can present data wrangling challenges.
>>
>> Yes, that's why me asking. Maybe someone has already taken
>> up that challenge as such a tool-chain (generic transaction
>> logging with userspace stream compression, analysis) might
>> be quite handy for such task, but hell effort to build ...
>>
>>> The more invasively you try to track internal btrfs state,
>>> the more the tools become _part_ of that state, and introduce
>>> additional problems. e.g. there is the ref verifier, and
>>> the _bug fix history_ of the ref verifier...
>>
>> That is right. Therefore I hoped, that some minimal invasive
>> toolsets might be available already for kernel or maybe could
>> be written, e.g.
>>
>> * Install an alternative kernel page fault handler * Set
>> breakpoints on btrfs functions * When entering the function,
>> record return address, stack and register arguments, send
>> to userspace * Strip write bits kernel from page table for
>> most pages exept those needed by page fault handler * Continue
>> execution * For each pagefault, the pagefault flips back to
>> original page table, sends information about write fault (what,
>> where) to userspace, performs the faulted instruction before
>> switching back to read-only page table and continuing btrfs
>> function * When returning from the last btrfs function, also
>> switch back to standard page table.
>>
>> By being completely btrfs-agnostic, such tool should not introduce
>> any btrfs-specific issues due to the analysis process. Does
>> someone know about such a tool or a simplified version of
>> it?
>>
>> Doing similar over qemu/kernel debugging tools might be easier
>> to implement but too slow to handle that huge amount of data.
>>
>>>> Other ideas for debugging that?
>>>
>>> Run dedupe on a test machine with a few TB test corpus (or
>>> whatever your workload is) on a debug-enabled kernel, report
>>> every bug that kills the box or hurts the data, update the
>>> kernel to get fixes for the bugs that were reported.  Repeat
>>> until the box stops crapping itself, then use the kernel
>>> it stopped on (5.4.14 in this case).  Do that for every kernel
>>> upgrade because regressions are a thing.
>>
>> Well, that seems like overkill. My btrfs is not haunted by
>> a load of bugs, just one that corrupted the filesystem two
>> times when trying to deduplicate the same set of files.
>>
>> As desccribed, just creating a btrfs with only that file did
>> not trigger the corruption. If this is not a super-rare
>> coincidence, then something in the other 20TB of transferred
>> files has to have corrupted the file system or at least brought
>> it to a state, where then deduplication of exact that problematic
>> set of files triggered the final fault.
>>
>>>> Just creating the same number of snapshots and putting just
>>>> that single file into each of them did not trigger the bug
>>>> during deduplication.
>>>
>>> Dedupe itself is fine, but some of the supporting ioctls
>>> a deduper has to use to get information about the filesystem
>>> structure triggered a lot of bugs.
>>
>> To get rid of that, I already ripped out quite some of the
>> userspace deduping part. I now do the extent queries in a
>> Python tool using ctypes, split the dedup request into smaller
>> chunks (to improve logging granularity) and just use the deduper
>> to do that single FIDEDUPERANGE call (I was to lazy to ctype
>> that in Python too).
>>
>> Still deduplicating the same files caused corruption again.
>>
>> hd
>>
>>> ...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-26 13:23           ` Zygo Blaxell
@ 2020-03-30  8:37             ` halfdog
  2020-03-31  8:13               ` halfdog
  0 siblings, 1 reply; 13+ messages in thread
From: halfdog @ 2020-03-30  8:37 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Zygo Blaxell writes:
> On Thu, Mar 26, 2020 at 09:53:08AM +0000, halfdog wrote:
>> Thanks for your lengthy reply!
>>
>> Zygo Blaxell writes:
>>> On Tue, Mar 24, 2020 at 08:27:22AM +0000, halfdog wrote:
>>>> Hello list,
>>>>
>>>> It seems the woes really continued ... After trashing the
>>>> old, corrupted filesystem (see old message below) I started
>>>> rebuilding the storage. Synchronization from another (still
>>>> working) storage roughly should have performed the same
>>>> actions as during initial build (minus count and time of
>>>> mounts/unmounts, transfer interrupts, ...).
>>>>
>>>> It does not seem to be a mere coincidence, that the corruption
>>>> occured when deduplicating the exact same file as last time.
>>>> While corruption last time made disk completely inaccessible,
>>>> this time it just was mounted readonly with a different
>>>> error message:
>>>>
>>>> [156603.177699] BTRFS error (device dm-1): parent transid
>>>> verify failed on 6680428544 wanted 12947 found 12945
>>>> [156603.177707] BTRFS: error (device dm-1) in
>>>> __btrfs_free_extent:3080: errno=-5 IO failure [156603.177708]
>>>> BTRFS info (device dm-1): forced readonly [156603.177711]
>>>> BTRFS: error (device dm-1) in btrfs_run_delayed_refs:2188:
>>>> errno=-5 IO failure
>>>
>>> Normally those messages mean your hardware is dropping writes
>>> somewhere; however, you previously reported running kernels
>>> 5.3.0 and 5.3.9, so there may be another explanation.
>>>
>>> Try kernel 4.19.x, 5.4.19, 5.5.3, or later.  Definitely do
>>> not use kernels from 5.1-rc1 to 5.4.13 inclusive unless backported
>>> fixes are included.
>>
>> Sorry, I forgot to update on that: I used the old kernel but
>> also managed t reproduce on ii  linux-image-5.4.0-4-amd64
>>            5.4.19-1                            amd64     
>>   Linux 5.4 for 64-bit PCs (signed) Linux version 5.4.0-4-amd64
>> (debian-kernel@lists.debian.org) (gcc version 9.2.1 20200203
>> (Debian 9.2.1-28)) #1 SMP Debian 5.4.19-1 (2020-02-13)
>>
>>> I mention 5.5.3 and 5.4.19 instead of 5.5.0 and 5.4.14 because
>>> the later ones include the EOF dedupe fix.  4.19 avoids the
>>> regressions of later kernels.
>>
>> 5.4.19-1 matches your spec, but as latest Debian experimental
>> is "linux-signed-amd64 (5.5~rc5+1~exp1)", which is also above
>> your  5.5.3 recommendation, should I try again with that kernel
>> or even use the "5.5~rc5+1~exp1" config to apply it to yesterays
>> 5.5.13 LTS and build an own kernel?
>
> I would try a mainline kernel just to make sure Debian didn't
> backport something they shouldn't have.

OK, so let's go for that... If I got you right, you mentioned
two scenarios, that might yield relevant information:

* Try a mainline kernel prior to "reloc_root" to see if the bug
  could already be reproduced with that one.
* Try a new 5.5.3 or later to see if the bug still can be reproduced.

Which of both would be or higher value to you for the first test?

Could you please share a kernel.org link to the exact tarball
that should be tested? If there is a specific kernel configuration
you deem superior for tests, that would be useful too. Otherwise
I would use one from a Debian package with a kernel version quite
close and adapt it to the given kernel.

>>>> As it seems that the bug here is somehow reproducible, I
>>>> would like to try to develop a reproducer exploit and fix
>>>> for that bug as an excercise. Unfortunately the fault occurs
>>>> only after transfering and deduplicating ~20TB of data.
>>>>
>>>> Are there any recommendations e.g. how to "bisect" that
>>>> problem?
>>>
>>> Find someone who has already done it and ask.  ;)
>>
>> Seems I found someone with good recommendations already :)
>>
>> Thank you!
>>
>>> Upgrade straight from 5.0.21 to 5.4.14 (or 5.4.19 if you
>>> want the dedupe fix too).  Don't run any kernel in between
>>> for btrfs.
>>>
>>> There was a bug introduced in 5.1-rc1, fixed in 5.4.14, which
>>> corrupts metadata.  It's a UAF bug, so its behavior can be
>>> unpredictable, but quite often the symptom is corrupted metadata
>>> or write-time tree-checker errors. Sometimes you just get
>>> a harmless NULL dereference crash, or some noise warnings.
>>>
>>> There are at least two other filesystem corrupting bugs with
>>> lifetimes overlapping that range of kernel versions; however
>>> both of those were fixed by 5.3.
>>
>> So maybe leaving my 5.4.19-1 to the 5.5+ series sounds like
>> recommended anyway?
>
> I've logged an order of magnitude more testing hours on 5.4.x
> than on 5.5.x, so I can't recommend one over the other due
> to lack of symmetrical data.
>
> I'd really like to know if you can reproduce this on 4.19 or
> 5.0.21. Those kernels predate the 5.1.x reloc_root lifespan
> change that we are still debugging to this day (in addition
> to metadata corruption, that change also seems to cause the
> recent looping balance issue and possibly other issues that
> haven't been identified yet).
>
>>>> Is there a way (switch or source code modification) to log
>>>> all internal btrfs state transitions for later analysis?
>>>
>>> There are (e.g. the dm write logger), but most bugs that
>>> would be found in unit tests by such tools have been fixed
>>> by the time a kernel is released, and they'll only tell you
>>> that btrfs did something wrong, not why.
>>
>> As IO seems sane, the error reported "verify failed on 6680428544
>> wanted 12947 found 12945" seems not to point to a data structure
>> problem at a sector/page/block boundary (12947==0x3293), I
>> would also guess, that basic IO/paging is not involved in
>> it, but that the data structure is corrupted in memory and
>> used directly or written/reread ... therefore I would deem
>> write logs also as not the first way to go ..
>
> transids are _timestamps_ not _positions_.  The message is
> saying while walking the tree, we were expecting to read a
> block that was written at time 12947 (the reference item's
> timestamp), but we found one that was written at an earlier
> time 12945 (the referred block's timestamp) instead.

Thanks for explanation. So as timestamps are close, this quite
proofs that there is no bit-wise storage hardware failure involved
as that would scramble the whole cipher-block. Also a bitwise
RAM error is unlikely, would be more likely something like
12947 vs 134230675 (0x3293 vs 0x8003293).

Still silently lost writes of whole blocks could not be ruled
out. But as metadata is small compared to data and all data was
verified after write against my external checksum database, it
would be verly unlikely, that the storage hardware loses a block
wrile belonging to the same file metadata twice while the other
approximately 536870912 blocks did not suffer any such loss.

> This check is normally used to detect firmware bugs where the
> drive said sectors were written (so we wrote the referring
> block) but the sectors were not in fact written (so the referenced
> block contains old data with a good csum and some other check
> fields).
>
> transid verify also detects split-brain raid1 failures:  drive
> A completed write 12945 but drive B did not, then later we
> read drive B and due to the transid verify check, we can know
> that drive B is wrong and copy replacement data from drive
> A to correct drive B.

Good design! But here no RAID involved ...

> You can also get transid verify from host memory corruption,
> either due to RAM failure or kernel bug.  RAM failure is not
> usually so repeatable, but a kernel bug could be.

Again, would be very weird to have it corrupt the same files
metadata twice while keeping TBs of hash-verified data intact...

>>> Also, there can be tens of thousands of btrfs state transitions
>>> per second during dedupe, so the volume of logs themselves
>>> can present data wrangling challenges.
>>
>> Yes, that's why me asking. Maybe someone has already taken
>> up that challenge as such a tool-chain (generic transaction
>> logging with userspace stream compression, analysis) might
>> be quite handy for such task, but hell effort to build ...
>>
>>> The more invasively you try to track internal btrfs state,
>>> the more the tools become _part_ of that state, and introduce
>>> additional problems. e.g. there is the ref verifier, and
>>> the _bug fix history_ of the ref verifier...
>>
>> That is right. Therefore I hoped, that some minimal invasive
>> toolsets might be available already for kernel or maybe could
>> be written, e.g.
>>
>> * Install an alternative kernel page fault handler * Set
>> breakpoints on btrfs functions * When entering the function,
>> record return address, stack and register arguments, send
>> to userspace * Strip write bits kernel from page table for
>> most pages exept those needed by page fault handler * Continue
>> execution * For each pagefault, the pagefault flips back to
>> original page table, sends information about write fault (what,
>> where) to userspace, performs the faulted instruction before
>> switching back to read-only page table and continuing btrfs
>> function * When returning from the last btrfs function, also
>> switch back to standard page table.
>>
>> By being completely btrfs-agnostic, such tool should not introduce
>> any btrfs-specific issues due to the analysis process. Does
>> someone know about such a tool or a simplified version of
>> it?
>>
>> Doing similar over qemu/kernel debugging tools might be easier
>> to implement but too slow to handle that huge amount of data.
>>
>>>> Other ideas for debugging that?
>>>
>>> Run dedupe on a test machine with a few TB test corpus (or
>>> whatever your workload is) on a debug-enabled kernel, report
>>> every bug that kills the box or hurts the data, update the
>>> kernel to get fixes for the bugs that were reported.  Repeat
>>> until the box stops crapping itself, then use the kernel
>>> it stopped on (5.4.14 in this case).  Do that for every kernel
>>> upgrade because regressions are a thing.
>>
>> Well, that seems like overkill. My btrfs is not haunted by
>> a load of bugs, just one that corrupted the filesystem two
>> times when trying to deduplicate the same set of files.
>>
>> As desccribed, just creating a btrfs with only that file did
>> not trigger the corruption. If this is not a super-rare
>> coincidence, then something in the other 20TB of transferred
>> files has to have corrupted the file system or at least brought
>> it to a state, where then deduplication of exact that problematic
>> set of files triggered the final fault.
>
> The above is a description of my standard test setup, which
> in turn is a more compact version of production workloads.
>  I haven't seen a 5.4.x filesystem splat since 5.4.14, but
> OTOH that's only 2 months.
>
> You could be finding something new here, but we have to make
> sure there's nothing wrong with your test setup.  Running a
> test with an older kernel would help with that.

So I account that as a 2:1 vote for testing an older kernel instead
of a new one ...

>>>> Just creating the same number of snapshots and putting just
>>>> that single file into each of them did not trigger the bug
>>>> during deduplication.
>>>
>>> Dedupe itself is fine, but some of the supporting ioctls
>>> a deduper has to use to get information about the filesystem
>>> structure triggered a lot of bugs.
>>
>> To get rid of that, I already ripped out quite some of the
>> userspace deduping part. I now do the extent queries in a
>> Python tool using ctypes, split the dedup request into smaller
>> chunks (to improve logging granularity) and just use the deduper
>> to do that single FIDEDUPERANGE call (I was to lazy to ctype
>> that in Python too).
>>
>> Still deduplicating the same files caused corruption again.
>>
>> hd
>>
>>> ...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-30  8:37             ` halfdog
@ 2020-03-31  8:13               ` halfdog
  2020-04-01 23:31                 ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: halfdog @ 2020-03-31  8:13 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs

halfdog writes:
> Zygo Blaxell writes:
>> ...
>> I would try a mainline kernel just to make sure Debian didn't
>> backport something they shouldn't have.
>
> OK, so let's go for that... If I got you right, you mentioned
> two scenarios, that might yield relevant information:
>
> * Try a mainline kernel prior to "reloc_root" to see if the bug
>   could already be reproduced with that one.
> * Try a new 5.5.3 or later to see if the bug still can be reproduced.
>
> Which of both would be or higher value to you for the first test?
>
> Could you please share a kernel.org link to the exact tarball
> that should be tested? If there is a specific kernel configuration
> you deem superior for tests, that would be useful too. Otherwise
> I would use one from a Debian package with a kernel version quite
> close and adapt it to the given kernel.

Yesterday I started preparing test infrastructure and to see
if my old test documentation still works with current software.
I ran a modified trinity test on a 128MB btrfs loop mount. The
test started at 12:02, at 14:30 trinity was OOM killed. As I
did not monitor the virtual machine, over the next hours without
trinity running any more also other processes were killed one
after another until at 21:13 finally also init was killed.

As I run similar tests for many days on ext4 filesystems, could
this be related to a btrfs memory leak even leaking just due
to the btrfs kernel workers? If so, when compiling a test kernel,
is there any option you recommend setting to verify/rule out/
pin-point btrfs leakage with trinity?

> ...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-03-31  8:13               ` halfdog
@ 2020-04-01 23:31                 ` Zygo Blaxell
  2020-04-30  4:49                   ` halfdog
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-04-01 23:31 UTC (permalink / raw)
  To: halfdog; +Cc: linux-btrfs

On Tue, Mar 31, 2020 at 08:13:30AM +0000, halfdog wrote:
> halfdog writes:
> > Zygo Blaxell writes:
> >> ...
> >> I would try a mainline kernel just to make sure Debian didn't
> >> backport something they shouldn't have.
> >
> > OK, so let's go for that... If I got you right, you mentioned
> > two scenarios, that might yield relevant information:
> >
> > * Try a mainline kernel prior to "reloc_root" to see if the bug
> >   could already be reproduced with that one.
> > * Try a new 5.5.3 or later to see if the bug still can be reproduced.
> >
> > Which of both would be or higher value to you for the first test?
> >
> > Could you please share a kernel.org link to the exact tarball
> > that should be tested? If there is a specific kernel configuration
> > you deem superior for tests, that would be useful too. Otherwise
> > I would use one from a Debian package with a kernel version quite
> > close and adapt it to the given kernel.
> 
> Yesterday I started preparing test infrastructure and to see
> if my old test documentation still works with current software.
> I ran a modified trinity test on a 128MB btrfs loop mount. The
> test started at 12:02, at 14:30 trinity was OOM killed. As I
> did not monitor the virtual machine, over the next hours without
> trinity running any more also other processes were killed one
> after another until at 21:13 finally also init was killed.
> 
> As I run similar tests for many days on ext4 filesystems, could
> this be related to a btrfs memory leak even leaking just due
> to the btrfs kernel workers? 

How big is the test VM?  I run btrfs on machines as small as 512M,
but no smaller--and I don't try to do memory-hungry things like dedupe
or balance on such machines.  Some kernel debug options use a lot of
memory too, or break it up into pieces too small to use (e.g. KASAN and
the btrfs ref verifier).

> If so, when compiling a test kernel,
> is there any option you recommend setting to verify/rule out/
> pin-point btrfs leakage with trinity?

There is kmemleak.

You can also run 'slabtop' and just watch the numbers grow.
'slabtop' usually names the thing that is leaking.

If the thing you've got too much of is btrfs_delayed_ref_head,
you should definitely go back to 4.19 for now.

> > ...
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FIDEDUPERANGE woes may continue (or unrelated issue?)
  2020-04-01 23:31                 ` Zygo Blaxell
@ 2020-04-30  4:49                   ` halfdog
  0 siblings, 0 replies; 13+ messages in thread
From: halfdog @ 2020-04-30  4:49 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Zygo Blaxell writes:
> On Tue, Mar 31, 2020 at 08:13:30AM +0000, halfdog wrote:
>> halfdog writes:
>>> Zygo Blaxell writes:
>>>> ... I would try a mainline kernel just to make sure Debian
>>>> didn't backport something they shouldn't have.
>>>
>>> OK, so let's go for that... If I got you right, you mentioned
>>> two scenarios, that might yield relevant information:
>>>
>>> * Try a mainline kernel prior to "reloc_root" to see if the
>>> bug could already be reproduced with that one. * Try a new
>>> 5.5.3 or later to see if the bug still can be reproduced.
>>>
>>> Which of both would be or higher value to you for the first
>>> test?
>>>
>>> Could you please share a kernel.org link to the exact tarball
>>> that should be tested? If there is a specific kernel configuration
>>> you deem superior for tests, that would be useful too. Otherwise
>>> I would use one from a Debian package with a kernel version
>>> quite close and adapt it to the given kernel.
>>
>> Yesterday I started preparing test infrastructure and to see
>> if my old test documentation still works with current software.
>> I ran a modified trinity test on a 128MB btrfs loop mount.
>> The test started at 12:02, at 14:30 trinity was OOM killed.
>> As I did not monitor the virtual machine, over the next hours
>> without trinity running any more also other processes were
>> killed one after another until at 21:13 finally also init
>> was killed.
>>
>> As I run similar tests for many days on ext4 filesystems,
>> could this be related to a btrfs memory leak even leaking
>> just due to the btrfs kernel workers?
>
> How big is the test VM?  I run btrfs on machines as small as
> 512M, but no smaller--and I don't try to do memory-hungry things
> like dedupe or balance on such machines.  Some kernel debug
> options use a lot of memory too, or break it up into pieces
> too small to use (e.g. KASAN and the btrfs ref verifier).

Sorry, I failed to run more VM test since my last mail. My VM
is also 512MB...

>> If so, when compiling a test kernel, is there any option you
>> recommend setting to verify/rule out/ pin-point btrfs leakage
>> with trinity?
>
> There is kmemleak.

I will try to run/analyze this, when there is more time, as I
assume to be quite slow due to being unexperienced.

> You can also run 'slabtop' and just watch the numbers grow.
> 'slabtop' usually names the thing that is leaking.
>
> If the thing you've got too much of is btrfs_delayed_ref_head,
> you should definitely go back to 4.19 for now.
>
>>> ...
>>

What I had time for (as all the steps were scripted already from
previous testing) was to rerun the deduplication tests after Debian
released a new kernel:

Linux version 5.5.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 9.3.0 (Debian 9.3.0-8)) #1 SMP Debian 5.5.13-2 (2020-03-30)
ii  linux-image-5.5.0-1-amd64            5.5.13-2                            amd64        Linux 5.5 for 64-bit PCs (signed)

As I noticed last time, that the deduplication process died quite hard
and might not flush stderr/stdout in that situation, I added
flush instructions after each successful file deduplication.

Result this time:

[39391.950277] BTRFS error (device dm-1): bad tree block start, want 1723369025536 have 17869752742738540124
[39391.950291] BTRFS: error (device dm-1) in btrfs_start_dirty_block_groups:2472: errno=-5 IO failure
[39391.950294] BTRFS info (device dm-1): forced readonly
[39392.021961] BTRFS warning (device dm-1): Skipping commit of aborted transaction.

want: 0x19140cc7000 (1723369025536) vs 0xf7fe1d7106fe365c (17869752742738540124)

The deduplication failure happened after more than 100hrs of deduplication
IO. With improvements from flushing/logging it could be pinpointed
to have happened exactly some seconds AFTER the very same file,
that cause the last 3 file system corruption events.

With the old logging structure, the logging would have reported the
exact same file as with all the previous events.

Currently I am running a tool to search for the source of the
"0xf7fe1d7106fe365c" value on disk and in file content as it
does not seem to be kernel or userspace address, machine code,
... Maybe this give a hint, what had happened. Is this value
a stored hashsum of a block/entry?

The "want" value is found easly (multiple tree/superblocks?):

Searching for 2 patterns, maximum length 8.
0x13ff0459e: found want
0x14032a59e: found want
0x192d401d4: found want

Just weird is the page/block/sector-wise alignment: 0x59e vs 0x1d4

As it seems, that the crash happens after deduping ~100 small files
after that large one, I will try to create a test case with same
structure. Last time my test case was just to emulate the situation
with the large file, which did not trigger any filesystem corruption.

hd


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-04-30  4:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-22  6:18 FIDEDUPERANGE woes may continue (or unrelated issue?) halfdog
2020-02-22  7:22 ` Qu Wenruo
2020-02-22 18:30   ` halfdog
2020-03-24  8:27     ` halfdog
2020-03-25  3:53       ` Zygo Blaxell
2020-03-26  9:53         ` halfdog
2020-03-26 12:12           ` Qu Wenruo
2020-03-30  8:10             ` halfdog
2020-03-26 13:23           ` Zygo Blaxell
2020-03-30  8:37             ` halfdog
2020-03-31  8:13               ` halfdog
2020-04-01 23:31                 ` Zygo Blaxell
2020-04-30  4:49                   ` halfdog

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.