All of lore.kernel.org
 help / color / mirror / Atom feed
* FIDEDUPERANGE claims to succeed for non-identical files
@ 2022-12-20  0:10 Zbigniew Halas
  2022-12-22  8:25 ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Zbigniew Halas @ 2022-12-20  0:10 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel

Hi,

I noticed a strange and misleading edge case in FIDEDUPERANGE ioctl.
For the files with the following content:
f1: abcd
f2: efghi
FIDEDUPERANGE claims to succeed and reports 4 bytes deduplicated,
despite the files being clearly different. Strace output:
openat(AT_FDCWD, "f1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "f2", O_WRONLY|O_CLOEXEC) = 4
ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=0,
src_length=4, dest_count=1, info=[{dest_fd=4, dest_offset=0}]} =>
{info=[{bytes_deduped=4, status=0}]}) = 0

The reason is that generic_remap_checks function is doing block
alignment of the deduplication length when the end of the destination
file is not at the end of the deduplication range (as described in the
comment):
/*
* If the user wanted us to link to the infile's EOF, round up to the
* next block boundary for this check.
*
* Otherwise, make sure the count is also block-aligned, having
* already confirmed the starting offsets' block alignment.
*/

So it effectively becomes a zero-length deduplication, which succeeds.
Despite that it's reported as a successful 4 bytes deduplication.

For a very similar test case, but with the files of the same length:
f3: abcd
f4: efgh
FIDEDUPERANGE fails with FILE_DEDUPE_RANGE_DIFFERS. Strace output:
openat(AT_FDCWD, "f3", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "f4", O_WRONLY|O_CLOEXEC) = 4
ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=0,
src_length=4, dest_count=1, info=[{dest_fd=4, dest_offset=0}]} =>
{info=[{bytes_deduped=0, status=1}]}) = 0

For this case generic_remap_checks does not alter deduplication
length, and deduplication fails when comparing the content of the
files.

Cheers,
Zbigniew Halas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2022-12-20  0:10 FIDEDUPERANGE claims to succeed for non-identical files Zbigniew Halas
@ 2022-12-22  8:25 ` Amir Goldstein
  2022-12-22 14:41   ` Zbigniew Halas
  0 siblings, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2022-12-22  8:25 UTC (permalink / raw)
  To: Zbigniew Halas, Darrick J. Wong, Filipe Manana; +Cc: viro, linux-fsdevel

On Tue, Dec 20, 2022 at 2:51 AM Zbigniew Halas <zhalas@gmail.com> wrote:
>
> Hi,
>
> I noticed a strange and misleading edge case in FIDEDUPERANGE ioctl.
> For the files with the following content:
> f1: abcd
> f2: efghi
> FIDEDUPERANGE claims to succeed and reports 4 bytes deduplicated,
> despite the files being clearly different. Strace output:
> openat(AT_FDCWD, "f1", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "f2", O_WRONLY|O_CLOEXEC) = 4
> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=0,
> src_length=4, dest_count=1, info=[{dest_fd=4, dest_offset=0}]} =>
> {info=[{bytes_deduped=4, status=0}]}) = 0
>
> The reason is that generic_remap_checks function is doing block
> alignment of the deduplication length when the end of the destination
> file is not at the end of the deduplication range (as described in the
> comment):
> /*
> * If the user wanted us to link to the infile's EOF, round up to the
> * next block boundary for this check.
> *
> * Otherwise, make sure the count is also block-aligned, having
> * already confirmed the starting offsets' block alignment.
> */
>
> So it effectively becomes a zero-length deduplication, which succeeds.
> Despite that it's reported as a successful 4 bytes deduplication.

Ouch!

>
> For a very similar test case, but with the files of the same length:
> f3: abcd
> f4: efgh
> FIDEDUPERANGE fails with FILE_DEDUPE_RANGE_DIFFERS. Strace output:
> openat(AT_FDCWD, "f3", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "f4", O_WRONLY|O_CLOEXEC) = 4
> ioctl(3, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=0,
> src_length=4, dest_count=1, info=[{dest_fd=4, dest_offset=0}]} =>
> {info=[{bytes_deduped=0, status=1}]}) = 0
>
> For this case generic_remap_checks does not alter deduplication
> length, and deduplication fails when comparing the content of the
> files.
>

Thanks for the analysis.
Would you be interested in trying to fix the bug and writing a test?
I can help if you would like.

It's hard to follow all the changes since
54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
in v4.5, but it *looks* like this behavior might have been in btrfs,
before the ioctl was promoted to vfs.. not sure.

We have fstests coverage for the "good" case of same size src/dst
(generic/136), but I didn't find a test for the non-same size src/dst.

In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
do not even have an interface to return the actual bytes_deduped,
so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
are valid, regardless of EOF.

What am I missing?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2022-12-22  8:25 ` Amir Goldstein
@ 2022-12-22 14:41   ` Zbigniew Halas
  2023-01-04 17:25     ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Zbigniew Halas @ 2022-12-22 14:41 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, Filipe Manana, viro, linux-fsdevel

On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Thanks for the analysis.
> Would you be interested in trying to fix the bug and writing a test?
> I can help if you would like.

I can give it a try unless it turns out that some deep VFS changes are
required, but let's try to narrow down the reasonable API semantics
first.

> It's hard to follow all the changes since
> 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> in v4.5, but it *looks* like this behavior might have been in btrfs,
> before the ioctl was promoted to vfs.. not sure.
>
> We have fstests coverage for the "good" case of same size src/dst
> (generic/136), but I didn't find a test for the non-same size src/dst.
>
> In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> do not even have an interface to return the actual bytes_deduped,
> so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> are valid, regardless of EOF.

Not sure about this, it looks to me that they are actually returning
the number of bytes deduped, but the value is not used, but maybe I'm
missing something.
Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
For example if a source file content is a prefix of a destination file
content and we want to dedup the whole range of the source file
without REMAP_FILE_CAN_SHORTEN,
then the ioctl will only succeed when the end of the source file is at
the block boundary, otherwise it will just fail. This will render the
API very inconsistent.

Cheers,
Zbigniew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2022-12-22 14:41   ` Zbigniew Halas
@ 2023-01-04 17:25     ` Darrick J. Wong
  2023-01-04 19:36       ` Zbigniew Halas
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2023-01-04 17:25 UTC (permalink / raw)
  To: Zbigniew Halas; +Cc: Amir Goldstein, Filipe Manana, viro, linux-fsdevel

On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote:
> On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Thanks for the analysis.
> > Would you be interested in trying to fix the bug and writing a test?
> > I can help if you would like.
> 
> I can give it a try unless it turns out that some deep VFS changes are
> required, but let's try to narrow down the reasonable API semantics
> first.
> 
> > It's hard to follow all the changes since
> > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> > in v4.5, but it *looks* like this behavior might have been in btrfs,
> > before the ioctl was promoted to vfs.. not sure.
> >
> > We have fstests coverage for the "good" case of same size src/dst
> > (generic/136), but I didn't find a test for the non-same size src/dst.
> >
> > In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> > do not even have an interface to return the actual bytes_deduped,

The number of bytes remapped is passed back via the loff_t return value
of ->remap_file_range.  If CAN_SHORTEN is set, the VFS and the fs
implementation are allowed to reduce the @len parameter as much as they
want.  TBH I'm mystified why the original btrfs dedupe ioctl wouldn't
allow deduplication of common prefixes, which means that len only gets
shortened to avoid weird problems when dealing with eof being in the
middle of a block.

(Or not, since there's clearly a bug.)

> > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> > are valid, regardless of EOF.
> 
> Not sure about this, it looks to me that they are actually returning
> the number of bytes deduped, but the value is not used, but maybe I'm
> missing something.
> Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
> For example if a source file content is a prefix of a destination file
> content and we want to dedup the whole range of the source file
> without REMAP_FILE_CAN_SHORTEN,
> then the ioctl will only succeed when the end of the source file is at
> the block boundary, otherwise it will just fail. This will render the
> API very inconsistent.

<nod> I'll try to figure out where the len adjusting went wrong here and
report back.

--D

> Cheers,
> Zbigniew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2023-01-04 17:25     ` Darrick J. Wong
@ 2023-01-04 19:36       ` Zbigniew Halas
  2023-02-04  2:23         ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Zbigniew Halas @ 2023-01-04 19:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, Filipe Manana, viro, linux-fsdevel

On Wed, Jan 4, 2023 at 6:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote:
> > On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Thanks for the analysis.
> > > Would you be interested in trying to fix the bug and writing a test?
> > > I can help if you would like.
> >
> > I can give it a try unless it turns out that some deep VFS changes are
> > required, but let's try to narrow down the reasonable API semantics
> > first.
> >
> > > It's hard to follow all the changes since
> > > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> > > in v4.5, but it *looks* like this behavior might have been in btrfs,
> > > before the ioctl was promoted to vfs.. not sure.
> > >
> > > We have fstests coverage for the "good" case of same size src/dst
> > > (generic/136), but I didn't find a test for the non-same size src/dst.
> > >
> > > In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> > > do not even have an interface to return the actual bytes_deduped,
>
> The number of bytes remapped is passed back via the loff_t return value
> of ->remap_file_range.  If CAN_SHORTEN is set, the VFS and the fs
> implementation are allowed to reduce the @len parameter as much as they
> want.  TBH I'm mystified why the original btrfs dedupe ioctl wouldn't
> allow deduplication of common prefixes, which means that len only gets
> shortened to avoid weird problems when dealing with eof being in the
> middle of a block.
>
> (Or not, since there's clearly a bug.)
>
> > > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> > > are valid, regardless of EOF.
> >
> > Not sure about this, it looks to me that they are actually returning
> > the number of bytes deduped, but the value is not used, but maybe I'm
> > missing something.
> > Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
> > For example if a source file content is a prefix of a destination file
> > content and we want to dedup the whole range of the source file
> > without REMAP_FILE_CAN_SHORTEN,
> > then the ioctl will only succeed when the end of the source file is at
> > the block boundary, otherwise it will just fail. This will render the
> > API very inconsistent.
>
> <nod> I'll try to figure out where the len adjusting went wrong here and
> report back.

Hey Darrick,

Len adjustment happens in generic_remap_checks, specifically here:
    if (pos_in + count == size_in &&
        (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
        bcount = ALIGN(size_in, bs) - pos_in;
    } else {
        if (!IS_ALIGNED(count, bs))
            count = ALIGN_DOWN(count, bs);
        bcount = count;
    }
the problem is that in this particular case length is set to zero, and
it makes generic_remap_file_range_prep to succeed.
I think that 2 things are needed to make this API behave reasonably:
* always use the original length for the file content comparison, not
the truncated one (there currently is also a short circuit for zero
length that skips comparison, this would need to be revisited as well)
- otherwise we may report successful deduplication, despite the ranges
being clearly different and only the truncated ranges being the same.
* report real, possibly truncated deduplication length to the userland

Cheers,
Zbigniew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2023-01-04 19:36       ` Zbigniew Halas
@ 2023-02-04  2:23         ` Darrick J. Wong
  2023-02-04  7:00           ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2023-02-04  2:23 UTC (permalink / raw)
  To: Zbigniew Halas
  Cc: Amir Goldstein, Filipe Manana, viro, linux-fsdevel, Dave Chinner

On Wed, Jan 04, 2023 at 08:36:16PM +0100, Zbigniew Halas wrote:
> On Wed, Jan 4, 2023 at 6:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote:
> > > On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > Thanks for the analysis.
> > > > Would you be interested in trying to fix the bug and writing a test?
> > > > I can help if you would like.
> > >
> > > I can give it a try unless it turns out that some deep VFS changes are
> > > required, but let's try to narrow down the reasonable API semantics
> > > first.
> > >
> > > > It's hard to follow all the changes since
> > > > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> > > > in v4.5, but it *looks* like this behavior might have been in btrfs,
> > > > before the ioctl was promoted to vfs.. not sure.
> > > >
> > > > We have fstests coverage for the "good" case of same size src/dst
> > > > (generic/136), but I didn't find a test for the non-same size src/dst.
> > > >
> > > > In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> > > > do not even have an interface to return the actual bytes_deduped,
> >
> > The number of bytes remapped is passed back via the loff_t return value
> > of ->remap_file_range.  If CAN_SHORTEN is set, the VFS and the fs
> > implementation are allowed to reduce the @len parameter as much as they
> > want.  TBH I'm mystified why the original btrfs dedupe ioctl wouldn't
> > allow deduplication of common prefixes, which means that len only gets
> > shortened to avoid weird problems when dealing with eof being in the
> > middle of a block.
> >
> > (Or not, since there's clearly a bug.)
> >
> > > > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> > > > are valid, regardless of EOF.
> > >
> > > Not sure about this, it looks to me that they are actually returning
> > > the number of bytes deduped, but the value is not used, but maybe I'm
> > > missing something.
> > > Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
> > > For example if a source file content is a prefix of a destination file
> > > content and we want to dedup the whole range of the source file
> > > without REMAP_FILE_CAN_SHORTEN,
> > > then the ioctl will only succeed when the end of the source file is at
> > > the block boundary, otherwise it will just fail. This will render the
> > > API very inconsistent.
> >
> > <nod> I'll try to figure out where the len adjusting went wrong here and
> > report back.
> 
> Hey Darrick,
> 
> Len adjustment happens in generic_remap_checks, specifically here:
>     if (pos_in + count == size_in &&
>         (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
>         bcount = ALIGN(size_in, bs) - pos_in;
>     } else {
>         if (!IS_ALIGNED(count, bs))
>             count = ALIGN_DOWN(count, bs);
>         bcount = count;
>     }
> the problem is that in this particular case length is set to zero, and
> it makes generic_remap_file_range_prep to succeed.
> I think that 2 things are needed to make this API behave reasonably:
> * always use the original length for the file content comparison, not
> the truncated one (there currently is also a short circuit for zero
> length that skips comparison, this would need to be revisited as well)
> - otherwise we may report successful deduplication, despite the ranges
> being clearly different and only the truncated ranges being the same.
> * report real, possibly truncated deduplication length to the userland

I was about to send in a patch when git blame told me that Linus was the
last person to change the last line of this chunk:

		deduped = vfs_dedupe_file_range_one(file, off, dst_file,
						    info->dest_offset, len,
						    REMAP_FILE_CAN_SHORTEN);
		if (deduped == -EBADE)
			info->status = FILE_DEDUPE_RANGE_DIFFERS;
		else if (deduped < 0)
			info->status = deduped;
		else
			info->bytes_deduped = len;

And then I remembered that I had totally repressed my entire memory of
this unpleasant discussion wherein I got the short end of the stick not
because I'd designed the weird btrfs ioctl that eventually became
FIDEDUPERANGE but because I was the last person to make any real
changes:
https://lore.kernel.org/all/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/

Originally, btrfs returned the request length, but also didn't do any
checking or operation length reduction to prevent people from remapping
a partial EOF block into the middle of a file:
https://elixir.bootlin.com/linux/v3.19/source/fs/btrfs/ioctl.c#L3018

Dave and I spent a long time adapting XFS to this interface and hoisted
the generic(ish) parts of the btrfs code to the VFS.  The VFS support
code returned the number of bytes remapped, though at that point either
the original request succeeded or it didn't, even if doing so set up
data corruption:
https://elixir.bootlin.com/linux/v4.5/source/fs/read_write.c#L1655

Miklos went and committed this change containing no description of why
it was made.  This patch (AFAICT) was never sent to fsdevel and does not
seem to have been reviewed by anybody.  The patch sets up the current
broken behavior where the kernel can dedupe fewer bytes than was asked
for, yet return the original request length:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5740c99e9d30b81fcc478797e7215c61e241f44e

Five months later I noticed and fixed the problems with remapping
partial eof blocks into the middle of files and fixed it, but none of us
noticed there was a logic bomb lurking in bytes_deduped:
https://lore.kernel.org/linux-fsdevel/153981625504.5568.2708520119290577378.stgit@magnolia/

A different person wrote a test for sub-block dedupe, but even he failed
to notice the broken behavior:
https://lore.kernel.org/fstests/20181105111445.11870-1-fdmanana@kernel.org/

Four years later (last summer), someone produced an attempt to fix this
particular weird discrepancy:
https://lore.kernel.org/all/b5805118-7e56-3d43-28e9-9e0198ee43f3@tu-darmstadt.de/

Which then Dave asked for a revert because it broke generic/517:
https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/

Since then, AFAICT there's been no followup to "We just need a bit of
time to look at the various major dedupe apps and check that they still
do the right thing w.r.t. proposed change."  The manpage for
FIDEDUPERANGE says that bytes_deduped contains the number of bytes
actually remapped, but that doesn't matter because Linus only cares
about past kernel behavior.  That behavior is now a mess, since every
LTS kernel since 4.19 has been over-reporting the number of bytes
deduped.

Unfortunately, while *I* think the correct behavior is to report bytes
remapped even if that quantity becomes zero due to alignment issues, we
can't report zero here because duperemove will go into an infinite loop
because the author did the usual "len += bytes_deduped" without checking
for a zero length.  This causes generic/561 to run forever.  One could
argue that we could return "extents differ" in that case, but then that
breaks generic/517 and generic/182 which aren't expecting that behavior
either.

It's probably just time for us to just burn FIEDEDUPERANGE to the ground
and for me to convert another of my other hobbies into my profession.

--D

> Cheers,
> Zbigniew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FIDEDUPERANGE claims to succeed for non-identical files
  2023-02-04  2:23         ` Darrick J. Wong
@ 2023-02-04  7:00           ` Amir Goldstein
  0 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2023-02-04  7:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zbigniew Halas, Filipe Manana, viro, linux-fsdevel, Dave Chinner

On Sat, Feb 4, 2023 at 4:23 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jan 04, 2023 at 08:36:16PM +0100, Zbigniew Halas wrote:
> > On Wed, Jan 4, 2023 at 6:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Dec 22, 2022 at 03:41:30PM +0100, Zbigniew Halas wrote:
> > > > On Thu, Dec 22, 2022 at 9:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > Thanks for the analysis.
> > > > > Would you be interested in trying to fix the bug and writing a test?
> > > > > I can help if you would like.
> > > >
> > > > I can give it a try unless it turns out that some deep VFS changes are
> > > > required, but let's try to narrow down the reasonable API semantics
> > > > first.
> > > >
> > > > > It's hard to follow all the changes since
> > > > > 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs")
> > > > > in v4.5, but it *looks* like this behavior might have been in btrfs,
> > > > > before the ioctl was promoted to vfs.. not sure.
> > > > >
> > > > > We have fstests coverage for the "good" case of same size src/dst
> > > > > (generic/136), but I didn't find a test for the non-same size src/dst.
> > > > >
> > > > > In any case, vfs_dedupe_file_range_one() and ->remap_file_range()
> > > > > do not even have an interface to return the actual bytes_deduped,
> > >
> > > The number of bytes remapped is passed back via the loff_t return value
> > > of ->remap_file_range.  If CAN_SHORTEN is set, the VFS and the fs
> > > implementation are allowed to reduce the @len parameter as much as they
> > > want.  TBH I'm mystified why the original btrfs dedupe ioctl wouldn't
> > > allow deduplication of common prefixes, which means that len only gets
> > > shortened to avoid weird problems when dealing with eof being in the
> > > middle of a block.
> > >
> > > (Or not, since there's clearly a bug.)
> > >
> > > > > so I do not see how any of the REMAP_FILE_CAN_SHORTEN cases
> > > > > are valid, regardless of EOF.
> > > >
> > > > Not sure about this, it looks to me that they are actually returning
> > > > the number of bytes deduped, but the value is not used, but maybe I'm
> > > > missing something.
> > > > Anyway I think there are valid cases when REMAP_FILE_CAN_SHORTEN makes sense.
> > > > For example if a source file content is a prefix of a destination file
> > > > content and we want to dedup the whole range of the source file
> > > > without REMAP_FILE_CAN_SHORTEN,
> > > > then the ioctl will only succeed when the end of the source file is at
> > > > the block boundary, otherwise it will just fail. This will render the
> > > > API very inconsistent.
> > >
> > > <nod> I'll try to figure out where the len adjusting went wrong here and
> > > report back.
> >
> > Hey Darrick,
> >
> > Len adjustment happens in generic_remap_checks, specifically here:
> >     if (pos_in + count == size_in &&
> >         (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
> >         bcount = ALIGN(size_in, bs) - pos_in;
> >     } else {
> >         if (!IS_ALIGNED(count, bs))
> >             count = ALIGN_DOWN(count, bs);
> >         bcount = count;
> >     }
> > the problem is that in this particular case length is set to zero, and
> > it makes generic_remap_file_range_prep to succeed.
> > I think that 2 things are needed to make this API behave reasonably:
> > * always use the original length for the file content comparison, not
> > the truncated one (there currently is also a short circuit for zero
> > length that skips comparison, this would need to be revisited as well)
> > - otherwise we may report successful deduplication, despite the ranges
> > being clearly different and only the truncated ranges being the same.
> > * report real, possibly truncated deduplication length to the userland
>
> I was about to send in a patch when git blame told me that Linus was the
> last person to change the last line of this chunk:
>
>                 deduped = vfs_dedupe_file_range_one(file, off, dst_file,
>                                                     info->dest_offset, len,
>                                                     REMAP_FILE_CAN_SHORTEN);
>                 if (deduped == -EBADE)
>                         info->status = FILE_DEDUPE_RANGE_DIFFERS;
>                 else if (deduped < 0)
>                         info->status = deduped;
>                 else
>                         info->bytes_deduped = len;
>
> And then I remembered that I had totally repressed my entire memory of
> this unpleasant discussion wherein I got the short end of the stick not
> because I'd designed the weird btrfs ioctl that eventually became
> FIDEDUPERANGE but because I was the last person to make any real
> changes:
> https://lore.kernel.org/all/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/
>
> Originally, btrfs returned the request length, but also didn't do any
> checking or operation length reduction to prevent people from remapping
> a partial EOF block into the middle of a file:
> https://elixir.bootlin.com/linux/v3.19/source/fs/btrfs/ioctl.c#L3018
>
> Dave and I spent a long time adapting XFS to this interface and hoisted
> the generic(ish) parts of the btrfs code to the VFS.  The VFS support
> code returned the number of bytes remapped, though at that point either
> the original request succeeded or it didn't, even if doing so set up
> data corruption:
> https://elixir.bootlin.com/linux/v4.5/source/fs/read_write.c#L1655
>
> Miklos went and committed this change containing no description of why
> it was made.  This patch (AFAICT) was never sent to fsdevel and does not
> seem to have been reviewed by anybody.  The patch sets up the current
> broken behavior where the kernel can dedupe fewer bytes than was asked
> for, yet return the original request length:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5740c99e9d30b81fcc478797e7215c61e241f44e
>
> Five months later I noticed and fixed the problems with remapping
> partial eof blocks into the middle of files and fixed it, but none of us
> noticed there was a logic bomb lurking in bytes_deduped:
> https://lore.kernel.org/linux-fsdevel/153981625504.5568.2708520119290577378.stgit@magnolia/
>
> A different person wrote a test for sub-block dedupe, but even he failed
> to notice the broken behavior:
> https://lore.kernel.org/fstests/20181105111445.11870-1-fdmanana@kernel.org/
>
> Four years later (last summer), someone produced an attempt to fix this
> particular weird discrepancy:
> https://lore.kernel.org/all/b5805118-7e56-3d43-28e9-9e0198ee43f3@tu-darmstadt.de/
>
> Which then Dave asked for a revert because it broke generic/517:
> https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/
>
> Since then, AFAICT there's been no followup to "We just need a bit of
> time to look at the various major dedupe apps and check that they still
> do the right thing w.r.t. proposed change."  The manpage for
> FIDEDUPERANGE says that bytes_deduped contains the number of bytes
> actually remapped, but that doesn't matter because Linus only cares
> about past kernel behavior.  That behavior is now a mess, since every
> LTS kernel since 4.19 has been over-reporting the number of bytes
> deduped.
>

Wow! What a story.

> Unfortunately, while *I* think the correct behavior is to report bytes
> remapped even if that quantity becomes zero due to alignment issues, we
> can't report zero here because duperemove will go into an infinite loop
> because the author did the usual "len += bytes_deduped" without checking
> for a zero length.  This causes generic/561 to run forever.  One could
> argue that we could return "extents differ" in that case, but then that
> breaks generic/517 and generic/182 which aren't expecting that behavior
> either.

But the bug report that started this thread is not about reporting actual vs.
requested dedupe size - it is about reporting that data in two ranges is the
same when it really differs - that could lead to very real data corruption.

Can we solve that by passing the requested len (and not the shotend len)
to {vfs,dax}_dedupe_file_range_compare() and then go on actually deduping
only the shortened len? (or return in case of shortened len == 0).

It does not look like this would break neither generic/517 nor generic/182
unless I am missing something?

>
> It's probably just time for us to just burn FIEDEDUPERANGE to the ground
> and for me to convert another of my other hobbies into my profession.
>

Have you considered a git historian career for retirement? ;)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-02-04  7:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-20  0:10 FIDEDUPERANGE claims to succeed for non-identical files Zbigniew Halas
2022-12-22  8:25 ` Amir Goldstein
2022-12-22 14:41   ` Zbigniew Halas
2023-01-04 17:25     ` Darrick J. Wong
2023-01-04 19:36       ` Zbigniew Halas
2023-02-04  2:23         ` Darrick J. Wong
2023-02-04  7:00           ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.