linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: fstests <fstests@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@suse.com>,
	linux-xfs@vger.kernel.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
Date: Tue, 21 Aug 2018 16:55:46 +0100	[thread overview]
Message-ID: <CAL3q7H5Oicg_Ekesg5cVo8Y2BZTcYUbB92SYcTmpBDW2iao4Vg@mail.gmail.com> (raw)
In-Reply-To: <20180820010932.GV2234@dastard>

On Mon, Aug 20, 2018 at 2:09 AM, Dave Chinner <david@fromorbit.com> wrote:
> [cc linux-fsdevel now, too]
>
> On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
>> [cc linux-xfs@vger.kernel.org]
>>
>> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdmanana@kernel.org wrote:
>> > From: Filipe Manana <fdmanana@suse.com>
>> >
>> > Test that deduplication of an entire file that has a size that is not
>> > aligned to the filesystem's block size into a different file does not
>> > corrupt the destination's file data.
>
> Ok, I've looked at this now. My first question is where did all the
> magic offsets in this test come from? i.e. how was this bug
> found and who is it affecting?

I found it myself. I'm not aware of any users or applications affected by it.

>
>> > This test is motivated by a bug found in Btrfs which is fixed by the
>> > following patch for the linux kernel:
>> >
>> >   "Btrfs: fix data corruption when deduplicating between different files"
>> >
>> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> > with the same corruption as in Btrfs - some bytes of a block get replaced
>> > with zeroes after the deduplication.
>>
>> Filipe, in future can please report XFS bugs you find to the XFS
>> list the moment you find them. We shouldn't ever find out about a
>> data corruption bug we need to fix via a "oh, by the way" comment in
>> a commit message for a regression test....
>
> This becomes much more relevant because of what I've just found....
>
> .....
>
>> > +# The first byte with a value of 0xae starts at an offset (2518890) which is not
>> > +# a multiple of the block size.
>> > +$XFS_IO_PROG -f \
>> > +   -c "pwrite -S 0x6b 0 2518890" \
>> > +   -c "pwrite -S 0xae 2518890 102398" \
>> > +   $SCRATCH_MNT/foo | _filter_xfs_io
>> > +
>> > +# Create a second file with a length not aligned to the block size, whose bytes
>> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
>> > +# first file.
>> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | _filter_xfs_io
>> > +
>> > +# The file is filled with bytes having the value 0x6b from offset 0 to offset
>> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> > +echo "File content before deduplication:"
>> > +od -t x1 $SCRATCH_MNT/foo
>
> Please use "od -Ad -t x1 <file>" so the file offsets reported by od
> match the offsets used in the test (i.e. in decimal, not octal).

Will do, in the next test version after agreement on the fix.

>
>> > +
>> > +# Now deduplicate the entire second file into a range of the first file that
>> > +# also has all bytes with the value 0x6b. The destination range's end offset
>> > +# must not be aligned to the block size and must be less then the offset of
>> > +# the first byte with the value 0xae (byte at offset 2518890).
>> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
>> > +   | _filter_xfs_io
>
> Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
> to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
> this check:
>
>         /* Only reflink if we're aligned to block boundaries */
>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
>             !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
>                 return -EINVAL;
>
> And it's pretty clear that a length of 557771 is not block aligned
> (being an odd number).
>
> So why was this dedupe request even accepted by the kernel? Well,
> I think there's a bug in the check just above this:
>
>         /* If we're linking to EOF, continue to the block boundary. */
>         if (pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;

Yes, btrfs, for the dedupe call, also has its own place where it does
the same thing,
at fs/btrfs/ioctl.c:extent_same_check_offsets().
And that's precisely what made me suspicious about it, together with
what you note below about the call to btrfs_cmp_data() using the
original, unaligned, length.

However, I just ran the same test using reflink and not dedupe and the
same problem happens. In earlier versions of the test/debugging I
either did not notice
or made some mistake because I hadn't seen the same problem for the
reflink/clone operation, and since we only do that extra rounding in
the btrfs dedupe code path,
and not on the clone one, I took the conclusion that only dedupe was
affected, but that is also done in the VFS as you just pointed.
I only noticied XFS also failed at the last moment, when I had the
test case complete.

>
> basically, when the "in" file dedupe/reflink range is to EOF, it
> expands the range to include /all/ of the block that contains the
> EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> EOF/. But when we go to compare the data in these ranges, it
> truncates the comparison to the length that the user asked for
> (i.e. *len) and not the extended block length.
>
> IOWs, it doesn't compare the bytes beyond EOF in the source block to
> the data in the destination block it would replace, and so doesn't
> fail the compare like it should.
>
> And, well, btrfs has the same bug. extent_same_check_offsets()
> extends the range for alignment so it passes alignment checks, but
> then /explicitly/ uses the original length for the data compare
> and dedupe. i.e:
>
>        /* pass original length for comparison so we stay within i_size */
>         ret = btrfs_cmp_data(olen, cmp);
>         if (ret == 0)
>                 ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);
>
> This is what we should see if someone tried to dedupe the EOF block
> of a file:
>
> generic/505     - output mismatch (see /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad)
>     --- tests/generic/505.out   2018-08-20 09:36:58.449015709 +1000
>     +++ /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad    2018-08-20 10:45:47.245482160 +1000
>     @@ -13,8 +13,7 @@
>      *
>      2621280 ae ae ae ae ae ae ae ae
>      2621288
>     -deduped 557771/557771 bytes at offset 1957888
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
>      File content after deduplication and before unmounting:
>     ...
>     (Run 'diff -u tests/generic/505.out /home/dave/src/xfstests-dev/results//xfs/generic/505.out.bad'  to see the entire diff)
>
> i.e. the dedupe request should fail as we cannot dedupe the EOF
> block.
>
> The patch below does this for the VFS dedupe code. it's not a final
> patch, it's just a demonstration of how this should never have got
> past the range checks. Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?

Maybe. No idea if that would "break" existing applications and use cases.

>         - should we just round down the EOF dedupe request to the
>           block before EOF so dedupe still succeeds?

That was my idea because dedupe is allowed to deduplicate less then
what is requested.

>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?

No, but for cloning I'm not sure which approach is better, to round
down or reject with -EINVAL.

>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?

I think it should.

cheers

>
> Discuss.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
>
> [RFC] vfs: fix data corruption w/ unaligned dedupe ranges
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Exposed by fstests generic/505 on XFS, caused by extending the
> bblock match range to include the partial EOF block, but then
> allowing unknown data beyond EOF to be considered a "match" to data
> in the destination file because the comparison is only made to the
> end of the source file. This corrupts the destination file when the
> source extent is shared with it.
>
> Open questions:
>
>         - is documenting rejection on request alignment grounds
>           (i.e. EINVAL) in the man page sufficient for app
>           developers to understand what is going on here?
>         - should we just silently round down the EOF dedupe request
>           to the block before EOF so dedupe still succeeds?
>         - should file cloning (i.e. reflink) be allow allowing the
>           EOF block to be shared somewhere inside EOF in the
>           destination file? That's stale data exposure, yes?
>         - should clone only allow sharing of the EOF block if it's a
>           either a full file clone or a matching range clone and
>           isize is the same for both src and dest files?
>
> Btrfs also has the same bug in extent_same_check_offsets() and
> btrfs_extent_same_range() that is not addressed by this patch.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 153f8f690490..3844359a4597 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1768,8 +1768,17 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>                         return -EINVAL;
>         }
>
> -       /* If we're linking to EOF, continue to the block boundary. */
> -       if (pos_in + *len == isize)
> +       /* Reflink allows linking to EOF, so round the length up to allow us to
> +        * shared the last block in the file. We don't care what is beyond the
> +        * EOF point in the block that gets shared, as we can't access it and
> +        * attempts to extent the file will break the sharing.
> +        *
> +        * For dedupe, sharing the EOF block is illegal because the bytes beyond
> +        * EOF are undefined (i.e. not readable) and so can't be deduped. Hence
> +        * we do not extend/align the length and instead let the alignment
> +        * checks below reject it.
> +        */
> +       if (!is_dedupe && pos_in + *len == isize)
>                 blen = ALIGN(isize, bs) - pos_in;
>         else
>                 blen = *len;

      parent reply	other threads:[~2018-08-21 19:16 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20180817083924.16916-1-fdmanana@kernel.org>
     [not found] ` <20180819231126.GU2234@dastard>
2018-08-20  1:09   ` [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files) Dave Chinner
2018-08-20 15:33     ` Darrick J. Wong
2018-08-21  0:49       ` Dave Chinner
2018-08-21  1:17         ` Eric Sandeen
2018-08-21  4:49           ` Dave Chinner
2018-08-23 12:58       ` Zygo Blaxell
2018-08-24  2:19         ` Zygo Blaxell
2018-08-30  6:27         ` Dave Chinner
2018-08-31  5:10           ` Zygo Blaxell
2018-09-06  8:38             ` Dave Chinner
2018-09-07  3:53               ` Zygo Blaxell
2018-09-10  9:06                 ` Dave Chinner
2018-09-19  4:12                   ` Zygo Blaxell
2018-09-21  2:59                     ` Dave Chinner
2018-09-21  4:40                       ` Zygo Blaxell
2018-10-01 20:34                         ` Andreas Dilger
2018-08-21 15:55     ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H5Oicg_Ekesg5cVo8Y2BZTcYUbB92SYcTmpBDW2iao4Vg@mail.gmail.com \
    --to=fdmanana@kernel.org \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=fstests@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).