All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-cifs@vger.kernel.org,
	overlayfs <linux-unionfs@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com
Subject: Re: [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems
Date: Thu, 11 Oct 2018 08:55:04 -0700	[thread overview]
Message-ID: <20181011155504.GZ28243@magnolia> (raw)
In-Reply-To: <CAOQ4uxgOvOOnKL5TsC9jpjBsepAgtQ56Hhjh7WDeXM7m0=dz7g@mail.gmail.com>

On Thu, Oct 11, 2018 at 11:33:57AM +0300, Amir Goldstein wrote:
> On Thu, Oct 11, 2018 at 7:12 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > Hi all,
> >
> > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> > reflink implementation, and tracked it down to reflink forgetting to do
> > some of the file-extending activities that must happen for regular
> > writes.
> >
> > We then started auditing the clone, dedupe, and copyfile code and
> > realized that from a file contents perspective, clonerange isn't any
> > different from a regular file write.  Unfortunately, we also noticed
> > that *unlike* a regular write, clonerange skips a ton of overflow
> > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> > and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> > strip security privileges (suid, capabilities) like a regular write
> > would.  I also noticed that xfs and ocfs2 need to dump the page cache
> > before remapping blocks, not after.
> >
> > In fixing the range checking problems I also realized that both dedupe
> > and copyfile tell userspace how much of the requested operation was
> > acted upon.  Since the range validation can shorten a clone request (or
> > we can ENOSPC midway through), we might as well plumb the short
> > operation reporting back through the VFS indirection code to userspace.
> >
> > So, here's the whole giant pile of patches[1] that fix all the problems.
> > This branch is against 4.19-rc7 with Dave Chinner's XFS for-next branch.
> > The patch "generic: test reflink side effects" recently sent to fstests
> > exercises the fixes in this series.  Tests are in [2].
> >
> > --D
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel
> 
> I tested your branch with overlayfs over xfs.
> I did not observe any failures with -g clone except for test generic/937
> which also failed on xfs in my test.

Ok, matches what I saw overnight.  Good, that means I (at least
theoretically) know how to test overlayfs now. :)

> I though that you forgot to mention I needed to grab xfsprogs from djwong-devel
> for commit e84a9e93 ("xfs_io: dedupe command should only complain
> if we don't dedupe anything"), but even with this change the test still fails:
> 
> generic/937     - output mismatch (see
> /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad)
>     --- tests/generic/937.out   2018-10-11 08:23:00.630938364 +0300
>     +++ /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad
>    2018-10-11 10:54:40.448134832 +0300
>     @@ -4,8 +4,7 @@
>      39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
>      Files 1-2 do not match (intentional)
>      (partial) dedupe the middle blocks together
>     -deduped XXXX/XXXX bytes at offset XXXX
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Extents did not match.

Ohhh, right, g/937 is the test to see if the dedupe implementation will
return a short bytes_deduped if a single byte at the end of the range
doesn't match.  I'll have to update that because...

I reverted the FIDEDUPERANGE behavior to set ->info[x].bytes_deduped =
->src_length even if we rounded the length down to the nearest block
boundary to avoid incorrect sharing of blocks on files with
non-block-aligned EOF.  It turned out that the existing FIDEDUPERANGE
users will hang in infinite loops if the kernel returns ->info[x].status
== FILE_DEDUPE_RANGE_SAME but ->info[x].bytes_deduped < ->src_length.

It seems really stupid to me that the kernel now lies to userspace to
avoid breaking it, but that's what btrfs does so we're stuck with that.
For now.

>      Compare sections
> 
> One thing that *is* different with overlayfs test is that filefrag crashes
> on this same test:
> 
>     QA output created by 937
>     Create the original files
>     35ac8d7917305c385c30f3d82c30a8f6  TEST_DIR/test-937/file1
>     39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
>     Files 1-2 do not match (intentional)
>     (partial) dedupe the middle blocks together
>     XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
>     ./tests/generic/937: line 59: 19242 Floating point exception(core
> dumped) ${FILEFRAG_PROG} -v $testdir/file1 >> $seqres.full
>     ./tests/generic/937: line 60: 19244 Floating point exception(core
> dumped) ${FILEFRAG_PROG} -v $testdir/file2 >> $seqres.full
> 
> It looks like an overlayfs v4.19-rc1 regression - FIGETBSZ returns zero.
> I never noticed this regression before, because none of the generic tests
> are using filefrag.

Funny, I was wondering just the other day if there were any filesystems
that set s_blocksize == 0... :)

--D

> Thanks,
> Amir.

WARNING: multiple messages have this Message-ID (diff)
From: Darrick J. Wong <darrick.wong@oracle.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-cifs@vger.kernel.org,
	overlayfs <linux-unionfs@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems
Date: Thu, 11 Oct 2018 08:55:04 -0700	[thread overview]
Message-ID: <20181011155504.GZ28243@magnolia> (raw)
In-Reply-To: <CAOQ4uxgOvOOnKL5TsC9jpjBsepAgtQ56Hhjh7WDeXM7m0=dz7g@mail.gmail.com>

On Thu, Oct 11, 2018 at 11:33:57AM +0300, Amir Goldstein wrote:
> On Thu, Oct 11, 2018 at 7:12 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > Hi all,
> >
> > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> > reflink implementation, and tracked it down to reflink forgetting to do
> > some of the file-extending activities that must happen for regular
> > writes.
> >
> > We then started auditing the clone, dedupe, and copyfile code and
> > realized that from a file contents perspective, clonerange isn't any
> > different from a regular file write.  Unfortunately, we also noticed
> > that *unlike* a regular write, clonerange skips a ton of overflow
> > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> > and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> > strip security privileges (suid, capabilities) like a regular write
> > would.  I also noticed that xfs and ocfs2 need to dump the page cache
> > before remapping blocks, not after.
> >
> > In fixing the range checking problems I also realized that both dedupe
> > and copyfile tell userspace how much of the requested operation was
> > acted upon.  Since the range validation can shorten a clone request (or
> > we can ENOSPC midway through), we might as well plumb the short
> > operation reporting back through the VFS indirection code to userspace.
> >
> > So, here's the whole giant pile of patches[1] that fix all the problems.
> > This branch is against 4.19-rc7 with Dave Chinner's XFS for-next branch.
> > The patch "generic: test reflink side effects" recently sent to fstests
> > exercises the fixes in this series.  Tests are in [2].
> >
> > --D
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel
> 
> I tested your branch with overlayfs over xfs.
> I did not observe any failures with -g clone except for test generic/937
> which also failed on xfs in my test.

Ok, matches what I saw overnight.  Good, that means I (at least
theoretically) know how to test overlayfs now. :)

> I though that you forgot to mention I needed to grab xfsprogs from djwong-devel
> for commit e84a9e93 ("xfs_io: dedupe command should only complain
> if we don't dedupe anything"), but even with this change the test still fails:
> 
> generic/937     - output mismatch (see
> /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad)
>     --- tests/generic/937.out   2018-10-11 08:23:00.630938364 +0300
>     +++ /old/home/amir/src/fstests/xfstests-dev/results//generic/937.out.bad
>    2018-10-11 10:54:40.448134832 +0300
>     @@ -4,8 +4,7 @@
>      39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
>      Files 1-2 do not match (intentional)
>      (partial) dedupe the middle blocks together
>     -deduped XXXX/XXXX bytes at offset XXXX
>     -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +XFS_IOC_FILE_EXTENT_SAME: Extents did not match.

Ohhh, right, g/937 is the test to see if the dedupe implementation will
return a short bytes_deduped if a single byte at the end of the range
doesn't match.  I'll have to update that because...

I reverted the FIDEDUPERANGE behavior to set ->info[x].bytes_deduped =
->src_length even if we rounded the length down to the nearest block
boundary to avoid incorrect sharing of blocks on files with
non-block-aligned EOF.  It turned out that the existing FIDEDUPERANGE
users will hang in infinite loops if the kernel returns ->info[x].status
== FILE_DEDUPE_RANGE_SAME but ->info[x].bytes_deduped < ->src_length.

It seems really stupid to me that the kernel now lies to userspace to
avoid breaking it, but that's what btrfs does so we're stuck with that.
For now.

>      Compare sections
> 
> One thing that *is* different with overlayfs test is that filefrag crashes
> on this same test:
> 
>     QA output created by 937
>     Create the original files
>     35ac8d7917305c385c30f3d82c30a8f6  TEST_DIR/test-937/file1
>     39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-937/file2
>     Files 1-2 do not match (intentional)
>     (partial) dedupe the middle blocks together
>     XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
>     ./tests/generic/937: line 59: 19242 Floating point exception(core
> dumped) ${FILEFRAG_PROG} -v $testdir/file1 >> $seqres.full
>     ./tests/generic/937: line 60: 19244 Floating point exception(core
> dumped) ${FILEFRAG_PROG} -v $testdir/file2 >> $seqres.full
> 
> It looks like an overlayfs v4.19-rc1 regression - FIGETBSZ returns zero.
> I never noticed this regression before, because none of the generic tests
> are using filefrag.

Funny, I was wondering just the other day if there were any filesystems
that set s_blocksize == 0... :)

--D

> Thanks,
> Amir.

  reply	other threads:[~2018-10-11 15:55 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-11  4:12 [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
2018-10-11  4:12 ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:12 ` [PATCH 01/25] xfs: add a per-xfs trace_printk macro Darrick J. Wong
2018-10-11  4:12   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11 13:39   ` Christoph Hellwig
2018-10-11 13:39     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-11 23:34     ` Darrick J. Wong
2018-10-11 23:34       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:12 ` [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF Darrick J. Wong
2018-10-11  4:12   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11 13:40   ` Christoph Hellwig
2018-10-11 13:40     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-11  4:12 ` [PATCH 03/25] vfs: check file ranges before cloning files Darrick J. Wong
2018-10-11  4:12   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11 13:42   ` Christoph Hellwig
2018-10-11 13:42     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-11 14:13     ` Amir Goldstein
2018-10-11  4:12 ` [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks Darrick J. Wong
2018-10-11  4:12   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11 13:43   ` Christoph Hellwig
2018-10-11 13:43     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-11  4:12 ` [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block Darrick J. Wong
2018-10-11  4:12   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-12  0:16   ` Dave Chinner
2018-10-12  0:16     ` [Ocfs2-devel] " Dave Chinner
2018-10-12 16:07     ` Darrick J. Wong
2018-10-12 16:07       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-12 20:22   ` Filipe Manana
2018-10-12 20:22     ` Filipe Manana
2018-10-15  0:31     ` Dave Chinner
2018-10-15  0:31       ` [Ocfs2-devel] " Dave Chinner
2018-11-02 12:04       ` Filipe Manana
2018-11-02 12:04         ` Filipe Manana
2018-11-02 17:42         ` Darrick J. Wong
2018-11-02 17:42           ` Darrick J. Wong
2018-11-02 17:42           ` [Ocfs2-devel] " Darrick J. Wong
2018-11-02 18:18           ` Filipe Manana
2018-11-02 19:05             ` Filipe Manana
2018-10-11  4:13 ` [PATCH 06/25] vfs: skip zero-length dedupe requests Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 08/25] vfs: rename vfs_clone_file_prep to be more descriptive Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 09/25] vfs: rename clone_verify_area to remap_verify_area Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 12/25] vfs: pass remap flags to generic_remap_checks Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:13 ` [PATCH 13/25] vfs: make remap_file_range functions take and return bytes completed Darrick J. Wong
2018-10-11  4:13   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 14/25] vfs: plumb RFR_* remap flags through the vfs clone functions Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 15/25] vfs: plumb RFR_* remap flags through the vfs dedupe functions Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 16/25] vfs: make remapping to source file eof more explicit Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  5:15   ` Amir Goldstein
2018-10-11 16:04     ` Darrick J. Wong
2018-10-11 16:04       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11 16:05   ` [PATCH v2 " Darrick J. Wong
2018-10-11 16:05     ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 18/25] vfs: hide file range comparison function Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 19/25] vfs: implement opportunistic short dedupe Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 20/25] ocfs2: truncate page cache for clone destination file before remapping Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:14 ` [PATCH 21/25] ocfs2: fix pagecache truncation prior to reflink Darrick J. Wong
2018-10-11  4:14   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:15 ` [PATCH 22/25] ocfs2: support partial clone range and dedupe range Darrick J. Wong
2018-10-11  4:15   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:15 ` [PATCH 23/25] xfs: fix pagecache truncation prior to reflink Darrick J. Wong
2018-10-11  4:15   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-12  1:15   ` Dave Chinner
2018-10-12  1:15     ` [Ocfs2-devel] " Dave Chinner
2018-10-11  4:15 ` [PATCH 24/25] xfs: support returning partial reflink results Darrick J. Wong
2018-10-11  4:15   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-12  1:22   ` Dave Chinner
2018-10-12  1:22     ` [Ocfs2-devel] " Dave Chinner
2018-10-12 16:06     ` Darrick J. Wong
2018-10-12 16:06       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-11  4:15 ` [PATCH 25/25] xfs: remove redundant remap partial EOF block checks Darrick J. Wong
2018-10-11  4:15   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-12  1:22   ` Dave Chinner
2018-10-12  1:22     ` [Ocfs2-devel] " Dave Chinner
2018-10-11  8:33 ` [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems Amir Goldstein
2018-10-11 15:55   ` Darrick J. Wong [this message]
2018-10-11 15:55     ` [Ocfs2-devel] " Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181011155504.GZ28243@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=amir73il@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-unionfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ocfs2-devel@oss.oracle.com \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.