Re: Question about clone_range() metadata stability

From: Trond Myklebust <trondmy@hammerspace.com>
To: "darrick.wong@oracle.com" <darrick.wong@oracle.com>
Cc: "david@fromorbit.com" <david@fromorbit.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Question about clone_range() metadata stability
Date: Tue, 3 Dec 2019 23:00:26 +0000	[thread overview]
Message-ID: <5bef0bc533c3ab768538a58e3ae7fb9ad26a5d83.camel@hammerspace.com> (raw)
In-Reply-To: <20191203163526.GD7323@magnolia>

On Tue, 2019-12-03 at 08:35 -0800, Darrick J. Wong wrote:
> On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust
> > > > wrote:
> > > > > Hi all
> > > > > 
> > > > > A quick question about clone_range() and guarantees around
> > > > > metadata
> > > > > stability.
> > > > > 
> > > > > Are users required to call fsync/fsync_range() after calling
> > > > > clone_range() in order to guarantee that the cloned range
> > > > > metadata is
> > > > > persisted?
> > > > 
> > > > Yes.
> > > > 
> > > > > I'm assuming that it is required in order to guarantee that
> > > > > data is persisted.
> > > > 
> > > > Data and metadata.  XFS and ocfs2's reflink implementations
> > > > will
> > > > flush
> > > > the page cache before starting the remap, but they both require
> > > > fsync to
> > > > force the log/journal to disk.
> > > 
> > > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > > post vfs_clone_file_range() completion on the server side, yes?
> > > 
> > 
> > I chose to implement this using a full call to vfs_fsync_range(),
> > since
> > we really do want to ensure data stability as well. Consider, for
> > instance, the case where client A is running an application, and
> > client
> > B runs vfs_clone_file_range() in order to create a point in time
> > snapshot of the file for disaster recovery purposes...
> 
> Seems reasonable, since (alas) we didn't define the ->remap_range api
> to
> guarantee that for you.
> 
> > > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > > word for
> > > > it.)
> > > > 
> > > > > I'm asking because knfsd currently just does a call to
> > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE
> > > > > operation. It
> > > > > does
> > > > > not call fsync()/fsync_range() on the destination file, and
> > > > > since
> > > > > the
> > > > > NFSv4.2 protocol does not require you to perform any other
> > > > > operation in
> > > > > order to persist data/metadata, I'm worried that we may be
> > > > > corrupting
> > > > > the cloned file if the NFS server crashes at the wrong moment
> > > > > after the
> > > > > client has been told the clone completed.
> > > 
> > > Yup, that's exactly what server side calls to commit_metadata()
> > > are
> > > supposed to address.
> > > 
> > > I suspect to be correct, this might require commit_metadata() to
> > > be
> > > called on both the source and destination inodes, as both of them
> > > may have modified metadata as a result of the clone operation.
> > > For
> > > XFS one of them will be a no-op, but for other filesystems that
> > > don't implement ->commit_metadata, we'll need to call
> > > sync_inode_metadata() on both inodes...
> > > 
> > 
> > That's interesting. I hadn't considered that a clone might cause
> > the
> > source metadata to change as well. What kind of change specifically
> > are
> > we talking about? Is it just delayed block allocation, or is there
> > more?
> 
> In XFS' case, we added a per-inode flag to help us bypass the
> reference
> count lookup during a write if the file has never shared any blocks,
> so
> if you never share anything, you'll never pay any of the runtime
> costs
> of the COW mechanism.
> 
> ocfs2's design has a reference count tree that is shared between
> groups
> of files that have been reflinked from each other.  So if you start
> with
> unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
> then A* will have their own refcount tree and B* will also have their
> own refcount tree.
> 
> Either way, nfs has to assume that changes could have been made to
> the
> source file.

Interesting. Thanks for the explanation! I'll try to send off an
amended patch to Bruce (hopefully before he merges).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com