linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "darrick.wong@oracle.com" <darrick.wong@oracle.com>
Cc: "david@fromorbit.com" <david@fromorbit.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Question about clone_range() metadata stability
Date: Tue, 3 Dec 2019 23:00:26 +0000	[thread overview]
Message-ID: <5bef0bc533c3ab768538a58e3ae7fb9ad26a5d83.camel@hammerspace.com> (raw)
In-Reply-To: <20191203163526.GD7323@magnolia>

On Tue, 2019-12-03 at 08:35 -0800, Darrick J. Wong wrote:
> On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust
> > > > wrote:
> > > > > Hi all
> > > > > 
> > > > > A quick question about clone_range() and guarantees around
> > > > > metadata
> > > > > stability.
> > > > > 
> > > > > Are users required to call fsync/fsync_range() after calling
> > > > > clone_range() in order to guarantee that the cloned range
> > > > > metadata is
> > > > > persisted?
> > > > 
> > > > Yes.
> > > > 
> > > > > I'm assuming that it is required in order to guarantee that
> > > > > data is persisted.
> > > > 
> > > > Data and metadata.  XFS and ocfs2's reflink implementations
> > > > will
> > > > flush
> > > > the page cache before starting the remap, but they both require
> > > > fsync to
> > > > force the log/journal to disk.
> > > 
> > > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > > post vfs_clone_file_range() completion on the server side, yes?
> > > 
> > 
> > I chose to implement this using a full call to vfs_fsync_range(),
> > since
> > we really do want to ensure data stability as well. Consider, for
> > instance, the case where client A is running an application, and
> > client
> > B runs vfs_clone_file_range() in order to create a point in time
> > snapshot of the file for disaster recovery purposes...
> 
> Seems reasonable, since (alas) we didn't define the ->remap_range api
> to
> guarantee that for you.
> 
> > > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > > word for
> > > > it.)
> > > > 
> > > > > I'm asking because knfsd currently just does a call to
> > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE
> > > > > operation. It
> > > > > does
> > > > > not call fsync()/fsync_range() on the destination file, and
> > > > > since
> > > > > the
> > > > > NFSv4.2 protocol does not require you to perform any other
> > > > > operation in
> > > > > order to persist data/metadata, I'm worried that we may be
> > > > > corrupting
> > > > > the cloned file if the NFS server crashes at the wrong moment
> > > > > after the
> > > > > client has been told the clone completed.
> > > 
> > > Yup, that's exactly what server side calls to commit_metadata()
> > > are
> > > supposed to address.
> > > 
> > > I suspect to be correct, this might require commit_metadata() to
> > > be
> > > called on both the source and destination inodes, as both of them
> > > may have modified metadata as a result of the clone operation.
> > > For
> > > XFS one of them will be a no-op, but for other filesystems that
> > > don't implement ->commit_metadata, we'll need to call
> > > sync_inode_metadata() on both inodes...
> > > 
> > 
> > That's interesting. I hadn't considered that a clone might cause
> > the
> > source metadata to change as well. What kind of change specifically
> > are
> > we talking about? Is it just delayed block allocation, or is there
> > more?
> 
> In XFS' case, we added a per-inode flag to help us bypass the
> reference
> count lookup during a write if the file has never shared any blocks,
> so
> if you never share anything, you'll never pay any of the runtime
> costs
> of the COW mechanism.
> 
> ocfs2's design has a reference count tree that is shared between
> groups
> of files that have been reflinked from each other.  So if you start
> with
> unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
> then A* will have their own refcount tree and B* will also have their
> own refcount tree.
> 
> Either way, nfs has to assume that changes could have been made to
> the
> source file.

Interesting. Thanks for the explanation! I'll try to send off an
amended patch to Bruce (hopefully before he merges).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2019-12-03 23:00 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27 18:38 Question about clone_range() metadata stability Trond Myklebust
2019-11-27 20:21 ` Darrick J. Wong
2019-11-29 12:43   ` Filipe Manana
2019-12-01 21:05   ` Dave Chinner
2019-12-02 17:09     ` Darrick J. Wong
2019-12-03  7:36     ` Trond Myklebust
2019-12-03 16:35       ` Darrick J. Wong
2019-12-03 23:00         ` Trond Myklebust [this message]
2019-12-06  1:31       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bef0bc533c3ab768538a58e3ae7fb9ad26a5d83.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).