From: "Darrick J. Wong" <darrick.wong@oracle.com> To: Trond Myklebust <trondmy@hammerspace.com> Cc: "david@fromorbit.com" <david@fromorbit.com>, "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>, "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org> Subject: Re: Question about clone_range() metadata stability Date: Tue, 3 Dec 2019 08:35:26 -0800 Message-ID: <20191203163526.GD7323@magnolia> (raw) In-Reply-To: <52f1afb6e0a2026840da6f4b98a5e01a247447e5.camel@hammerspace.com> On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote: > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote: > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote: > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote: > > > > Hi all > > > > > > > > A quick question about clone_range() and guarantees around > > > > metadata > > > > stability. > > > > > > > > Are users required to call fsync/fsync_range() after calling > > > > clone_range() in order to guarantee that the cloned range > > > > metadata is > > > > persisted? > > > > > > Yes. > > > > > > > I'm assuming that it is required in order to guarantee that > > > > data is persisted. > > > > > > Data and metadata. XFS and ocfs2's reflink implementations will > > > flush > > > the page cache before starting the remap, but they both require > > > fsync to > > > force the log/journal to disk. > > > > So we need to call xfs_fs_nfs_commit_metadata() to get that done > > post vfs_clone_file_range() completion on the server side, yes? > > > > I chose to implement this using a full call to vfs_fsync_range(), since > we really do want to ensure data stability as well. Consider, for > instance, the case where client A is running an application, and client > B runs vfs_clone_file_range() in order to create a point in time > snapshot of the file for disaster recovery purposes... Seems reasonable, since (alas) we didn't define the ->remap_range api to guarantee that for you. > > > (AFAICT the same reasoning applies to btrfs, but don't trust my > > > word for > > > it.) > > > > > > > I'm asking because knfsd currently just does a call to > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It > > > > does > > > > not call fsync()/fsync_range() on the destination file, and since > > > > the > > > > NFSv4.2 protocol does not require you to perform any other > > > > operation in > > > > order to persist data/metadata, I'm worried that we may be > > > > corrupting > > > > the cloned file if the NFS server crashes at the wrong moment > > > > after the > > > > client has been told the clone completed. > > > > Yup, that's exactly what server side calls to commit_metadata() are > > supposed to address. > > > > I suspect to be correct, this might require commit_metadata() to be > > called on both the source and destination inodes, as both of them > > may have modified metadata as a result of the clone operation. For > > XFS one of them will be a no-op, but for other filesystems that > > don't implement ->commit_metadata, we'll need to call > > sync_inode_metadata() on both inodes... > > > > That's interesting. I hadn't considered that a clone might cause the > source metadata to change as well. What kind of change specifically are > we talking about? Is it just delayed block allocation, or is there > more? In XFS' case, we added a per-inode flag to help us bypass the reference count lookup during a write if the file has never shared any blocks, so if you never share anything, you'll never pay any of the runtime costs of the COW mechanism. ocfs2's design has a reference count tree that is shared between groups of files that have been reflinked from each other. So if you start with unshared files A and B and clone A to A1 and A2; and B to B1 and B2, then A* will have their own refcount tree and B* will also have their own refcount tree. Either way, nfs has to assume that changes could have been made to the source file. --D > Thanks > Trond > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
next prev parent reply index Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-27 18:38 Trond Myklebust 2019-11-27 20:21 ` Darrick J. Wong 2019-11-29 12:43 ` Filipe Manana 2019-12-01 21:05 ` Dave Chinner 2019-12-02 17:09 ` Darrick J. Wong 2019-12-03 7:36 ` Trond Myklebust 2019-12-03 16:35 ` Darrick J. Wong [this message] 2019-12-03 23:00 ` Trond Myklebust 2019-12-06 1:31 ` Dave Chinner
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191203163526.GD7323@magnolia \ --to=darrick.wong@oracle.com \ --cc=david@fromorbit.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-nfs@vger.kernel.org \ --cc=trondmy@hammerspace.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-NFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \ linux-nfs@vger.kernel.org public-inbox-index linux-nfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git