Re: [PATCH] nfs: track writeback errors with errseq_t

From: Jeff Layton <jlayton@redhat.com>
To: NeilBrown <neilb@suse.com>, Jeff Layton <jlayton@kernel.org>,
	trond.myklebust@primarydata.com, anna.schumaker@netapp.com
Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] nfs: track writeback errors with errseq_t
Date: Tue, 29 Aug 2017 06:54:18 -0400	[thread overview]
Message-ID: <1504004058.4679.7.camel@redhat.com> (raw)
In-Reply-To: <87pobfgo9q.fsf@notabene.neil.brown.name>

On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> On Mon, Aug 28 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > 
> > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > > There is some ambiguity in nfs about how writeback errors are
> > > > > tracked.
> > > > > 
> > > > > For instance, nfs_pageio_add_request calls mapping_set_error when
> > > > > the
> > > > > add fails, but we track errors that occur after adding the
> > > > > request
> > > > > with a dedicated int error in the open context.
> > > > > 
> > > > > Now that we have better infrastructure for the vfs layer, this
> > > > > latter int is now unnecessary. Just have
> > > > > nfs_context_set_write_error set
> > > > > the error in the mapping when one occurs.
> > > > > 
> > > > > Have NFS use file_write_and_wait_range to initiate and wait on
> > > > > writeback
> > > > > of the data, and then check again after issuing the commit(s).
> > > > > 
> > > > > With this, we also don't need to pay attention to the ERROR_WRITE
> > > > > flag for reporting, and just clear it to indicate to subsequent
> > > > > writers that they should try to go asynchronous again.
> > > > > 
> > > > > In nfs_page_async_flush, sample the error before locking and
> > > > > joining
> > > > > the requests, and check for errors since that point.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > ---
> > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > >  fs/nfs/inode.c         |  3 +--
> > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > >  include/linux/nfs_fs.h |  1 -
> > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > 
> > > > > I have a baling wire and duct tape solution for testing this with
> > > > > xfstests (using iptables REJECT targets and soft mounts). This
> > > > > seems to
> > > > > make nfs do the right thing.
> > > > > 
> > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > --- a/fs/nfs/file.c
> > > > > +++ b/fs/nfs/file.c
> > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
> > > > > loff_t start, loff_t end, int datasync)
> > > > >  {
> > > > >  	struct nfs_open_context *ctx =
> > > > > nfs_file_open_context(file);
> > > > >  	struct inode *inode = file_inode(file);
> > > > > -	int have_error, do_resend, status;
> > > > > -	int ret = 0;
> > > > > +	int do_resend, status;
> > > > > +	int ret;
> > > > >  
> > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
> > > > > datasync);
> > > > >  
> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > >  	do_resend =
> > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > &ctx->flags);
> > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > flags);
> > > > > 
> > > > > -	if (have_error) {
> > > > > -		ret = xchg(&ctx->error, 0);
> > > > > -		if (ret)
> > > > > -			goto out;
> > > > > -	}
> > > > > -	if (status < 0) {
> > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > +
> > > > > +	/* Recheck and advance after the commit */
> > > > > +	status = file_check_and_advance_wb_err(file);
> > > 
> > > This change makes the code inconsistent with the comment above the
> > > function, which still references ctx->error.  The intent of the
> > > comment
> > > is still correct, but the details have changed.
> > > 
> > 
> > Good catch. I'll fix that up in a respin.
> > 
> > > Also, there is a call to mapping_set_error() in
> > > nfs_pageio_add_request().
> > > I wonder if that should be changed to
> > >   nfs_context_set_write_error(req->wb_context, desc->pg_error)
> > > ??
> > > 
> > 
> > Trickier question...
> > 
> > I'm not quite sure what semantics we're looking for with
> > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > synchronous, but I'm not quite sure why it gets cleared the way it
> > does. It's set on any error but cleared before issuing a commit.
> > 
> > I added a similar flag to Ceph inodes recently, but only clear it when
> > a write succeeds. Wouldn't that make more sense here as well?
> 
> It is a bit hard to wrap one's mind around.
> 
> In the original code (commit 7b159fc18d417980) it looks like:
>  - test-and-clear bit
>  - write and sync
>  - test-bit
> 
> This does, I think, seem safer than "clear on successful write" as the
> writes could complete out-of-order and I wouldn't be surprised if the
> unsuccessful ones completed with an error before the successful one -
> particularly with an error like EDQUOT.
> 
> However the current code does the writes before the test-and-clear, and
> only does the commit afterwards.  That makes it less clear why the
> current sequence is a good idea.
> 
> However ... nfs_file_fsync_commit() is only called if
> filemap_write_and_wait_range() returned with success, so we only clear
> the flag after successful writes(?).
> 
> Oh....
> This patch from me:
> 
> Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error handling.")
> 
> seems to have been reverted by
> 
> Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if page writeback failed")
> 
> which probably isn't good.  It appears that this code is very fragile
> and easily broken.
> Maybe we need to work out exactly what is required, and document it - so
> we can stop breaking it.
> Or maybe we need some unit tests.....
> 

Yes, laying out what's necessary for this would be very helpful. We
clearly want to set the flag when an error occurs. Under what
circumstances should we be clearing it?

I'm not sure we can really do much better than clearing it on a
successful write. With Ceph, was that this is just a hint to the write
submission mechanism and we generally aren't too concerned if a few slip
past in either direction.
-- 
Jeff Layton <jlayton@redhat.com>