From: NeilBrown <neilb@suse.com>
To: Trond Myklebust <trondmy@primarydata.com>,
"anna.schumaker\@netapp.com" <anna.schumaker@netapp.com>,
"jlayton\@redhat.com" <jlayton@redhat.com>,
"jlayton\@kernel.org" <jlayton@kernel.org>
Cc: "linux-nfs\@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"linux-fsdevel\@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] nfs: track writeback errors with errseq_t
Date: Mon, 11 Sep 2017 13:24:52 +1000 [thread overview]
Message-ID: <874ls9diij.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <1504796087.3561.7.camel@primarydata.com>
[-- Attachment #1: Type: text/plain, Size: 16524 bytes --]
On Thu, Sep 07 2017, Trond Myklebust wrote:
> On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
>> On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
>> > On Tue, Aug 29 2017, Jeff Layton wrote:
>> >
>> > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
>> > > > On Mon, Aug 28 2017, Jeff Layton wrote:
>> > > >
>> > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
>> > > > > >
>> > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > > > > > > From: Jeff Layton <jlayton@redhat.com>
>> > > > > > > >
>> > > > > > > > There is some ambiguity in nfs about how writeback
>> > > > > > > > errors are
>> > > > > > > > tracked.
>> > > > > > > >
>> > > > > > > > For instance, nfs_pageio_add_request calls
>> > > > > > > > mapping_set_error when
>> > > > > > > > the
>> > > > > > > > add fails, but we track errors that occur after adding
>> > > > > > > > the
>> > > > > > > > request
>> > > > > > > > with a dedicated int error in the open context.
>> > > > > > > >
>> > > > > > > > Now that we have better infrastructure for the vfs
>> > > > > > > > layer, this
>> > > > > > > > latter int is now unnecessary. Just have
>> > > > > > > > nfs_context_set_write_error set
>> > > > > > > > the error in the mapping when one occurs.
>> > > > > > > >
>> > > > > > > > Have NFS use file_write_and_wait_range to initiate and
>> > > > > > > > wait on
>> > > > > > > > writeback
>> > > > > > > > of the data, and then check again after issuing the
>> > > > > > > > commit(s).
>> > > > > > > >
>> > > > > > > > With this, we also don't need to pay attention to the
>> > > > > > > > ERROR_WRITE
>> > > > > > > > flag for reporting, and just clear it to indicate to
>> > > > > > > > subsequent
>> > > > > > > > writers that they should try to go asynchronous again.
>> > > > > > > >
>> > > > > > > > In nfs_page_async_flush, sample the error before
>> > > > > > > > locking and
>> > > > > > > > joining
>> > > > > > > > the requests, and check for errors since that point.
>> > > > > > > >
>> > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > ---
>> > > > > > > > fs/nfs/file.c | 24 +++++++++++-------------
>> > > > > > > > fs/nfs/inode.c | 3 +--
>> > > > > > > > fs/nfs/write.c | 8 ++++++--
>> > > > > > > > include/linux/nfs_fs.h | 1 -
>> > > > > > > > 4 files changed, 18 insertions(+), 18 deletions(-)
>> > > > > > > >
>> > > > > > > > I have a baling wire and duct tape solution for testing
>> > > > > > > > this with
>> > > > > > > > xfstests (using iptables REJECT targets and soft
>> > > > > > > > mounts). This
>> > > > > > > > seems to
>> > > > > > > > make nfs do the right thing.
>> > > > > > > >
>> > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > > > > > > --- a/fs/nfs/file.c
>> > > > > > > > +++ b/fs/nfs/file.c
>> > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
>> > > > > > > > *file,
>> > > > > > > > loff_t start, loff_t end, int datasync)
>> > > > > > > > {
>> > > > > > > > struct nfs_open_context *ctx =
>> > > > > > > > nfs_file_open_context(file);
>> > > > > > > > struct inode *inode = file_inode(file);
>> > > > > > > > - int have_error, do_resend, status;
>> > > > > > > > - int ret = 0;
>> > > > > > > > + int do_resend, status;
>> > > > > > > > + int ret;
>> > > > > > > >
>> > > > > > > > dprintk("NFS: fsync file(%pD2) datasync %d\n",
>> > > > > > > > file,
>> > > > > > > > datasync);
>> > > > > > > >
>> > > > > > > > nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > > > > > > > do_resend =
>> > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
>> > > > > > > > >flags);
>> > > > > > > > - have_error =
>> > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > > > > > > &ctx->flags);
>> > > > > > > > - status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > - have_error |=
>> > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > flags);
>> > > > > > > >
>> > > > > > > > - if (have_error) {
>> > > > > > > > - ret = xchg(&ctx->error, 0);
>> > > > > > > > - if (ret)
>> > > > > > > > - goto out;
>> > > > > > > > - }
>> > > > > > > > - if (status < 0) {
>> > > > > > > > + clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > >flags);
>> > > > > > > > + ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > +
>> > > > > > > > + /* Recheck and advance after the commit */
>> > > > > > > > + status = file_check_and_advance_wb_err(file);
>> > > > > >
>> > > > > > This change makes the code inconsistent with the comment
>> > > > > > above the
>> > > > > > function, which still references ctx->error. The intent of
>> > > > > > the
>> > > > > > comment
>> > > > > > is still correct, but the details have changed.
>> > > > > >
>> > > > >
>> > > > > Good catch. I'll fix that up in a respin.
>> > > > >
>> > > > > > Also, there is a call to mapping_set_error() in
>> > > > > > nfs_pageio_add_request().
>> > > > > > I wonder if that should be changed to
>> > > > > > nfs_context_set_write_error(req->wb_context, desc-
>> > > > > > >pg_error)
>> > > > > > ??
>> > > > > >
>> > > > >
>> > > > > Trickier question...
>> > > > >
>> > > > > I'm not quite sure what semantics we're looking for with
>> > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
>> > > > > synchronous, but I'm not quite sure why it gets cleared the
>> > > > > way it
>> > > > > does. It's set on any error but cleared before issuing a
>> > > > > commit.
>> > > > >
>> > > > > I added a similar flag to Ceph inodes recently, but only
>> > > > > clear it when
>> > > > > a write succeeds. Wouldn't that make more sense here as well?
>> > > >
>> > > > It is a bit hard to wrap one's mind around.
>> > > >
>> > > > In the original code (commit 7b159fc18d417980) it looks like:
>> > > > - test-and-clear bit
>> > > > - write and sync
>> > > > - test-bit
>> > > >
>> > > > This does, I think, seem safer than "clear on successful write"
>> > > > as the
>> > > > writes could complete out-of-order and I wouldn't be surprised
>> > > > if the
>> > > > unsuccessful ones completed with an error before the successful
>> > > > one -
>> > > > particularly with an error like EDQUOT.
>> > > >
>> > > > However the current code does the writes before the test-and-
>> > > > clear, and
>> > > > only does the commit afterwards. That makes it less clear why
>> > > > the
>> > > > current sequence is a good idea.
>> > > >
>> > > > However ... nfs_file_fsync_commit() is only called if
>> > > > filemap_write_and_wait_range() returned with success, so we
>> > > > only clear
>> > > > the flag after successful writes(?).
>> > > >
>> > > > Oh....
>> > > > This patch from me:
>> > > >
>> > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
>> > > > handling.")
>> > > >
>> > > > seems to have been reverted by
>> > > >
>> > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
>> > > > page writeback failed")
>> > > >
>> > > > which probably isn't good. It appears that this code is very
>> > > > fragile
>> > > > and easily broken.
>> >
>> > On further investigation, I think the problem that I fixed and then
>> > we
>> > reintroduced will be fixed again - more permanently - by your
>> > patch.
>> > The root problem is that nfs keeps error codes in a different way
>> > to the
>> > MM core. By unifying those, the problem goes.
>> > (The specific problem is that writes which hit EDQUOT on the server
>> > can
>> > report EIO on the client).
>> >
>> >
>> > > > Maybe we need to work out exactly what is required, and
>> > > > document it - so
>> > > > we can stop breaking it.
>> > > > Or maybe we need some unit tests.....
>> > > >
>> > >
>> > > Yes, laying out what's necessary for this would be very helpful.
>> > > We
>> > > clearly want to set the flag when an error occurs. Under what
>> > > circumstances should we be clearing it?
>> >
>> > Well.... looking back at 7b159fc18d417980f57ae which introduced
>> > the
>> > flag, prior to that write errors (ctx->error) were only reported by
>> > nfs_file_flush and nfs_fsync, so only one close() and fsync().
>> >
>> > After that commit, setting the flag would mean that errors could be
>> > returned by 'write'. So clearing as part of returning the error
>> > makes
>> > perfect sense.
>> >
>> > As long as the error gets recorded, and gets returned when it is
>> > recorded, it doesn't much matter when the flag is cleared. With
>> > your
>> > patches we don't need to flag any more to get errors reliably
>> > reported.
>> >
>> > Leaving the flag set means that writes go more slowly - we don't
>> > get
>> > large queue of background rights building up but destined for
>> > failure.
>> > This is the main point made in the comment message when the flag
>> > was
>> > introduced.
>> > Of course, by the time we first get an error there could already
>> > by a large queue, so we probably want that to drain completely
>> > before
>> > allowing async writes again.
>
> We already have this functionality implemented in the existing code.
>
>> >
>> > It might make sense to have 2 flags. One which says "writes should
>> > be
>> > synchronous", another that says "There was an error recently".
>> > We clear the error flag before calling nfs_fsync, and if it is
>> > still
>> > clear afterwards, we clear the sync-writes flag. Maybe that is
>> > more
>> > complex than needed though.
>> >
>
> We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
> see any global mechanism that will replace that.
>
>> > I'm leaning towards your suggestion that it doesn't matter very
>> > much
>> > when it gets cleared, and clearing it on any successful write is
>> > simplest.
>> >
>> > So I'm still in favor of using nfs_context_set_write_error() in
>> > nfs_pageio_add_request(), primarily because it is most consistent -
>> > we
>> > don't need exceptions.
>>
>> Thanks for taking a closer look. I can easily make the change above,
>> and
>> I do think that keeping this mechanism as simple as possible will
>> make
>> it easier to prevent bitrot.
>>
>> That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
>> is a
>> per open file description object.
>>
>> Is that the correct way to track this? All of the ctx's will share
>> the
>> same inode. If we're getting writeback errors for one context, it's
>> quite likely that we'll be seeing them via others.
>>
>> I suppose the counterargument is when we have things like expiring
>> krb5
>> tickets. Write failures via an expiring set of creds may have no
>> effect
>> on writeback via other creds.
>>
>> Still, I think a per-inode flag might make more sense here.
>>
>> Thoughts?
>
> As far as I'm concerned, that would be a regression. The most common
> problem when flushing writeback data to the server aside from ENOSPC
> (and possibly ESTALE) is EACCES, which is particular to the file
> descriptor that opened the file.
>
> File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
> private to the file descriptor.
Thanks for the reminder that errors are per-context and this patch drops
this. The per-context nature of errors in NFS was the reason that I
nagged Jeff to make errseq_t a stand-alone type rather than just a part
of address_space. I had envisaged that it would be embedded in the
open_context as well.
We still could do that, but as there is precisely one open-file for each
open_context, the gains are not great.
However, while looking over the code to make sure I really understood it
and all the possible consequences of changing to errseq_t I found a few
anomalies. The patch below addresses them all.
Would you see if they may sense to you?
Thanks,
NeilBrown
From: NeilBrown <neilb@suse.com>
Date: Mon, 11 Sep 2017 13:15:50 +1000
Subject: [PATCH] NFS: various changes relating to reporting IO errors.
1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
They aren't used.
2/ Make nfs_context_set_write_error() a "static inline" in internal.h
so we can...
3/ Use nfs_context_set_write_error() instead of mapping_set_error()
if nfs_pageio_add_request() fails before sending any request.
NFS generally keeps errors in the open_context, not the mapping,
so this is more consistent.
4/ If filemap_write_and_write_range() reports any error, still
check ctx->error. The value in ctx->error is likely to be
more useful. As part of this, NFS_CONTEXT_ERROR_WRITE is
cleared slightly earlier, before nfs_file_fsync_commit() is called,
rather than at the start of that function.
Signed-off-by: NeilBrown <neilb@suse.com>
---
fs/nfs/file.c | 16 ++++++++++------
fs/nfs/internal.h | 7 +++++++
fs/nfs/pagelist.c | 4 ++--
fs/nfs/write.c | 7 -------
4 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index af330c31f627..ab324f14081f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
* fall back to doing a synchronous write.
*/
static int
-nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
+nfs_file_fsync_commit(struct file *file, int datasync)
{
struct nfs_open_context *ctx = nfs_file_open_context(file);
struct inode *inode = file_inode(file);
- int have_error, do_resend, status;
+ int do_resend, status;
int ret = 0;
dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
- have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
status = nfs_commit_inode(inode, FLUSH_SYNC);
- have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
- if (have_error) {
+ if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
ret = xchg(&ctx->error, 0);
if (ret)
goto out;
@@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
trace_nfs_fsync_enter(inode);
do {
+ struct nfs_open_context *ctx = nfs_file_open_context(file);
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
+ int ret2 = xchg(&ctx->error, 0);
+ if (ret2)
+ ret = ret2;
+ }
if (ret != 0)
break;
- ret = nfs_file_fsync_commit(file, start, end, datasync);
+ ret = nfs_file_fsync_commit(file, datasync);
if (!ret)
ret = pnfs_sync_inode(inode, !!datasync);
/*
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index dc456416d2be..44c8962fec91 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
return false;
}
}
+
+static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
+{
+ ctx->error = error;
+ smp_wmb();
+ set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
+}
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index de9066a92c0d..0ebd26b9a6bd 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -1198,8 +1198,8 @@ out_failed:
/* remember fatal errors */
if (nfs_error_is_fatal(desc->pg_error))
- mapping_set_error(desc->pg_inode->i_mapping,
- desc->pg_error);
+ nfs_context_set_write_error(req->wb_context,
+ desc->pg_error);
func = desc->pg_completion_ops->error_cleanup;
for (midx = 0; midx < desc->pg_mirror_count; midx++) {
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b1af5dee5e0a..f702bf2def79 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
kref_put(&ioc->refcount, nfs_io_completion_release);
}
-static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
-{
- ctx->error = error;
- smp_wmb();
- set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
-}
-
/*
* nfs_page_find_head_request_locked - find head request associated with @page
*
--
2.14.0.rc0.dirty
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2017-09-11 3:25 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-20 19:42 [PATCH] nfs: track writeback errors with errseq_t Jeff Layton
2017-08-25 17:59 ` Jeff Layton
2017-08-27 23:24 ` NeilBrown
2017-08-28 11:47 ` Jeff Layton
2017-08-29 1:23 ` NeilBrown
2017-08-29 10:54 ` Jeff Layton
2017-09-07 3:37 ` NeilBrown
2017-09-07 11:35 ` Jeff Layton
2017-09-07 14:54 ` Trond Myklebust
2017-09-11 3:24 ` NeilBrown [this message]
2017-09-11 10:46 ` Jeff Layton
2017-09-11 21:52 ` NeilBrown
2017-09-12 15:20 ` Jeff Layton
2017-09-12 21:47 ` NeilBrown
2017-09-13 12:23 ` Jeff Layton
2017-09-13 23:50 ` [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes NeilBrown
2017-09-14 10:48 ` Jeff Layton
2017-09-15 7:50 ` Michael Kerrisk (man-pages)
2017-09-15 8:25 ` NeilBrown
2017-09-28 3:01 ` NeilBrown
2017-09-28 12:20 ` Jeff Layton
2017-09-28 16:19 ` Michael Kerrisk (man-opages)
2017-09-12 2:24 ` [PATCH] nfs: track writeback errors with errseq_t Trond Myklebust
2017-09-12 5:29 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874ls9diij.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=anna.schumaker@netapp.com \
--cc=jlayton@kernel.org \
--cc=jlayton@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).