linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nfs: track writeback errors with errseq_t
@ 2017-07-20 19:42 Jeff Layton
  2017-08-25 17:59 ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-07-20 19:42 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs, linux-fsdevel, NeilBrown

From: Jeff Layton <jlayton@redhat.com>

There is some ambiguity in nfs about how writeback errors are tracked.

For instance, nfs_pageio_add_request calls mapping_set_error when the
add fails, but we track errors that occur after adding the request
with a dedicated int error in the open context.

Now that we have better infrastructure for the vfs layer, this
latter int is now unnecessary. Just have nfs_context_set_write_error set
the error in the mapping when one occurs.

Have NFS use file_write_and_wait_range to initiate and wait on writeback
of the data, and then check again after issuing the commit(s).

With this, we also don't need to pay attention to the ERROR_WRITE
flag for reporting, and just clear it to indicate to subsequent
writers that they should try to go asynchronous again.

In nfs_page_async_flush, sample the error before locking and joining
the requests, and check for errors since that point.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/file.c          | 24 +++++++++++-------------
 fs/nfs/inode.c         |  3 +--
 fs/nfs/write.c         |  8 ++++++--
 include/linux/nfs_fs.h |  1 -
 4 files changed, 18 insertions(+), 18 deletions(-)

I have a baling wire and duct tape solution for testing this with
xfstests (using iptables REJECT targets and soft mounts). This seems to
make nfs do the right thing.

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5713eb32a45e..15d3c6faafd3 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
 	struct inode *inode = file_inode(file);
-	int have_error, do_resend, status;
-	int ret = 0;
+	int do_resend, status;
+	int ret;
 
 	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
 
 	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
 	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
-	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
-	status = nfs_commit_inode(inode, FLUSH_SYNC);
-	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
-	if (have_error) {
-		ret = xchg(&ctx->error, 0);
-		if (ret)
-			goto out;
-	}
-	if (status < 0) {
+	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
+	ret = nfs_commit_inode(inode, FLUSH_SYNC);
+
+	/* Recheck and advance after the commit */
+	status = file_check_and_advance_wb_err(file);
+	if (!ret)
 		ret = status;
+	if (ret)
 		goto out;
-	}
+
 	do_resend |= test_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
 	if (do_resend)
 		ret = -EAGAIN;
@@ -247,7 +245,7 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	trace_nfs_fsync_enter(inode);
 
 	do {
-		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+		ret = file_write_and_wait_range(file, start, end);
 		if (ret != 0)
 			break;
 		ret = nfs_file_fsync_commit(file, start, end, datasync);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 109279d6d91b..c48f673c5cc9 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -900,7 +900,6 @@ struct nfs_open_context *alloc_nfs_open_context(struct dentry *dentry,
 	ctx->state = NULL;
 	ctx->mode = f_mode;
 	ctx->flags = 0;
-	ctx->error = 0;
 	ctx->flock_owner = (fl_owner_t)filp;
 	nfs_init_lock_context(&ctx->lock_context);
 	ctx->lock_context.open_context = ctx;
@@ -1009,7 +1008,7 @@ void nfs_file_clear_open_context(struct file *filp)
 		 * We fatal error on write before. Try to writeback
 		 * every page again.
 		 */
-		if (ctx->error < 0)
+		if (filemap_check_wb_err(inode->i_mapping, filp->f_wb_err))
 			invalidate_inode_pages2(inode->i_mapping);
 		filp->private_data = NULL;
 		spin_lock(&inode->i_lock);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b1af5dee5e0a..c2fcaf07cd24 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -149,7 +149,9 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
 
 static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
 {
-	ctx->error = error;
+	struct inode *inode = d_inode(ctx->dentry);
+
+	mapping_set_error(inode->i_mapping, error);
 	smp_wmb();
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
@@ -628,6 +630,8 @@ static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
 {
 	struct nfs_page *req;
 	int ret = 0;
+	struct address_space *mapping = page_file_mapping(page);
+	errseq_t since = filemap_sample_wb_err(mapping);
 
 	req = nfs_lock_and_join_requests(page, nonblock);
 	if (!req)
@@ -641,7 +645,7 @@ static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
 
 	ret = 0;
 	/* If there is a fatal error that covers this write, just exit */
-	if (nfs_error_is_fatal_on_server(req->wb_context->error))
+	if (nfs_error_is_fatal_on_server(filemap_check_wb_err(mapping, since)))
 		goto out_launder;
 
 	if (!nfs_pageio_add_request(pgio, req)) {
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index e52cc55ac300..a96b0bd52b32 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -77,7 +77,6 @@ struct nfs_open_context {
 #define NFS_CONTEXT_RESEND_WRITES	(1)
 #define NFS_CONTEXT_BAD			(2)
 #define NFS_CONTEXT_UNLOCK	(3)
-	int error;
 
 	struct list_head list;
 	struct nfs4_threshold	*mdsthreshold;
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-07-20 19:42 [PATCH] nfs: track writeback errors with errseq_t Jeff Layton
@ 2017-08-25 17:59 ` Jeff Layton
  2017-08-27 23:24   ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-08-25 17:59 UTC (permalink / raw)
  To: Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel, NeilBrown

On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> There is some ambiguity in nfs about how writeback errors are tracked.
> 
> For instance, nfs_pageio_add_request calls mapping_set_error when the
> add fails, but we track errors that occur after adding the request
> with a dedicated int error in the open context.
> 
> Now that we have better infrastructure for the vfs layer, this
> latter int is now unnecessary. Just have nfs_context_set_write_error set
> the error in the mapping when one occurs.
> 
> Have NFS use file_write_and_wait_range to initiate and wait on writeback
> of the data, and then check again after issuing the commit(s).
> 
> With this, we also don't need to pay attention to the ERROR_WRITE
> flag for reporting, and just clear it to indicate to subsequent
> writers that they should try to go asynchronous again.
> 
> In nfs_page_async_flush, sample the error before locking and joining
> the requests, and check for errors since that point.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/nfs/file.c          | 24 +++++++++++-------------
>  fs/nfs/inode.c         |  3 +--
>  fs/nfs/write.c         |  8 ++++++--
>  include/linux/nfs_fs.h |  1 -
>  4 files changed, 18 insertions(+), 18 deletions(-)
> 
> I have a baling wire and duct tape solution for testing this with
> xfstests (using iptables REJECT targets and soft mounts). This seems to
> make nfs do the right thing.
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5713eb32a45e..15d3c6faafd3 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
>  {
>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>  	struct inode *inode = file_inode(file);
> -	int have_error, do_resend, status;
> -	int ret = 0;
> +	int do_resend, status;
> +	int ret;
>  
>  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> -	if (have_error) {
> -		ret = xchg(&ctx->error, 0);
> -		if (ret)
> -			goto out;
> -	}
> -	if (status < 0) {
> +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> +
> +	/* Recheck and advance after the commit */
> +	status = file_check_and_advance_wb_err(file);
> +	if (!ret)
>  		ret = status;
> +	if (ret)
>  		goto out;
> -	}
> +
>  	do_resend |= test_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>  	if (do_resend)
>  		ret = -EAGAIN;
> @@ -247,7 +245,7 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	trace_nfs_fsync_enter(inode);
>  
>  	do {
> -		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +		ret = file_write_and_wait_range(file, start, end);
>  		if (ret != 0)
>  			break;
>  		ret = nfs_file_fsync_commit(file, start, end, datasync);
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 109279d6d91b..c48f673c5cc9 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -900,7 +900,6 @@ struct nfs_open_context *alloc_nfs_open_context(struct dentry *dentry,
>  	ctx->state = NULL;
>  	ctx->mode = f_mode;
>  	ctx->flags = 0;
> -	ctx->error = 0;
>  	ctx->flock_owner = (fl_owner_t)filp;
>  	nfs_init_lock_context(&ctx->lock_context);
>  	ctx->lock_context.open_context = ctx;
> @@ -1009,7 +1008,7 @@ void nfs_file_clear_open_context(struct file *filp)
>  		 * We fatal error on write before. Try to writeback
>  		 * every page again.
>  		 */
> -		if (ctx->error < 0)
> +		if (filemap_check_wb_err(inode->i_mapping, filp->f_wb_err))
>  			invalidate_inode_pages2(inode->i_mapping);
>  		filp->private_data = NULL;
>  		spin_lock(&inode->i_lock);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b1af5dee5e0a..c2fcaf07cd24 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -149,7 +149,9 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
>  
>  static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>  {
> -	ctx->error = error;
> +	struct inode *inode = d_inode(ctx->dentry);
> +
> +	mapping_set_error(inode->i_mapping, error);
>  	smp_wmb();
>  	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>  }
> @@ -628,6 +630,8 @@ static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
>  {
>  	struct nfs_page *req;
>  	int ret = 0;
> +	struct address_space *mapping = page_file_mapping(page);
> +	errseq_t since = filemap_sample_wb_err(mapping);
>  
>  	req = nfs_lock_and_join_requests(page, nonblock);
>  	if (!req)
> @@ -641,7 +645,7 @@ static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
>  
>  	ret = 0;
>  	/* If there is a fatal error that covers this write, just exit */
> -	if (nfs_error_is_fatal_on_server(req->wb_context->error))
> +	if (nfs_error_is_fatal_on_server(filemap_check_wb_err(mapping, since)))
>  		goto out_launder;
>  
>  	if (!nfs_pageio_add_request(pgio, req)) {
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index e52cc55ac300..a96b0bd52b32 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -77,7 +77,6 @@ struct nfs_open_context {
>  #define NFS_CONTEXT_RESEND_WRITES	(1)
>  #define NFS_CONTEXT_BAD			(2)
>  #define NFS_CONTEXT_UNLOCK	(3)
> -	int error;
>  
>  	struct list_head list;
>  	struct nfs4_threshold	*mdsthreshold;

Anna and Trond,

Ping? I haven't seen any word on this patch, and it hasn't shown up in
any branches. Do you have concerns with it, or is this good to go for
v4.14?

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-08-25 17:59 ` Jeff Layton
@ 2017-08-27 23:24   ` NeilBrown
  2017-08-28 11:47     ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-08-27 23:24 UTC (permalink / raw)
  To: Jeff Layton, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3232 bytes --]

On Fri, Aug 25 2017, Jeff Layton wrote:

> On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> From: Jeff Layton <jlayton@redhat.com>
>> 
>> There is some ambiguity in nfs about how writeback errors are tracked.
>> 
>> For instance, nfs_pageio_add_request calls mapping_set_error when the
>> add fails, but we track errors that occur after adding the request
>> with a dedicated int error in the open context.
>> 
>> Now that we have better infrastructure for the vfs layer, this
>> latter int is now unnecessary. Just have nfs_context_set_write_error set
>> the error in the mapping when one occurs.
>> 
>> Have NFS use file_write_and_wait_range to initiate and wait on writeback
>> of the data, and then check again after issuing the commit(s).
>> 
>> With this, we also don't need to pay attention to the ERROR_WRITE
>> flag for reporting, and just clear it to indicate to subsequent
>> writers that they should try to go asynchronous again.
>> 
>> In nfs_page_async_flush, sample the error before locking and joining
>> the requests, and check for errors since that point.
>> 
>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> ---
>>  fs/nfs/file.c          | 24 +++++++++++-------------
>>  fs/nfs/inode.c         |  3 +--
>>  fs/nfs/write.c         |  8 ++++++--
>>  include/linux/nfs_fs.h |  1 -
>>  4 files changed, 18 insertions(+), 18 deletions(-)
>> 
>> I have a baling wire and duct tape solution for testing this with
>> xfstests (using iptables REJECT targets and soft mounts). This seems to
>> make nfs do the right thing.
>> 
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index 5713eb32a45e..15d3c6faafd3 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
>>  {
>>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>>  	struct inode *inode = file_inode(file);
>> -	int have_error, do_resend, status;
>> -	int ret = 0;
>> +	int do_resend, status;
>> +	int ret;
>>  
>>  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
>>  
>>  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>>  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>> -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> -	if (have_error) {
>> -		ret = xchg(&ctx->error, 0);
>> -		if (ret)
>> -			goto out;
>> -	}
>> -	if (status < 0) {
>> +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> +
>> +	/* Recheck and advance after the commit */
>> +	status = file_check_and_advance_wb_err(file);

This change makes the code inconsistent with the comment above the
function, which still references ctx->error.  The intent of the comment
is still correct, but the details have changed.

Also, there is a call to mapping_set_error() in
nfs_pageio_add_request().
I wonder if that should be changed to
  nfs_context_set_write_error(req->wb_context, desc->pg_error)
??

Otherwise, patch looks good to me.
Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-08-27 23:24   ` NeilBrown
@ 2017-08-28 11:47     ` Jeff Layton
  2017-08-29  1:23       ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-08-28 11:47 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> On Fri, Aug 25 2017, Jeff Layton wrote:
> 
> > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > From: Jeff Layton <jlayton@redhat.com>
> > > 
> > > There is some ambiguity in nfs about how writeback errors are
> > > tracked.
> > > 
> > > For instance, nfs_pageio_add_request calls mapping_set_error when
> > > the
> > > add fails, but we track errors that occur after adding the
> > > request
> > > with a dedicated int error in the open context.
> > > 
> > > Now that we have better infrastructure for the vfs layer, this
> > > latter int is now unnecessary. Just have
> > > nfs_context_set_write_error set
> > > the error in the mapping when one occurs.
> > > 
> > > Have NFS use file_write_and_wait_range to initiate and wait on
> > > writeback
> > > of the data, and then check again after issuing the commit(s).
> > > 
> > > With this, we also don't need to pay attention to the ERROR_WRITE
> > > flag for reporting, and just clear it to indicate to subsequent
> > > writers that they should try to go asynchronous again.
> > > 
> > > In nfs_page_async_flush, sample the error before locking and
> > > joining
> > > the requests, and check for errors since that point.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > >  fs/nfs/inode.c         |  3 +--
> > >  fs/nfs/write.c         |  8 ++++++--
> > >  include/linux/nfs_fs.h |  1 -
> > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > 
> > > I have a baling wire and duct tape solution for testing this with
> > > xfstests (using iptables REJECT targets and soft mounts). This
> > > seems to
> > > make nfs do the right thing.
> > > 
> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > index 5713eb32a45e..15d3c6faafd3 100644
> > > --- a/fs/nfs/file.c
> > > +++ b/fs/nfs/file.c
> > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
> > > loff_t start, loff_t end, int datasync)
> > >  {
> > >  	struct nfs_open_context *ctx =
> > > nfs_file_open_context(file);
> > >  	struct inode *inode = file_inode(file);
> > > -	int have_error, do_resend, status;
> > > -	int ret = 0;
> > > +	int do_resend, status;
> > > +	int ret;
> > >  
> > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
> > > datasync);
> > >  
> > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > >  	do_resend =
> > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > &ctx->flags);
> > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > >flags);
> > > -	if (have_error) {
> > > -		ret = xchg(&ctx->error, 0);
> > > -		if (ret)
> > > -			goto out;
> > > -	}
> > > -	if (status < 0) {
> > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > +
> > > +	/* Recheck and advance after the commit */
> > > +	status = file_check_and_advance_wb_err(file);
> 
> This change makes the code inconsistent with the comment above the
> function, which still references ctx->error.  The intent of the
> comment
> is still correct, but the details have changed.
> 

Good catch. I'll fix that up in a respin.

> Also, there is a call to mapping_set_error() in
> nfs_pageio_add_request().
> I wonder if that should be changed to
>   nfs_context_set_write_error(req->wb_context, desc->pg_error)
> ??
> 

Trickier question...

I'm not quite sure what semantics we're looking for with
NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
synchronous, but I'm not quite sure why it gets cleared the way it
does. It's set on any error but cleared before issuing a commit.

I added a similar flag to Ceph inodes recently, but only clear it when
a write succeeds. Wouldn't that make more sense here as well?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-08-28 11:47     ` Jeff Layton
@ 2017-08-29  1:23       ` NeilBrown
  2017-08-29 10:54         ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-08-29  1:23 UTC (permalink / raw)
  To: Jeff Layton, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5412 bytes --]

On Mon, Aug 28 2017, Jeff Layton wrote:

> On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> On Fri, Aug 25 2017, Jeff Layton wrote:
>> 
>> > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > From: Jeff Layton <jlayton@redhat.com>
>> > > 
>> > > There is some ambiguity in nfs about how writeback errors are
>> > > tracked.
>> > > 
>> > > For instance, nfs_pageio_add_request calls mapping_set_error when
>> > > the
>> > > add fails, but we track errors that occur after adding the
>> > > request
>> > > with a dedicated int error in the open context.
>> > > 
>> > > Now that we have better infrastructure for the vfs layer, this
>> > > latter int is now unnecessary. Just have
>> > > nfs_context_set_write_error set
>> > > the error in the mapping when one occurs.
>> > > 
>> > > Have NFS use file_write_and_wait_range to initiate and wait on
>> > > writeback
>> > > of the data, and then check again after issuing the commit(s).
>> > > 
>> > > With this, we also don't need to pay attention to the ERROR_WRITE
>> > > flag for reporting, and just clear it to indicate to subsequent
>> > > writers that they should try to go asynchronous again.
>> > > 
>> > > In nfs_page_async_flush, sample the error before locking and
>> > > joining
>> > > the requests, and check for errors since that point.
>> > > 
>> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > ---
>> > >  fs/nfs/file.c          | 24 +++++++++++-------------
>> > >  fs/nfs/inode.c         |  3 +--
>> > >  fs/nfs/write.c         |  8 ++++++--
>> > >  include/linux/nfs_fs.h |  1 -
>> > >  4 files changed, 18 insertions(+), 18 deletions(-)
>> > > 
>> > > I have a baling wire and duct tape solution for testing this with
>> > > xfstests (using iptables REJECT targets and soft mounts). This
>> > > seems to
>> > > make nfs do the right thing.
>> > > 
>> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > --- a/fs/nfs/file.c
>> > > +++ b/fs/nfs/file.c
>> > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
>> > > loff_t start, loff_t end, int datasync)
>> > >  {
>> > >  	struct nfs_open_context *ctx =
>> > > nfs_file_open_context(file);
>> > >  	struct inode *inode = file_inode(file);
>> > > -	int have_error, do_resend, status;
>> > > -	int ret = 0;
>> > > +	int do_resend, status;
>> > > +	int ret;
>> > >  
>> > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
>> > > datasync);
>> > >  
>> > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > >  	do_resend =
>> > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>> > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > &ctx->flags);
>> > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > >flags);
>> > > -	if (have_error) {
>> > > -		ret = xchg(&ctx->error, 0);
>> > > -		if (ret)
>> > > -			goto out;
>> > > -	}
>> > > -	if (status < 0) {
>> > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > +
>> > > +	/* Recheck and advance after the commit */
>> > > +	status = file_check_and_advance_wb_err(file);
>> 
>> This change makes the code inconsistent with the comment above the
>> function, which still references ctx->error.  The intent of the
>> comment
>> is still correct, but the details have changed.
>> 
>
> Good catch. I'll fix that up in a respin.
>
>> Also, there is a call to mapping_set_error() in
>> nfs_pageio_add_request().
>> I wonder if that should be changed to
>>   nfs_context_set_write_error(req->wb_context, desc->pg_error)
>> ??
>> 
>
> Trickier question...
>
> I'm not quite sure what semantics we're looking for with
> NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> synchronous, but I'm not quite sure why it gets cleared the way it
> does. It's set on any error but cleared before issuing a commit.
>
> I added a similar flag to Ceph inodes recently, but only clear it when
> a write succeeds. Wouldn't that make more sense here as well?

It is a bit hard to wrap one's mind around.

In the original code (commit 7b159fc18d417980) it looks like:
 - test-and-clear bit
 - write and sync
 - test-bit

This does, I think, seem safer than "clear on successful write" as the
writes could complete out-of-order and I wouldn't be surprised if the
unsuccessful ones completed with an error before the successful one -
particularly with an error like EDQUOT.

However the current code does the writes before the test-and-clear, and
only does the commit afterwards.  That makes it less clear why the
current sequence is a good idea.

However ... nfs_file_fsync_commit() is only called if
filemap_write_and_wait_range() returned with success, so we only clear
the flag after successful writes(?).

Oh....
This patch from me:

Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error handling.")

seems to have been reverted by

Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if page writeback failed")

which probably isn't good.  It appears that this code is very fragile
and easily broken.
Maybe we need to work out exactly what is required, and document it - so
we can stop breaking it.
Or maybe we need some unit tests.....

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-08-29  1:23       ` NeilBrown
@ 2017-08-29 10:54         ` Jeff Layton
  2017-09-07  3:37           ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-08-29 10:54 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> On Mon, Aug 28 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > 
> > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > > There is some ambiguity in nfs about how writeback errors are
> > > > > tracked.
> > > > > 
> > > > > For instance, nfs_pageio_add_request calls mapping_set_error when
> > > > > the
> > > > > add fails, but we track errors that occur after adding the
> > > > > request
> > > > > with a dedicated int error in the open context.
> > > > > 
> > > > > Now that we have better infrastructure for the vfs layer, this
> > > > > latter int is now unnecessary. Just have
> > > > > nfs_context_set_write_error set
> > > > > the error in the mapping when one occurs.
> > > > > 
> > > > > Have NFS use file_write_and_wait_range to initiate and wait on
> > > > > writeback
> > > > > of the data, and then check again after issuing the commit(s).
> > > > > 
> > > > > With this, we also don't need to pay attention to the ERROR_WRITE
> > > > > flag for reporting, and just clear it to indicate to subsequent
> > > > > writers that they should try to go asynchronous again.
> > > > > 
> > > > > In nfs_page_async_flush, sample the error before locking and
> > > > > joining
> > > > > the requests, and check for errors since that point.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > ---
> > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > >  fs/nfs/inode.c         |  3 +--
> > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > >  include/linux/nfs_fs.h |  1 -
> > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > 
> > > > > I have a baling wire and duct tape solution for testing this with
> > > > > xfstests (using iptables REJECT targets and soft mounts). This
> > > > > seems to
> > > > > make nfs do the right thing.
> > > > > 
> > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > --- a/fs/nfs/file.c
> > > > > +++ b/fs/nfs/file.c
> > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
> > > > > loff_t start, loff_t end, int datasync)
> > > > >  {
> > > > >  	struct nfs_open_context *ctx =
> > > > > nfs_file_open_context(file);
> > > > >  	struct inode *inode = file_inode(file);
> > > > > -	int have_error, do_resend, status;
> > > > > -	int ret = 0;
> > > > > +	int do_resend, status;
> > > > > +	int ret;
> > > > >  
> > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
> > > > > datasync);
> > > > >  
> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > >  	do_resend =
> > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > &ctx->flags);
> > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > flags);
> > > > > 
> > > > > -	if (have_error) {
> > > > > -		ret = xchg(&ctx->error, 0);
> > > > > -		if (ret)
> > > > > -			goto out;
> > > > > -	}
> > > > > -	if (status < 0) {
> > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > +
> > > > > +	/* Recheck and advance after the commit */
> > > > > +	status = file_check_and_advance_wb_err(file);
> > > 
> > > This change makes the code inconsistent with the comment above the
> > > function, which still references ctx->error.  The intent of the
> > > comment
> > > is still correct, but the details have changed.
> > > 
> > 
> > Good catch. I'll fix that up in a respin.
> > 
> > > Also, there is a call to mapping_set_error() in
> > > nfs_pageio_add_request().
> > > I wonder if that should be changed to
> > >   nfs_context_set_write_error(req->wb_context, desc->pg_error)
> > > ??
> > > 
> > 
> > Trickier question...
> > 
> > I'm not quite sure what semantics we're looking for with
> > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > synchronous, but I'm not quite sure why it gets cleared the way it
> > does. It's set on any error but cleared before issuing a commit.
> > 
> > I added a similar flag to Ceph inodes recently, but only clear it when
> > a write succeeds. Wouldn't that make more sense here as well?
> 
> It is a bit hard to wrap one's mind around.
> 
> In the original code (commit 7b159fc18d417980) it looks like:
>  - test-and-clear bit
>  - write and sync
>  - test-bit
> 
> This does, I think, seem safer than "clear on successful write" as the
> writes could complete out-of-order and I wouldn't be surprised if the
> unsuccessful ones completed with an error before the successful one -
> particularly with an error like EDQUOT.
> 
> However the current code does the writes before the test-and-clear, and
> only does the commit afterwards.  That makes it less clear why the
> current sequence is a good idea.
> 
> However ... nfs_file_fsync_commit() is only called if
> filemap_write_and_wait_range() returned with success, so we only clear
> the flag after successful writes(?).
> 
> Oh....
> This patch from me:
> 
> Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error handling.")
> 
> seems to have been reverted by
> 
> Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if page writeback failed")
> 
> which probably isn't good.  It appears that this code is very fragile
> and easily broken.
> Maybe we need to work out exactly what is required, and document it - so
> we can stop breaking it.
> Or maybe we need some unit tests.....
> 

Yes, laying out what's necessary for this would be very helpful. We
clearly want to set the flag when an error occurs. Under what
circumstances should we be clearing it?

I'm not sure we can really do much better than clearing it on a
successful write. With Ceph, was that this is just a hint to the write
submission mechanism and we generally aren't too concerned if a few slip
past in either direction.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-08-29 10:54         ` Jeff Layton
@ 2017-09-07  3:37           ` NeilBrown
  2017-09-07 11:35             ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-09-07  3:37 UTC (permalink / raw)
  To: Jeff Layton, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 8507 bytes --]

On Tue, Aug 29 2017, Jeff Layton wrote:

> On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
>> On Mon, Aug 28 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> > > On Fri, Aug 25 2017, Jeff Layton wrote:
>> > > 
>> > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > > > From: Jeff Layton <jlayton@redhat.com>
>> > > > > 
>> > > > > There is some ambiguity in nfs about how writeback errors are
>> > > > > tracked.
>> > > > > 
>> > > > > For instance, nfs_pageio_add_request calls mapping_set_error when
>> > > > > the
>> > > > > add fails, but we track errors that occur after adding the
>> > > > > request
>> > > > > with a dedicated int error in the open context.
>> > > > > 
>> > > > > Now that we have better infrastructure for the vfs layer, this
>> > > > > latter int is now unnecessary. Just have
>> > > > > nfs_context_set_write_error set
>> > > > > the error in the mapping when one occurs.
>> > > > > 
>> > > > > Have NFS use file_write_and_wait_range to initiate and wait on
>> > > > > writeback
>> > > > > of the data, and then check again after issuing the commit(s).
>> > > > > 
>> > > > > With this, we also don't need to pay attention to the ERROR_WRITE
>> > > > > flag for reporting, and just clear it to indicate to subsequent
>> > > > > writers that they should try to go asynchronous again.
>> > > > > 
>> > > > > In nfs_page_async_flush, sample the error before locking and
>> > > > > joining
>> > > > > the requests, and check for errors since that point.
>> > > > > 
>> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > ---
>> > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
>> > > > >  fs/nfs/inode.c         |  3 +--
>> > > > >  fs/nfs/write.c         |  8 ++++++--
>> > > > >  include/linux/nfs_fs.h |  1 -
>> > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
>> > > > > 
>> > > > > I have a baling wire and duct tape solution for testing this with
>> > > > > xfstests (using iptables REJECT targets and soft mounts). This
>> > > > > seems to
>> > > > > make nfs do the right thing.
>> > > > > 
>> > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > > > --- a/fs/nfs/file.c
>> > > > > +++ b/fs/nfs/file.c
>> > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
>> > > > > loff_t start, loff_t end, int datasync)
>> > > > >  {
>> > > > >  	struct nfs_open_context *ctx =
>> > > > > nfs_file_open_context(file);
>> > > > >  	struct inode *inode = file_inode(file);
>> > > > > -	int have_error, do_resend, status;
>> > > > > -	int ret = 0;
>> > > > > +	int do_resend, status;
>> > > > > +	int ret;
>> > > > >  
>> > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
>> > > > > datasync);
>> > > > >  
>> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > > > >  	do_resend =
>> > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>> > > > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > > > &ctx->flags);
>> > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > flags);
>> > > > > 
>> > > > > -	if (have_error) {
>> > > > > -		ret = xchg(&ctx->error, 0);
>> > > > > -		if (ret)
>> > > > > -			goto out;
>> > > > > -	}
>> > > > > -	if (status < 0) {
>> > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > +
>> > > > > +	/* Recheck and advance after the commit */
>> > > > > +	status = file_check_and_advance_wb_err(file);
>> > > 
>> > > This change makes the code inconsistent with the comment above the
>> > > function, which still references ctx->error.  The intent of the
>> > > comment
>> > > is still correct, but the details have changed.
>> > > 
>> > 
>> > Good catch. I'll fix that up in a respin.
>> > 
>> > > Also, there is a call to mapping_set_error() in
>> > > nfs_pageio_add_request().
>> > > I wonder if that should be changed to
>> > >   nfs_context_set_write_error(req->wb_context, desc->pg_error)
>> > > ??
>> > > 
>> > 
>> > Trickier question...
>> > 
>> > I'm not quite sure what semantics we're looking for with
>> > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
>> > synchronous, but I'm not quite sure why it gets cleared the way it
>> > does. It's set on any error but cleared before issuing a commit.
>> > 
>> > I added a similar flag to Ceph inodes recently, but only clear it when
>> > a write succeeds. Wouldn't that make more sense here as well?
>> 
>> It is a bit hard to wrap one's mind around.
>> 
>> In the original code (commit 7b159fc18d417980) it looks like:
>>  - test-and-clear bit
>>  - write and sync
>>  - test-bit
>> 
>> This does, I think, seem safer than "clear on successful write" as the
>> writes could complete out-of-order and I wouldn't be surprised if the
>> unsuccessful ones completed with an error before the successful one -
>> particularly with an error like EDQUOT.
>> 
>> However the current code does the writes before the test-and-clear, and
>> only does the commit afterwards.  That makes it less clear why the
>> current sequence is a good idea.
>> 
>> However ... nfs_file_fsync_commit() is only called if
>> filemap_write_and_wait_range() returned with success, so we only clear
>> the flag after successful writes(?).
>> 
>> Oh....
>> This patch from me:
>> 
>> Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error handling.")
>> 
>> seems to have been reverted by
>> 
>> Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if page writeback failed")
>> 
>> which probably isn't good.  It appears that this code is very fragile
>> and easily broken.

On further investigation, I think the problem that I fixed and then we
reintroduced will be fixed again - more permanently - by your patch.
The root problem is that nfs keeps error codes in a different way to the
MM core.  By unifying those, the problem goes.
(The specific problem is that writes which hit EDQUOT on the server can
 report EIO on the client).


>> Maybe we need to work out exactly what is required, and document it - so
>> we can stop breaking it.
>> Or maybe we need some unit tests.....
>> 
>
> Yes, laying out what's necessary for this would be very helpful. We
> clearly want to set the flag when an error occurs. Under what
> circumstances should we be clearing it?

Well.... looking back at  7b159fc18d417980f57ae which introduced the
flag, prior to that write errors (ctx->error) were only reported by
nfs_file_flush and nfs_fsync, so only one close() and fsync().

After that commit, setting the flag would mean that errors could be
returned by 'write'.  So clearing as part of returning the error makes
perfect sense.

As long as the error gets recorded, and gets returned when it is
recorded, it doesn't much matter when the flag is cleared.  With your
patches we don't need to flag any more to get errors reliably reported.

Leaving the flag set means that writes go more slowly - we don't get
large queue of background rights building up but destined for failure.
This is the main point made in the comment message when the flag was
introduced.
Of course, by the time we first get an error there could already
by a large queue, so we probably want that to drain completely before
allowing async writes again.

It might make sense to have 2 flags.  One which says "writes should be
synchronous", another that says "There was an error recently".
We clear the error flag before calling nfs_fsync, and if it is still
clear afterwards, we clear the sync-writes flag.  Maybe that is more
complex than needed though.

I'm leaning towards your suggestion that it doesn't matter very much
when it gets cleared, and clearing it on any successful write is
simplest.

So I'm still in favor of using nfs_context_set_write_error() in
nfs_pageio_add_request(), primarily because it is most consistent - we
don't need exceptions.

Thanks,
NeilBrown


>
> I'm not sure we can really do much better than clearing it on a
> successful write. With Ceph, was that this is just a hint to the write
> submission mechanism and we generally aren't too concerned if a few slip
> past in either direction.
> -- 
> Jeff Layton <jlayton@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-07  3:37           ` NeilBrown
@ 2017-09-07 11:35             ` Jeff Layton
  2017-09-07 14:54               ` Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-09-07 11:35 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, trond.myklebust, anna.schumaker
  Cc: linux-nfs, linux-fsdevel

On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
> On Tue, Aug 29 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> > > On Mon, Aug 28 2017, Jeff Layton wrote:
> > > 
> > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > > > 
> > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > > > 
> > > > > > > There is some ambiguity in nfs about how writeback errors are
> > > > > > > tracked.
> > > > > > > 
> > > > > > > For instance, nfs_pageio_add_request calls mapping_set_error when
> > > > > > > the
> > > > > > > add fails, but we track errors that occur after adding the
> > > > > > > request
> > > > > > > with a dedicated int error in the open context.
> > > > > > > 
> > > > > > > Now that we have better infrastructure for the vfs layer, this
> > > > > > > latter int is now unnecessary. Just have
> > > > > > > nfs_context_set_write_error set
> > > > > > > the error in the mapping when one occurs.
> > > > > > > 
> > > > > > > Have NFS use file_write_and_wait_range to initiate and wait on
> > > > > > > writeback
> > > > > > > of the data, and then check again after issuing the commit(s).
> > > > > > > 
> > > > > > > With this, we also don't need to pay attention to the ERROR_WRITE
> > > > > > > flag for reporting, and just clear it to indicate to subsequent
> > > > > > > writers that they should try to go asynchronous again.
> > > > > > > 
> > > > > > > In nfs_page_async_flush, sample the error before locking and
> > > > > > > joining
> > > > > > > the requests, and check for errors since that point.
> > > > > > > 
> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > ---
> > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > > > >  fs/nfs/inode.c         |  3 +--
> > > > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > > > >  include/linux/nfs_fs.h |  1 -
> > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > > > 
> > > > > > > I have a baling wire and duct tape solution for testing this with
> > > > > > > xfstests (using iptables REJECT targets and soft mounts). This
> > > > > > > seems to
> > > > > > > make nfs do the right thing.
> > > > > > > 
> > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > > > --- a/fs/nfs/file.c
> > > > > > > +++ b/fs/nfs/file.c
> > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file *file,
> > > > > > > loff_t start, loff_t end, int datasync)
> > > > > > >  {
> > > > > > >  	struct nfs_open_context *ctx =
> > > > > > > nfs_file_open_context(file);
> > > > > > >  	struct inode *inode = file_inode(file);
> > > > > > > -	int have_error, do_resend, status;
> > > > > > > -	int ret = 0;
> > > > > > > +	int do_resend, status;
> > > > > > > +	int ret;
> > > > > > >  
> > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
> > > > > > > datasync);
> > > > > > >  
> > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > > > >  	do_resend =
> > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > > > > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > > > &ctx->flags);
> > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > flags);
> > > > > > > 
> > > > > > > -	if (have_error) {
> > > > > > > -		ret = xchg(&ctx->error, 0);
> > > > > > > -		if (ret)
> > > > > > > -			goto out;
> > > > > > > -	}
> > > > > > > -	if (status < 0) {
> > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > +
> > > > > > > +	/* Recheck and advance after the commit */
> > > > > > > +	status = file_check_and_advance_wb_err(file);
> > > > > 
> > > > > This change makes the code inconsistent with the comment above the
> > > > > function, which still references ctx->error.  The intent of the
> > > > > comment
> > > > > is still correct, but the details have changed.
> > > > > 
> > > > 
> > > > Good catch. I'll fix that up in a respin.
> > > > 
> > > > > Also, there is a call to mapping_set_error() in
> > > > > nfs_pageio_add_request().
> > > > > I wonder if that should be changed to
> > > > >   nfs_context_set_write_error(req->wb_context, desc->pg_error)
> > > > > ??
> > > > > 
> > > > 
> > > > Trickier question...
> > > > 
> > > > I'm not quite sure what semantics we're looking for with
> > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > > > synchronous, but I'm not quite sure why it gets cleared the way it
> > > > does. It's set on any error but cleared before issuing a commit.
> > > > 
> > > > I added a similar flag to Ceph inodes recently, but only clear it when
> > > > a write succeeds. Wouldn't that make more sense here as well?
> > > 
> > > It is a bit hard to wrap one's mind around.
> > > 
> > > In the original code (commit 7b159fc18d417980) it looks like:
> > >  - test-and-clear bit
> > >  - write and sync
> > >  - test-bit
> > > 
> > > This does, I think, seem safer than "clear on successful write" as the
> > > writes could complete out-of-order and I wouldn't be surprised if the
> > > unsuccessful ones completed with an error before the successful one -
> > > particularly with an error like EDQUOT.
> > > 
> > > However the current code does the writes before the test-and-clear, and
> > > only does the commit afterwards.  That makes it less clear why the
> > > current sequence is a good idea.
> > > 
> > > However ... nfs_file_fsync_commit() is only called if
> > > filemap_write_and_wait_range() returned with success, so we only clear
> > > the flag after successful writes(?).
> > > 
> > > Oh....
> > > This patch from me:
> > > 
> > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error handling.")
> > > 
> > > seems to have been reverted by
> > > 
> > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if page writeback failed")
> > > 
> > > which probably isn't good.  It appears that this code is very fragile
> > > and easily broken.
> 
> On further investigation, I think the problem that I fixed and then we
> reintroduced will be fixed again - more permanently - by your patch.
> The root problem is that nfs keeps error codes in a different way to the
> MM core.  By unifying those, the problem goes.
> (The specific problem is that writes which hit EDQUOT on the server can
>  report EIO on the client).
> 
> 
> > > Maybe we need to work out exactly what is required, and document it - so
> > > we can stop breaking it.
> > > Or maybe we need some unit tests.....
> > > 
> > 
> > Yes, laying out what's necessary for this would be very helpful. We
> > clearly want to set the flag when an error occurs. Under what
> > circumstances should we be clearing it?
> 
> Well.... looking back at  7b159fc18d417980f57ae which introduced the
> flag, prior to that write errors (ctx->error) were only reported by
> nfs_file_flush and nfs_fsync, so only one close() and fsync().
> 
> After that commit, setting the flag would mean that errors could be
> returned by 'write'.  So clearing as part of returning the error makes
> perfect sense.
> 
> As long as the error gets recorded, and gets returned when it is
> recorded, it doesn't much matter when the flag is cleared.  With your
> patches we don't need to flag any more to get errors reliably reported.
> 
> Leaving the flag set means that writes go more slowly - we don't get
> large queue of background rights building up but destined for failure.
> This is the main point made in the comment message when the flag was
> introduced.
> Of course, by the time we first get an error there could already
> by a large queue, so we probably want that to drain completely before
> allowing async writes again.
> 
> It might make sense to have 2 flags.  One which says "writes should be
> synchronous", another that says "There was an error recently".
> We clear the error flag before calling nfs_fsync, and if it is still
> clear afterwards, we clear the sync-writes flag.  Maybe that is more
> complex than needed though.
> 
> I'm leaning towards your suggestion that it doesn't matter very much
> when it gets cleared, and clearing it on any successful write is
> simplest.
> 
> So I'm still in favor of using nfs_context_set_write_error() in
> nfs_pageio_add_request(), primarily because it is most consistent - we
> don't need exceptions.

Thanks for taking a closer look. I can easily make the change above, and
I do think that keeping this mechanism as simple as possible will make
it easier to prevent bitrot.

That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx is a
per open file description object.

Is that the correct way to track this? All of the ctx's will share the
same inode. If we're getting writeback errors for one context, it's
quite likely that we'll be seeing them via others.

I suppose the counterargument is when we have things like expiring krb5
tickets. Write failures via an expiring set of creds may have no effect
on writeback via other creds.

Still, I think a per-inode flag might make more sense here.

Thoughts?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-07 11:35             ` Jeff Layton
@ 2017-09-07 14:54               ` Trond Myklebust
  2017-09-11  3:24                 ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2017-09-07 14:54 UTC (permalink / raw)
  To: anna.schumaker, jlayton, neilb, jlayton; +Cc: linux-nfs, linux-fsdevel

On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
> On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
> > On Tue, Aug 29 2017, Jeff Layton wrote:
> > 
> > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> > > > On Mon, Aug 28 2017, Jeff Layton wrote:
> > > > 
> > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > > > > 
> > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > > > > 
> > > > > > > > There is some ambiguity in nfs about how writeback
> > > > > > > > errors are
> > > > > > > > tracked.
> > > > > > > > 
> > > > > > > > For instance, nfs_pageio_add_request calls
> > > > > > > > mapping_set_error when
> > > > > > > > the
> > > > > > > > add fails, but we track errors that occur after adding
> > > > > > > > the
> > > > > > > > request
> > > > > > > > with a dedicated int error in the open context.
> > > > > > > > 
> > > > > > > > Now that we have better infrastructure for the vfs
> > > > > > > > layer, this
> > > > > > > > latter int is now unnecessary. Just have
> > > > > > > > nfs_context_set_write_error set
> > > > > > > > the error in the mapping when one occurs.
> > > > > > > > 
> > > > > > > > Have NFS use file_write_and_wait_range to initiate and
> > > > > > > > wait on
> > > > > > > > writeback
> > > > > > > > of the data, and then check again after issuing the
> > > > > > > > commit(s).
> > > > > > > > 
> > > > > > > > With this, we also don't need to pay attention to the
> > > > > > > > ERROR_WRITE
> > > > > > > > flag for reporting, and just clear it to indicate to
> > > > > > > > subsequent
> > > > > > > > writers that they should try to go asynchronous again.
> > > > > > > > 
> > > > > > > > In nfs_page_async_flush, sample the error before
> > > > > > > > locking and
> > > > > > > > joining
> > > > > > > > the requests, and check for errors since that point.
> > > > > > > > 
> > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > > ---
> > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > > > > >  fs/nfs/inode.c         |  3 +--
> > > > > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > > > > >  include/linux/nfs_fs.h |  1 -
> > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > > > > 
> > > > > > > > I have a baling wire and duct tape solution for testing
> > > > > > > > this with
> > > > > > > > xfstests (using iptables REJECT targets and soft
> > > > > > > > mounts). This
> > > > > > > > seems to
> > > > > > > > make nfs do the right thing.
> > > > > > > > 
> > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > > > > --- a/fs/nfs/file.c
> > > > > > > > +++ b/fs/nfs/file.c
> > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
> > > > > > > > *file,
> > > > > > > > loff_t start, loff_t end, int datasync)
> > > > > > > >  {
> > > > > > > >  	struct nfs_open_context *ctx =
> > > > > > > > nfs_file_open_context(file);
> > > > > > > >  	struct inode *inode = file_inode(file);
> > > > > > > > -	int have_error, do_resend, status;
> > > > > > > > -	int ret = 0;
> > > > > > > > +	int do_resend, status;
> > > > > > > > +	int ret;
> > > > > > > >  
> > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
> > > > > > > > file,
> > > > > > > > datasync);
> > > > > > > >  
> > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > > > > >  	do_resend =
> > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
> > > > > > > > >flags);
> > > > > > > > -	have_error =
> > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > > > > &ctx->flags);
> > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > -	have_error |=
> > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > flags);
> > > > > > > > 
> > > > > > > > -	if (have_error) {
> > > > > > > > -		ret = xchg(&ctx->error, 0);
> > > > > > > > -		if (ret)
> > > > > > > > -			goto out;
> > > > > > > > -	}
> > > > > > > > -	if (status < 0) {
> > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > >flags);
> > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > +
> > > > > > > > +	/* Recheck and advance after the commit */
> > > > > > > > +	status = file_check_and_advance_wb_err(file);
> > > > > > 
> > > > > > This change makes the code inconsistent with the comment
> > > > > > above the
> > > > > > function, which still references ctx->error.  The intent of
> > > > > > the
> > > > > > comment
> > > > > > is still correct, but the details have changed.
> > > > > > 
> > > > > 
> > > > > Good catch. I'll fix that up in a respin.
> > > > > 
> > > > > > Also, there is a call to mapping_set_error() in
> > > > > > nfs_pageio_add_request().
> > > > > > I wonder if that should be changed to
> > > > > >   nfs_context_set_write_error(req->wb_context, desc-
> > > > > > >pg_error)
> > > > > > ??
> > > > > > 
> > > > > 
> > > > > Trickier question...
> > > > > 
> > > > > I'm not quite sure what semantics we're looking for with
> > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > > > > synchronous, but I'm not quite sure why it gets cleared the
> > > > > way it
> > > > > does. It's set on any error but cleared before issuing a
> > > > > commit.
> > > > > 
> > > > > I added a similar flag to Ceph inodes recently, but only
> > > > > clear it when
> > > > > a write succeeds. Wouldn't that make more sense here as well?
> > > > 
> > > > It is a bit hard to wrap one's mind around.
> > > > 
> > > > In the original code (commit 7b159fc18d417980) it looks like:
> > > >  - test-and-clear bit
> > > >  - write and sync
> > > >  - test-bit
> > > > 
> > > > This does, I think, seem safer than "clear on successful write"
> > > > as the
> > > > writes could complete out-of-order and I wouldn't be surprised
> > > > if the
> > > > unsuccessful ones completed with an error before the successful
> > > > one -
> > > > particularly with an error like EDQUOT.
> > > > 
> > > > However the current code does the writes before the test-and-
> > > > clear, and
> > > > only does the commit afterwards.  That makes it less clear why
> > > > the
> > > > current sequence is a good idea.
> > > > 
> > > > However ... nfs_file_fsync_commit() is only called if
> > > > filemap_write_and_wait_range() returned with success, so we
> > > > only clear
> > > > the flag after successful writes(?).
> > > > 
> > > > Oh....
> > > > This patch from me:
> > > > 
> > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
> > > > handling.")
> > > > 
> > > > seems to have been reverted by
> > > > 
> > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
> > > > page writeback failed")
> > > > 
> > > > which probably isn't good.  It appears that this code is very
> > > > fragile
> > > > and easily broken.
> > 
> > On further investigation, I think the problem that I fixed and then
> > we
> > reintroduced will be fixed again - more permanently - by your
> > patch.
> > The root problem is that nfs keeps error codes in a different way
> > to the
> > MM core.  By unifying those, the problem goes.
> > (The specific problem is that writes which hit EDQUOT on the server
> > can
> >  report EIO on the client).
> > 
> > 
> > > > Maybe we need to work out exactly what is required, and
> > > > document it - so
> > > > we can stop breaking it.
> > > > Or maybe we need some unit tests.....
> > > > 
> > > 
> > > Yes, laying out what's necessary for this would be very helpful.
> > > We
> > > clearly want to set the flag when an error occurs. Under what
> > > circumstances should we be clearing it?
> > 
> > Well.... looking back at  7b159fc18d417980f57ae which introduced
> > the
> > flag, prior to that write errors (ctx->error) were only reported by
> > nfs_file_flush and nfs_fsync, so only one close() and fsync().
> > 
> > After that commit, setting the flag would mean that errors could be
> > returned by 'write'.  So clearing as part of returning the error
> > makes
> > perfect sense.
> > 
> > As long as the error gets recorded, and gets returned when it is
> > recorded, it doesn't much matter when the flag is cleared.  With
> > your
> > patches we don't need to flag any more to get errors reliably
> > reported.
> > 
> > Leaving the flag set means that writes go more slowly - we don't
> > get
> > large queue of background rights building up but destined for
> > failure.
> > This is the main point made in the comment message when the flag
> > was
> > introduced.
> > Of course, by the time we first get an error there could already
> > by a large queue, so we probably want that to drain completely
> > before
> > allowing async writes again.

We already have this functionality implemented in the existing code.

> > 
> > It might make sense to have 2 flags.  One which says "writes should
> > be
> > synchronous", another that says "There was an error recently".
> > We clear the error flag before calling nfs_fsync, and if it is
> > still
> > clear afterwards, we clear the sync-writes flag.  Maybe that is
> > more
> > complex than needed though.
> > 

We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
see any global mechanism that will replace that.

> > I'm leaning towards your suggestion that it doesn't matter very
> > much
> > when it gets cleared, and clearing it on any successful write is
> > simplest.
> > 
> > So I'm still in favor of using nfs_context_set_write_error() in
> > nfs_pageio_add_request(), primarily because it is most consistent -
> > we
> > don't need exceptions.
> 
> Thanks for taking a closer look. I can easily make the change above,
> and
> I do think that keeping this mechanism as simple as possible will
> make
> it easier to prevent bitrot.
> 
> That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
> is a
> per open file description object.
> 
> Is that the correct way to track this? All of the ctx's will share
> the
> same inode. If we're getting writeback errors for one context, it's
> quite likely that we'll be seeing them via others.
> 
> I suppose the counterargument is when we have things like expiring
> krb5
> tickets. Write failures via an expiring set of creds may have no
> effect
> on writeback via other creds.
> 
> Still, I think a per-inode flag might make more sense here.
> 
> Thoughts?

As far as I'm concerned, that would be a regression. The most common
problem when flushing writeback data to the server aside from ENOSPC
(and possibly ESTALE) is EACCES, which is particular to the file
descriptor that opened the file.

File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
private to the file descriptor.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-07 14:54               ` Trond Myklebust
@ 2017-09-11  3:24                 ` NeilBrown
  2017-09-11 10:46                   ` Jeff Layton
  2017-09-12  2:24                   ` [PATCH] nfs: track writeback errors with errseq_t Trond Myklebust
  0 siblings, 2 replies; 24+ messages in thread
From: NeilBrown @ 2017-09-11  3:24 UTC (permalink / raw)
  To: Trond Myklebust, anna.schumaker, jlayton, jlayton
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 16524 bytes --]

On Thu, Sep 07 2017, Trond Myklebust wrote:

> On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
>> On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
>> > On Tue, Aug 29 2017, Jeff Layton wrote:
>> > 
>> > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
>> > > > On Mon, Aug 28 2017, Jeff Layton wrote:
>> > > > 
>> > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
>> > > > > > 
>> > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > > > > > > From: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > 
>> > > > > > > > There is some ambiguity in nfs about how writeback
>> > > > > > > > errors are
>> > > > > > > > tracked.
>> > > > > > > > 
>> > > > > > > > For instance, nfs_pageio_add_request calls
>> > > > > > > > mapping_set_error when
>> > > > > > > > the
>> > > > > > > > add fails, but we track errors that occur after adding
>> > > > > > > > the
>> > > > > > > > request
>> > > > > > > > with a dedicated int error in the open context.
>> > > > > > > > 
>> > > > > > > > Now that we have better infrastructure for the vfs
>> > > > > > > > layer, this
>> > > > > > > > latter int is now unnecessary. Just have
>> > > > > > > > nfs_context_set_write_error set
>> > > > > > > > the error in the mapping when one occurs.
>> > > > > > > > 
>> > > > > > > > Have NFS use file_write_and_wait_range to initiate and
>> > > > > > > > wait on
>> > > > > > > > writeback
>> > > > > > > > of the data, and then check again after issuing the
>> > > > > > > > commit(s).
>> > > > > > > > 
>> > > > > > > > With this, we also don't need to pay attention to the
>> > > > > > > > ERROR_WRITE
>> > > > > > > > flag for reporting, and just clear it to indicate to
>> > > > > > > > subsequent
>> > > > > > > > writers that they should try to go asynchronous again.
>> > > > > > > > 
>> > > > > > > > In nfs_page_async_flush, sample the error before
>> > > > > > > > locking and
>> > > > > > > > joining
>> > > > > > > > the requests, and check for errors since that point.
>> > > > > > > > 
>> > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > ---
>> > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
>> > > > > > > >  fs/nfs/inode.c         |  3 +--
>> > > > > > > >  fs/nfs/write.c         |  8 ++++++--
>> > > > > > > >  include/linux/nfs_fs.h |  1 -
>> > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
>> > > > > > > > 
>> > > > > > > > I have a baling wire and duct tape solution for testing
>> > > > > > > > this with
>> > > > > > > > xfstests (using iptables REJECT targets and soft
>> > > > > > > > mounts). This
>> > > > > > > > seems to
>> > > > > > > > make nfs do the right thing.
>> > > > > > > > 
>> > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > > > > > > --- a/fs/nfs/file.c
>> > > > > > > > +++ b/fs/nfs/file.c
>> > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
>> > > > > > > > *file,
>> > > > > > > > loff_t start, loff_t end, int datasync)
>> > > > > > > >  {
>> > > > > > > >  	struct nfs_open_context *ctx =
>> > > > > > > > nfs_file_open_context(file);
>> > > > > > > >  	struct inode *inode = file_inode(file);
>> > > > > > > > -	int have_error, do_resend, status;
>> > > > > > > > -	int ret = 0;
>> > > > > > > > +	int do_resend, status;
>> > > > > > > > +	int ret;
>> > > > > > > >  
>> > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
>> > > > > > > > file,
>> > > > > > > > datasync);
>> > > > > > > >  
>> > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > > > > > > >  	do_resend =
>> > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
>> > > > > > > > >flags);
>> > > > > > > > -	have_error =
>> > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > > > > > > &ctx->flags);
>> > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > -	have_error |=
>> > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > flags);
>> > > > > > > > 
>> > > > > > > > -	if (have_error) {
>> > > > > > > > -		ret = xchg(&ctx->error, 0);
>> > > > > > > > -		if (ret)
>> > > > > > > > -			goto out;
>> > > > > > > > -	}
>> > > > > > > > -	if (status < 0) {
>> > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > >flags);
>> > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > +
>> > > > > > > > +	/* Recheck and advance after the commit */
>> > > > > > > > +	status = file_check_and_advance_wb_err(file);
>> > > > > > 
>> > > > > > This change makes the code inconsistent with the comment
>> > > > > > above the
>> > > > > > function, which still references ctx->error.  The intent of
>> > > > > > the
>> > > > > > comment
>> > > > > > is still correct, but the details have changed.
>> > > > > > 
>> > > > > 
>> > > > > Good catch. I'll fix that up in a respin.
>> > > > > 
>> > > > > > Also, there is a call to mapping_set_error() in
>> > > > > > nfs_pageio_add_request().
>> > > > > > I wonder if that should be changed to
>> > > > > >   nfs_context_set_write_error(req->wb_context, desc-
>> > > > > > >pg_error)
>> > > > > > ??
>> > > > > > 
>> > > > > 
>> > > > > Trickier question...
>> > > > > 
>> > > > > I'm not quite sure what semantics we're looking for with
>> > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
>> > > > > synchronous, but I'm not quite sure why it gets cleared the
>> > > > > way it
>> > > > > does. It's set on any error but cleared before issuing a
>> > > > > commit.
>> > > > > 
>> > > > > I added a similar flag to Ceph inodes recently, but only
>> > > > > clear it when
>> > > > > a write succeeds. Wouldn't that make more sense here as well?
>> > > > 
>> > > > It is a bit hard to wrap one's mind around.
>> > > > 
>> > > > In the original code (commit 7b159fc18d417980) it looks like:
>> > > >  - test-and-clear bit
>> > > >  - write and sync
>> > > >  - test-bit
>> > > > 
>> > > > This does, I think, seem safer than "clear on successful write"
>> > > > as the
>> > > > writes could complete out-of-order and I wouldn't be surprised
>> > > > if the
>> > > > unsuccessful ones completed with an error before the successful
>> > > > one -
>> > > > particularly with an error like EDQUOT.
>> > > > 
>> > > > However the current code does the writes before the test-and-
>> > > > clear, and
>> > > > only does the commit afterwards.  That makes it less clear why
>> > > > the
>> > > > current sequence is a good idea.
>> > > > 
>> > > > However ... nfs_file_fsync_commit() is only called if
>> > > > filemap_write_and_wait_range() returned with success, so we
>> > > > only clear
>> > > > the flag after successful writes(?).
>> > > > 
>> > > > Oh....
>> > > > This patch from me:
>> > > > 
>> > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
>> > > > handling.")
>> > > > 
>> > > > seems to have been reverted by
>> > > > 
>> > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
>> > > > page writeback failed")
>> > > > 
>> > > > which probably isn't good.  It appears that this code is very
>> > > > fragile
>> > > > and easily broken.
>> > 
>> > On further investigation, I think the problem that I fixed and then
>> > we
>> > reintroduced will be fixed again - more permanently - by your
>> > patch.
>> > The root problem is that nfs keeps error codes in a different way
>> > to the
>> > MM core.  By unifying those, the problem goes.
>> > (The specific problem is that writes which hit EDQUOT on the server
>> > can
>> >  report EIO on the client).
>> > 
>> > 
>> > > > Maybe we need to work out exactly what is required, and
>> > > > document it - so
>> > > > we can stop breaking it.
>> > > > Or maybe we need some unit tests.....
>> > > > 
>> > > 
>> > > Yes, laying out what's necessary for this would be very helpful.
>> > > We
>> > > clearly want to set the flag when an error occurs. Under what
>> > > circumstances should we be clearing it?
>> > 
>> > Well.... looking back at  7b159fc18d417980f57ae which introduced
>> > the
>> > flag, prior to that write errors (ctx->error) were only reported by
>> > nfs_file_flush and nfs_fsync, so only one close() and fsync().
>> > 
>> > After that commit, setting the flag would mean that errors could be
>> > returned by 'write'.  So clearing as part of returning the error
>> > makes
>> > perfect sense.
>> > 
>> > As long as the error gets recorded, and gets returned when it is
>> > recorded, it doesn't much matter when the flag is cleared.  With
>> > your
>> > patches we don't need to flag any more to get errors reliably
>> > reported.
>> > 
>> > Leaving the flag set means that writes go more slowly - we don't
>> > get
>> > large queue of background rights building up but destined for
>> > failure.
>> > This is the main point made in the comment message when the flag
>> > was
>> > introduced.
>> > Of course, by the time we first get an error there could already
>> > by a large queue, so we probably want that to drain completely
>> > before
>> > allowing async writes again.
>
> We already have this functionality implemented in the existing code.
>
>> > 
>> > It might make sense to have 2 flags.  One which says "writes should
>> > be
>> > synchronous", another that says "There was an error recently".
>> > We clear the error flag before calling nfs_fsync, and if it is
>> > still
>> > clear afterwards, we clear the sync-writes flag.  Maybe that is
>> > more
>> > complex than needed though.
>> > 
>
> We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
> see any global mechanism that will replace that.
>
>> > I'm leaning towards your suggestion that it doesn't matter very
>> > much
>> > when it gets cleared, and clearing it on any successful write is
>> > simplest.
>> > 
>> > So I'm still in favor of using nfs_context_set_write_error() in
>> > nfs_pageio_add_request(), primarily because it is most consistent -
>> > we
>> > don't need exceptions.
>> 
>> Thanks for taking a closer look. I can easily make the change above,
>> and
>> I do think that keeping this mechanism as simple as possible will
>> make
>> it easier to prevent bitrot.
>> 
>> That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
>> is a
>> per open file description object.
>> 
>> Is that the correct way to track this? All of the ctx's will share
>> the
>> same inode. If we're getting writeback errors for one context, it's
>> quite likely that we'll be seeing them via others.
>> 
>> I suppose the counterargument is when we have things like expiring
>> krb5
>> tickets. Write failures via an expiring set of creds may have no
>> effect
>> on writeback via other creds.
>> 
>> Still, I think a per-inode flag might make more sense here.
>> 
>> Thoughts?
>
> As far as I'm concerned, that would be a regression. The most common
> problem when flushing writeback data to the server aside from ENOSPC
> (and possibly ESTALE) is EACCES, which is particular to the file
> descriptor that opened the file.
>
> File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
> private to the file descriptor.

Thanks for the reminder that errors are per-context and this patch drops
this.  The per-context nature of errors in NFS was the reason that I
nagged Jeff to make errseq_t a stand-alone type rather than just a part
of address_space.  I had envisaged that it would be embedded in the
open_context as well.
We still could do that, but as there is precisely one open-file for each
open_context, the gains are not great.

However, while looking over the code to make sure I really understood it
and all the possible consequences of changing to errseq_t I found a few
anomalies.  The patch below addresses them all.

Would you see if they may sense to you?

Thanks,
NeilBrown


From: NeilBrown <neilb@suse.com>
Date: Mon, 11 Sep 2017 13:15:50 +1000
Subject: [PATCH] NFS: various changes relating to reporting IO errors.

1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
   They aren't used.

2/ Make nfs_context_set_write_error() a "static inline" in internal.h
   so we can...

3/ Use nfs_context_set_write_error() instead of mapping_set_error()
   if nfs_pageio_add_request() fails before sending any request.
   NFS generally keeps errors in the open_context, not the mapping,
   so this is more consistent.

4/ If filemap_write_and_write_range() reports any error, still
   check ctx->error.  The value in ctx->error is likely to be
   more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
   cleared slightly earlier, before nfs_file_fsync_commit() is called,
   rather than at the start of that function.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/nfs/file.c     | 16 ++++++++++------
 fs/nfs/internal.h |  7 +++++++
 fs/nfs/pagelist.c |  4 ++--
 fs/nfs/write.c    |  7 -------
 4 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index af330c31f627..ab324f14081f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
  * fall back to doing a synchronous write.
  */
 static int
-nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
+nfs_file_fsync_commit(struct file *file, int datasync)
 {
 	struct nfs_open_context *ctx = nfs_file_open_context(file);
 	struct inode *inode = file_inode(file);
-	int have_error, do_resend, status;
+	int do_resend, status;
 	int ret = 0;
 
 	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
 
 	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
 	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
-	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 	status = nfs_commit_inode(inode, FLUSH_SYNC);
-	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
-	if (have_error) {
+	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
 		ret = xchg(&ctx->error, 0);
 		if (ret)
 			goto out;
@@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	trace_nfs_fsync_enter(inode);
 
 	do {
+		struct nfs_open_context *ctx = nfs_file_open_context(file);
 		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
+			int ret2 = xchg(&ctx->error, 0);
+			if (ret2)
+				ret = ret2;
+		}
 		if (ret != 0)
 			break;
-		ret = nfs_file_fsync_commit(file, start, end, datasync);
+		ret = nfs_file_fsync_commit(file, datasync);
 		if (!ret)
 			ret = pnfs_sync_inode(inode, !!datasync);
 		/*
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index dc456416d2be..44c8962fec91 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
 		return false;
 	}
 }
+
+static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
+{
+	ctx->error = error;
+	smp_wmb();
+	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
+}
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index de9066a92c0d..0ebd26b9a6bd 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -1198,8 +1198,8 @@ out_failed:
 
 		/* remember fatal errors */
 		if (nfs_error_is_fatal(desc->pg_error))
-			mapping_set_error(desc->pg_inode->i_mapping,
-					  desc->pg_error);
+			nfs_context_set_write_error(req->wb_context,
+						    desc->pg_error);
 
 		func = desc->pg_completion_ops->error_cleanup;
 		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b1af5dee5e0a..f702bf2def79 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
 		kref_put(&ioc->refcount, nfs_io_completion_release);
 }
 
-static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
-{
-	ctx->error = error;
-	smp_wmb();
-	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
-}
-
 /*
  * nfs_page_find_head_request_locked - find head request associated with @page
  *
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-11  3:24                 ` NeilBrown
@ 2017-09-11 10:46                   ` Jeff Layton
  2017-09-11 21:52                     ` NeilBrown
  2017-09-12  2:24                   ` [PATCH] nfs: track writeback errors with errseq_t Trond Myklebust
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-09-11 10:46 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-nfs, linux-fsdevel

On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
> On Thu, Sep 07 2017, Trond Myklebust wrote:
> 
> > On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
> > > On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
> > > > On Tue, Aug 29 2017, Jeff Layton wrote:
> > > > 
> > > > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> > > > > > On Mon, Aug 28 2017, Jeff Layton wrote:
> > > > > > 
> > > > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > > > > > > 
> > > > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > > > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > 
> > > > > > > > > > There is some ambiguity in nfs about how writeback
> > > > > > > > > > errors are
> > > > > > > > > > tracked.
> > > > > > > > > > 
> > > > > > > > > > For instance, nfs_pageio_add_request calls
> > > > > > > > > > mapping_set_error when
> > > > > > > > > > the
> > > > > > > > > > add fails, but we track errors that occur after adding
> > > > > > > > > > the
> > > > > > > > > > request
> > > > > > > > > > with a dedicated int error in the open context.
> > > > > > > > > > 
> > > > > > > > > > Now that we have better infrastructure for the vfs
> > > > > > > > > > layer, this
> > > > > > > > > > latter int is now unnecessary. Just have
> > > > > > > > > > nfs_context_set_write_error set
> > > > > > > > > > the error in the mapping when one occurs.
> > > > > > > > > > 
> > > > > > > > > > Have NFS use file_write_and_wait_range to initiate and
> > > > > > > > > > wait on
> > > > > > > > > > writeback
> > > > > > > > > > of the data, and then check again after issuing the
> > > > > > > > > > commit(s).
> > > > > > > > > > 
> > > > > > > > > > With this, we also don't need to pay attention to the
> > > > > > > > > > ERROR_WRITE
> > > > > > > > > > flag for reporting, and just clear it to indicate to
> > > > > > > > > > subsequent
> > > > > > > > > > writers that they should try to go asynchronous again.
> > > > > > > > > > 
> > > > > > > > > > In nfs_page_async_flush, sample the error before
> > > > > > > > > > locking and
> > > > > > > > > > joining
> > > > > > > > > > the requests, and check for errors since that point.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > > > > > > >  fs/nfs/inode.c         |  3 +--
> > > > > > > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > > > > > > >  include/linux/nfs_fs.h |  1 -
> > > > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > I have a baling wire and duct tape solution for testing
> > > > > > > > > > this with
> > > > > > > > > > xfstests (using iptables REJECT targets and soft
> > > > > > > > > > mounts). This
> > > > > > > > > > seems to
> > > > > > > > > > make nfs do the right thing.
> > > > > > > > > > 
> > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > > > > > > --- a/fs/nfs/file.c
> > > > > > > > > > +++ b/fs/nfs/file.c
> > > > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
> > > > > > > > > > *file,
> > > > > > > > > > loff_t start, loff_t end, int datasync)
> > > > > > > > > >  {
> > > > > > > > > >  	struct nfs_open_context *ctx =
> > > > > > > > > > nfs_file_open_context(file);
> > > > > > > > > >  	struct inode *inode = file_inode(file);
> > > > > > > > > > -	int have_error, do_resend, status;
> > > > > > > > > > -	int ret = 0;
> > > > > > > > > > +	int do_resend, status;
> > > > > > > > > > +	int ret;
> > > > > > > > > >  
> > > > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
> > > > > > > > > > file,
> > > > > > > > > > datasync);
> > > > > > > > > >  
> > > > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > > > > > > >  	do_resend =
> > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
> > > > > > > > > > > flags);
> > > > > > > > > > 
> > > > > > > > > > -	have_error =
> > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > > > > > > &ctx->flags);
> > > > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > -	have_error |=
> > > > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > flags);
> > > > > > > > > > 
> > > > > > > > > > -	if (have_error) {
> > > > > > > > > > -		ret = xchg(&ctx->error, 0);
> > > > > > > > > > -		if (ret)
> > > > > > > > > > -			goto out;
> > > > > > > > > > -	}
> > > > > > > > > > -	if (status < 0) {
> > > > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > flags);
> > > > > > > > > > 
> > > > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > +
> > > > > > > > > > +	/* Recheck and advance after the commit */
> > > > > > > > > > +	status = file_check_and_advance_wb_err(file);
> > > > > > > > 
> > > > > > > > This change makes the code inconsistent with the comment
> > > > > > > > above the
> > > > > > > > function, which still references ctx->error.  The intent of
> > > > > > > > the
> > > > > > > > comment
> > > > > > > > is still correct, but the details have changed.
> > > > > > > > 
> > > > > > > 
> > > > > > > Good catch. I'll fix that up in a respin.
> > > > > > > 
> > > > > > > > Also, there is a call to mapping_set_error() in
> > > > > > > > nfs_pageio_add_request().
> > > > > > > > I wonder if that should be changed to
> > > > > > > >   nfs_context_set_write_error(req->wb_context, desc-
> > > > > > > > > pg_error)
> > > > > > > > 
> > > > > > > > ??
> > > > > > > > 
> > > > > > > 
> > > > > > > Trickier question...
> > > > > > > 
> > > > > > > I'm not quite sure what semantics we're looking for with
> > > > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > > > > > > synchronous, but I'm not quite sure why it gets cleared the
> > > > > > > way it
> > > > > > > does. It's set on any error but cleared before issuing a
> > > > > > > commit.
> > > > > > > 
> > > > > > > I added a similar flag to Ceph inodes recently, but only
> > > > > > > clear it when
> > > > > > > a write succeeds. Wouldn't that make more sense here as well?
> > > > > > 
> > > > > > It is a bit hard to wrap one's mind around.
> > > > > > 
> > > > > > In the original code (commit 7b159fc18d417980) it looks like:
> > > > > >  - test-and-clear bit
> > > > > >  - write and sync
> > > > > >  - test-bit
> > > > > > 
> > > > > > This does, I think, seem safer than "clear on successful write"
> > > > > > as the
> > > > > > writes could complete out-of-order and I wouldn't be surprised
> > > > > > if the
> > > > > > unsuccessful ones completed with an error before the successful
> > > > > > one -
> > > > > > particularly with an error like EDQUOT.
> > > > > > 
> > > > > > However the current code does the writes before the test-and-
> > > > > > clear, and
> > > > > > only does the commit afterwards.  That makes it less clear why
> > > > > > the
> > > > > > current sequence is a good idea.
> > > > > > 
> > > > > > However ... nfs_file_fsync_commit() is only called if
> > > > > > filemap_write_and_wait_range() returned with success, so we
> > > > > > only clear
> > > > > > the flag after successful writes(?).
> > > > > > 
> > > > > > Oh....
> > > > > > This patch from me:
> > > > > > 
> > > > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
> > > > > > handling.")
> > > > > > 
> > > > > > seems to have been reverted by
> > > > > > 
> > > > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
> > > > > > page writeback failed")
> > > > > > 
> > > > > > which probably isn't good.  It appears that this code is very
> > > > > > fragile
> > > > > > and easily broken.
> > > > 
> > > > On further investigation, I think the problem that I fixed and then
> > > > we
> > > > reintroduced will be fixed again - more permanently - by your
> > > > patch.
> > > > The root problem is that nfs keeps error codes in a different way
> > > > to the
> > > > MM core.  By unifying those, the problem goes.
> > > > (The specific problem is that writes which hit EDQUOT on the server
> > > > can
> > > >  report EIO on the client).
> > > > 
> > > > 
> > > > > > Maybe we need to work out exactly what is required, and
> > > > > > document it - so
> > > > > > we can stop breaking it.
> > > > > > Or maybe we need some unit tests.....
> > > > > > 
> > > > > 
> > > > > Yes, laying out what's necessary for this would be very helpful.
> > > > > We
> > > > > clearly want to set the flag when an error occurs. Under what
> > > > > circumstances should we be clearing it?
> > > > 
> > > > Well.... looking back at  7b159fc18d417980f57ae which introduced
> > > > the
> > > > flag, prior to that write errors (ctx->error) were only reported by
> > > > nfs_file_flush and nfs_fsync, so only one close() and fsync().
> > > > 
> > > > After that commit, setting the flag would mean that errors could be
> > > > returned by 'write'.  So clearing as part of returning the error
> > > > makes
> > > > perfect sense.
> > > > 
> > > > As long as the error gets recorded, and gets returned when it is
> > > > recorded, it doesn't much matter when the flag is cleared.  With
> > > > your
> > > > patches we don't need to flag any more to get errors reliably
> > > > reported.
> > > > 
> > > > Leaving the flag set means that writes go more slowly - we don't
> > > > get
> > > > large queue of background rights building up but destined for
> > > > failure.
> > > > This is the main point made in the comment message when the flag
> > > > was
> > > > introduced.
> > > > Of course, by the time we first get an error there could already
> > > > by a large queue, so we probably want that to drain completely
> > > > before
> > > > allowing async writes again.
> > 
> > We already have this functionality implemented in the existing code.
> > 
> > > > 
> > > > It might make sense to have 2 flags.  One which says "writes should
> > > > be
> > > > synchronous", another that says "There was an error recently".
> > > > We clear the error flag before calling nfs_fsync, and if it is
> > > > still
> > > > clear afterwards, we clear the sync-writes flag.  Maybe that is
> > > > more
> > > > complex than needed though.
> > > > 
> > 
> > We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
> > see any global mechanism that will replace that.
> > 
> > > > I'm leaning towards your suggestion that it doesn't matter very
> > > > much
> > > > when it gets cleared, and clearing it on any successful write is
> > > > simplest.
> > > > 
> > > > So I'm still in favor of using nfs_context_set_write_error() in
> > > > nfs_pageio_add_request(), primarily because it is most consistent -
> > > > we
> > > > don't need exceptions.
> > > 
> > > Thanks for taking a closer look. I can easily make the change above,
> > > and
> > > I do think that keeping this mechanism as simple as possible will
> > > make
> > > it easier to prevent bitrot.
> > > 
> > > That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
> > > is a
> > > per open file description object.
> > > 
> > > Is that the correct way to track this? All of the ctx's will share
> > > the
> > > same inode. If we're getting writeback errors for one context, it's
> > > quite likely that we'll be seeing them via others.
> > > 
> > > I suppose the counterargument is when we have things like expiring
> > > krb5
> > > tickets. Write failures via an expiring set of creds may have no
> > > effect
> > > on writeback via other creds.
> > > 
> > > Still, I think a per-inode flag might make more sense here.
> > > 
> > > Thoughts?
> > 
> > As far as I'm concerned, that would be a regression. The most common
> > problem when flushing writeback data to the server aside from ENOSPC
> > (and possibly ESTALE) is EACCES, which is particular to the file
> > descriptor that opened the file.
> > 
> > File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
> > private to the file descriptor.
> 
> Thanks for the reminder that errors are per-context and this patch drops
> this.  The per-context nature of errors in NFS was the reason that I
> nagged Jeff to make errseq_t a stand-alone type rather than just a part
> of address_space.  I had envisaged that it would be embedded in the
> open_context as well.
> We still could do that, but as there is precisely one open-file for each
> open_context, the gains are not great.
> 
> However, while looking over the code to make sure I really understood it
> and all the possible consequences of changing to errseq_t I found a few
> anomalies.  The patch below addresses them all.
> 
> Would you see if they may sense to you?
> 
> Thanks,
> NeilBrown
> 
> 
> From: NeilBrown <neilb@suse.com>
> Date: Mon, 11 Sep 2017 13:15:50 +1000
> Subject: [PATCH] NFS: various changes relating to reporting IO errors.
> 
> 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
>    They aren't used.
> 
> 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
>    so we can...
> 
> 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
>    if nfs_pageio_add_request() fails before sending any request.
>    NFS generally keeps errors in the open_context, not the mapping,
>    so this is more consistent.
> 
> 4/ If filemap_write_and_write_range() reports any error, still
>    check ctx->error.  The value in ctx->error is likely to be
>    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
>    cleared slightly earlier, before nfs_file_fsync_commit() is called,
>    rather than at the start of that function.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/nfs/file.c     | 16 ++++++++++------
>  fs/nfs/internal.h |  7 +++++++
>  fs/nfs/pagelist.c |  4 ++--
>  fs/nfs/write.c    |  7 -------
>  4 files changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index af330c31f627..ab324f14081f 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
>   * fall back to doing a synchronous write.
>   */
>  static int
> -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
> +nfs_file_fsync_commit(struct file *file, int datasync)
>  {
>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>  	struct inode *inode = file_inode(file);
> -	int have_error, do_resend, status;
> +	int do_resend, status;
>  	int ret = 0;
>  
>  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>  	status = nfs_commit_inode(inode, FLUSH_SYNC);
> -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> -	if (have_error) {
> +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>  		ret = xchg(&ctx->error, 0);
>  		if (ret)
>  			goto out;
> @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	trace_nfs_fsync_enter(inode);
>  
>  	do {
> +		struct nfs_open_context *ctx = nfs_file_open_context(file);
>  		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
> +			int ret2 = xchg(&ctx->error, 0);
> +			if (ret2)
> +				ret = ret2;
> +		}
>  		if (ret != 0)
>  			break;
> -		ret = nfs_file_fsync_commit(file, start, end, datasync);
> +		ret = nfs_file_fsync_commit(file, datasync);
>  		if (!ret)
>  			ret = pnfs_sync_inode(inode, !!datasync);
>  		/*
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index dc456416d2be..44c8962fec91 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
>  		return false;
>  	}
>  }
> +
> +static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> +{
> +	ctx->error = error;
> +	smp_wmb();
> +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> +}
> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> index de9066a92c0d..0ebd26b9a6bd 100644
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -1198,8 +1198,8 @@ out_failed:
>  
>  		/* remember fatal errors */
>  		if (nfs_error_is_fatal(desc->pg_error))
> -			mapping_set_error(desc->pg_inode->i_mapping,
> -					  desc->pg_error);
> +			nfs_context_set_write_error(req->wb_context,
> +						    desc->pg_error);
>  
>  		func = desc->pg_completion_ops->error_cleanup;
>  		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b1af5dee5e0a..f702bf2def79 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
>  		kref_put(&ioc->refcount, nfs_io_completion_release);
>  }
>  
> -static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> -{
> -	ctx->error = error;
> -	smp_wmb();
> -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> -}
> -
>  /*
>   * nfs_page_find_head_request_locked - find head request associated with @page
>   *

This should probably be broken out into at least a 2-3 different
patches.

Ok, so to make sure I understand:

All writeback is done under the aegis of an open context, and writes
from different open contexts are not mergeable. We also flush to the
server in the case that a dirty page is written via an incompatible open
context. So with that we can always tie

In that case, yes...mixing in errseq_t doesn't really buy us much here,
and I agree with most of the changes above.

That said...I'm still not thrilled with how NFS_CONTEXT_ERROR_WRITE is
handled in this code. That flag is set when a write fails, but is only
cleared on fsync.

That seems wrong to me. Why wait for an fsync to start doing async
writes again once they start working? What if the application never does
an fsync? Clearing that flag on a successful WRITE seems like it'd make
more sense.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-11 10:46                   ` Jeff Layton
@ 2017-09-11 21:52                     ` NeilBrown
  2017-09-12 15:20                       ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-09-11 21:52 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-nfs, linux-fsdevel


> On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
>> On Thu, Sep 07 2017, Trond Myklebust wrote:
>> 
>> > On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
>> > > On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
>> > > > On Tue, Aug 29 2017, Jeff Layton wrote:
>> > > > 
>> > > > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
>> > > > > > On Mon, Aug 28 2017, Jeff Layton wrote:
>> > > > > > 
>> > > > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> > > > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
>> > > > > > > > 
>> > > > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > > > > > > > > From: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > > > 
>> > > > > > > > > > There is some ambiguity in nfs about how writeback
>> > > > > > > > > > errors are
>> > > > > > > > > > tracked.
>> > > > > > > > > > 
>> > > > > > > > > > For instance, nfs_pageio_add_request calls
>> > > > > > > > > > mapping_set_error when
>> > > > > > > > > > the
>> > > > > > > > > > add fails, but we track errors that occur after adding
>> > > > > > > > > > the
>> > > > > > > > > > request
>> > > > > > > > > > with a dedicated int error in the open context.
>> > > > > > > > > > 
>> > > > > > > > > > Now that we have better infrastructure for the vfs
>> > > > > > > > > > layer, this
>> > > > > > > > > > latter int is now unnecessary. Just have
>> > > > > > > > > > nfs_context_set_write_error set
>> > > > > > > > > > the error in the mapping when one occurs.
>> > > > > > > > > > 
>> > > > > > > > > > Have NFS use file_write_and_wait_range to initiate and
>> > > > > > > > > > wait on
>> > > > > > > > > > writeback
>> > > > > > > > > > of the data, and then check again after issuing the
>> > > > > > > > > > commit(s).
>> > > > > > > > > > 
>> > > > > > > > > > With this, we also don't need to pay attention to the
>> > > > > > > > > > ERROR_WRITE
>> > > > > > > > > > flag for reporting, and just clear it to indicate to
>> > > > > > > > > > subsequent
>> > > > > > > > > > writers that they should try to go asynchronous again.
>> > > > > > > > > > 
>> > > > > > > > > > In nfs_page_async_flush, sample the error before
>> > > > > > > > > > locking and
>> > > > > > > > > > joining
>> > > > > > > > > > the requests, and check for errors since that point.
>> > > > > > > > > > 
>> > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > > > ---
>> > > > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
>> > > > > > > > > >  fs/nfs/inode.c         |  3 +--
>> > > > > > > > > >  fs/nfs/write.c         |  8 ++++++--
>> > > > > > > > > >  include/linux/nfs_fs.h |  1 -
>> > > > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
>> > > > > > > > > > 
>> > > > > > > > > > I have a baling wire and duct tape solution for testing
>> > > > > > > > > > this with
>> > > > > > > > > > xfstests (using iptables REJECT targets and soft
>> > > > > > > > > > mounts). This
>> > > > > > > > > > seems to
>> > > > > > > > > > make nfs do the right thing.
>> > > > > > > > > > 
>> > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > > > > > > > > --- a/fs/nfs/file.c
>> > > > > > > > > > +++ b/fs/nfs/file.c
>> > > > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
>> > > > > > > > > > *file,
>> > > > > > > > > > loff_t start, loff_t end, int datasync)
>> > > > > > > > > >  {
>> > > > > > > > > >  	struct nfs_open_context *ctx =
>> > > > > > > > > > nfs_file_open_context(file);
>> > > > > > > > > >  	struct inode *inode = file_inode(file);
>> > > > > > > > > > -	int have_error, do_resend, status;
>> > > > > > > > > > -	int ret = 0;
>> > > > > > > > > > +	int do_resend, status;
>> > > > > > > > > > +	int ret;
>> > > > > > > > > >  
>> > > > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
>> > > > > > > > > > file,
>> > > > > > > > > > datasync);
>> > > > > > > > > >  
>> > > > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > > > > > > > > >  	do_resend =
>> > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
>> > > > > > > > > > > flags);
>> > > > > > > > > > 
>> > > > > > > > > > -	have_error =
>> > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > > > > > > > > &ctx->flags);
>> > > > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > > > -	have_error |=
>> > > > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > > > flags);
>> > > > > > > > > > 
>> > > > > > > > > > -	if (have_error) {
>> > > > > > > > > > -		ret = xchg(&ctx->error, 0);
>> > > > > > > > > > -		if (ret)
>> > > > > > > > > > -			goto out;
>> > > > > > > > > > -	}
>> > > > > > > > > > -	if (status < 0) {
>> > > > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > > > flags);
>> > > > > > > > > > 
>> > > > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > > > +
>> > > > > > > > > > +	/* Recheck and advance after the commit */
>> > > > > > > > > > +	status = file_check_and_advance_wb_err(file);
>> > > > > > > > 
>> > > > > > > > This change makes the code inconsistent with the comment
>> > > > > > > > above the
>> > > > > > > > function, which still references ctx->error.  The intent of
>> > > > > > > > the
>> > > > > > > > comment
>> > > > > > > > is still correct, but the details have changed.
>> > > > > > > > 
>> > > > > > > 
>> > > > > > > Good catch. I'll fix that up in a respin.
>> > > > > > > 
>> > > > > > > > Also, there is a call to mapping_set_error() in
>> > > > > > > > nfs_pageio_add_request().
>> > > > > > > > I wonder if that should be changed to
>> > > > > > > >   nfs_context_set_write_error(req->wb_context, desc-
>> > > > > > > > > pg_error)
>> > > > > > > > 
>> > > > > > > > ??
>> > > > > > > > 
>> > > > > > > 
>> > > > > > > Trickier question...
>> > > > > > > 
>> > > > > > > I'm not quite sure what semantics we're looking for with
>> > > > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
>> > > > > > > synchronous, but I'm not quite sure why it gets cleared the
>> > > > > > > way it
>> > > > > > > does. It's set on any error but cleared before issuing a
>> > > > > > > commit.
>> > > > > > > 
>> > > > > > > I added a similar flag to Ceph inodes recently, but only
>> > > > > > > clear it when
>> > > > > > > a write succeeds. Wouldn't that make more sense here as well?
>> > > > > > 
>> > > > > > It is a bit hard to wrap one's mind around.
>> > > > > > 
>> > > > > > In the original code (commit 7b159fc18d417980) it looks like:
>> > > > > >  - test-and-clear bit
>> > > > > >  - write and sync
>> > > > > >  - test-bit
>> > > > > > 
>> > > > > > This does, I think, seem safer than "clear on successful write"
>> > > > > > as the
>> > > > > > writes could complete out-of-order and I wouldn't be surprised
>> > > > > > if the
>> > > > > > unsuccessful ones completed with an error before the successful
>> > > > > > one -
>> > > > > > particularly with an error like EDQUOT.
>> > > > > > 
>> > > > > > However the current code does the writes before the test-and-
>> > > > > > clear, and
>> > > > > > only does the commit afterwards.  That makes it less clear why
>> > > > > > the
>> > > > > > current sequence is a good idea.
>> > > > > > 
>> > > > > > However ... nfs_file_fsync_commit() is only called if
>> > > > > > filemap_write_and_wait_range() returned with success, so we
>> > > > > > only clear
>> > > > > > the flag after successful writes(?).
>> > > > > > 
>> > > > > > Oh....
>> > > > > > This patch from me:
>> > > > > > 
>> > > > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
>> > > > > > handling.")
>> > > > > > 
>> > > > > > seems to have been reverted by
>> > > > > > 
>> > > > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
>> > > > > > page writeback failed")
>> > > > > > 
>> > > > > > which probably isn't good.  It appears that this code is very
>> > > > > > fragile
>> > > > > > and easily broken.
>> > > > 
>> > > > On further investigation, I think the problem that I fixed and then
>> > > > we
>> > > > reintroduced will be fixed again - more permanently - by your
>> > > > patch.
>> > > > The root problem is that nfs keeps error codes in a different way
>> > > > to the
>> > > > MM core.  By unifying those, the problem goes.
>> > > > (The specific problem is that writes which hit EDQUOT on the server
>> > > > can
>> > > >  report EIO on the client).
>> > > > 
>> > > > 
>> > > > > > Maybe we need to work out exactly what is required, and
>> > > > > > document it - so
>> > > > > > we can stop breaking it.
>> > > > > > Or maybe we need some unit tests.....
>> > > > > > 
>> > > > > 
>> > > > > Yes, laying out what's necessary for this would be very helpful.
>> > > > > We
>> > > > > clearly want to set the flag when an error occurs. Under what
>> > > > > circumstances should we be clearing it?
>> > > > 
>> > > > Well.... looking back at  7b159fc18d417980f57ae which introduced
>> > > > the
>> > > > flag, prior to that write errors (ctx->error) were only reported by
>> > > > nfs_file_flush and nfs_fsync, so only one close() and fsync().
>> > > > 
>> > > > After that commit, setting the flag would mean that errors could be
>> > > > returned by 'write'.  So clearing as part of returning the error
>> > > > makes
>> > > > perfect sense.
>> > > > 
>> > > > As long as the error gets recorded, and gets returned when it is
>> > > > recorded, it doesn't much matter when the flag is cleared.  With
>> > > > your
>> > > > patches we don't need to flag any more to get errors reliably
>> > > > reported.
>> > > > 
>> > > > Leaving the flag set means that writes go more slowly - we don't
>> > > > get
>> > > > large queue of background rights building up but destined for
>> > > > failure.
>> > > > This is the main point made in the comment message when the flag
>> > > > was
>> > > > introduced.
>> > > > Of course, by the time we first get an error there could already
>> > > > by a large queue, so we probably want that to drain completely
>> > > > before
>> > > > allowing async writes again.
>> > 
>> > We already have this functionality implemented in the existing code.
>> > 
>> > > > 
>> > > > It might make sense to have 2 flags.  One which says "writes should
>> > > > be
>> > > > synchronous", another that says "There was an error recently".
>> > > > We clear the error flag before calling nfs_fsync, and if it is
>> > > > still
>> > > > clear afterwards, we clear the sync-writes flag.  Maybe that is
>> > > > more
>> > > > complex than needed though.
>> > > > 
>> > 
>> > We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
>> > see any global mechanism that will replace that.
>> > 
>> > > > I'm leaning towards your suggestion that it doesn't matter very
>> > > > much
>> > > > when it gets cleared, and clearing it on any successful write is
>> > > > simplest.
>> > > > 
>> > > > So I'm still in favor of using nfs_context_set_write_error() in
>> > > > nfs_pageio_add_request(), primarily because it is most consistent -
>> > > > we
>> > > > don't need exceptions.
>> > > 
>> > > Thanks for taking a closer look. I can easily make the change above,
>> > > and
>> > > I do think that keeping this mechanism as simple as possible will
>> > > make
>> > > it easier to prevent bitrot.
>> > > 
>> > > That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
>> > > is a
>> > > per open file description object.
>> > > 
>> > > Is that the correct way to track this? All of the ctx's will share
>> > > the
>> > > same inode. If we're getting writeback errors for one context, it's
>> > > quite likely that we'll be seeing them via others.
>> > > 
>> > > I suppose the counterargument is when we have things like expiring
>> > > krb5
>> > > tickets. Write failures via an expiring set of creds may have no
>> > > effect
>> > > on writeback via other creds.
>> > > 
>> > > Still, I think a per-inode flag might make more sense here.
>> > > 
>> > > Thoughts?
>> > 
>> > As far as I'm concerned, that would be a regression. The most common
>> > problem when flushing writeback data to the server aside from ENOSPC
>> > (and possibly ESTALE) is EACCES, which is particular to the file
>> > descriptor that opened the file.
>> > 
>> > File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
>> > private to the file descriptor.
>> 
>> Thanks for the reminder that errors are per-context and this patch drops
>> this.  The per-context nature of errors in NFS was the reason that I
>> nagged Jeff to make errseq_t a stand-alone type rather than just a part
>> of address_space.  I had envisaged that it would be embedded in the
>> open_context as well.
>> We still could do that, but as there is precisely one open-file for each
>> open_context, the gains are not great.
>> 
>> However, while looking over the code to make sure I really understood it
>> and all the possible consequences of changing to errseq_t I found a few
>> anomalies.  The patch below addresses them all.
>> 
>> Would you see if they may sense to you?
>> 
>> Thanks,
>> NeilBrown
>> 
>> 
>> From: NeilBrown <neilb@suse.com>
>> Date: Mon, 11 Sep 2017 13:15:50 +1000
>> Subject: [PATCH] NFS: various changes relating to reporting IO errors.
>> 
>> 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
>>    They aren't used.
>> 
>> 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
>>    so we can...
>> 
>> 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
>>    if nfs_pageio_add_request() fails before sending any request.
>>    NFS generally keeps errors in the open_context, not the mapping,
>>    so this is more consistent.
>> 
>> 4/ If filemap_write_and_write_range() reports any error, still
>>    check ctx->error.  The value in ctx->error is likely to be
>>    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
>>    cleared slightly earlier, before nfs_file_fsync_commit() is called,
>>    rather than at the start of that function.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/nfs/file.c     | 16 ++++++++++------
>>  fs/nfs/internal.h |  7 +++++++
>>  fs/nfs/pagelist.c |  4 ++--
>>  fs/nfs/write.c    |  7 -------
>>  4 files changed, 19 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index af330c31f627..ab324f14081f 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
>>   * fall back to doing a synchronous write.
>>   */
>>  static int
>> -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
>> +nfs_file_fsync_commit(struct file *file, int datasync)
>>  {
>>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>>  	struct inode *inode = file_inode(file);
>> -	int have_error, do_resend, status;
>> +	int do_resend, status;
>>  	int ret = 0;
>>  
>>  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
>>  
>>  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>>  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>> -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>>  	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> -	if (have_error) {
>> +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>>  		ret = xchg(&ctx->error, 0);
>>  		if (ret)
>>  			goto out;
>> @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>>  	trace_nfs_fsync_enter(inode);
>>  
>>  	do {
>> +		struct nfs_open_context *ctx = nfs_file_open_context(file);
>>  		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
>> +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>> +			int ret2 = xchg(&ctx->error, 0);
>> +			if (ret2)
>> +				ret = ret2;
>> +		}
>>  		if (ret != 0)
>>  			break;
>> -		ret = nfs_file_fsync_commit(file, start, end, datasync);
>> +		ret = nfs_file_fsync_commit(file, datasync);
>>  		if (!ret)
>>  			ret = pnfs_sync_inode(inode, !!datasync);
>>  		/*
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index dc456416d2be..44c8962fec91 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
>>  		return false;
>>  	}
>>  }
>> +
>> +static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> +{
>> +	ctx->error = error;
>> +	smp_wmb();
>> +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> +}
>> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
>> index de9066a92c0d..0ebd26b9a6bd 100644
>> --- a/fs/nfs/pagelist.c
>> +++ b/fs/nfs/pagelist.c
>> @@ -1198,8 +1198,8 @@ out_failed:
>>  
>>  		/* remember fatal errors */
>>  		if (nfs_error_is_fatal(desc->pg_error))
>> -			mapping_set_error(desc->pg_inode->i_mapping,
>> -					  desc->pg_error);
>> +			nfs_context_set_write_error(req->wb_context,
>> +						    desc->pg_error);
>>  
>>  		func = desc->pg_completion_ops->error_cleanup;
>>  		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index b1af5dee5e0a..f702bf2def79 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
>>  		kref_put(&ioc->refcount, nfs_io_completion_release);
>>  }
>>  
>> -static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> -{
>> -	ctx->error = error;
>> -	smp_wmb();
>> -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> -}
>> -
>>  /*
>>   * nfs_page_find_head_request_locked - find head request associated with @page
>>   *
>
> This should probably be broken out into at least a 2-3 different
> patches.
>
> Ok, so to make sure I understand:
>
> All writeback is done under the aegis of an open context, and writes
> from different open contexts are not mergeable. We also flush to the
> server in the case that a dirty page is written via an incompatible open
> context. So with that we can always tie

Not quite.  Writes from different open contexts are sometimes mergeable,
providing the credential is the same and there are no locks that might
get in the way. (nfs_flush_incompatible() gets rid of conflicts writes
to the same page as part of nfs_write_begin().
When writes are merged, all contexts remain reachable from the request
through an 'nfs_page'. nfs_write_completion() iterates over all the
nfs_pages attached to the nfs_pgio_header, and sets the context
write_error from the hdr->error.

>
> In that case, yes...mixing in errseq_t doesn't really buy us much here,
> and I agree with most of the changes above.
>
> That said...I'm still not thrilled with how NFS_CONTEXT_ERROR_WRITE is
> handled in this code. That flag is set when a write fails, but is only
> cleared on fsync.
>
> That seems wrong to me. Why wait for an fsync to start doing async
> writes again once they start working? What if the application never does
> an fsync? Clearing that flag on a successful WRITE seems like it'd make
> more sense.

We don't really 'wait' for an fsync.  Having NFS_CONTEXT_ERROR_WRITE
means that the very next write will force an fsync
(nfs_need_check_write()).  So we really just wait for the next write.
The current code doesn't seem "obviously right" to me, but it isn't
"obviously wrong" either, and I can only make it obviously right to me
by making it more complex, and I don't think I can justify that.

Thanks,
NeilBrown

> -- 
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-11  3:24                 ` NeilBrown
  2017-09-11 10:46                   ` Jeff Layton
@ 2017-09-12  2:24                   ` Trond Myklebust
  2017-09-12  5:29                     ` NeilBrown
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2017-09-12  2:24 UTC (permalink / raw)
  To: anna.schumaker, jlayton, neilb, jlayton; +Cc: linux-nfs, linux-fsdevel

On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
> However, while looking over the code to make sure I really understood
> it
> and all the possible consequences of changing to errseq_t I found a
> few
> anomalies.  The patch below addresses them all.
> 
> Would you see if they may sense to you?
> 
> Thanks,
> NeilBrown
> 
> 
> From: NeilBrown <neilb@suse.com>
> Date: Mon, 11 Sep 2017 13:15:50 +1000
> Subject: [PATCH] NFS: various changes relating to reporting IO
> errors.
> 
> 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
>    They aren't used.
> 
> 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
>    so we can...
> 
> 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
>    if nfs_pageio_add_request() fails before sending any request.
>    NFS generally keeps errors in the open_context, not the mapping,
>    so this is more consistent.
> 
> 4/ If filemap_write_and_write_range() reports any error, still
>    check ctx->error.  The value in ctx->error is likely to be
>    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
>    cleared slightly earlier, before nfs_file_fsync_commit() is
> called,
>    rather than at the start of that function.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/nfs/file.c     | 16 ++++++++++------
>  fs/nfs/internal.h |  7 +++++++
>  fs/nfs/pagelist.c |  4 ++--
>  fs/nfs/write.c    |  7 -------
>  4 files changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index af330c31f627..ab324f14081f 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
>   * fall back to doing a synchronous write.
>   */
>  static int
> -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end,
> int datasync)
> +nfs_file_fsync_commit(struct file *file, int datasync)
>  {
>  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>  	struct inode *inode = file_inode(file);
> -	int have_error, do_resend, status;
> +	int do_resend, status;
>  	int ret = 0;
>  
>  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file,
> datasync);
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES,
> &ctx->flags);
> -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> &ctx->flags);
>  	status = nfs_commit_inode(inode, FLUSH_SYNC);
> -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> >flags);
> -	if (have_error) {
> +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>  		ret = xchg(&ctx->error, 0);
>  		if (ret)
>  			goto out;
> @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start,
> loff_t end, int datasync)
>  	trace_nfs_fsync_enter(inode);
>  
>  	do {
> +		struct nfs_open_context *ctx =
> nfs_file_open_context(file);
>  		ret = filemap_write_and_wait_range(inode->i_mapping, 
> start, end);
> +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> &ctx->flags)) {
> +			int ret2 = xchg(&ctx->error, 0);
> +			if (ret2)
> +				ret = ret2;
> +		}
>  		if (ret != 0)
>  			break;
> -		ret = nfs_file_fsync_commit(file, start, end,
> datasync);
> +		ret = nfs_file_fsync_commit(file, datasync);
>  		if (!ret)
>  			ret = pnfs_sync_inode(inode, !!datasync);
>  		/*
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index dc456416d2be..44c8962fec91 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
>  		return false;
>  	}
>  }
> +
> +static inline void nfs_context_set_write_error(struct
> nfs_open_context *ctx, int error)
> +{
> +	ctx->error = error;
> +	smp_wmb();
> +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> +}
> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> index de9066a92c0d..0ebd26b9a6bd 100644
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -1198,8 +1198,8 @@ out_failed:
>  
>  		/* remember fatal errors */
>  		if (nfs_error_is_fatal(desc->pg_error))
> -			mapping_set_error(desc->pg_inode->i_mapping,
> -					  desc->pg_error);
> +			nfs_context_set_write_error(req->wb_context,
> +						    desc->pg_error);
>  
>  		func = desc->pg_completion_ops->error_cleanup;
>  		for (midx = 0; midx < desc->pg_mirror_count; midx++)
> {
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b1af5dee5e0a..f702bf2def79 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct
> nfs_io_completion *ioc)
>  		kref_put(&ioc->refcount, nfs_io_completion_release);
>  }
>  
> -static void nfs_context_set_write_error(struct nfs_open_context
> *ctx, int error)
> -{
> -	ctx->error = error;
> -	smp_wmb();
> -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> -}
> -
>  /*
>   * nfs_page_find_head_request_locked - find head request associated
> with @page
>   *

That makes sense to me. I'm applying and will send as an update to
4.14...

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-12  2:24                   ` [PATCH] nfs: track writeback errors with errseq_t Trond Myklebust
@ 2017-09-12  5:29                     ` NeilBrown
  0 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2017-09-12  5:29 UTC (permalink / raw)
  To: Trond Myklebust, anna.schumaker, jlayton, jlayton
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 146 bytes --]

On Tue, Sep 12 2017, Trond Myklebust wrote:
>
> That makes sense to me. I'm applying and will send as an update to
> 4.14...
>

Thanks!
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-11 21:52                     ` NeilBrown
@ 2017-09-12 15:20                       ` Jeff Layton
  2017-09-12 21:47                         ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-09-12 15:20 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-nfs, linux-fsdevel

On Tue, 2017-09-12 at 07:52 +1000, NeilBrown wrote:
> > On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
> > > On Thu, Sep 07 2017, Trond Myklebust wrote:
> > > 
> > > > On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
> > > > > On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
> > > > > > On Tue, Aug 29 2017, Jeff Layton wrote:
> > > > > > 
> > > > > > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> > > > > > > > On Mon, Aug 28 2017, Jeff Layton wrote:
> > > > > > > > 
> > > > > > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > > > > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > > > > > > > > 
> > > > > > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > > > > > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > > > 
> > > > > > > > > > > > There is some ambiguity in nfs about how writeback
> > > > > > > > > > > > errors are
> > > > > > > > > > > > tracked.
> > > > > > > > > > > > 
> > > > > > > > > > > > For instance, nfs_pageio_add_request calls
> > > > > > > > > > > > mapping_set_error when
> > > > > > > > > > > > the
> > > > > > > > > > > > add fails, but we track errors that occur after adding
> > > > > > > > > > > > the
> > > > > > > > > > > > request
> > > > > > > > > > > > with a dedicated int error in the open context.
> > > > > > > > > > > > 
> > > > > > > > > > > > Now that we have better infrastructure for the vfs
> > > > > > > > > > > > layer, this
> > > > > > > > > > > > latter int is now unnecessary. Just have
> > > > > > > > > > > > nfs_context_set_write_error set
> > > > > > > > > > > > the error in the mapping when one occurs.
> > > > > > > > > > > > 
> > > > > > > > > > > > Have NFS use file_write_and_wait_range to initiate and
> > > > > > > > > > > > wait on
> > > > > > > > > > > > writeback
> > > > > > > > > > > > of the data, and then check again after issuing the
> > > > > > > > > > > > commit(s).
> > > > > > > > > > > > 
> > > > > > > > > > > > With this, we also don't need to pay attention to the
> > > > > > > > > > > > ERROR_WRITE
> > > > > > > > > > > > flag for reporting, and just clear it to indicate to
> > > > > > > > > > > > subsequent
> > > > > > > > > > > > writers that they should try to go asynchronous again.
> > > > > > > > > > > > 
> > > > > > > > > > > > In nfs_page_async_flush, sample the error before
> > > > > > > > > > > > locking and
> > > > > > > > > > > > joining
> > > > > > > > > > > > the requests, and check for errors since that point.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > > > > > > > > >  fs/nfs/inode.c         |  3 +--
> > > > > > > > > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > > > > > > > > >  include/linux/nfs_fs.h |  1 -
> > > > > > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > I have a baling wire and duct tape solution for testing
> > > > > > > > > > > > this with
> > > > > > > > > > > > xfstests (using iptables REJECT targets and soft
> > > > > > > > > > > > mounts). This
> > > > > > > > > > > > seems to
> > > > > > > > > > > > make nfs do the right thing.
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > > > > > > > > --- a/fs/nfs/file.c
> > > > > > > > > > > > +++ b/fs/nfs/file.c
> > > > > > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
> > > > > > > > > > > > *file,
> > > > > > > > > > > > loff_t start, loff_t end, int datasync)
> > > > > > > > > > > >  {
> > > > > > > > > > > >  	struct nfs_open_context *ctx =
> > > > > > > > > > > > nfs_file_open_context(file);
> > > > > > > > > > > >  	struct inode *inode = file_inode(file);
> > > > > > > > > > > > -	int have_error, do_resend, status;
> > > > > > > > > > > > -	int ret = 0;
> > > > > > > > > > > > +	int do_resend, status;
> > > > > > > > > > > > +	int ret;
> > > > > > > > > > > >  
> > > > > > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
> > > > > > > > > > > > file,
> > > > > > > > > > > > datasync);
> > > > > > > > > > > >  
> > > > > > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > > > > > > > > >  	do_resend =
> > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
> > > > > > > > > > > > > flags);
> > > > > > > > > > > > 
> > > > > > > > > > > > -	have_error =
> > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > > > > > > > > &ctx->flags);
> > > > > > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > > > -	have_error |=
> > > > > > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > > > flags);
> > > > > > > > > > > > 
> > > > > > > > > > > > -	if (have_error) {
> > > > > > > > > > > > -		ret = xchg(&ctx->error, 0);
> > > > > > > > > > > > -		if (ret)
> > > > > > > > > > > > -			goto out;
> > > > > > > > > > > > -	}
> > > > > > > > > > > > -	if (status < 0) {
> > > > > > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > > > flags);
> > > > > > > > > > > > 
> > > > > > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Recheck and advance after the commit */
> > > > > > > > > > > > +	status = file_check_and_advance_wb_err(file);
> > > > > > > > > > 
> > > > > > > > > > This change makes the code inconsistent with the comment
> > > > > > > > > > above the
> > > > > > > > > > function, which still references ctx->error.  The intent of
> > > > > > > > > > the
> > > > > > > > > > comment
> > > > > > > > > > is still correct, but the details have changed.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Good catch. I'll fix that up in a respin.
> > > > > > > > > 
> > > > > > > > > > Also, there is a call to mapping_set_error() in
> > > > > > > > > > nfs_pageio_add_request().
> > > > > > > > > > I wonder if that should be changed to
> > > > > > > > > >   nfs_context_set_write_error(req->wb_context, desc-
> > > > > > > > > > > pg_error)
> > > > > > > > > > 
> > > > > > > > > > ??
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Trickier question...
> > > > > > > > > 
> > > > > > > > > I'm not quite sure what semantics we're looking for with
> > > > > > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > > > > > > > > synchronous, but I'm not quite sure why it gets cleared the
> > > > > > > > > way it
> > > > > > > > > does. It's set on any error but cleared before issuing a
> > > > > > > > > commit.
> > > > > > > > > 
> > > > > > > > > I added a similar flag to Ceph inodes recently, but only
> > > > > > > > > clear it when
> > > > > > > > > a write succeeds. Wouldn't that make more sense here as well?
> > > > > > > > 
> > > > > > > > It is a bit hard to wrap one's mind around.
> > > > > > > > 
> > > > > > > > In the original code (commit 7b159fc18d417980) it looks like:
> > > > > > > >  - test-and-clear bit
> > > > > > > >  - write and sync
> > > > > > > >  - test-bit
> > > > > > > > 
> > > > > > > > This does, I think, seem safer than "clear on successful write"
> > > > > > > > as the
> > > > > > > > writes could complete out-of-order and I wouldn't be surprised
> > > > > > > > if the
> > > > > > > > unsuccessful ones completed with an error before the successful
> > > > > > > > one -
> > > > > > > > particularly with an error like EDQUOT.
> > > > > > > > 
> > > > > > > > However the current code does the writes before the test-and-
> > > > > > > > clear, and
> > > > > > > > only does the commit afterwards.  That makes it less clear why
> > > > > > > > the
> > > > > > > > current sequence is a good idea.
> > > > > > > > 
> > > > > > > > However ... nfs_file_fsync_commit() is only called if
> > > > > > > > filemap_write_and_wait_range() returned with success, so we
> > > > > > > > only clear
> > > > > > > > the flag after successful writes(?).
> > > > > > > > 
> > > > > > > > Oh....
> > > > > > > > This patch from me:
> > > > > > > > 
> > > > > > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
> > > > > > > > handling.")
> > > > > > > > 
> > > > > > > > seems to have been reverted by
> > > > > > > > 
> > > > > > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
> > > > > > > > page writeback failed")
> > > > > > > > 
> > > > > > > > which probably isn't good.  It appears that this code is very
> > > > > > > > fragile
> > > > > > > > and easily broken.
> > > > > > 
> > > > > > On further investigation, I think the problem that I fixed and then
> > > > > > we
> > > > > > reintroduced will be fixed again - more permanently - by your
> > > > > > patch.
> > > > > > The root problem is that nfs keeps error codes in a different way
> > > > > > to the
> > > > > > MM core.  By unifying those, the problem goes.
> > > > > > (The specific problem is that writes which hit EDQUOT on the server
> > > > > > can
> > > > > >  report EIO on the client).
> > > > > > 
> > > > > > 
> > > > > > > > Maybe we need to work out exactly what is required, and
> > > > > > > > document it - so
> > > > > > > > we can stop breaking it.
> > > > > > > > Or maybe we need some unit tests.....
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes, laying out what's necessary for this would be very helpful.
> > > > > > > We
> > > > > > > clearly want to set the flag when an error occurs. Under what
> > > > > > > circumstances should we be clearing it?
> > > > > > 
> > > > > > Well.... looking back at  7b159fc18d417980f57ae which introduced
> > > > > > the
> > > > > > flag, prior to that write errors (ctx->error) were only reported by
> > > > > > nfs_file_flush and nfs_fsync, so only one close() and fsync().
> > > > > > 
> > > > > > After that commit, setting the flag would mean that errors could be
> > > > > > returned by 'write'.  So clearing as part of returning the error
> > > > > > makes
> > > > > > perfect sense.
> > > > > > 
> > > > > > As long as the error gets recorded, and gets returned when it is
> > > > > > recorded, it doesn't much matter when the flag is cleared.  With
> > > > > > your
> > > > > > patches we don't need to flag any more to get errors reliably
> > > > > > reported.
> > > > > > 
> > > > > > Leaving the flag set means that writes go more slowly - we don't
> > > > > > get
> > > > > > large queue of background rights building up but destined for
> > > > > > failure.
> > > > > > This is the main point made in the comment message when the flag
> > > > > > was
> > > > > > introduced.
> > > > > > Of course, by the time we first get an error there could already
> > > > > > by a large queue, so we probably want that to drain completely
> > > > > > before
> > > > > > allowing async writes again.
> > > > 
> > > > We already have this functionality implemented in the existing code.
> > > > 
> > > > > > 
> > > > > > It might make sense to have 2 flags.  One which says "writes should
> > > > > > be
> > > > > > synchronous", another that says "There was an error recently".
> > > > > > We clear the error flag before calling nfs_fsync, and if it is
> > > > > > still
> > > > > > clear afterwards, we clear the sync-writes flag.  Maybe that is
> > > > > > more
> > > > > > complex than needed though.
> > > > > > 
> > > > 
> > > > We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
> > > > see any global mechanism that will replace that.
> > > > 
> > > > > > I'm leaning towards your suggestion that it doesn't matter very
> > > > > > much
> > > > > > when it gets cleared, and clearing it on any successful write is
> > > > > > simplest.
> > > > > > 
> > > > > > So I'm still in favor of using nfs_context_set_write_error() in
> > > > > > nfs_pageio_add_request(), primarily because it is most consistent -
> > > > > > we
> > > > > > don't need exceptions.
> > > > > 
> > > > > Thanks for taking a closer look. I can easily make the change above,
> > > > > and
> > > > > I do think that keeping this mechanism as simple as possible will
> > > > > make
> > > > > it easier to prevent bitrot.
> > > > > 
> > > > > That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
> > > > > is a
> > > > > per open file description object.
> > > > > 
> > > > > Is that the correct way to track this? All of the ctx's will share
> > > > > the
> > > > > same inode. If we're getting writeback errors for one context, it's
> > > > > quite likely that we'll be seeing them via others.
> > > > > 
> > > > > I suppose the counterargument is when we have things like expiring
> > > > > krb5
> > > > > tickets. Write failures via an expiring set of creds may have no
> > > > > effect
> > > > > on writeback via other creds.
> > > > > 
> > > > > Still, I think a per-inode flag might make more sense here.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > As far as I'm concerned, that would be a regression. The most common
> > > > problem when flushing writeback data to the server aside from ENOSPC
> > > > (and possibly ESTALE) is EACCES, which is particular to the file
> > > > descriptor that opened the file.
> > > > 
> > > > File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
> > > > private to the file descriptor.
> > > 
> > > Thanks for the reminder that errors are per-context and this patch drops
> > > this.  The per-context nature of errors in NFS was the reason that I
> > > nagged Jeff to make errseq_t a stand-alone type rather than just a part
> > > of address_space.  I had envisaged that it would be embedded in the
> > > open_context as well.
> > > We still could do that, but as there is precisely one open-file for each
> > > open_context, the gains are not great.
> > > 
> > > However, while looking over the code to make sure I really understood it
> > > and all the possible consequences of changing to errseq_t I found a few
> > > anomalies.  The patch below addresses them all.
> > > 
> > > Would you see if they may sense to you?
> > > 
> > > Thanks,
> > > NeilBrown
> > > 
> > > 
> > > From: NeilBrown <neilb@suse.com>
> > > Date: Mon, 11 Sep 2017 13:15:50 +1000
> > > Subject: [PATCH] NFS: various changes relating to reporting IO errors.
> > > 
> > > 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
> > >    They aren't used.
> > > 
> > > 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
> > >    so we can...
> > > 
> > > 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
> > >    if nfs_pageio_add_request() fails before sending any request.
> > >    NFS generally keeps errors in the open_context, not the mapping,
> > >    so this is more consistent.
> > > 
> > > 4/ If filemap_write_and_write_range() reports any error, still
> > >    check ctx->error.  The value in ctx->error is likely to be
> > >    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
> > >    cleared slightly earlier, before nfs_file_fsync_commit() is called,
> > >    rather than at the start of that function.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.com>
> > > ---
> > >  fs/nfs/file.c     | 16 ++++++++++------
> > >  fs/nfs/internal.h |  7 +++++++
> > >  fs/nfs/pagelist.c |  4 ++--
> > >  fs/nfs/write.c    |  7 -------
> > >  4 files changed, 19 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > index af330c31f627..ab324f14081f 100644
> > > --- a/fs/nfs/file.c
> > > +++ b/fs/nfs/file.c
> > > @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
> > >   * fall back to doing a synchronous write.
> > >   */
> > >  static int
> > > -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
> > > +nfs_file_fsync_commit(struct file *file, int datasync)
> > >  {
> > >  	struct nfs_open_context *ctx = nfs_file_open_context(file);
> > >  	struct inode *inode = file_inode(file);
> > > -	int have_error, do_resend, status;
> > > +	int do_resend, status;
> > >  	int ret = 0;
> > >  
> > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
> > >  
> > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > >  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > >  	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > -	if (have_error) {
> > > +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
> > >  		ret = xchg(&ctx->error, 0);
> > >  		if (ret)
> > >  			goto out;
> > > @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > >  	trace_nfs_fsync_enter(inode);
> > >  
> > >  	do {
> > > +		struct nfs_open_context *ctx = nfs_file_open_context(file);
> > >  		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > > +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
> > > +			int ret2 = xchg(&ctx->error, 0);
> > > +			if (ret2)
> > > +				ret = ret2;
> > > +		}
> > >  		if (ret != 0)
> > >  			break;
> > > -		ret = nfs_file_fsync_commit(file, start, end, datasync);
> > > +		ret = nfs_file_fsync_commit(file, datasync);
> > >  		if (!ret)
> > >  			ret = pnfs_sync_inode(inode, !!datasync);
> > >  		/*
> > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> > > index dc456416d2be..44c8962fec91 100644
> > > --- a/fs/nfs/internal.h
> > > +++ b/fs/nfs/internal.h
> > > @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
> > >  		return false;
> > >  	}
> > >  }
> > > +
> > > +static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> > > +{
> > > +	ctx->error = error;
> > > +	smp_wmb();
> > > +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > +}
> > > diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> > > index de9066a92c0d..0ebd26b9a6bd 100644
> > > --- a/fs/nfs/pagelist.c
> > > +++ b/fs/nfs/pagelist.c
> > > @@ -1198,8 +1198,8 @@ out_failed:
> > >  
> > >  		/* remember fatal errors */
> > >  		if (nfs_error_is_fatal(desc->pg_error))
> > > -			mapping_set_error(desc->pg_inode->i_mapping,
> > > -					  desc->pg_error);
> > > +			nfs_context_set_write_error(req->wb_context,
> > > +						    desc->pg_error);
> > >  
> > >  		func = desc->pg_completion_ops->error_cleanup;
> > >  		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index b1af5dee5e0a..f702bf2def79 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
> > >  		kref_put(&ioc->refcount, nfs_io_completion_release);
> > >  }
> > >  
> > > -static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> > > -{
> > > -	ctx->error = error;
> > > -	smp_wmb();
> > > -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > -}
> > > -
> > >  /*
> > >   * nfs_page_find_head_request_locked - find head request associated with @page
> > >   *
> > 
> > This should probably be broken out into at least a 2-3 different
> > patches.
> > 
> > Ok, so to make sure I understand:
> > 
> > All writeback is done under the aegis of an open context, and writes
> > from different open contexts are not mergeable. We also flush to the
> > server in the case that a dirty page is written via an incompatible open
> > context. So with that we can always tie
> 
> Not quite.  Writes from different open contexts are sometimes mergeable,
> providing the credential is the same and there are no locks that might
> get in the way. (nfs_flush_incompatible() gets rid of conflicts writes
> to the same page as part of nfs_write_begin().
> When writes are merged, all contexts remain reachable from the request
> through an 'nfs_page'. nfs_write_completion() iterates over all the
> nfs_pages attached to the nfs_pgio_header, and sets the context
> write_error from the hdr->error.
> 

Ok, by this account, NFS should already have "correct" error reporting
semantics on fsync. i.e. when the file is written via multiple fds, you
should get back an error on all fds if those writebacks failed.

I have a test for nfs for the new-style error reporting:

    https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/xfstests-dev.git/log/?h=wberr

The nfs test is still pretty rickety, using soft mounts and iptables to
cause requests to fail. With the patch I originally proposed, this test
would pass. When I run this test on normal mainline kernels, it fails:

-------------------------------8<------------------------------
FSTYP         -- nfs
PLATFORM      -- Linux/x86_64 wberr 4.12.11-300.fc26.x86_64
MKFS_OPTIONS  -- knfsdsrv:/export/scratch
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 knfsdsrv:/export/scratch /mnt/scratch

nfs/002	 - output mismatch (see /home/jlayton/git/xfstests/results//nfs/002.out.bad)
    --- tests/nfs/002.out	2017-07-19 13:12:59.354561869 -0400
    +++ /home/jlayton/git/xfstests/results//nfs/002.out.bad	2017-09-12 11:17:17.335943539 -0400
    @@ -1,3 +1,3 @@
     QA output created by 002
     Format and mount
    -Test passed!
    +Success on second fsync on fd[1]!
    ...
    (Run 'diff -u tests/nfs/002.out /home/jlayton/git/xfstests/results//nfs/002.out.bad'  to see the entire diff)
Ran: nfs/002
Failures: nfs/002
Failed 1 of 1 tests
-------------------------------8<------------------------------

I'm not sure that errors are really propagated to all struct files like
you suggest above. I'll plan to look a little more closely at what's
happening here, when I get some time.

> > 
> > In that case, yes...mixing in errseq_t doesn't really buy us much here,
> > and I agree with most of the changes above.
> > 
> > That said...I'm still not thrilled with how NFS_CONTEXT_ERROR_WRITE is
> > handled in this code. That flag is set when a write fails, but is only
> > cleared on fsync.
> > 
> > That seems wrong to me. Why wait for an fsync to start doing async
> > writes again once they start working? What if the application never does
> > an fsync? Clearing that flag on a successful WRITE seems like it'd make
> > more sense.
> 
> We don't really 'wait' for an fsync.  Having NFS_CONTEXT_ERROR_WRITE
> means that the very next write will force an fsync
> (nfs_need_check_write()).  So we really just wait for the next write.
> The current code doesn't seem "obviously right" to me, but it isn't
> "obviously wrong" either, and I can only make it obviously right to me
> by making it more complex, and I don't think I can justify that.

Thanks for pointing this out. I missed the bit about it forcing the
fsync when this fails. I agree that that should be fine.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-12 15:20                       ` Jeff Layton
@ 2017-09-12 21:47                         ` NeilBrown
  2017-09-13 12:23                           ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-09-12 21:47 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 24786 bytes --]

On Tue, Sep 12 2017, Jeff Layton wrote:

> On Tue, 2017-09-12 at 07:52 +1000, NeilBrown wrote:
>> > On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
>> > > On Thu, Sep 07 2017, Trond Myklebust wrote:
>> > > 
>> > > > On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
>> > > > > On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
>> > > > > > On Tue, Aug 29 2017, Jeff Layton wrote:
>> > > > > > 
>> > > > > > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
>> > > > > > > > On Mon, Aug 28 2017, Jeff Layton wrote:
>> > > > > > > > 
>> > > > > > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
>> > > > > > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
>> > > > > > > > > > 
>> > > > > > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
>> > > > > > > > > > > > From: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > > > > > 
>> > > > > > > > > > > > There is some ambiguity in nfs about how writeback
>> > > > > > > > > > > > errors are
>> > > > > > > > > > > > tracked.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > For instance, nfs_pageio_add_request calls
>> > > > > > > > > > > > mapping_set_error when
>> > > > > > > > > > > > the
>> > > > > > > > > > > > add fails, but we track errors that occur after adding
>> > > > > > > > > > > > the
>> > > > > > > > > > > > request
>> > > > > > > > > > > > with a dedicated int error in the open context.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > Now that we have better infrastructure for the vfs
>> > > > > > > > > > > > layer, this
>> > > > > > > > > > > > latter int is now unnecessary. Just have
>> > > > > > > > > > > > nfs_context_set_write_error set
>> > > > > > > > > > > > the error in the mapping when one occurs.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > Have NFS use file_write_and_wait_range to initiate and
>> > > > > > > > > > > > wait on
>> > > > > > > > > > > > writeback
>> > > > > > > > > > > > of the data, and then check again after issuing the
>> > > > > > > > > > > > commit(s).
>> > > > > > > > > > > > 
>> > > > > > > > > > > > With this, we also don't need to pay attention to the
>> > > > > > > > > > > > ERROR_WRITE
>> > > > > > > > > > > > flag for reporting, and just clear it to indicate to
>> > > > > > > > > > > > subsequent
>> > > > > > > > > > > > writers that they should try to go asynchronous again.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > In nfs_page_async_flush, sample the error before
>> > > > > > > > > > > > locking and
>> > > > > > > > > > > > joining
>> > > > > > > > > > > > the requests, and check for errors since that point.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > > > > > > > ---
>> > > > > > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
>> > > > > > > > > > > >  fs/nfs/inode.c         |  3 +--
>> > > > > > > > > > > >  fs/nfs/write.c         |  8 ++++++--
>> > > > > > > > > > > >  include/linux/nfs_fs.h |  1 -
>> > > > > > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
>> > > > > > > > > > > > 
>> > > > > > > > > > > > I have a baling wire and duct tape solution for testing
>> > > > > > > > > > > > this with
>> > > > > > > > > > > > xfstests (using iptables REJECT targets and soft
>> > > > > > > > > > > > mounts). This
>> > > > > > > > > > > > seems to
>> > > > > > > > > > > > make nfs do the right thing.
>> > > > > > > > > > > > 
>> > > > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > > > > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
>> > > > > > > > > > > > --- a/fs/nfs/file.c
>> > > > > > > > > > > > +++ b/fs/nfs/file.c
>> > > > > > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
>> > > > > > > > > > > > *file,
>> > > > > > > > > > > > loff_t start, loff_t end, int datasync)
>> > > > > > > > > > > >  {
>> > > > > > > > > > > >  	struct nfs_open_context *ctx =
>> > > > > > > > > > > > nfs_file_open_context(file);
>> > > > > > > > > > > >  	struct inode *inode = file_inode(file);
>> > > > > > > > > > > > -	int have_error, do_resend, status;
>> > > > > > > > > > > > -	int ret = 0;
>> > > > > > > > > > > > +	int do_resend, status;
>> > > > > > > > > > > > +	int ret;
>> > > > > > > > > > > >  
>> > > > > > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
>> > > > > > > > > > > > file,
>> > > > > > > > > > > > datasync);
>> > > > > > > > > > > >  
>> > > > > > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > > > > > > > > > > >  	do_resend =
>> > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
>> > > > > > > > > > > > > flags);
>> > > > > > > > > > > > 
>> > > > > > > > > > > > -	have_error =
>> > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
>> > > > > > > > > > > > &ctx->flags);
>> > > > > > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > > > > > -	have_error |=
>> > > > > > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > > > > > flags);
>> > > > > > > > > > > > 
>> > > > > > > > > > > > -	if (have_error) {
>> > > > > > > > > > > > -		ret = xchg(&ctx->error, 0);
>> > > > > > > > > > > > -		if (ret)
>> > > > > > > > > > > > -			goto out;
>> > > > > > > > > > > > -	}
>> > > > > > > > > > > > -	if (status < 0) {
>> > > > > > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
>> > > > > > > > > > > > > flags);
>> > > > > > > > > > > > 
>> > > > > > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > > > > > > > > > > +
>> > > > > > > > > > > > +	/* Recheck and advance after the commit */
>> > > > > > > > > > > > +	status = file_check_and_advance_wb_err(file);
>> > > > > > > > > > 
>> > > > > > > > > > This change makes the code inconsistent with the comment
>> > > > > > > > > > above the
>> > > > > > > > > > function, which still references ctx->error.  The intent of
>> > > > > > > > > > the
>> > > > > > > > > > comment
>> > > > > > > > > > is still correct, but the details have changed.
>> > > > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > Good catch. I'll fix that up in a respin.
>> > > > > > > > > 
>> > > > > > > > > > Also, there is a call to mapping_set_error() in
>> > > > > > > > > > nfs_pageio_add_request().
>> > > > > > > > > > I wonder if that should be changed to
>> > > > > > > > > >   nfs_context_set_write_error(req->wb_context, desc-
>> > > > > > > > > > > pg_error)
>> > > > > > > > > > 
>> > > > > > > > > > ??
>> > > > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > Trickier question...
>> > > > > > > > > 
>> > > > > > > > > I'm not quite sure what semantics we're looking for with
>> > > > > > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
>> > > > > > > > > synchronous, but I'm not quite sure why it gets cleared the
>> > > > > > > > > way it
>> > > > > > > > > does. It's set on any error but cleared before issuing a
>> > > > > > > > > commit.
>> > > > > > > > > 
>> > > > > > > > > I added a similar flag to Ceph inodes recently, but only
>> > > > > > > > > clear it when
>> > > > > > > > > a write succeeds. Wouldn't that make more sense here as well?
>> > > > > > > > 
>> > > > > > > > It is a bit hard to wrap one's mind around.
>> > > > > > > > 
>> > > > > > > > In the original code (commit 7b159fc18d417980) it looks like:
>> > > > > > > >  - test-and-clear bit
>> > > > > > > >  - write and sync
>> > > > > > > >  - test-bit
>> > > > > > > > 
>> > > > > > > > This does, I think, seem safer than "clear on successful write"
>> > > > > > > > as the
>> > > > > > > > writes could complete out-of-order and I wouldn't be surprised
>> > > > > > > > if the
>> > > > > > > > unsuccessful ones completed with an error before the successful
>> > > > > > > > one -
>> > > > > > > > particularly with an error like EDQUOT.
>> > > > > > > > 
>> > > > > > > > However the current code does the writes before the test-and-
>> > > > > > > > clear, and
>> > > > > > > > only does the commit afterwards.  That makes it less clear why
>> > > > > > > > the
>> > > > > > > > current sequence is a good idea.
>> > > > > > > > 
>> > > > > > > > However ... nfs_file_fsync_commit() is only called if
>> > > > > > > > filemap_write_and_wait_range() returned with success, so we
>> > > > > > > > only clear
>> > > > > > > > the flag after successful writes(?).
>> > > > > > > > 
>> > > > > > > > Oh....
>> > > > > > > > This patch from me:
>> > > > > > > > 
>> > > > > > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
>> > > > > > > > handling.")
>> > > > > > > > 
>> > > > > > > > seems to have been reverted by
>> > > > > > > > 
>> > > > > > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
>> > > > > > > > page writeback failed")
>> > > > > > > > 
>> > > > > > > > which probably isn't good.  It appears that this code is very
>> > > > > > > > fragile
>> > > > > > > > and easily broken.
>> > > > > > 
>> > > > > > On further investigation, I think the problem that I fixed and then
>> > > > > > we
>> > > > > > reintroduced will be fixed again - more permanently - by your
>> > > > > > patch.
>> > > > > > The root problem is that nfs keeps error codes in a different way
>> > > > > > to the
>> > > > > > MM core.  By unifying those, the problem goes.
>> > > > > > (The specific problem is that writes which hit EDQUOT on the server
>> > > > > > can
>> > > > > >  report EIO on the client).
>> > > > > > 
>> > > > > > 
>> > > > > > > > Maybe we need to work out exactly what is required, and
>> > > > > > > > document it - so
>> > > > > > > > we can stop breaking it.
>> > > > > > > > Or maybe we need some unit tests.....
>> > > > > > > > 
>> > > > > > > 
>> > > > > > > Yes, laying out what's necessary for this would be very helpful.
>> > > > > > > We
>> > > > > > > clearly want to set the flag when an error occurs. Under what
>> > > > > > > circumstances should we be clearing it?
>> > > > > > 
>> > > > > > Well.... looking back at  7b159fc18d417980f57ae which introduced
>> > > > > > the
>> > > > > > flag, prior to that write errors (ctx->error) were only reported by
>> > > > > > nfs_file_flush and nfs_fsync, so only one close() and fsync().
>> > > > > > 
>> > > > > > After that commit, setting the flag would mean that errors could be
>> > > > > > returned by 'write'.  So clearing as part of returning the error
>> > > > > > makes
>> > > > > > perfect sense.
>> > > > > > 
>> > > > > > As long as the error gets recorded, and gets returned when it is
>> > > > > > recorded, it doesn't much matter when the flag is cleared.  With
>> > > > > > your
>> > > > > > patches we don't need to flag any more to get errors reliably
>> > > > > > reported.
>> > > > > > 
>> > > > > > Leaving the flag set means that writes go more slowly - we don't
>> > > > > > get
>> > > > > > large queue of background rights building up but destined for
>> > > > > > failure.
>> > > > > > This is the main point made in the comment message when the flag
>> > > > > > was
>> > > > > > introduced.
>> > > > > > Of course, by the time we first get an error there could already
>> > > > > > by a large queue, so we probably want that to drain completely
>> > > > > > before
>> > > > > > allowing async writes again.
>> > > > 
>> > > > We already have this functionality implemented in the existing code.
>> > > > 
>> > > > > > 
>> > > > > > It might make sense to have 2 flags.  One which says "writes should
>> > > > > > be
>> > > > > > synchronous", another that says "There was an error recently".
>> > > > > > We clear the error flag before calling nfs_fsync, and if it is
>> > > > > > still
>> > > > > > clear afterwards, we clear the sync-writes flag.  Maybe that is
>> > > > > > more
>> > > > > > complex than needed though.
>> > > > > > 
>> > > > 
>> > > > We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
>> > > > see any global mechanism that will replace that.
>> > > > 
>> > > > > > I'm leaning towards your suggestion that it doesn't matter very
>> > > > > > much
>> > > > > > when it gets cleared, and clearing it on any successful write is
>> > > > > > simplest.
>> > > > > > 
>> > > > > > So I'm still in favor of using nfs_context_set_write_error() in
>> > > > > > nfs_pageio_add_request(), primarily because it is most consistent -
>> > > > > > we
>> > > > > > don't need exceptions.
>> > > > > 
>> > > > > Thanks for taking a closer look. I can easily make the change above,
>> > > > > and
>> > > > > I do think that keeping this mechanism as simple as possible will
>> > > > > make
>> > > > > it easier to prevent bitrot.
>> > > > > 
>> > > > > That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
>> > > > > is a
>> > > > > per open file description object.
>> > > > > 
>> > > > > Is that the correct way to track this? All of the ctx's will share
>> > > > > the
>> > > > > same inode. If we're getting writeback errors for one context, it's
>> > > > > quite likely that we'll be seeing them via others.
>> > > > > 
>> > > > > I suppose the counterargument is when we have things like expiring
>> > > > > krb5
>> > > > > tickets. Write failures via an expiring set of creds may have no
>> > > > > effect
>> > > > > on writeback via other creds.
>> > > > > 
>> > > > > Still, I think a per-inode flag might make more sense here.
>> > > > > 
>> > > > > Thoughts?
>> > > > 
>> > > > As far as I'm concerned, that would be a regression. The most common
>> > > > problem when flushing writeback data to the server aside from ENOSPC
>> > > > (and possibly ESTALE) is EACCES, which is particular to the file
>> > > > descriptor that opened the file.
>> > > > 
>> > > > File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
>> > > > private to the file descriptor.
>> > > 
>> > > Thanks for the reminder that errors are per-context and this patch drops
>> > > this.  The per-context nature of errors in NFS was the reason that I
>> > > nagged Jeff to make errseq_t a stand-alone type rather than just a part
>> > > of address_space.  I had envisaged that it would be embedded in the
>> > > open_context as well.
>> > > We still could do that, but as there is precisely one open-file for each
>> > > open_context, the gains are not great.
>> > > 
>> > > However, while looking over the code to make sure I really understood it
>> > > and all the possible consequences of changing to errseq_t I found a few
>> > > anomalies.  The patch below addresses them all.
>> > > 
>> > > Would you see if they may sense to you?
>> > > 
>> > > Thanks,
>> > > NeilBrown
>> > > 
>> > > 
>> > > From: NeilBrown <neilb@suse.com>
>> > > Date: Mon, 11 Sep 2017 13:15:50 +1000
>> > > Subject: [PATCH] NFS: various changes relating to reporting IO errors.
>> > > 
>> > > 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
>> > >    They aren't used.
>> > > 
>> > > 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
>> > >    so we can...
>> > > 
>> > > 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
>> > >    if nfs_pageio_add_request() fails before sending any request.
>> > >    NFS generally keeps errors in the open_context, not the mapping,
>> > >    so this is more consistent.
>> > > 
>> > > 4/ If filemap_write_and_write_range() reports any error, still
>> > >    check ctx->error.  The value in ctx->error is likely to be
>> > >    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
>> > >    cleared slightly earlier, before nfs_file_fsync_commit() is called,
>> > >    rather than at the start of that function.
>> > > 
>> > > Signed-off-by: NeilBrown <neilb@suse.com>
>> > > ---
>> > >  fs/nfs/file.c     | 16 ++++++++++------
>> > >  fs/nfs/internal.h |  7 +++++++
>> > >  fs/nfs/pagelist.c |  4 ++--
>> > >  fs/nfs/write.c    |  7 -------
>> > >  4 files changed, 19 insertions(+), 15 deletions(-)
>> > > 
>> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > index af330c31f627..ab324f14081f 100644
>> > > --- a/fs/nfs/file.c
>> > > +++ b/fs/nfs/file.c
>> > > @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
>> > >   * fall back to doing a synchronous write.
>> > >   */
>> > >  static int
>> > > -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
>> > > +nfs_file_fsync_commit(struct file *file, int datasync)
>> > >  {
>> > >  	struct nfs_open_context *ctx = nfs_file_open_context(file);
>> > >  	struct inode *inode = file_inode(file);
>> > > -	int have_error, do_resend, status;
>> > > +	int do_resend, status;
>> > >  	int ret = 0;
>> > >  
>> > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
>> > >  
>> > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
>> > >  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
>> > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > >  	status = nfs_commit_inode(inode, FLUSH_SYNC);
>> > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > > -	if (have_error) {
>> > > +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>> > >  		ret = xchg(&ctx->error, 0);
>> > >  		if (ret)
>> > >  			goto out;
>> > > @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>> > >  	trace_nfs_fsync_enter(inode);
>> > >  
>> > >  	do {
>> > > +		struct nfs_open_context *ctx = nfs_file_open_context(file);
>> > >  		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
>> > > +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
>> > > +			int ret2 = xchg(&ctx->error, 0);
>> > > +			if (ret2)
>> > > +				ret = ret2;
>> > > +		}
>> > >  		if (ret != 0)
>> > >  			break;
>> > > -		ret = nfs_file_fsync_commit(file, start, end, datasync);
>> > > +		ret = nfs_file_fsync_commit(file, datasync);
>> > >  		if (!ret)
>> > >  			ret = pnfs_sync_inode(inode, !!datasync);
>> > >  		/*
>> > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> > > index dc456416d2be..44c8962fec91 100644
>> > > --- a/fs/nfs/internal.h
>> > > +++ b/fs/nfs/internal.h
>> > > @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
>> > >  		return false;
>> > >  	}
>> > >  }
>> > > +
>> > > +static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> > > +{
>> > > +	ctx->error = error;
>> > > +	smp_wmb();
>> > > +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > > +}
>> > > diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
>> > > index de9066a92c0d..0ebd26b9a6bd 100644
>> > > --- a/fs/nfs/pagelist.c
>> > > +++ b/fs/nfs/pagelist.c
>> > > @@ -1198,8 +1198,8 @@ out_failed:
>> > >  
>> > >  		/* remember fatal errors */
>> > >  		if (nfs_error_is_fatal(desc->pg_error))
>> > > -			mapping_set_error(desc->pg_inode->i_mapping,
>> > > -					  desc->pg_error);
>> > > +			nfs_context_set_write_error(req->wb_context,
>> > > +						    desc->pg_error);
>> > >  
>> > >  		func = desc->pg_completion_ops->error_cleanup;
>> > >  		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
>> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> > > index b1af5dee5e0a..f702bf2def79 100644
>> > > --- a/fs/nfs/write.c
>> > > +++ b/fs/nfs/write.c
>> > > @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
>> > >  		kref_put(&ioc->refcount, nfs_io_completion_release);
>> > >  }
>> > >  
>> > > -static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
>> > > -{
>> > > -	ctx->error = error;
>> > > -	smp_wmb();
>> > > -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
>> > > -}
>> > > -
>> > >  /*
>> > >   * nfs_page_find_head_request_locked - find head request associated with @page
>> > >   *
>> > 
>> > This should probably be broken out into at least a 2-3 different
>> > patches.
>> > 
>> > Ok, so to make sure I understand:
>> > 
>> > All writeback is done under the aegis of an open context, and writes
>> > from different open contexts are not mergeable. We also flush to the
>> > server in the case that a dirty page is written via an incompatible open
>> > context. So with that we can always tie
>> 
>> Not quite.  Writes from different open contexts are sometimes mergeable,
>> providing the credential is the same and there are no locks that might
>> get in the way. (nfs_flush_incompatible() gets rid of conflicts writes
>> to the same page as part of nfs_write_begin().
>> When writes are merged, all contexts remain reachable from the request
>> through an 'nfs_page'. nfs_write_completion() iterates over all the
>> nfs_pages attached to the nfs_pgio_header, and sets the context
>> write_error from the hdr->error.
>> 
>
> Ok, by this account, NFS should already have "correct" error reporting
> semantics on fsync. i.e. when the file is written via multiple fds, you
> should get back an error on all fds if those writebacks failed.
>
> I have a test for nfs for the new-style error reporting:
>
>     https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/xfstests-dev.git/log/?h=wberr
>
> The nfs test is still pretty rickety, using soft mounts and iptables to
> cause requests to fail. With the patch I originally proposed, this test
> would pass. When I run this test on normal mainline kernels, it fails:
>
> -------------------------------8<------------------------------
> FSTYP         -- nfs
> PLATFORM      -- Linux/x86_64 wberr 4.12.11-300.fc26.x86_64
> MKFS_OPTIONS  -- knfsdsrv:/export/scratch
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 knfsdsrv:/export/scratch /mnt/scratch
>
> nfs/002	 - output mismatch (see /home/jlayton/git/xfstests/results//nfs/002.out.bad)
>     --- tests/nfs/002.out	2017-07-19 13:12:59.354561869 -0400
>     +++ /home/jlayton/git/xfstests/results//nfs/002.out.bad	2017-09-12 11:17:17.335943539 -0400
>     @@ -1,3 +1,3 @@
>      QA output created by 002
>      Format and mount
>     -Test passed!
>     +Success on second fsync on fd[1]!

Interesting.
I haven't reviewed the kernel code yet, but having looked at the test
code I see that the one file is opened 10 time and that the *same* block
in the file (65k?) is written via each fd.
If I write to a file, and then someone else over-writes all the bytes
that I wrote, then the page is written to the server and gets an error,
then you could argue that as none of the bytes I wrote were involved in
the error, I don't need to see the error status - someone else has taken
on responsibility for that range.

Maybe the test should/could write a few bytes with a different offset for each
fd ??

NeilBrown


>     ...
>     (Run 'diff -u tests/nfs/002.out /home/jlayton/git/xfstests/results//nfs/002.out.bad'  to see the entire diff)
> Ran: nfs/002
> Failures: nfs/002
> Failed 1 of 1 tests
> -------------------------------8<------------------------------
>
> I'm not sure that errors are really propagated to all struct files like
> you suggest above. I'll plan to look a little more closely at what's
> happening here, when I get some time.
>
>> > 
>> > In that case, yes...mixing in errseq_t doesn't really buy us much here,
>> > and I agree with most of the changes above.
>> > 
>> > That said...I'm still not thrilled with how NFS_CONTEXT_ERROR_WRITE is
>> > handled in this code. That flag is set when a write fails, but is only
>> > cleared on fsync.
>> > 
>> > That seems wrong to me. Why wait for an fsync to start doing async
>> > writes again once they start working? What if the application never does
>> > an fsync? Clearing that flag on a successful WRITE seems like it'd make
>> > more sense.
>> 
>> We don't really 'wait' for an fsync.  Having NFS_CONTEXT_ERROR_WRITE
>> means that the very next write will force an fsync
>> (nfs_need_check_write()).  So we really just wait for the next write.
>> The current code doesn't seem "obviously right" to me, but it isn't
>> "obviously wrong" either, and I can only make it obviously right to me
>> by making it more complex, and I don't think I can justify that.
>
> Thanks for pointing this out. I missed the bit about it forcing the
> fsync when this fails. I agree that that should be fine.
>
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] nfs: track writeback errors with errseq_t
  2017-09-12 21:47                         ` NeilBrown
@ 2017-09-13 12:23                           ` Jeff Layton
  2017-09-13 23:50                             ` [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2017-09-13 12:23 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-nfs, linux-fsdevel

On Wed, 2017-09-13 at 07:47 +1000, NeilBrown wrote:
> On Tue, Sep 12 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-09-12 at 07:52 +1000, NeilBrown wrote:
> > > > On Mon, 2017-09-11 at 13:24 +1000, NeilBrown wrote:
> > > > > On Thu, Sep 07 2017, Trond Myklebust wrote:
> > > > > 
> > > > > > On Thu, 2017-09-07 at 07:35 -0400, Jeff Layton wrote:
> > > > > > > On Thu, 2017-09-07 at 13:37 +1000, NeilBrown wrote:
> > > > > > > > On Tue, Aug 29 2017, Jeff Layton wrote:
> > > > > > > > 
> > > > > > > > > On Tue, 2017-08-29 at 11:23 +1000, NeilBrown wrote:
> > > > > > > > > > On Mon, Aug 28 2017, Jeff Layton wrote:
> > > > > > > > > > 
> > > > > > > > > > > On Mon, 2017-08-28 at 09:24 +1000, NeilBrown wrote:
> > > > > > > > > > > > On Fri, Aug 25 2017, Jeff Layton wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > On Thu, 2017-07-20 at 15:42 -0400, Jeff Layton wrote:
> > > > > > > > > > > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > There is some ambiguity in nfs about how writeback
> > > > > > > > > > > > > > errors are
> > > > > > > > > > > > > > tracked.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > For instance, nfs_pageio_add_request calls
> > > > > > > > > > > > > > mapping_set_error when
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > add fails, but we track errors that occur after adding
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > request
> > > > > > > > > > > > > > with a dedicated int error in the open context.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Now that we have better infrastructure for the vfs
> > > > > > > > > > > > > > layer, this
> > > > > > > > > > > > > > latter int is now unnecessary. Just have
> > > > > > > > > > > > > > nfs_context_set_write_error set
> > > > > > > > > > > > > > the error in the mapping when one occurs.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Have NFS use file_write_and_wait_range to initiate and
> > > > > > > > > > > > > > wait on
> > > > > > > > > > > > > > writeback
> > > > > > > > > > > > > > of the data, and then check again after issuing the
> > > > > > > > > > > > > > commit(s).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > With this, we also don't need to pay attention to the
> > > > > > > > > > > > > > ERROR_WRITE
> > > > > > > > > > > > > > flag for reporting, and just clear it to indicate to
> > > > > > > > > > > > > > subsequent
> > > > > > > > > > > > > > writers that they should try to go asynchronous again.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > In nfs_page_async_flush, sample the error before
> > > > > > > > > > > > > > locking and
> > > > > > > > > > > > > > joining
> > > > > > > > > > > > > > the requests, and check for errors since that point.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  fs/nfs/file.c          | 24 +++++++++++-------------
> > > > > > > > > > > > > >  fs/nfs/inode.c         |  3 +--
> > > > > > > > > > > > > >  fs/nfs/write.c         |  8 ++++++--
> > > > > > > > > > > > > >  include/linux/nfs_fs.h |  1 -
> > > > > > > > > > > > > >  4 files changed, 18 insertions(+), 18 deletions(-)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I have a baling wire and duct tape solution for testing
> > > > > > > > > > > > > > this with
> > > > > > > > > > > > > > xfstests (using iptables REJECT targets and soft
> > > > > > > > > > > > > > mounts). This
> > > > > > > > > > > > > > seems to
> > > > > > > > > > > > > > make nfs do the right thing.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > > > > > > > > > > index 5713eb32a45e..15d3c6faafd3 100644
> > > > > > > > > > > > > > --- a/fs/nfs/file.c
> > > > > > > > > > > > > > +++ b/fs/nfs/file.c
> > > > > > > > > > > > > > @@ -212,25 +212,23 @@ nfs_file_fsync_commit(struct file
> > > > > > > > > > > > > > *file,
> > > > > > > > > > > > > > loff_t start, loff_t end, int datasync)
> > > > > > > > > > > > > >  {
> > > > > > > > > > > > > >  	struct nfs_open_context *ctx =
> > > > > > > > > > > > > > nfs_file_open_context(file);
> > > > > > > > > > > > > >  	struct inode *inode = file_inode(file);
> > > > > > > > > > > > > > -	int have_error, do_resend, status;
> > > > > > > > > > > > > > -	int ret = 0;
> > > > > > > > > > > > > > +	int do_resend, status;
> > > > > > > > > > > > > > +	int ret;
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n",
> > > > > > > > > > > > > > file,
> > > > > > > > > > > > > > datasync);
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > > > > > > > > > > >  	do_resend =
> > > > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx-
> > > > > > > > > > > > > > > flags);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -	have_error =
> > > > > > > > > > > > > > test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE,
> > > > > > > > > > > > > > &ctx->flags);
> > > > > > > > > > > > > > -	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > > > > > -	have_error |=
> > > > > > > > > > > > > > test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > > > > > flags);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -	if (have_error) {
> > > > > > > > > > > > > > -		ret = xchg(&ctx->error, 0);
> > > > > > > > > > > > > > -		if (ret)
> > > > > > > > > > > > > > -			goto out;
> > > > > > > > > > > > > > -	}
> > > > > > > > > > > > > > -	if (status < 0) {
> > > > > > > > > > > > > > +	clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx-
> > > > > > > > > > > > > > > flags);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > +	ret = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +	/* Recheck and advance after the commit */
> > > > > > > > > > > > > > +	status = file_check_and_advance_wb_err(file);
> > > > > > > > > > > > 
> > > > > > > > > > > > This change makes the code inconsistent with the comment
> > > > > > > > > > > > above the
> > > > > > > > > > > > function, which still references ctx->error.  The intent of
> > > > > > > > > > > > the
> > > > > > > > > > > > comment
> > > > > > > > > > > > is still correct, but the details have changed.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Good catch. I'll fix that up in a respin.
> > > > > > > > > > > 
> > > > > > > > > > > > Also, there is a call to mapping_set_error() in
> > > > > > > > > > > > nfs_pageio_add_request().
> > > > > > > > > > > > I wonder if that should be changed to
> > > > > > > > > > > >   nfs_context_set_write_error(req->wb_context, desc-
> > > > > > > > > > > > > pg_error)
> > > > > > > > > > > > 
> > > > > > > > > > > > ??
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Trickier question...
> > > > > > > > > > > 
> > > > > > > > > > > I'm not quite sure what semantics we're looking for with
> > > > > > > > > > > NFS_CONTEXT_ERROR_WRITE. I know that it forces writes to be
> > > > > > > > > > > synchronous, but I'm not quite sure why it gets cleared the
> > > > > > > > > > > way it
> > > > > > > > > > > does. It's set on any error but cleared before issuing a
> > > > > > > > > > > commit.
> > > > > > > > > > > 
> > > > > > > > > > > I added a similar flag to Ceph inodes recently, but only
> > > > > > > > > > > clear it when
> > > > > > > > > > > a write succeeds. Wouldn't that make more sense here as well?
> > > > > > > > > > 
> > > > > > > > > > It is a bit hard to wrap one's mind around.
> > > > > > > > > > 
> > > > > > > > > > In the original code (commit 7b159fc18d417980) it looks like:
> > > > > > > > > >  - test-and-clear bit
> > > > > > > > > >  - write and sync
> > > > > > > > > >  - test-bit
> > > > > > > > > > 
> > > > > > > > > > This does, I think, seem safer than "clear on successful write"
> > > > > > > > > > as the
> > > > > > > > > > writes could complete out-of-order and I wouldn't be surprised
> > > > > > > > > > if the
> > > > > > > > > > unsuccessful ones completed with an error before the successful
> > > > > > > > > > one -
> > > > > > > > > > particularly with an error like EDQUOT.
> > > > > > > > > > 
> > > > > > > > > > However the current code does the writes before the test-and-
> > > > > > > > > > clear, and
> > > > > > > > > > only does the commit afterwards.  That makes it less clear why
> > > > > > > > > > the
> > > > > > > > > > current sequence is a good idea.
> > > > > > > > > > 
> > > > > > > > > > However ... nfs_file_fsync_commit() is only called if
> > > > > > > > > > filemap_write_and_wait_range() returned with success, so we
> > > > > > > > > > only clear
> > > > > > > > > > the flag after successful writes(?).
> > > > > > > > > > 
> > > > > > > > > > Oh....
> > > > > > > > > > This patch from me:
> > > > > > > > > > 
> > > > > > > > > > Commit: 2edb6bc3852c ("NFS - fix recent breakage to NFS error
> > > > > > > > > > handling.")
> > > > > > > > > > 
> > > > > > > > > > seems to have been reverted by
> > > > > > > > > > 
> > > > > > > > > > Commit: 7b281ee02655 ("NFS: fsync() must exit with an error if
> > > > > > > > > > page writeback failed")
> > > > > > > > > > 
> > > > > > > > > > which probably isn't good.  It appears that this code is very
> > > > > > > > > > fragile
> > > > > > > > > > and easily broken.
> > > > > > > > 
> > > > > > > > On further investigation, I think the problem that I fixed and then
> > > > > > > > we
> > > > > > > > reintroduced will be fixed again - more permanently - by your
> > > > > > > > patch.
> > > > > > > > The root problem is that nfs keeps error codes in a different way
> > > > > > > > to the
> > > > > > > > MM core.  By unifying those, the problem goes.
> > > > > > > > (The specific problem is that writes which hit EDQUOT on the server
> > > > > > > > can
> > > > > > > >  report EIO on the client).
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > Maybe we need to work out exactly what is required, and
> > > > > > > > > > document it - so
> > > > > > > > > > we can stop breaking it.
> > > > > > > > > > Or maybe we need some unit tests.....
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Yes, laying out what's necessary for this would be very helpful.
> > > > > > > > > We
> > > > > > > > > clearly want to set the flag when an error occurs. Under what
> > > > > > > > > circumstances should we be clearing it?
> > > > > > > > 
> > > > > > > > Well.... looking back at  7b159fc18d417980f57ae which introduced
> > > > > > > > the
> > > > > > > > flag, prior to that write errors (ctx->error) were only reported by
> > > > > > > > nfs_file_flush and nfs_fsync, so only one close() and fsync().
> > > > > > > > 
> > > > > > > > After that commit, setting the flag would mean that errors could be
> > > > > > > > returned by 'write'.  So clearing as part of returning the error
> > > > > > > > makes
> > > > > > > > perfect sense.
> > > > > > > > 
> > > > > > > > As long as the error gets recorded, and gets returned when it is
> > > > > > > > recorded, it doesn't much matter when the flag is cleared.  With
> > > > > > > > your
> > > > > > > > patches we don't need to flag any more to get errors reliably
> > > > > > > > reported.
> > > > > > > > 
> > > > > > > > Leaving the flag set means that writes go more slowly - we don't
> > > > > > > > get
> > > > > > > > large queue of background rights building up but destined for
> > > > > > > > failure.
> > > > > > > > This is the main point made in the comment message when the flag
> > > > > > > > was
> > > > > > > > introduced.
> > > > > > > > Of course, by the time we first get an error there could already
> > > > > > > > by a large queue, so we probably want that to drain completely
> > > > > > > > before
> > > > > > > > allowing async writes again.
> > > > > > 
> > > > > > We already have this functionality implemented in the existing code.
> > > > > > 
> > > > > > > > 
> > > > > > > > It might make sense to have 2 flags.  One which says "writes should
> > > > > > > > be
> > > > > > > > synchronous", another that says "There was an error recently".
> > > > > > > > We clear the error flag before calling nfs_fsync, and if it is
> > > > > > > > still
> > > > > > > > clear afterwards, we clear the sync-writes flag.  Maybe that is
> > > > > > > > more
> > > > > > > > complex than needed though.
> > > > > > > > 
> > > > > > 
> > > > > > We also need to preserve the NFS_CONTEXT_RESEND_WRITES flag. I don't
> > > > > > see any global mechanism that will replace that.
> > > > > > 
> > > > > > > > I'm leaning towards your suggestion that it doesn't matter very
> > > > > > > > much
> > > > > > > > when it gets cleared, and clearing it on any successful write is
> > > > > > > > simplest.
> > > > > > > > 
> > > > > > > > So I'm still in favor of using nfs_context_set_write_error() in
> > > > > > > > nfs_pageio_add_request(), primarily because it is most consistent -
> > > > > > > > we
> > > > > > > > don't need exceptions.
> > > > > > > 
> > > > > > > Thanks for taking a closer look. I can easily make the change above,
> > > > > > > and
> > > > > > > I do think that keeping this mechanism as simple as possible will
> > > > > > > make
> > > > > > > it easier to prevent bitrot.
> > > > > > > 
> > > > > > > That said... NFS_CONTEXT_ERROR_WRITE is a per ctx flag, and the ctx
> > > > > > > is a
> > > > > > > per open file description object.
> > > > > > > 
> > > > > > > Is that the correct way to track this? All of the ctx's will share
> > > > > > > the
> > > > > > > same inode. If we're getting writeback errors for one context, it's
> > > > > > > quite likely that we'll be seeing them via others.
> > > > > > > 
> > > > > > > I suppose the counterargument is when we have things like expiring
> > > > > > > krb5
> > > > > > > tickets. Write failures via an expiring set of creds may have no
> > > > > > > effect
> > > > > > > on writeback via other creds.
> > > > > > > 
> > > > > > > Still, I think a per-inode flag might make more sense here.
> > > > > > > 
> > > > > > > Thoughts?
> > > > > > 
> > > > > > As far as I'm concerned, that would be a regression. The most common
> > > > > > problem when flushing writeback data to the server aside from ENOSPC
> > > > > > (and possibly ESTALE) is EACCES, which is particular to the file
> > > > > > descriptor that opened the file.
> > > > > > 
> > > > > > File contexts, and NFS_CONTEXT_ERROR_WRITE solve that problem by being
> > > > > > private to the file descriptor.
> > > > > 
> > > > > Thanks for the reminder that errors are per-context and this patch drops
> > > > > this.  The per-context nature of errors in NFS was the reason that I
> > > > > nagged Jeff to make errseq_t a stand-alone type rather than just a part
> > > > > of address_space.  I had envisaged that it would be embedded in the
> > > > > open_context as well.
> > > > > We still could do that, but as there is precisely one open-file for each
> > > > > open_context, the gains are not great.
> > > > > 
> > > > > However, while looking over the code to make sure I really understood it
> > > > > and all the possible consequences of changing to errseq_t I found a few
> > > > > anomalies.  The patch below addresses them all.
> > > > > 
> > > > > Would you see if they may sense to you?
> > > > > 
> > > > > Thanks,
> > > > > NeilBrown
> > > > > 
> > > > > 
> > > > > From: NeilBrown <neilb@suse.com>
> > > > > Date: Mon, 11 Sep 2017 13:15:50 +1000
> > > > > Subject: [PATCH] NFS: various changes relating to reporting IO errors.
> > > > > 
> > > > > 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
> > > > >    They aren't used.
> > > > > 
> > > > > 2/ Make nfs_context_set_write_error() a "static inline" in internal.h
> > > > >    so we can...
> > > > > 
> > > > > 3/ Use nfs_context_set_write_error() instead of mapping_set_error()
> > > > >    if nfs_pageio_add_request() fails before sending any request.
> > > > >    NFS generally keeps errors in the open_context, not the mapping,
> > > > >    so this is more consistent.
> > > > > 
> > > > > 4/ If filemap_write_and_write_range() reports any error, still
> > > > >    check ctx->error.  The value in ctx->error is likely to be
> > > > >    more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
> > > > >    cleared slightly earlier, before nfs_file_fsync_commit() is called,
> > > > >    rather than at the start of that function.
> > > > > 
> > > > > Signed-off-by: NeilBrown <neilb@suse.com>
> > > > > ---
> > > > >  fs/nfs/file.c     | 16 ++++++++++------
> > > > >  fs/nfs/internal.h |  7 +++++++
> > > > >  fs/nfs/pagelist.c |  4 ++--
> > > > >  fs/nfs/write.c    |  7 -------
> > > > >  4 files changed, 19 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > index af330c31f627..ab324f14081f 100644
> > > > > --- a/fs/nfs/file.c
> > > > > +++ b/fs/nfs/file.c
> > > > > @@ -208,21 +208,19 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
> > > > >   * fall back to doing a synchronous write.
> > > > >   */
> > > > >  static int
> > > > > -nfs_file_fsync_commit(struct file *file, loff_t start, loff_t end, int datasync)
> > > > > +nfs_file_fsync_commit(struct file *file, int datasync)
> > > > >  {
> > > > >  	struct nfs_open_context *ctx = nfs_file_open_context(file);
> > > > >  	struct inode *inode = file_inode(file);
> > > > > -	int have_error, do_resend, status;
> > > > > +	int do_resend, status;
> > > > >  	int ret = 0;
> > > > >  
> > > > >  	dprintk("NFS: fsync file(%pD2) datasync %d\n", file, datasync);
> > > > >  
> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSFSYNC);
> > > > >  	do_resend = test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags);
> > > > > -	have_error = test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > >  	status = nfs_commit_inode(inode, FLUSH_SYNC);
> > > > > -	have_error |= test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > -	if (have_error) {
> > > > > +	if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
> > > > >  		ret = xchg(&ctx->error, 0);
> > > > >  		if (ret)
> > > > >  			goto out;
> > > > > @@ -247,10 +245,16 @@ nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > > > >  	trace_nfs_fsync_enter(inode);
> > > > >  
> > > > >  	do {
> > > > > +		struct nfs_open_context *ctx = nfs_file_open_context(file);
> > > > >  		ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > > > > +		if (test_and_clear_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags)) {
> > > > > +			int ret2 = xchg(&ctx->error, 0);
> > > > > +			if (ret2)
> > > > > +				ret = ret2;
> > > > > +		}
> > > > >  		if (ret != 0)
> > > > >  			break;
> > > > > -		ret = nfs_file_fsync_commit(file, start, end, datasync);
> > > > > +		ret = nfs_file_fsync_commit(file, datasync);
> > > > >  		if (!ret)
> > > > >  			ret = pnfs_sync_inode(inode, !!datasync);
> > > > >  		/*
> > > > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> > > > > index dc456416d2be..44c8962fec91 100644
> > > > > --- a/fs/nfs/internal.h
> > > > > +++ b/fs/nfs/internal.h
> > > > > @@ -769,3 +769,10 @@ static inline bool nfs_error_is_fatal(int err)
> > > > >  		return false;
> > > > >  	}
> > > > >  }
> > > > > +
> > > > > +static inline void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> > > > > +{
> > > > > +	ctx->error = error;
> > > > > +	smp_wmb();
> > > > > +	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > +}
> > > > > diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> > > > > index de9066a92c0d..0ebd26b9a6bd 100644
> > > > > --- a/fs/nfs/pagelist.c
> > > > > +++ b/fs/nfs/pagelist.c
> > > > > @@ -1198,8 +1198,8 @@ out_failed:
> > > > >  
> > > > >  		/* remember fatal errors */
> > > > >  		if (nfs_error_is_fatal(desc->pg_error))
> > > > > -			mapping_set_error(desc->pg_inode->i_mapping,
> > > > > -					  desc->pg_error);
> > > > > +			nfs_context_set_write_error(req->wb_context,
> > > > > +						    desc->pg_error);
> > > > >  
> > > > >  		func = desc->pg_completion_ops->error_cleanup;
> > > > >  		for (midx = 0; midx < desc->pg_mirror_count; midx++) {
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index b1af5dee5e0a..f702bf2def79 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -147,13 +147,6 @@ static void nfs_io_completion_put(struct nfs_io_completion *ioc)
> > > > >  		kref_put(&ioc->refcount, nfs_io_completion_release);
> > > > >  }
> > > > >  
> > > > > -static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
> > > > > -{
> > > > > -	ctx->error = error;
> > > > > -	smp_wmb();
> > > > > -	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
> > > > > -}
> > > > > -
> > > > >  /*
> > > > >   * nfs_page_find_head_request_locked - find head request associated with @page
> > > > >   *
> > > > 
> > > > This should probably be broken out into at least a 2-3 different
> > > > patches.
> > > > 
> > > > Ok, so to make sure I understand:
> > > > 
> > > > All writeback is done under the aegis of an open context, and writes
> > > > from different open contexts are not mergeable. We also flush to the
> > > > server in the case that a dirty page is written via an incompatible open
> > > > context. So with that we can always tie
> > > 
> > > Not quite.  Writes from different open contexts are sometimes mergeable,
> > > providing the credential is the same and there are no locks that might
> > > get in the way. (nfs_flush_incompatible() gets rid of conflicts writes
> > > to the same page as part of nfs_write_begin().
> > > When writes are merged, all contexts remain reachable from the request
> > > through an 'nfs_page'. nfs_write_completion() iterates over all the
> > > nfs_pages attached to the nfs_pgio_header, and sets the context
> > > write_error from the hdr->error.
> > > 
> > 
> > Ok, by this account, NFS should already have "correct" error reporting
> > semantics on fsync. i.e. when the file is written via multiple fds, you
> > should get back an error on all fds if those writebacks failed.
> > 
> > I have a test for nfs for the new-style error reporting:
> > 
> >     https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/xfstests-dev.git/log/?h=wberr
> > 
> > The nfs test is still pretty rickety, using soft mounts and iptables to
> > cause requests to fail. With the patch I originally proposed, this test
> > would pass. When I run this test on normal mainline kernels, it fails:
> > 
> > -------------------------------8<------------------------------
> > FSTYP         -- nfs
> > PLATFORM      -- Linux/x86_64 wberr 4.12.11-300.fc26.x86_64
> > MKFS_OPTIONS  -- knfsdsrv:/export/scratch
> > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 knfsdsrv:/export/scratch /mnt/scratch
> > 
> > nfs/002	 - output mismatch (see /home/jlayton/git/xfstests/results//nfs/002.out.bad)
> >     --- tests/nfs/002.out	2017-07-19 13:12:59.354561869 -0400
> >     +++ /home/jlayton/git/xfstests/results//nfs/002.out.bad	2017-09-12 11:17:17.335943539 -0400
> >     @@ -1,3 +1,3 @@
> >      QA output created by 002
> >      Format and mount
> >     -Test passed!
> >     +Success on second fsync on fd[1]!
> 
> Interesting.
> I haven't reviewed the kernel code yet, but having looked at the test
> code I see that the one file is opened 10 time and that the *same* block
> in the file (65k?) is written via each fd.
> If I write to a file, and then someone else over-writes all the bytes
> that I wrote, then the page is written to the server and gets an error,
> then you could argue that as none of the bytes I wrote were involved in
> the error, I don't need to see the error status - someone else has taken
> on responsibility for that range.
> 
> Maybe the test should/could write a few bytes with a different offset for each
> fd ??
> 
> NeilBrown
> 

Yes, that seems to do the trick! I'll send out a patch to xfstests to
change that in the test there as it's probably more representative of a
real workload.

Ok, so I guess from both this result and starting at the code for a bit
that the nfs client will replace one nfs_page with another if the second
one covers the entire range of the first.

We should note that that behavior is a little different from other
filesystems that use errseq_t now. If two tasks are writing to the same
file, and one attempts to overwrite the other's range, then the first
one will not see an error on writeback.

Maybe the above is worth a blurb in vfs.txt?

I do (idly) wonder if that behavior could be exploited in some fashion.
A process running in a different container that has write access to a
file could potentially mask writeback errors for a task running in a
different container. It's probably not useful on its own to an attacker,
but could be part of a wider exploit? In any case, it doesn't seem like
a big deal.

Thanks for all of the patience here. I think your patch looks fine as
well. You can add:

Reviewed-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-13 12:23                           ` Jeff Layton
@ 2017-09-13 23:50                             ` NeilBrown
  2017-09-14 10:48                               ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2017-09-13 23:50 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages),
	Jeff Layton, Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-man, linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4344 bytes --]


Since 4.13, errors from writeback are more reliably reported
to all file descriptors that might be relevant.

Add notes to this effect, and also add details about ENOSPC and EDQUOT
which can be delayed in a similar manner to EIO - for NFS in particular.

Signed-off-by: NeilBrown <neilb@suse.com>
---

This is my summary of recent changes, and details that have been made
clear during the exploration of those changes.

I haven't mentioned the fact that EPERM can be returned by
write/fsync/close on NFS if the permissions on the server are changed.
We probably should ... are there other errors that are worth mentioning
along with EPERM, ENOSPC, EDQUOT ??

Thanks,
NeilBronw


 man2/close.2 |  9 +++++++++
 man2/fsync.2 | 19 ++++++++++++++++++-
 man2/write.2 | 20 +++++++++++++++++---
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/man2/close.2 b/man2/close.2
index 751ec322b1f1..9776c839b8b6 100644
--- a/man2/close.2
+++ b/man2/close.2
@@ -82,6 +82,15 @@ call was interrupted by a signal; see
 .TP
 .B EIO
 An I/O error occurred.
+.TP
+.BR ENOSPC ", " EDQUOT
+On NFS, these errors are not normally reported against the first write
+which exceeds the available storage space, but instead against a
+subsequent
+.BR write (2),
+.BR fsync (2),
+or
+.BR close (2).
 .PP
 See NOTES for a discussion of why
 .BR close ()
diff --git a/man2/fsync.2 b/man2/fsync.2
index f1a01301da0f..e706a08d360d 100644
--- a/man2/fsync.2
+++ b/man2/fsync.2
@@ -120,12 +120,29 @@ is set appropriately.
 is not a valid open file descriptor.
 .TP
 .B EIO
-An error occurred during synchronization.
+An error occurred during synchronization.  This error may relate
+to data written to some other file descriptor on the same file.
+.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
+Since Linux 4.13 errors from write-back will be reported to
+all file descriptors that might have written the data which triggered
+the error, and which are still open.  Some filesystems (e.g. NFS)
+keep close track of which data came through which file descriptor,
+and give more precise reporting.  Other filesystems (e.g. most local
+filesystems) will report errors to all file descriptors on the same
+file.
 .TP
 .BR EROFS ", " EINVAL
 .I fd
 is bound to a special file (e.g., a pipe, FIFO, or socket)
 which does not support synchronization.
+.TP
+.BR ENOSPC ", " EDQUOT
+.I fd
+is bound to a file on NFS or another filesystem which does not allocate
+space at the time of a
+.BR write (2)
+system call, and some previous write failed due to insufficient
+storage space.
 .SH CONFORMING TO
 POSIX.1-2001, POSIX.1-2008, 4.3BSD.
 .SH AVAILABILITY
diff --git a/man2/write.2 b/man2/write.2
index 6a39b5b5541d..1a9a86b03b04 100644
--- a/man2/write.2
+++ b/man2/write.2
@@ -47,7 +47,7 @@ write \- write to a file descriptor
 .BR write ()
 writes up to
 .I count
-bytes from the buffer pointed
+bytes from the buffer starting at
 .I buf
 to the file referred to by the file descriptor
 .IR fd .
@@ -181,6 +181,14 @@ or the file offset is not suitably aligned.
 .TP
 .B EIO
 A low-level I/O error occurred while modifying the inode.
+This error may relate to data written by an earlier
+.BR write (2),
+which may have been issued to a different file descriptor on
+the same file.  Since Linux 4.13 errors from write-back will
+be reported to all file descriptors that might have
+written the data which triggered the error, and which are still
+open.
+.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
 .TP
 .B ENOSPC
 The device containing the file referred to by
@@ -222,8 +230,14 @@ unsigned and signed integer data types specified by POSIX.1.
 A successful return from
 .BR write ()
 does not make any guarantee that data has been committed to disk.
-In fact, on some buggy implementations, it does not even guarantee
-that space has successfully been reserved for the data.
+On some filesystems, including NFS, it does not even guarantee
+that space has successfully been reserved for the data.  In the case,
+some errors might be delayed to a future
+.BR write (2)
+or to
+.BR fsync (2)
+or even
+.BR close (2).
 The only way to be sure is to call
 .BR fsync (2)
 after you are done writing all your data.
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-13 23:50                             ` [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes NeilBrown
@ 2017-09-14 10:48                               ` Jeff Layton
  2017-09-15  7:50                                 ` Michael Kerrisk (man-pages)
  2017-09-28  3:01                                 ` NeilBrown
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff Layton @ 2017-09-14 10:48 UTC (permalink / raw)
  To: NeilBrown, Michael Kerrisk (man-pages),
	Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-man, linux-nfs, linux-fsdevel

On Thu, 2017-09-14 at 09:50 +1000, NeilBrown wrote:
> Since 4.13, errors from writeback are more reliably reported
> to all file descriptors that might be relevant.
> 
> Add notes to this effect, and also add details about ENOSPC and EDQUOT
> which can be delayed in a similar manner to EIO - for NFS in particular.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> 
> This is my summary of recent changes, and details that have been made
> clear during the exploration of those changes.
> 
> I haven't mentioned the fact that EPERM can be returned by
> write/fsync/close on NFS if the permissions on the server are changed.
> We probably should ... are there other errors that are worth mentioning
> along with EPERM, ENOSPC, EDQUOT ??
> 
> Thanks,
> NeilBronw
> 

Many thanks for doing this! It was on my to-do list. Comments below:

> 
>  man2/close.2 |  9 +++++++++
>  man2/fsync.2 | 19 ++++++++++++++++++-
>  man2/write.2 | 20 +++++++++++++++++---
>  3 files changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/man2/close.2 b/man2/close.2
> index 751ec322b1f1..9776c839b8b6 100644
> --- a/man2/close.2
> +++ b/man2/close.2
> @@ -82,6 +82,15 @@ call was interrupted by a signal; see
>  .TP
>  .B EIO
>  An I/O error occurred.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +On NFS, these errors are not normally reported against the first write
> +which exceeds the available storage space, but instead against a
> +subsequent
> +.BR write (2),
> +.BR fsync (2),
> +or
> +.BR close (2).
>  .PP
>  See NOTES for a discussion of why
>  .BR close ()
> diff --git a/man2/fsync.2 b/man2/fsync.2
> index f1a01301da0f..e706a08d360d 100644
> --- a/man2/fsync.2
> +++ b/man2/fsync.2
> @@ -120,12 +120,29 @@ is set appropriately.
>  is not a valid open file descriptor.
>  .TP
>  .B EIO
> -An error occurred during synchronization.
> +An error occurred during synchronization.  This error may relate
> +to data written to some other file descriptor on the same file.
> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
> +Since Linux 4.13 errors from write-back will be reported to
> +all file descriptors that might have written the data which triggered
> +the error, and which are still open.

This is a little awkward. How could we report to a fd that was no longer
open? How about:

"Since Linux 4.13, errors from write-back will be reported to all file
descriptors that were open at the time that the error was recorded."

>   Some filesystems (e.g. NFS)
> +keep close track of which data came through which file descriptor,
> +and give more precise reporting.  Other filesystems (e.g. most local
> +filesystems) will report errors to all file descriptors on the same
> +file.
>  .TP
>  .BR EROFS ", " EINVAL
>  .I fd
>  is bound to a special file (e.g., a pipe, FIFO, or socket)
>  which does not support synchronization.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +.I fd
> +is bound to a file on NFS or another filesystem which does not allocate
> +space at the time of a
> +.BR write (2)
> +system call, and some previous write failed due to insufficient
> +storage space.
>  .SH CONFORMING TO
>  POSIX.1-2001, POSIX.1-2008, 4.3BSD.
>  .SH AVAILABILITY
> diff --git a/man2/write.2 b/man2/write.2
> index 6a39b5b5541d..1a9a86b03b04 100644
> --- a/man2/write.2
> +++ b/man2/write.2
> @@ -47,7 +47,7 @@ write \- write to a file descriptor
>  .BR write ()
>  writes up to
>  .I count
> -bytes from the buffer pointed
> +bytes from the buffer starting at
>  .I buf
>  to the file referred to by the file descriptor
>  .IR fd .
> @@ -181,6 +181,14 @@ or the file offset is not suitably aligned.
>  .TP
>  .B EIO
>  A low-level I/O error occurred while modifying the inode.
> +This error may relate to data written by an earlier
> +.BR write (2),
> +which may have been issued to a different file descriptor on
> +the same file.  Since Linux 4.13 errors from write-back will
> +be reported to all file descriptors that might have
> +written the data which triggered the error, and which are still
> +open.


This is where things get a little more vague.

Some filesystems will return errors on a subsequent write(2) when
previous writeback has failed -- some don't. In either case though,
write(2) should never advance your errseq_t cursor, so only an fsync
will "clear" an earlier error.

I'm not sure how best to convey that in the manpages though.

> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>  .TP
>  .B ENOSPC
>  The device containing the file referred to by
> @@ -222,8 +230,14 @@ unsigned and signed integer data types specified by POSIX.1.
>  A successful return from
>  .BR write ()
>  does not make any guarantee that data has been committed to disk.
> -In fact, on some buggy implementations, it does not even guarantee
> -that space has successfully been reserved for the data.
> +On some filesystems, including NFS, it does not even guarantee
> +that space has successfully been reserved for the data.  In the case,
> +some errors might be delayed to a future
> +.BR write (2)
> +or to
> +.BR fsync (2)
> +or even
> +.BR close (2).
>  The only way to be sure is to call
>  .BR fsync (2)
>  after you are done writing all your data.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-14 10:48                               ` Jeff Layton
@ 2017-09-15  7:50                                 ` Michael Kerrisk (man-pages)
  2017-09-15  8:25                                   ` NeilBrown
  2017-09-28  3:01                                 ` NeilBrown
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-09-15  7:50 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Trond Myklebust, anna.schumaker, jlayton
  Cc: mtk.manpages, linux-man, linux-nfs, linux-fsdevel

Hi Neil,

Will you revise this patch to incorporate Jeff's comments, or
should I try manually editing them in. (I'd prefer the former.)

Cheers,

Michael


On 09/14/2017 12:48 PM, Jeff Layton wrote:
> On Thu, 2017-09-14 at 09:50 +1000, NeilBrown wrote:
>> Since 4.13, errors from writeback are more reliably reported
>> to all file descriptors that might be relevant.
>>
>> Add notes to this effect, and also add details about ENOSPC and EDQUOT
>> which can be delayed in a similar manner to EIO - for NFS in particular.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>
>> This is my summary of recent changes, and details that have been made
>> clear during the exploration of those changes.
>>
>> I haven't mentioned the fact that EPERM can be returned by
>> write/fsync/close on NFS if the permissions on the server are changed.
>> We probably should ... are there other errors that are worth mentioning
>> along with EPERM, ENOSPC, EDQUOT ??
>>
>> Thanks,
>> NeilBronw
>>
> 
> Many thanks for doing this! It was on my to-do list. Comments below:
> 
>>
>>  man2/close.2 |  9 +++++++++
>>  man2/fsync.2 | 19 ++++++++++++++++++-
>>  man2/write.2 | 20 +++++++++++++++++---
>>  3 files changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/man2/close.2 b/man2/close.2
>> index 751ec322b1f1..9776c839b8b6 100644
>> --- a/man2/close.2
>> +++ b/man2/close.2
>> @@ -82,6 +82,15 @@ call was interrupted by a signal; see
>>  .TP
>>  .B EIO
>>  An I/O error occurred.
>> +.TP
>> +.BR ENOSPC ", " EDQUOT
>> +On NFS, these errors are not normally reported against the first write
>> +which exceeds the available storage space, but instead against a
>> +subsequent
>> +.BR write (2),
>> +.BR fsync (2),
>> +or
>> +.BR close (2).
>>  .PP
>>  See NOTES for a discussion of why
>>  .BR close ()
>> diff --git a/man2/fsync.2 b/man2/fsync.2
>> index f1a01301da0f..e706a08d360d 100644
>> --- a/man2/fsync.2
>> +++ b/man2/fsync.2
>> @@ -120,12 +120,29 @@ is set appropriately.
>>  is not a valid open file descriptor.
>>  .TP
>>  .B EIO
>> -An error occurred during synchronization.
>> +An error occurred during synchronization.  This error may relate
>> +to data written to some other file descriptor on the same file.
>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>> +Since Linux 4.13 errors from write-back will be reported to
>> +all file descriptors that might have written the data which triggered
>> +the error, and which are still open.
> 
> This is a little awkward. How could we report to a fd that was no longer
> open? How about:
> 
> "Since Linux 4.13, errors from write-back will be reported to all file
> descriptors that were open at the time that the error was recorded."
> 
>>   Some filesystems (e.g. NFS)
>> +keep close track of which data came through which file descriptor,
>> +and give more precise reporting.  Other filesystems (e.g. most local
>> +filesystems) will report errors to all file descriptors on the same
>> +file.
>>  .TP
>>  .BR EROFS ", " EINVAL
>>  .I fd
>>  is bound to a special file (e.g., a pipe, FIFO, or socket)
>>  which does not support synchronization.
>> +.TP
>> +.BR ENOSPC ", " EDQUOT
>> +.I fd
>> +is bound to a file on NFS or another filesystem which does not allocate
>> +space at the time of a
>> +.BR write (2)
>> +system call, and some previous write failed due to insufficient
>> +storage space.
>>  .SH CONFORMING TO
>>  POSIX.1-2001, POSIX.1-2008, 4.3BSD.
>>  .SH AVAILABILITY
>> diff --git a/man2/write.2 b/man2/write.2
>> index 6a39b5b5541d..1a9a86b03b04 100644
>> --- a/man2/write.2
>> +++ b/man2/write.2
>> @@ -47,7 +47,7 @@ write \- write to a file descriptor
>>  .BR write ()
>>  writes up to
>>  .I count
>> -bytes from the buffer pointed
>> +bytes from the buffer starting at
>>  .I buf
>>  to the file referred to by the file descriptor
>>  .IR fd .
>> @@ -181,6 +181,14 @@ or the file offset is not suitably aligned.
>>  .TP
>>  .B EIO
>>  A low-level I/O error occurred while modifying the inode.
>> +This error may relate to data written by an earlier
>> +.BR write (2),
>> +which may have been issued to a different file descriptor on
>> +the same file.  Since Linux 4.13 errors from write-back will
>> +be reported to all file descriptors that might have
>> +written the data which triggered the error, and which are still
>> +open.
> 
> 
> This is where things get a little more vague.
> 
> Some filesystems will return errors on a subsequent write(2) when
> previous writeback has failed -- some don't. In either case though,
> write(2) should never advance your errseq_t cursor, so only an fsync
> will "clear" an earlier error.
> 
> I'm not sure how best to convey that in the manpages though.
> 
>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>>  .TP
>>  .B ENOSPC
>>  The device containing the file referred to by
>> @@ -222,8 +230,14 @@ unsigned and signed integer data types specified by POSIX.1.
>>  A successful return from
>>  .BR write ()
>>  does not make any guarantee that data has been committed to disk.
>> -In fact, on some buggy implementations, it does not even guarantee
>> -that space has successfully been reserved for the data.
>> +On some filesystems, including NFS, it does not even guarantee
>> +that space has successfully been reserved for the data.  In the case,
>> +some errors might be delayed to a future
>> +.BR write (2)
>> +or to
>> +.BR fsync (2)
>> +or even
>> +.BR close (2).
>>  The only way to be sure is to call
>>  .BR fsync (2)
>>  after you are done writing all your data.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-15  7:50                                 ` Michael Kerrisk (man-pages)
@ 2017-09-15  8:25                                   ` NeilBrown
  0 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2017-09-15  8:25 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages),
	Jeff Layton, Trond Myklebust, anna.schumaker, jlayton
  Cc: mtk.manpages, linux-man, linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6201 bytes --]


I'll do something, but maybe not for a few days (24 hour sardine
impersonation pending - Europe, here I come...)

NeilBrown

On Fri, Sep 15 2017, Michael Kerrisk (man-pages) wrote:

> Hi Neil,
>
> Will you revise this patch to incorporate Jeff's comments, or
> should I try manually editing them in. (I'd prefer the former.)
>
> Cheers,
>
> Michael
>
>
> On 09/14/2017 12:48 PM, Jeff Layton wrote:
>> On Thu, 2017-09-14 at 09:50 +1000, NeilBrown wrote:
>>> Since 4.13, errors from writeback are more reliably reported
>>> to all file descriptors that might be relevant.
>>>
>>> Add notes to this effect, and also add details about ENOSPC and EDQUOT
>>> which can be delayed in a similar manner to EIO - for NFS in particular.
>>>
>>> Signed-off-by: NeilBrown <neilb@suse.com>
>>> ---
>>>
>>> This is my summary of recent changes, and details that have been made
>>> clear during the exploration of those changes.
>>>
>>> I haven't mentioned the fact that EPERM can be returned by
>>> write/fsync/close on NFS if the permissions on the server are changed.
>>> We probably should ... are there other errors that are worth mentioning
>>> along with EPERM, ENOSPC, EDQUOT ??
>>>
>>> Thanks,
>>> NeilBronw
>>>
>> 
>> Many thanks for doing this! It was on my to-do list. Comments below:
>> 
>>>
>>>  man2/close.2 |  9 +++++++++
>>>  man2/fsync.2 | 19 ++++++++++++++++++-
>>>  man2/write.2 | 20 +++++++++++++++++---
>>>  3 files changed, 44 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/man2/close.2 b/man2/close.2
>>> index 751ec322b1f1..9776c839b8b6 100644
>>> --- a/man2/close.2
>>> +++ b/man2/close.2
>>> @@ -82,6 +82,15 @@ call was interrupted by a signal; see
>>>  .TP
>>>  .B EIO
>>>  An I/O error occurred.
>>> +.TP
>>> +.BR ENOSPC ", " EDQUOT
>>> +On NFS, these errors are not normally reported against the first write
>>> +which exceeds the available storage space, but instead against a
>>> +subsequent
>>> +.BR write (2),
>>> +.BR fsync (2),
>>> +or
>>> +.BR close (2).
>>>  .PP
>>>  See NOTES for a discussion of why
>>>  .BR close ()
>>> diff --git a/man2/fsync.2 b/man2/fsync.2
>>> index f1a01301da0f..e706a08d360d 100644
>>> --- a/man2/fsync.2
>>> +++ b/man2/fsync.2
>>> @@ -120,12 +120,29 @@ is set appropriately.
>>>  is not a valid open file descriptor.
>>>  .TP
>>>  .B EIO
>>> -An error occurred during synchronization.
>>> +An error occurred during synchronization.  This error may relate
>>> +to data written to some other file descriptor on the same file.
>>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>>> +Since Linux 4.13 errors from write-back will be reported to
>>> +all file descriptors that might have written the data which triggered
>>> +the error, and which are still open.
>> 
>> This is a little awkward. How could we report to a fd that was no longer
>> open? How about:
>> 
>> "Since Linux 4.13, errors from write-back will be reported to all file
>> descriptors that were open at the time that the error was recorded."
>> 
>>>   Some filesystems (e.g. NFS)
>>> +keep close track of which data came through which file descriptor,
>>> +and give more precise reporting.  Other filesystems (e.g. most local
>>> +filesystems) will report errors to all file descriptors on the same
>>> +file.
>>>  .TP
>>>  .BR EROFS ", " EINVAL
>>>  .I fd
>>>  is bound to a special file (e.g., a pipe, FIFO, or socket)
>>>  which does not support synchronization.
>>> +.TP
>>> +.BR ENOSPC ", " EDQUOT
>>> +.I fd
>>> +is bound to a file on NFS or another filesystem which does not allocate
>>> +space at the time of a
>>> +.BR write (2)
>>> +system call, and some previous write failed due to insufficient
>>> +storage space.
>>>  .SH CONFORMING TO
>>>  POSIX.1-2001, POSIX.1-2008, 4.3BSD.
>>>  .SH AVAILABILITY
>>> diff --git a/man2/write.2 b/man2/write.2
>>> index 6a39b5b5541d..1a9a86b03b04 100644
>>> --- a/man2/write.2
>>> +++ b/man2/write.2
>>> @@ -47,7 +47,7 @@ write \- write to a file descriptor
>>>  .BR write ()
>>>  writes up to
>>>  .I count
>>> -bytes from the buffer pointed
>>> +bytes from the buffer starting at
>>>  .I buf
>>>  to the file referred to by the file descriptor
>>>  .IR fd .
>>> @@ -181,6 +181,14 @@ or the file offset is not suitably aligned.
>>>  .TP
>>>  .B EIO
>>>  A low-level I/O error occurred while modifying the inode.
>>> +This error may relate to data written by an earlier
>>> +.BR write (2),
>>> +which may have been issued to a different file descriptor on
>>> +the same file.  Since Linux 4.13 errors from write-back will
>>> +be reported to all file descriptors that might have
>>> +written the data which triggered the error, and which are still
>>> +open.
>> 
>> 
>> This is where things get a little more vague.
>> 
>> Some filesystems will return errors on a subsequent write(2) when
>> previous writeback has failed -- some don't. In either case though,
>> write(2) should never advance your errseq_t cursor, so only an fsync
>> will "clear" an earlier error.
>> 
>> I'm not sure how best to convey that in the manpages though.
>> 
>>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>>>  .TP
>>>  .B ENOSPC
>>>  The device containing the file referred to by
>>> @@ -222,8 +230,14 @@ unsigned and signed integer data types specified by POSIX.1.
>>>  A successful return from
>>>  .BR write ()
>>>  does not make any guarantee that data has been committed to disk.
>>> -In fact, on some buggy implementations, it does not even guarantee
>>> -that space has successfully been reserved for the data.
>>> +On some filesystems, including NFS, it does not even guarantee
>>> +that space has successfully been reserved for the data.  In the case,
>>> +some errors might be delayed to a future
>>> +.BR write (2)
>>> +or to
>>> +.BR fsync (2)
>>> +or even
>>> +.BR close (2).
>>>  The only way to be sure is to call
>>>  .BR fsync (2)
>>>  after you are done writing all your data.
>> 
>
>
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-14 10:48                               ` Jeff Layton
  2017-09-15  7:50                                 ` Michael Kerrisk (man-pages)
@ 2017-09-28  3:01                                 ` NeilBrown
  2017-09-28 12:20                                   ` Jeff Layton
  2017-09-28 16:19                                   ` Michael Kerrisk (man-opages)
  1 sibling, 2 replies; 24+ messages in thread
From: NeilBrown @ 2017-09-28  3:01 UTC (permalink / raw)
  To: Jeff Layton, Michael Kerrisk (man-pages),
	Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-man, linux-nfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 7147 bytes --]

On Thu, Sep 14 2017, Jeff Layton wrote:

>>  .TP
>>  .B EIO
>> -An error occurred during synchronization.
>> +An error occurred during synchronization.  This error may relate
>> +to data written to some other file descriptor on the same file.
>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>> +Since Linux 4.13 errors from write-back will be reported to
>> +all file descriptors that might have written the data which triggered
>> +the error, and which are still open.
>
> This is a little awkward. How could we report to a fd that was no longer
> open? How about:
>
> "Since Linux 4.13, errors from write-back will be reported to all file
> descriptors that were open at the time that the error was recorded."

That might be simpler, but it is less correct.  As I go on to say, NFS
*doesn't* report on all file descriptors that were open at that time.

I've changed it to

-------------------
Since Linux 4.13, errors from write-back will be reported to
all file descriptors that might have written the data which triggered
the error.  Some filesystems (e.g. NFS) keep close track of which data
came through which file descriptor, and give precise reporting.
Other filesystems (e.g. most local filesystems) will report errors to
all file descriptors that where open on the file when the error was recorded.
------------------

which includes some of your text, and removes the "that are still open"
which probably doesn't help.

>>  .TP
>>  .B EIO
>>  A low-level I/O error occurred while modifying the inode.
>> +This error may relate to data written by an earlier
>> +.BR write (2),
>> +which may have been issued to a different file descriptor on
>> +the same file.  Since Linux 4.13 errors from write-back will
>> +be reported to all file descriptors that might have
>> +written the data which triggered the error, and which are still
>> +open.
>
>
> This is where things get a little more vague.
>
> Some filesystems will return errors on a subsequent write(2) when
> previous writeback has failed -- some don't. In either case though,
> write(2) should never advance your errseq_t cursor, so only an fsync
> will "clear" an earlier error.
>
> I'm not sure how best to convey that in the manpages though.

How about:

-------------
This error may relate to the write-back of data written by an
earlier
.BR write (2),
which may have been issued to a different file descriptor on
the same file.  Since Linux 4.13, errors from write-back come
with a promise that they
.I may
be reported by subsequent.
.BR write (2)
requests, and
.I will
be reported by a subsequent
.BR fsync (2)
(whether or not they were also reported by
.BR write (2)).
------------
??

Those changes are included in the following.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Thu, 14 Sep 2017 09:44:43 +1000
Subject: [PATCH] write.2, fsync.2, close.2: update description of error codes

Since 4.13, errors from writeback are more reliably reported
to all file descriptors that might be relevant.

Add notes to this effect, and also add detail about ENOSPC and EDQUOT
which can be delayed in a similar many to EIO - for NFS in particular.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 man2/close.2 |  9 +++++++++
 man2/fsync.2 | 18 +++++++++++++++++-
 man2/write.2 | 28 +++++++++++++++++++++++++---
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/man2/close.2 b/man2/close.2
index 55d89ed3dbc7..136bd0be3f67 100644
--- a/man2/close.2
+++ b/man2/close.2
@@ -82,6 +82,15 @@ call was interrupted by a signal; see
 .TP
 .B EIO
 An I/O error occurred.
+.TP
+.BR ENOSPC ", " EDQUOT
+On NFS, these errors are not normally reported against the first write
+which exceeds the available storage space, but instead against a
+subsequent
+.BR write (2),
+.BR fsync (2),
+or
+.BR close (2).
 .PP
 See NOTES for a discussion of why
 .BR close ()
diff --git a/man2/fsync.2 b/man2/fsync.2
index eed3c460bea9..c7878bf3496f 100644
--- a/man2/fsync.2
+++ b/man2/fsync.2
@@ -121,7 +121,15 @@ is set appropriately.
 is not a valid open file descriptor.
 .TP
 .B EIO
-An error occurred during synchronization.
+An error occurred during synchronization.  This error may relate
+to data written to some other file descriptor on the same file.
+.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
+Since Linux 4.13, errors from write-back will be reported to
+all file descriptors that might have written the data which triggered
+the error.  Some filesystems (e.g. NFS) keep close track of which data
+came through which file descriptor, and give more precise reporting.
+Other filesystems (e.g. most local filesystems) will report errors to
+all file descriptors that where open on the file when the error was recorded.
 .TP
 .B ENOSPC
 Disk space was exhausted while synchronizing.
@@ -130,6 +138,14 @@ Disk space was exhausted while synchronizing.
 .I fd
 is bound to a special file (e.g., a pipe, FIFO, or socket)
 which does not support synchronization.
+.TP
+.BR ENOSPC ", " EDQUOT
+.I fd
+is bound to a file on NFS or another filesystem which does not allocate
+space at the time of a
+.BR write (2)
+system call, and some previous write failed due to insufficient
+storage space.
 .SH CONFORMING TO
 POSIX.1-2001, POSIX.1-2008, 4.3BSD.
 .SH AVAILABILITY
diff --git a/man2/write.2 b/man2/write.2
index 061aa70cf590..b1cc3a2cfb17 100644
--- a/man2/write.2
+++ b/man2/write.2
@@ -47,7 +47,7 @@ write \- write to a file descriptor
 .BR write ()
 writes up to
 .I count
-bytes from the buffer pointed
+bytes from the buffer starting at
 .I buf
 to the file referred to by the file descriptor
 .IR fd .
@@ -181,6 +181,22 @@ or the file offset is not suitably aligned.
 .TP
 .B EIO
 A low-level I/O error occurred while modifying the inode.
+This error may relate to the write-back of data written by an
+earlier
+.BR write (2),
+which may have been issued to a different file descriptor on
+the same file.  Since Linux 4.13, errors from write-back come
+with a promise that they
+.I may
+be reported by subsequent.
+.BR write (2)
+requests, and
+.I will
+be reported by a subsequent
+.BR fsync (2)
+(whether or not they were also reported by
+.BR write (2)).
+.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
 .TP
 .B ENOSPC
 The device containing the file referred to by
@@ -222,8 +238,14 @@ unsigned and signed integer data types specified by POSIX.1.
 A successful return from
 .BR write ()
 does not make any guarantee that data has been committed to disk.
-In fact, on some buggy implementations, it does not even guarantee
-that space has successfully been reserved for the data.
+On some filesystems, including NFS, it does not even guarantee
+that space has successfully been reserved for the data.  In the case,
+some errors might be delayed to a future
+.BR write (2)
+or to
+.BR fsync (2)
+or even
+.BR close (2).
 The only way to be sure is to call
 .BR fsync (2)
 after you are done writing all your data.
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-28  3:01                                 ` NeilBrown
@ 2017-09-28 12:20                                   ` Jeff Layton
  2017-09-28 16:19                                   ` Michael Kerrisk (man-opages)
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2017-09-28 12:20 UTC (permalink / raw)
  To: NeilBrown, Michael Kerrisk (man-pages),
	Trond Myklebust, anna.schumaker, jlayton
  Cc: linux-man, linux-nfs, linux-fsdevel

On Thu, 2017-09-28 at 13:01 +1000, NeilBrown wrote:
> On Thu, Sep 14 2017, Jeff Layton wrote:
> 
> > >  .TP
> > >  .B EIO
> > > -An error occurred during synchronization.
> > > +An error occurred during synchronization.  This error may relate
> > > +to data written to some other file descriptor on the same file.
> > > +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
> > > +Since Linux 4.13 errors from write-back will be reported to
> > > +all file descriptors that might have written the data which
> > > triggered
> > > +the error, and which are still open.
> > 
> > This is a little awkward. How could we report to a fd that was no
> > longer
> > open? How about:
> > 
> > "Since Linux 4.13, errors from write-back will be reported to all
> > file
> > descriptors that were open at the time that the error was
> > recorded."
> 
> That might be simpler, but it is less correct.  As I go on to say,
> NFS
> *doesn't* report on all file descriptors that were open at that time.
> 
> I've changed it to
> 
> -------------------
> Since Linux 4.13, errors from write-back will be reported to
> all file descriptors that might have written the data which triggered
> the error.  Some filesystems (e.g. NFS) keep close track of which
> data
> came through which file descriptor, and give precise reporting.
> Other filesystems (e.g. most local filesystems) will report errors to
> all file descriptors that where open on the file when the error was
> recorded.
> ------------------
> 
> which includes some of your text, and removes the "that are still
> open"
> which probably doesn't help.
> 
> > >  .TP
> > >  .B EIO
> > >  A low-level I/O error occurred while modifying the inode.
> > > +This error may relate to data written by an earlier
> > > +.BR write (2),
> > > +which may have been issued to a different file descriptor on
> > > +the same file.  Since Linux 4.13 errors from write-back will
> > > +be reported to all file descriptors that might have
> > > +written the data which triggered the error, and which are still
> > > +open.
> > 
> > 
> > This is where things get a little more vague.
> > 
> > Some filesystems will return errors on a subsequent write(2) when
> > previous writeback has failed -- some don't. In either case though,
> > write(2) should never advance your errseq_t cursor, so only an
> > fsync
> > will "clear" an earlier error.
> > 
> > I'm not sure how best to convey that in the manpages though.
> 
> How about:
> 
> -------------
> This error may relate to the write-back of data written by an
> earlier
> .BR write (2),
> which may have been issued to a different file descriptor on
> the same file.  Since Linux 4.13, errors from write-back come
> with a promise that they
> .I may
> be reported by subsequent.
> .BR write (2)
> requests, and
> .I will
> be reported by a subsequent
> .BR fsync (2)
> (whether or not they were also reported by
> .BR write (2)).
> ------------
> ??
> 
> Those changes are included in the following.
> 
> Thanks,
> NeilBrown
> 
> From: NeilBrown <neilb@suse.com>
> Date: Thu, 14 Sep 2017 09:44:43 +1000
> Subject: [PATCH] write.2, fsync.2, close.2: update description of
> error codes
> 
> Since 4.13, errors from writeback are more reliably reported
> to all file descriptors that might be relevant.
> 
> Add notes to this effect, and also add detail about ENOSPC and EDQUOT
> which can be delayed in a similar many to EIO - for NFS in
> particular.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  man2/close.2 |  9 +++++++++
>  man2/fsync.2 | 18 +++++++++++++++++-
>  man2/write.2 | 28 +++++++++++++++++++++++++---
>  3 files changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/man2/close.2 b/man2/close.2
> index 55d89ed3dbc7..136bd0be3f67 100644
> --- a/man2/close.2
> +++ b/man2/close.2
> @@ -82,6 +82,15 @@ call was interrupted by a signal; see
>  .TP
>  .B EIO
>  An I/O error occurred.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +On NFS, these errors are not normally reported against the first
> write
> +which exceeds the available storage space, but instead against a
> +subsequent
> +.BR write (2),
> +.BR fsync (2),
> +or
> +.BR close (2).
>  .PP
>  See NOTES for a discussion of why
>  .BR close ()
> diff --git a/man2/fsync.2 b/man2/fsync.2
> index eed3c460bea9..c7878bf3496f 100644
> --- a/man2/fsync.2
> +++ b/man2/fsync.2
> @@ -121,7 +121,15 @@ is set appropriately.
>  is not a valid open file descriptor.
>  .TP
>  .B EIO
> -An error occurred during synchronization.
> +An error occurred during synchronization.  This error may relate
> +to data written to some other file descriptor on the same file.
> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
> +Since Linux 4.13, errors from write-back will be reported to
> +all file descriptors that might have written the data which
> triggered
> +the error.  Some filesystems (e.g. NFS) keep close track of which
> data
> +came through which file descriptor, and give more precise reporting.
> +Other filesystems (e.g. most local filesystems) will report errors
> to
> +all file descriptors that where open on the file when the error was
> recorded.
>  .TP
>  .B ENOSPC
>  Disk space was exhausted while synchronizing.
> @@ -130,6 +138,14 @@ Disk space was exhausted while synchronizing.
>  .I fd
>  is bound to a special file (e.g., a pipe, FIFO, or socket)
>  which does not support synchronization.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +.I fd
> +is bound to a file on NFS or another filesystem which does not
> allocate
> +space at the time of a
> +.BR write (2)
> +system call, and some previous write failed due to insufficient
> +storage space.
>  .SH CONFORMING TO
>  POSIX.1-2001, POSIX.1-2008, 4.3BSD.
>  .SH AVAILABILITY
> diff --git a/man2/write.2 b/man2/write.2
> index 061aa70cf590..b1cc3a2cfb17 100644
> --- a/man2/write.2
> +++ b/man2/write.2
> @@ -47,7 +47,7 @@ write \- write to a file descriptor
>  .BR write ()
>  writes up to
>  .I count
> -bytes from the buffer pointed
> +bytes from the buffer starting at
>  .I buf
>  to the file referred to by the file descriptor
>  .IR fd .
> @@ -181,6 +181,22 @@ or the file offset is not suitably aligned.
>  .TP
>  .B EIO
>  A low-level I/O error occurred while modifying the inode.
> +This error may relate to the write-back of data written by an
> +earlier
> +.BR write (2),
> +which may have been issued to a different file descriptor on
> +the same file.  Since Linux 4.13, errors from write-back come
> +with a promise that they
> +.I may
> +be reported by subsequent.
> +.BR write (2)
> +requests, and
> +.I will
> +be reported by a subsequent
> +.BR fsync (2)
> +(whether or not they were also reported by
> +.BR write (2)).
> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>  .TP
>  .B ENOSPC
>  The device containing the file referred to by
> @@ -222,8 +238,14 @@ unsigned and signed integer data types specified
> by POSIX.1.
>  A successful return from
>  .BR write ()
>  does not make any guarantee that data has been committed to disk.
> -In fact, on some buggy implementations, it does not even guarantee
> -that space has successfully been reserved for the data.
> +On some filesystems, including NFS, it does not even guarantee
> +that space has successfully been reserved for the data.  In the
> case,
> +some errors might be delayed to a future
> +.BR write (2)
> +or to
> +.BR fsync (2)
> +or even
> +.BR close (2).
>  The only way to be sure is to call
>  .BR fsync (2)
>  after you are done writing all your data.

Looks good to me!

Reviewed-by: Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes
  2017-09-28  3:01                                 ` NeilBrown
  2017-09-28 12:20                                   ` Jeff Layton
@ 2017-09-28 16:19                                   ` Michael Kerrisk (man-opages)
  1 sibling, 0 replies; 24+ messages in thread
From: Michael Kerrisk (man-opages) @ 2017-09-28 16:19 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, anna.schumaker, jlayton
  Cc: mtk.manpages, linux-man, linux-nfs, linux-fsdevel

Hi Neil,

On 09/28/2017 05:01 AM, NeilBrown wrote:
> On Thu, Sep 14 2017, Jeff Layton wrote:
> 
>>>   .TP
>>>   .B EIO
>>> -An error occurred during synchronization.
>>> +An error occurred during synchronization.  This error may relate
>>> +to data written to some other file descriptor on the same file.
>>> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>>> +Since Linux 4.13 errors from write-back will be reported to
>>> +all file descriptors that might have written the data which triggered
>>> +the error, and which are still open.
>>
>> This is a little awkward. How could we report to a fd that was no longer
>> open? How about:
>>
>> "Since Linux 4.13, errors from write-back will be reported to all file
>> descriptors that were open at the time that the error was recorded."
> 
> That might be simpler, but it is less correct.  As I go on to say, NFS
> *doesn't* report on all file descriptors that were open at that time.
> 
> I've changed it to
> 
> -------------------
> Since Linux 4.13, errors from write-back will be reported to
> all file descriptors that might have written the data which triggered
> the error.  Some filesystems (e.g. NFS) keep close track of which data
> came through which file descriptor, and give precise reporting.
> Other filesystems (e.g. most local filesystems) will report errors to
> all file descriptors that where open on the file when the error was recorded.
> ------------------
> 
> which includes some of your text, and removes the "that are still open"
> which probably doesn't help.
> 
>>>   .TP
>>>   .B EIO
>>>   A low-level I/O error occurred while modifying the inode.
>>> +This error may relate to data written by an earlier
>>> +.BR write (2),
>>> +which may have been issued to a different file descriptor on
>>> +the same file.  Since Linux 4.13 errors from write-back will
>>> +be reported to all file descriptors that might have
>>> +written the data which triggered the error, and which are still
>>> +open.
>>
>>
>> This is where things get a little more vague.
>>
>> Some filesystems will return errors on a subsequent write(2) when
>> previous writeback has failed -- some don't. In either case though,
>> write(2) should never advance your errseq_t cursor, so only an fsync
>> will "clear" an earlier error.
>>
>> I'm not sure how best to convey that in the manpages though.
> 
> How about:
> 
> -------------
> This error may relate to the write-back of data written by an
> earlier
> .BR write (2),
> which may have been issued to a different file descriptor on
> the same file.  Since Linux 4.13, errors from write-back come
> with a promise that they
> .I may
> be reported by subsequent.
> .BR write (2)
> requests, and
> .I will
> be reported by a subsequent
> .BR fsync (2)
> (whether or not they were also reported by
> .BR write (2)).
> ------------
> ??
> 
> Those changes are included in the following.
> 
> Thanks,
> NeilBrown
> 
> From: NeilBrown <neilb@suse.com>
> Date: Thu, 14 Sep 2017 09:44:43 +1000
> Subject: [PATCH] write.2, fsync.2, close.2: update description of error codes
> 
> Since 4.13, errors from writeback are more reliably reported
> to all file descriptors that might be relevant.
> 
> Add notes to this effect, and also add detail about ENOSPC and EDQUOT
> which can be delayed in a similar many to EIO - for NFS in particular.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Thanks! I've applied, and added Jeff's Reviewed-by.

Cheers,

Michael


> ---
>   man2/close.2 |  9 +++++++++
>   man2/fsync.2 | 18 +++++++++++++++++-
>   man2/write.2 | 28 +++++++++++++++++++++++++---
>   3 files changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/man2/close.2 b/man2/close.2
> index 55d89ed3dbc7..136bd0be3f67 100644
> --- a/man2/close.2
> +++ b/man2/close.2
> @@ -82,6 +82,15 @@ call was interrupted by a signal; see
>   .TP
>   .B EIO
>   An I/O error occurred.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +On NFS, these errors are not normally reported against the first write
> +which exceeds the available storage space, but instead against a
> +subsequent
> +.BR write (2),
> +.BR fsync (2),
> +or
> +.BR close (2).
>   .PP
>   See NOTES for a discussion of why
>   .BR close ()
> diff --git a/man2/fsync.2 b/man2/fsync.2
> index eed3c460bea9..c7878bf3496f 100644
> --- a/man2/fsync.2
> +++ b/man2/fsync.2
> @@ -121,7 +121,15 @@ is set appropriately.
>   is not a valid open file descriptor.
>   .TP
>   .B EIO
> -An error occurred during synchronization.
> +An error occurred during synchronization.  This error may relate
> +to data written to some other file descriptor on the same file.
> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
> +Since Linux 4.13, errors from write-back will be reported to
> +all file descriptors that might have written the data which triggered
> +the error.  Some filesystems (e.g. NFS) keep close track of which data
> +came through which file descriptor, and give more precise reporting.
> +Other filesystems (e.g. most local filesystems) will report errors to
> +all file descriptors that where open on the file when the error was recorded.
>   .TP
>   .B ENOSPC
>   Disk space was exhausted while synchronizing.
> @@ -130,6 +138,14 @@ Disk space was exhausted while synchronizing.
>   .I fd
>   is bound to a special file (e.g., a pipe, FIFO, or socket)
>   which does not support synchronization.
> +.TP
> +.BR ENOSPC ", " EDQUOT
> +.I fd
> +is bound to a file on NFS or another filesystem which does not allocate
> +space at the time of a
> +.BR write (2)
> +system call, and some previous write failed due to insufficient
> +storage space.
>   .SH CONFORMING TO
>   POSIX.1-2001, POSIX.1-2008, 4.3BSD.
>   .SH AVAILABILITY
> diff --git a/man2/write.2 b/man2/write.2
> index 061aa70cf590..b1cc3a2cfb17 100644
> --- a/man2/write.2
> +++ b/man2/write.2
> @@ -47,7 +47,7 @@ write \- write to a file descriptor
>   .BR write ()
>   writes up to
>   .I count
> -bytes from the buffer pointed
> +bytes from the buffer starting at
>   .I buf
>   to the file referred to by the file descriptor
>   .IR fd .
> @@ -181,6 +181,22 @@ or the file offset is not suitably aligned.
>   .TP
>   .B EIO
>   A low-level I/O error occurred while modifying the inode.
> +This error may relate to the write-back of data written by an
> +earlier
> +.BR write (2),
> +which may have been issued to a different file descriptor on
> +the same file.  Since Linux 4.13, errors from write-back come
> +with a promise that they
> +.I may
> +be reported by subsequent.
> +.BR write (2)
> +requests, and
> +.I will
> +be reported by a subsequent
> +.BR fsync (2)
> +(whether or not they were also reported by
> +.BR write (2)).
> +.\" commit 088737f44bbf6378745f5b57b035e57ee3dc4750
>   .TP
>   .B ENOSPC
>   The device containing the file referred to by
> @@ -222,8 +238,14 @@ unsigned and signed integer data types specified by POSIX.1.
>   A successful return from
>   .BR write ()
>   does not make any guarantee that data has been committed to disk.
> -In fact, on some buggy implementations, it does not even guarantee
> -that space has successfully been reserved for the data.
> +On some filesystems, including NFS, it does not even guarantee
> +that space has successfully been reserved for the data.  In the case,
> +some errors might be delayed to a future
> +.BR write (2)
> +or to
> +.BR fsync (2)
> +or even
> +.BR close (2).
>   The only way to be sure is to call
>   .BR fsync (2)
>   after you are done writing all your data.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-09-28 16:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-20 19:42 [PATCH] nfs: track writeback errors with errseq_t Jeff Layton
2017-08-25 17:59 ` Jeff Layton
2017-08-27 23:24   ` NeilBrown
2017-08-28 11:47     ` Jeff Layton
2017-08-29  1:23       ` NeilBrown
2017-08-29 10:54         ` Jeff Layton
2017-09-07  3:37           ` NeilBrown
2017-09-07 11:35             ` Jeff Layton
2017-09-07 14:54               ` Trond Myklebust
2017-09-11  3:24                 ` NeilBrown
2017-09-11 10:46                   ` Jeff Layton
2017-09-11 21:52                     ` NeilBrown
2017-09-12 15:20                       ` Jeff Layton
2017-09-12 21:47                         ` NeilBrown
2017-09-13 12:23                           ` Jeff Layton
2017-09-13 23:50                             ` [RFC PATCH manpages] write.2, fsync.2, close.2: update description of error codes NeilBrown
2017-09-14 10:48                               ` Jeff Layton
2017-09-15  7:50                                 ` Michael Kerrisk (man-pages)
2017-09-15  8:25                                   ` NeilBrown
2017-09-28  3:01                                 ` NeilBrown
2017-09-28 12:20                                   ` Jeff Layton
2017-09-28 16:19                                   ` Michael Kerrisk (man-opages)
2017-09-12  2:24                   ` [PATCH] nfs: track writeback errors with errseq_t Trond Myklebust
2017-09-12  5:29                     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).