linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "neilb@suse.de" <neilb@suse.de>,
	"Anna.Schumaker@Netapp.com" <Anna.Schumaker@Netapp.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH/RFC] NFS: Minimize COMMIT calls during writeback.
Date: Mon, 16 Mar 2020 01:22:58 +0000	[thread overview]
Message-ID: <b557a1d43105b42b304a2682ce8e2c31637e1989.camel@hammerspace.com> (raw)
In-Reply-To: <87y2s1rwof.fsf@notabene.neil.brown.name>

On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
> Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
> every
> .writepages() call it received - when the writes submitted for that
> call
> have all completed.
> 
> This works well enough when COMMIT is fast, but if COMMIT is slow
> these
> calls can overlap each other, overload the server, and substantially
> reduce throughput.
> 
> I looked at this due to a report of regression when writing to a
> Ganesha
> NFS server with tracing showing multiple overlapping COMMITs, and
> individual commits typically taking 200ms to 300ms.  This resulted in
> 2
> orders of magnitude slow down when writing 1000x1M from /dev/zero
> with dd.
> This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in
> the
> Linux NFS server.


When there is overlap of writeback to the same file, then it is almost
always because we're hitting memory reclaim on the client. In that case
we want to ensure that memory reclaim completes, which it won't do if
we delay COMMIT.
IOW: this behaviour is very much intentional.

> This patch changes the details of when COMMIT is sent without
> changing
> the general approach.
> 
> Where previously the writes from a single .writepages() call were
> accounted together in a nfs_io_completion, now the writes from
> multiple
> writepages() calls are accounted together.  The set of writepages
> calls
> that are grouped together are all those from when one COMMIT call
> ends
> to when the next COMMIT call ends.  This automatically reduces the
> rate
> of COMMIT requests when COMMIT itself is slow.
> (If there are no COMMIT calls pending, the first .writepages will get
>  an nfs_io_completion to itself, then subsequenct writepages requests
>  until the first COMMIT completes will go in the next
> nfs_io_completion)
> 
> There are typically at most two nfs_io_completion structures
> allocated
> for writeback to a given inode.
> 
> - If there was been any writepages calls since the last time a COMMIT
>   completed, there will be an nfs_io_completion stored in the inode
> (in
>   nfs_mds_commit_info) which new writepages requests are accounted
> it.
> 
> - If there is no pending COMMIT request, but there are pending
> writeback
>   WRITES, there will be another nfs_io_completion which is not
> attached
>   and which is draining.  When it fully drains a COMMIT will be sent.
>   When that COMMIT completes, the attached nfs_io_completion will be
>   detached and allowed to drain.
> 
> The rpcs_out counter now counts the unattached nfs_io_completion as
> well
> as any pending COMMIT requests.  As an unattached nfs_io_completion
> will
> soon turn into a COMMIT request, this seems reasonable.  It allows us
> to
> clearly detect when there are no longer any COMMITs expected to
> complete, so we know to detach any nfs_io_completion from the inode
> and
> allow it to drain.
> 
> As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
> access the attached nfs_io_completion, we now use kfree_rcu() to free
> it, to ensure it is not accessed after it is freed.
> 
> With 300ms commits, this improves throught of a 1G dd by 2 orders of
> magnitude.  Even without the 300ms delay, this noticeably improves
> throughput to a Linux NFS server is a simple VM.
> 
> Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
> complete")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/inode.c          |  1 +
>  fs/nfs/write.c          | 50 ++++++++++++++++++++++++++++++++++++---
> --
>  include/linux/nfs_xdr.h |  7 ++++++
>  3 files changed, 53 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 11bf15800ac9..c00b54491949 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>  	atomic_long_set(&nfsi->nrequests, 0);
>  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> +	nfsi->commit_info.ioc = NULL;
>  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>  	init_rwsem(&nfsi->rmdir_sem);
>  	mutex_init(&nfsi->commit_mutex);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c478b772cc49..57e209f964e4 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -47,6 +47,7 @@ struct nfs_io_completion {
>  	void (*complete)(void *data);
>  	void *data;
>  	struct kref refcount;
> +	struct rcu_head rcu;
>  };
>  
>  /*
> @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref
> *kref)
>  	struct nfs_io_completion *ioc = container_of(kref,
>  			struct nfs_io_completion, refcount);
>  	ioc->complete(ioc->data);
> -	kfree(ioc);
> +	kfree_rcu(ioc, rcu);
>  }
>  
>  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
> @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void *inode)
>  	nfs_commit_inode(inode, 0);
>  }
>  
> +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
> +
>  int nfs_writepages(struct address_space *mapping, struct
> writeback_control *wbc)
>  {
>  	struct inode *inode = mapping->host;
> @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
> *mapping, struct writeback_control *wbc)
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
> -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> -	if (ioc)
> -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
> inode);
> +	rcu_read_lock();
> +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> +		rcu_read_unlock();
> +		/* We've successfully piggybacked on the attached ioc
> */
> +	} else {
> +		rcu_read_unlock();
> +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> +		if (ioc) {
> +			struct nfs_io_completion *ioc_prev;
> +
> +			nfs_io_completion_init(ioc,
> nfs_io_completion_commit,
> +					       inode);
> +			/* Temporarily elevate rpcs_out to ensure a
> commit
> +			 * completion *will* happen after we attach
> this ioc,
> +			 * so it *will* get a chance to drain.
> +			 */
> +			atomic_inc(&NFS_I(inode)-
> >commit_info.rpcs_out);
> +			nfs_io_completion_get(ioc);
> +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
> +				    ioc);
> +			/* ioc_prev is normally NULL, but racing
> writepages
> +			 * calls might result in it being non-NULL.
> +			 * In either case it is safe to put it - worst
> case
> +			 * we get an extra COMMIT.
> +			 */
> +			nfs_io_completion_put(ioc_prev);
> +			/* release temporary ref on rpcs_out */
> +			nfs_commit_end(&NFS_I(inode)->commit_info);

So won't this normally trigger the xchg(NULL) in nfs_commmit_end()? For
most cases, I'd expect commit_info.rpcs_out to be zero until we start
actually sending out the COMMITs.

> +		}
> +	}
>  
>  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false,
>  				&nfs_async_write_completion_ops);
> @@ -1677,8 +1708,17 @@ static void nfs_commit_begin(struct
> nfs_mds_commit_info *cinfo)
>  
>  static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
>  {
> -	if (atomic_dec_and_test(&cinfo->rpcs_out))
> +	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
> +		/* When a commit finishes, we must release any attached
> +		 * nfs_io_completion so that it can drain and then
> request
> +		 * another commit.
> +		 */
> +		struct nfs_io_completion *ioc;
> +		ioc = xchg(&cinfo->ioc, NULL);
> +		nfs_io_completion_put(ioc);
> +
>  		wake_up_var(&cinfo->rpcs_out);
> +	}
>  }
>  
>  void nfs_commitdata_release(struct nfs_commit_data *data)
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 94c77ed55ce1..89db1e9d461d 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -1557,9 +1557,16 @@ struct nfs_pgio_header {
>  };
>  
>  struct nfs_mds_commit_info {
> +	/* rpcs_out counts pending COMMIT rpcs plus pendng
> nfs_io_completions
> +	 * which are *not* attached at 'ioc' below.  Such
> nfs_io_compleions
> +	 * (normally at most one) will drain as writes complete and
> then trigger
> +	 * a COMMIT, so they can be considered as pending COMMITs which
> haven't
> +	 * been sent yet
> +	 */
>  	atomic_t rpcs_out;
>  	atomic_long_t		ncommit;
>  	struct list_head	list;
> +	struct nfs_io_completion *ioc;
>  };
>  
>  struct nfs_commit_info;
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2020-03-16  1:23 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-16  0:05 [PATCH/RFC] NFS: Minimize COMMIT calls during writeback NeilBrown
2020-03-16  1:22 ` Trond Myklebust [this message]
2020-03-16  3:39   ` NeilBrown
2020-03-16  4:13     ` Trond Myklebust
2020-03-17 10:41       ` NeilBrown
2020-03-17 14:22         ` Trond Myklebust
2020-03-17 23:48           ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b557a1d43105b42b304a2682ce8e2c31637e1989.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=Anna.Schumaker@Netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).