Re: Performance regression with random IO pattern from the client

From: Jan Kara <jack@suse.cz>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "jack@suse.cz" <jack@suse.cz>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"nfbrown@suse.com" <nfbrown@suse.com>,
	"bfields@redhat.com" <bfields@redhat.com>
Subject: Re: Performance regression with random IO pattern from the client
Date: Thu, 31 Mar 2022 15:09:35 +0200	[thread overview]
Message-ID: <20220331130935.ejqu3kxsjm2tvlly@quack3.lan> (raw)
In-Reply-To: <e5dcbff2ad43304af5039315c00316f2ee5a4e25.camel@hammerspace.com>

On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
> > > 
> > > On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > > > > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > > > > <trondmy@hammerspace.com> wrote:
> > > > > 
> > > > > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > during our performance testing we have noticed that commit
> > > > > > b6669305d35a
> > > > > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > > > > introduced
> > > > > > a
> > > > > > performance regression when a client does random buffered
> > > > > > writes. The
> > > > > > workload on NFS client is fio running 4 processed doing
> > > > > > random
> > > > > > buffered writes to 4
> > > > > > different files and the files are large enough to hit dirty
> > > > > > limits
> > > > > > and
> > > > > > force writeback from the client. In particular the invocation
> > > > > > is
> > > > > > like:
> > > > > > 
> > > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
> > > > > > --
> > > > > > invalidate=1 --group_reporting=1 --runtime=300 --
> > > > > > fallocate=posix --
> > > > > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > > > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > > > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > > > > 
> > > > > > The reason why commit b6669305d35a regresses performance is
> > > > > > the
> > > > > > filemap_flush() call it adds into nfsd_file_put(). Before
> > > > > > this commit
> > > > > > writeback on the server happened from nfsd_commit() code
> > > > > > resulting in
> > > > > > rather long semisequential streams of 4k writes. After commit
> > > > > > b6669305d35a
> > > > > > all the writeback happens from filemap_flush() calls
> > > > > > resulting in
> > > > > > much
> > > > > > longer average seek distance (IO from different files is more
> > > > > > interleaved)
> > > > > > and about 16-20% regression in the achieved writeback
> > > > > > throughput when
> > > > > > the
> > > > > > backing store is rotational storage.
> > > > > > 
> > > > > > I think the filemap_flush() from nfsd_file_put() is indeed
> > > > > > rather
> > > > > > aggressive and I think we'd be better off to just leave
> > > > > > writeback to
> > > > > > either
> > > > > > nfsd_commit() or standard dirty page cleaning happening on
> > > > > > the
> > > > > > system. I
> > > > > > assume the rationale for the filemap_flush() call was to make
> > > > > > it more
> > > > > > likely the file can be evicted during the garbage collection
> > > > > > run? Was
> > > > > > there
> > > > > > any particular problem leading to addition of this call or
> > > > > > was it
> > > > > > just "it
> > > > > > seemed like a good idea" thing?
> > > > > > 
> > > > > > Thanks in advance for ideas.
> > > > > > 
> > > > > >                                                              
> > > > > >   Honza
> > > > > 
> > > > > It was mainly introduced to reduce the amount of work that
> > > > > nfsd_file_free() needs to do. In particular when re-exporting
> > > > > NFS, the
> > > > > call to filp_close() can be expensive because it synchronously
> > > > > flushes
> > > > > out dirty pages. That again means that some of the calls to
> > > > > nfsd_file_dispose_list() can end up being very expensive
> > > > > (particularly
> > > > > the ones run by the garbage collector itself).
> > > > 
> > > > The "no regressions" rule suggests that some kind of action needs
> > > > to be taken. I don't have a sense of whether Jan's workload or
> > > > NFS
> > > > re-export is the more common use case, however.
> > > > 
> > > > I can see that filp_close() on a file on an NFS mount could be
> > > > costly if that file has dirty pages, due to the NFS client's
> > > > "flush on close" semantic.
> > > > 
> > > > Trond, are there alternatives to flushing in the nfsd_file_put()
> > > > path? I'm willing to entertain bug fix patches rather than a
> > > > mechanical revert of b6669305d35a.
> > > 
> > > Yeah, I don't think we need to rush fixing this with a revert.
> > 
> > Sorry I wasn't clear: I would prefer to apply a bug fix over
> > sending a revert commit, and I do not have enough information
> > yet to make that choice. Waiting a bit is OK with me.
> > 
> > 
> > > Also because
> > > just removing the filemap_flush() from nfsd_file_put() would keep
> > > other
> > > benefits of that commit while fixing the regression AFAIU. But I
> > > think
> > > making flushing less aggressive is desirable because as I wrote in
> > > my other
> > > reply, currently it is preventing pagecache from accumulating
> > > enough dirty
> > > data for a good IO pattern.
> > 
> > I might even go so far as to say that a small write workload
> > isn't especially good for solid state storage either.
> > 
> > I know Trond is trying to address NFS re-export performance, but
> > there appear to be some palpable effects outside of that narrow
> > use case that need to be considered. Thus a server-side fix,
> > rather than a fix in the NFS client used to do the re-export,
> > seems appropriate to consider.
> 
> Turns out it is not just the NFS client that is the problem. It is
> rather that we need in general to be able to detect flush errors and
> either report them directly (through commit) or we need to change the
> boot verifier to force clients to resend the unstable writes.
> 
> Hence, I think we're looking at something like this:

Thanks for the fix! I've run the patch through some testing and your fix
indeed restores the good IO pattern and returns the performance back to
original levels. So feel free to add:

Tested-by: Jan Kara <jack@suse.cz>

								Honza

> 
> 8<--------------------------------------------------------------------
> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> Date: Wed, 30 Mar 2022 16:55:38 -0400
> Subject: [PATCH] nfsd: Fix a write performance regression
> 
> The call to filemap_flush() in nfsd_file_put() is there to ensure that
> we clear out any writes belonging to a NFSv3 client relatively quickly
> and avoid situations where the file can't be evicted by the garbage
> collector. It also ensures that we detect write errors quickly.
> 
> The problem is this causes a regression in performance for some
> workloads.
> 
> So try to improve matters by deferring writeback until we're ready to
> close the file, and need to detect errors so that we can force the
> client to resend.
> 
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfsd/filecache.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 8bc807c5fea4..9578a6317709 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>  	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>  }
>  
> +static void
> +nfsd_file_flush(struct nfsd_file *nf)
> +{
> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> +}
> +
>  static void
>  nfsd_file_do_unhash(struct nfsd_file *nf)
>  {
> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>  		return;
>  	}
>  
> -	filemap_flush(nf->nf_file->f_mapping);
>  	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
> -	nfsd_file_put_noref(nf);
> -	if (is_hashed)
> +	if (!is_hashed) {
> +		nfsd_file_flush(nf);
> +		nfsd_file_put_noref(nf);
> +	} else {
> +		nfsd_file_put_noref(nf);
>  		nfsd_file_schedule_laundrette();
> +	}
>  	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>  		nfsd_file_gc();
>  }
> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>  	while(!list_empty(dispose)) {
>  		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>  		list_del(&nf->nf_lru);
> +		nfsd_file_flush(nf);
>  		nfsd_file_put_noref(nf);
>  	}
>  }
> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>  	while(!list_empty(dispose)) {
>  		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>  		list_del(&nf->nf_lru);
> +		nfsd_file_flush(nf);
>  		if (!refcount_dec_and_test(&nf->nf_ref))
>  			continue;
>  		if (nfsd_file_free(nf))
> -- 
> 2.35.1
> 
> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR