Re: Performance regression with random IO pattern from the client

From: Chuck Lever III <chuck.lever@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"nfbrown@suse.com" <nfbrown@suse.com>,
	Bruce Fields <bfields@redhat.com>
Subject: Re: Performance regression with random IO pattern from the client
Date: Thu, 31 Mar 2022 14:20:11 +0000	[thread overview]
Message-ID: <E275D6B3-9F15-4002-9967-B35820A01937@oracle.com> (raw)
In-Reply-To: <20220331130935.ejqu3kxsjm2tvlly@quack3.lan>

> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>> 
>>> 
>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
>>>> 
>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>> 
>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> during our performance testing we have noticed that commit
>>>>>>> b6669305d35a
>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>> introduced
>>>>>>> a
>>>>>>> performance regression when a client does random buffered
>>>>>>> writes. The
>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>> random
>>>>>>> buffered writes to 4
>>>>>>> different files and the files are large enough to hit dirty
>>>>>>> limits
>>>>>>> and
>>>>>>> force writeback from the client. In particular the invocation
>>>>>>> is
>>>>>>> like:
>>>>>>> 
>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>> --
>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>> fallocate=posix --
>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>> 
>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>> the
>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>> this commit
>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>> resulting in
>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>> b6669305d35a
>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>> resulting in
>>>>>>> much
>>>>>>> longer average seek distance (IO from different files is more
>>>>>>> interleaved)
>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>> throughput when
>>>>>>> the
>>>>>>> backing store is rotational storage.
>>>>>>> 
>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>> rather
>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>> writeback to
>>>>>>> either
>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>> the
>>>>>>> system. I
>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>> it more
>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>> run? Was
>>>>>>> there
>>>>>>> any particular problem leading to addition of this call or
>>>>>>> was it
>>>>>>> just "it
>>>>>>> seemed like a good idea" thing?
>>>>>>> 
>>>>>>> Thanks in advance for ideas.
>>>>>>> 
>>>>>>>                                                              
>>>>>>>   Honza
>>>>>> 
>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>> NFS, the
>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>> flushes
>>>>>> out dirty pages. That again means that some of the calls to
>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>> (particularly
>>>>>> the ones run by the garbage collector itself).
>>>>> 
>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>> NFS
>>>>> re-export is the more common use case, however.
>>>>> 
>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>> "flush on close" semantic.
>>>>> 
>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>> mechanical revert of b6669305d35a.
>>>> 
>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>> 
>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>> sending a revert commit, and I do not have enough information
>>> yet to make that choice. Waiting a bit is OK with me.
>>> 
>>> 
>>>> Also because
>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>> other
>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>> think
>>>> making flushing less aggressive is desirable because as I wrote in
>>>> my other
>>>> reply, currently it is preventing pagecache from accumulating
>>>> enough dirty
>>>> data for a good IO pattern.
>>> 
>>> I might even go so far as to say that a small write workload
>>> isn't especially good for solid state storage either.
>>> 
>>> I know Trond is trying to address NFS re-export performance, but
>>> there appear to be some palpable effects outside of that narrow
>>> use case that need to be considered. Thus a server-side fix,
>>> rather than a fix in the NFS client used to do the re-export,
>>> seems appropriate to consider.
>> 
>> Turns out it is not just the NFS client that is the problem. It is
>> rather that we need in general to be able to detect flush errors and
>> either report them directly (through commit) or we need to change the
>> boot verifier to force clients to resend the unstable writes.
>> 
>> Hence, I think we're looking at something like this:
> 
> Thanks for the fix! I've run the patch through some testing and your fix
> indeed restores the good IO pattern and returns the performance back to
> original levels. So feel free to add:
> 
> Tested-by: Jan Kara <jack@suse.cz>

Excellent. I'll queue this up in the NFSD tree for 5.17-rc.

> 
> 								Honza
> 
>> 
>> 8<--------------------------------------------------------------------
>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>> Subject: [PATCH] nfsd: Fix a write performance regression
>> 
>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>> we clear out any writes belonging to a NFSv3 client relatively quickly
>> and avoid situations where the file can't be evicted by the garbage
>> collector. It also ensures that we detect write errors quickly.
>> 
>> The problem is this causes a regression in performance for some
>> workloads.
>> 
>> So try to improve matters by deferring writeback until we're ready to
>> close the file, and need to detect errors so that we can force the
>> client to resend.
>> 
>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>> ---
>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>> 1 file changed, 15 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index 8bc807c5fea4..9578a6317709 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>> 	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>> }
>> 
>> +static void
>> +nfsd_file_flush(struct nfsd_file *nf)
>> +{
>> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>> +}
>> +
>> static void
>> nfsd_file_do_unhash(struct nfsd_file *nf)
>> {
>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>> 		return;
>> 	}
>> 
>> -	filemap_flush(nf->nf_file->f_mapping);
>> 	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>> -	nfsd_file_put_noref(nf);
>> -	if (is_hashed)
>> +	if (!is_hashed) {
>> +		nfsd_file_flush(nf);
>> +		nfsd_file_put_noref(nf);
>> +	} else {
>> +		nfsd_file_put_noref(nf);
>> 		nfsd_file_schedule_laundrette();
>> +	}
>> 	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>> 		nfsd_file_gc();
>> }
>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>> 	while(!list_empty(dispose)) {
>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> 		list_del(&nf->nf_lru);
>> +		nfsd_file_flush(nf);
>> 		nfsd_file_put_noref(nf);
>> 	}
>> }
>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>> 	while(!list_empty(dispose)) {
>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> 		list_del(&nf->nf_lru);
>> +		nfsd_file_flush(nf);
>> 		if (!refcount_dec_and_test(&nf->nf_ref))
>> 			continue;
>> 		if (nfsd_file_free(nf))
>> -- 
>> 2.35.1
>> 
>> 
>> 
>> -- 
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> trond.myklebust@hammerspace.com
>> 
>> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

--
Chuck Lever