Re: Performance regression with random IO pattern from the client

From: Chuck Lever III <chuck.lever@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"nfbrown@suse.com" <nfbrown@suse.com>,
	Bruce Fields <bfields@redhat.com>
Subject: Re: Performance regression with random IO pattern from the client
Date: Thu, 31 Mar 2022 14:22:38 +0000	[thread overview]
Message-ID: <020930EF-E866-40CE-83BA-5539C55C091C@oracle.com> (raw)
In-Reply-To: <E275D6B3-9F15-4002-9967-B35820A01937@oracle.com>

 
> On Mar 31, 2022, at 10:20 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@suse.cz> wrote:
>> 
>> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
>>>>> 
>>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>> 
>>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> during our performance testing we have noticed that commit
>>>>>>>> b6669305d35a
>>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>>> introduced
>>>>>>>> a
>>>>>>>> performance regression when a client does random buffered
>>>>>>>> writes. The
>>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>>> random
>>>>>>>> buffered writes to 4
>>>>>>>> different files and the files are large enough to hit dirty
>>>>>>>> limits
>>>>>>>> and
>>>>>>>> force writeback from the client. In particular the invocation
>>>>>>>> is
>>>>>>>> like:
>>>>>>>> 
>>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>>> --
>>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>>> fallocate=posix --
>>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>>> 
>>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>>> the
>>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>>> this commit
>>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>>> resulting in
>>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>>> b6669305d35a
>>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>>> resulting in
>>>>>>>> much
>>>>>>>> longer average seek distance (IO from different files is more
>>>>>>>> interleaved)
>>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>>> throughput when
>>>>>>>> the
>>>>>>>> backing store is rotational storage.
>>>>>>>> 
>>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>>> rather
>>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>>> writeback to
>>>>>>>> either
>>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>>> the
>>>>>>>> system. I
>>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>>> it more
>>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>>> run? Was
>>>>>>>> there
>>>>>>>> any particular problem leading to addition of this call or
>>>>>>>> was it
>>>>>>>> just "it
>>>>>>>> seemed like a good idea" thing?
>>>>>>>> 
>>>>>>>> Thanks in advance for ideas.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  Honza
>>>>>>> 
>>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>>> NFS, the
>>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>>> flushes
>>>>>>> out dirty pages. That again means that some of the calls to
>>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>>> (particularly
>>>>>>> the ones run by the garbage collector itself).
>>>>>> 
>>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>>> NFS
>>>>>> re-export is the more common use case, however.
>>>>>> 
>>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>>> "flush on close" semantic.
>>>>>> 
>>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>>> mechanical revert of b6669305d35a.
>>>>> 
>>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>>> 
>>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>>> sending a revert commit, and I do not have enough information
>>>> yet to make that choice. Waiting a bit is OK with me.
>>>> 
>>>> 
>>>>> Also because
>>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>>> other
>>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>>> think
>>>>> making flushing less aggressive is desirable because as I wrote in
>>>>> my other
>>>>> reply, currently it is preventing pagecache from accumulating
>>>>> enough dirty
>>>>> data for a good IO pattern.
>>>> 
>>>> I might even go so far as to say that a small write workload
>>>> isn't especially good for solid state storage either.
>>>> 
>>>> I know Trond is trying to address NFS re-export performance, but
>>>> there appear to be some palpable effects outside of that narrow
>>>> use case that need to be considered. Thus a server-side fix,
>>>> rather than a fix in the NFS client used to do the re-export,
>>>> seems appropriate to consider.
>>> 
>>> Turns out it is not just the NFS client that is the problem. It is
>>> rather that we need in general to be able to detect flush errors and
>>> either report them directly (through commit) or we need to change the
>>> boot verifier to force clients to resend the unstable writes.
>>> 
>>> Hence, I think we're looking at something like this:
>> 
>> Thanks for the fix! I've run the patch through some testing and your fix
>> indeed restores the good IO pattern and returns the performance back to
>> original levels. So feel free to add:
>> 
>> Tested-by: Jan Kara <jack@suse.cz>
> 
> Excellent. I'll queue this up in the NFSD tree for 5.17-rc.

Belay that. 5.18-rc.


>> 
>> 								Honza
>> 
>>> 
>>> 8<--------------------------------------------------------------------
>>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>>> Subject: [PATCH] nfsd: Fix a write performance regression
>>> 
>>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>>> we clear out any writes belonging to a NFSv3 client relatively quickly
>>> and avoid situations where the file can't be evicted by the garbage
>>> collector. It also ensures that we detect write errors quickly.
>>> 
>>> The problem is this causes a regression in performance for some
>>> workloads.
>>> 
>>> So try to improve matters by deferring writeback until we're ready to
>>> close the file, and need to detect errors so that we can force the
>>> client to resend.
>>> 
>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> ---
>>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 8bc807c5fea4..9578a6317709 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>>> 	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>>> }
>>> 
>>> +static void
>>> +nfsd_file_flush(struct nfsd_file *nf)
>>> +{
>>> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>>> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> +}
>>> +
>>> static void
>>> nfsd_file_do_unhash(struct nfsd_file *nf)
>>> {
>>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>>> 		return;
>>> 	}
>>> 
>>> -	filemap_flush(nf->nf_file->f_mapping);
>>> 	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>>> -	nfsd_file_put_noref(nf);
>>> -	if (is_hashed)
>>> +	if (!is_hashed) {
>>> +		nfsd_file_flush(nf);
>>> +		nfsd_file_put_noref(nf);
>>> +	} else {
>>> +		nfsd_file_put_noref(nf);
>>> 		nfsd_file_schedule_laundrette();
>>> +	}
>>> 	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>>> 		nfsd_file_gc();
>>> }
>>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>>> 	while(!list_empty(dispose)) {
>>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> 		list_del(&nf->nf_lru);
>>> +		nfsd_file_flush(nf);
>>> 		nfsd_file_put_noref(nf);
>>> 	}
>>> }
>>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>>> 	while(!list_empty(dispose)) {
>>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> 		list_del(&nf->nf_lru);
>>> +		nfsd_file_flush(nf);
>>> 		if (!refcount_dec_and_test(&nf->nf_ref))
>>> 			continue;
>>> 		if (nfsd_file_free(nf))
>>> -- 
>>> 2.35.1
>>> 
>>> 
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@hammerspace.com
>>> 
>>> 
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
> 
> --
> Chuck Lever

--
Chuck Lever