From: Chuck Lever III <chuck.lever@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
"nfbrown@suse.com" <nfbrown@suse.com>,
Bruce Fields <bfields@redhat.com>
Subject: Re: Performance regression with random IO pattern from the client
Date: Thu, 31 Mar 2022 14:20:11 +0000 [thread overview]
Message-ID: <E275D6B3-9F15-4002-9967-B35820A01937@oracle.com> (raw)
In-Reply-To: <20220331130935.ejqu3kxsjm2tvlly@quack3.lan>
> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@suse.cz> wrote:
>
> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>>
>>>
>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
>>>>
>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>
>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> during our performance testing we have noticed that commit
>>>>>>> b6669305d35a
>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>> introduced
>>>>>>> a
>>>>>>> performance regression when a client does random buffered
>>>>>>> writes. The
>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>> random
>>>>>>> buffered writes to 4
>>>>>>> different files and the files are large enough to hit dirty
>>>>>>> limits
>>>>>>> and
>>>>>>> force writeback from the client. In particular the invocation
>>>>>>> is
>>>>>>> like:
>>>>>>>
>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>> --
>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>> fallocate=posix --
>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>>
>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>> the
>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>> this commit
>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>> resulting in
>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>> b6669305d35a
>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>> resulting in
>>>>>>> much
>>>>>>> longer average seek distance (IO from different files is more
>>>>>>> interleaved)
>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>> throughput when
>>>>>>> the
>>>>>>> backing store is rotational storage.
>>>>>>>
>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>> rather
>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>> writeback to
>>>>>>> either
>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>> the
>>>>>>> system. I
>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>> it more
>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>> run? Was
>>>>>>> there
>>>>>>> any particular problem leading to addition of this call or
>>>>>>> was it
>>>>>>> just "it
>>>>>>> seemed like a good idea" thing?
>>>>>>>
>>>>>>> Thanks in advance for ideas.
>>>>>>>
>>>>>>>
>>>>>>> Honza
>>>>>>
>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>> NFS, the
>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>> flushes
>>>>>> out dirty pages. That again means that some of the calls to
>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>> (particularly
>>>>>> the ones run by the garbage collector itself).
>>>>>
>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>> NFS
>>>>> re-export is the more common use case, however.
>>>>>
>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>> "flush on close" semantic.
>>>>>
>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>> mechanical revert of b6669305d35a.
>>>>
>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>>
>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>> sending a revert commit, and I do not have enough information
>>> yet to make that choice. Waiting a bit is OK with me.
>>>
>>>
>>>> Also because
>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>> other
>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>> think
>>>> making flushing less aggressive is desirable because as I wrote in
>>>> my other
>>>> reply, currently it is preventing pagecache from accumulating
>>>> enough dirty
>>>> data for a good IO pattern.
>>>
>>> I might even go so far as to say that a small write workload
>>> isn't especially good for solid state storage either.
>>>
>>> I know Trond is trying to address NFS re-export performance, but
>>> there appear to be some palpable effects outside of that narrow
>>> use case that need to be considered. Thus a server-side fix,
>>> rather than a fix in the NFS client used to do the re-export,
>>> seems appropriate to consider.
>>
>> Turns out it is not just the NFS client that is the problem. It is
>> rather that we need in general to be able to detect flush errors and
>> either report them directly (through commit) or we need to change the
>> boot verifier to force clients to resend the unstable writes.
>>
>> Hence, I think we're looking at something like this:
>
> Thanks for the fix! I've run the patch through some testing and your fix
> indeed restores the good IO pattern and returns the performance back to
> original levels. So feel free to add:
>
> Tested-by: Jan Kara <jack@suse.cz>
Excellent. I'll queue this up in the NFSD tree for 5.17-rc.
>
> Honza
>
>>
>> 8<--------------------------------------------------------------------
>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>> Subject: [PATCH] nfsd: Fix a write performance regression
>>
>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>> we clear out any writes belonging to a NFSv3 client relatively quickly
>> and avoid situations where the file can't be evicted by the garbage
>> collector. It also ensures that we detect write errors quickly.
>>
>> The problem is this causes a regression in performance for some
>> workloads.
>>
>> So try to improve matters by deferring writeback until we're ready to
>> close the file, and need to detect errors so that we can force the
>> client to resend.
>>
>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>> ---
>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index 8bc807c5fea4..9578a6317709 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>> }
>>
>> +static void
>> +nfsd_file_flush(struct nfsd_file *nf)
>> +{
>> + if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>> +}
>> +
>> static void
>> nfsd_file_do_unhash(struct nfsd_file *nf)
>> {
>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>> return;
>> }
>>
>> - filemap_flush(nf->nf_file->f_mapping);
>> is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>> - nfsd_file_put_noref(nf);
>> - if (is_hashed)
>> + if (!is_hashed) {
>> + nfsd_file_flush(nf);
>> + nfsd_file_put_noref(nf);
>> + } else {
>> + nfsd_file_put_noref(nf);
>> nfsd_file_schedule_laundrette();
>> + }
>> if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>> nfsd_file_gc();
>> }
>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>> while(!list_empty(dispose)) {
>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> list_del(&nf->nf_lru);
>> + nfsd_file_flush(nf);
>> nfsd_file_put_noref(nf);
>> }
>> }
>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>> while(!list_empty(dispose)) {
>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> list_del(&nf->nf_lru);
>> + nfsd_file_flush(nf);
>> if (!refcount_dec_and_test(&nf->nf_ref))
>> continue;
>> if (nfsd_file_free(nf))
>> --
>> 2.35.1
>>
>>
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> trond.myklebust@hammerspace.com
>>
>>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
Chuck Lever
next prev parent reply other threads:[~2022-03-31 14:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-30 10:34 Performance regression with random IO pattern from the client Jan Kara
2022-03-30 15:03 ` Trond Myklebust
2022-03-30 15:38 ` Chuck Lever III
2022-03-30 16:19 ` Trond Myklebust
2022-03-30 16:19 ` Jan Kara
2022-03-30 17:56 ` Chuck Lever III
2022-03-30 22:02 ` Trond Myklebust
2022-03-31 13:09 ` Jan Kara
2022-03-31 14:20 ` Chuck Lever III [this message]
2022-03-31 14:22 ` Chuck Lever III
2022-03-30 16:14 ` Jan Kara
2022-03-31 8:43 ` Thorsten Leemhuis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E275D6B3-9F15-4002-9967-B35820A01937@oracle.com \
--to=chuck.lever@oracle.com \
--cc=bfields@redhat.com \
--cc=jack@suse.cz \
--cc=linux-nfs@vger.kernel.org \
--cc=nfbrown@suse.com \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).