Performance regression with random IO pattern from the client

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Performance regression with random IO pattern from the client
@ 2022-03-30 10:34 Jan Kara
  2022-03-30 15:03 ` Trond Myklebust
  2022-03-31  8:43 ` Thorsten Leemhuis
  0 siblings, 2 replies; 12+ messages in thread
From: Jan Kara @ 2022-03-30 10:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs, nfbrown

Hello,

during our performance testing we have noticed that commit b6669305d35a
("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced a
performance regression when a client does random buffered writes. The
workload on NFS client is fio running 4 processed doing random buffered writes to 4
different files and the files are large enough to hit dirty limits and
force writeback from the client. In particular the invocation is like:

fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --rw=randwrite --size=4000m --numjobs=4 --bs=4k --filename_format=FioWorkloads.\$jobnum --end_fsync=1

The reason why commit b6669305d35a regresses performance is the
filemap_flush() call it adds into nfsd_file_put(). Before this commit
writeback on the server happened from nfsd_commit() code resulting in
rather long semisequential streams of 4k writes. After commit b6669305d35a
all the writeback happens from filemap_flush() calls resulting in much
longer average seek distance (IO from different files is more interleaved)
and about 16-20% regression in the achieved writeback throughput when the
backing store is rotational storage.

I think the filemap_flush() from nfsd_file_put() is indeed rather
aggressive and I think we'd be better off to just leave writeback to either
nfsd_commit() or standard dirty page cleaning happening on the system. I
assume the rationale for the filemap_flush() call was to make it more
likely the file can be evicted during the garbage collection run? Was there
any particular problem leading to addition of this call or was it just "it
seemed like a good idea" thing?

Thanks in advance for ideas.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 10:34 Performance regression with random IO pattern from the client Jan Kara
@ 2022-03-30 15:03 ` Trond Myklebust
  2022-03-30 15:38   ` Chuck Lever III
  2022-03-30 16:14   ` Jan Kara
  2022-03-31  8:43 ` Thorsten Leemhuis
  1 sibling, 2 replies; 12+ messages in thread
From: Trond Myklebust @ 2022-03-30 15:03 UTC (permalink / raw)
  To: jack; +Cc: linux-nfs, nfbrown, bfields, chuck.lever

On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> Hello,
> 
> during our performance testing we have noticed that commit
> b6669305d35a
> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> a
> performance regression when a client does random buffered writes. The
> workload on NFS client is fio running 4 processed doing random
> buffered writes to 4
> different files and the files are large enough to hit dirty limits
> and
> force writeback from the client. In particular the invocation is
> like:
> 
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> filename_format=FioWorkloads.\$jobnum --end_fsync=1
> 
> The reason why commit b6669305d35a regresses performance is the
> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> writeback on the server happened from nfsd_commit() code resulting in
> rather long semisequential streams of 4k writes. After commit
> b6669305d35a
> all the writeback happens from filemap_flush() calls resulting in
> much
> longer average seek distance (IO from different files is more
> interleaved)
> and about 16-20% regression in the achieved writeback throughput when
> the
> backing store is rotational storage.
> 
> I think the filemap_flush() from nfsd_file_put() is indeed rather
> aggressive and I think we'd be better off to just leave writeback to
> either
> nfsd_commit() or standard dirty page cleaning happening on the
> system. I
> assume the rationale for the filemap_flush() call was to make it more
> likely the file can be evicted during the garbage collection run? Was
> there
> any particular problem leading to addition of this call or was it
> just "it
> seemed like a good idea" thing?
> 
> Thanks in advance for ideas.
> 
>                                                                 Honza

It was mainly introduced to reduce the amount of work that
nfsd_file_free() needs to do. In particular when re-exporting NFS, the
call to filp_close() can be expensive because it synchronously flushes
out dirty pages. That again means that some of the calls to
nfsd_file_dispose_list() can end up being very expensive (particularly
the ones run by the garbage collector itself).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 15:03 ` Trond Myklebust
@ 2022-03-30 15:38   ` Chuck Lever III
  2022-03-30 16:19     ` Trond Myklebust
  2022-03-30 16:19     ` Jan Kara
  2022-03-30 16:14   ` Jan Kara
  1 sibling, 2 replies; 12+ messages in thread
From: Chuck Lever III @ 2022-03-30 15:38 UTC (permalink / raw)
  To: Trond Myklebust, jack; +Cc: Linux NFS Mailing List, nfbrown, Bruce Fields



> On Mar 30, 2022, at 11:03 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>> Hello,
>> 
>> during our performance testing we have noticed that commit
>> b6669305d35a
>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
>> a
>> performance regression when a client does random buffered writes. The
>> workload on NFS client is fio running 4 processed doing random
>> buffered writes to 4
>> different files and the files are large enough to hit dirty limits
>> and
>> force writeback from the client. In particular the invocation is
>> like:
>> 
>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
>> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>> 
>> The reason why commit b6669305d35a regresses performance is the
>> filemap_flush() call it adds into nfsd_file_put(). Before this commit
>> writeback on the server happened from nfsd_commit() code resulting in
>> rather long semisequential streams of 4k writes. After commit
>> b6669305d35a
>> all the writeback happens from filemap_flush() calls resulting in
>> much
>> longer average seek distance (IO from different files is more
>> interleaved)
>> and about 16-20% regression in the achieved writeback throughput when
>> the
>> backing store is rotational storage.
>> 
>> I think the filemap_flush() from nfsd_file_put() is indeed rather
>> aggressive and I think we'd be better off to just leave writeback to
>> either
>> nfsd_commit() or standard dirty page cleaning happening on the
>> system. I
>> assume the rationale for the filemap_flush() call was to make it more
>> likely the file can be evicted during the garbage collection run? Was
>> there
>> any particular problem leading to addition of this call or was it
>> just "it
>> seemed like a good idea" thing?
>> 
>> Thanks in advance for ideas.
>> 
>>                                                                 Honza
> 
> It was mainly introduced to reduce the amount of work that
> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> call to filp_close() can be expensive because it synchronously flushes
> out dirty pages. That again means that some of the calls to
> nfsd_file_dispose_list() can end up being very expensive (particularly
> the ones run by the garbage collector itself).

The "no regressions" rule suggests that some kind of action needs
to be taken. I don't have a sense of whether Jan's workload or NFS
re-export is the more common use case, however.

I can see that filp_close() on a file on an NFS mount could be
costly if that file has dirty pages, due to the NFS client's
"flush on close" semantic.

Trond, are there alternatives to flushing in the nfsd_file_put()
path? I'm willing to entertain bug fix patches rather than a
mechanical revert of b6669305d35a.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 15:03 ` Trond Myklebust
  2022-03-30 15:38   ` Chuck Lever III
@ 2022-03-30 16:14   ` Jan Kara
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Kara @ 2022-03-30 16:14 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: jack, linux-nfs, nfbrown, bfields, chuck.lever

On Wed 30-03-22 15:03:30, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > Hello,
> > 
> > during our performance testing we have noticed that commit
> > b6669305d35a
> > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> > a
> > performance regression when a client does random buffered writes. The
> > workload on NFS client is fio running 4 processed doing random
> > buffered writes to 4
> > different files and the files are large enough to hit dirty limits
> > and
> > force writeback from the client. In particular the invocation is
> > like:
> > 
> > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> > invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > 
> > The reason why commit b6669305d35a regresses performance is the
> > filemap_flush() call it adds into nfsd_file_put(). Before this commit
> > writeback on the server happened from nfsd_commit() code resulting in
> > rather long semisequential streams of 4k writes. After commit
> > b6669305d35a
> > all the writeback happens from filemap_flush() calls resulting in
> > much
> > longer average seek distance (IO from different files is more
> > interleaved)
> > and about 16-20% regression in the achieved writeback throughput when
> > the
> > backing store is rotational storage.
> > 
> > I think the filemap_flush() from nfsd_file_put() is indeed rather
> > aggressive and I think we'd be better off to just leave writeback to
> > either
> > nfsd_commit() or standard dirty page cleaning happening on the
> > system. I
> > assume the rationale for the filemap_flush() call was to make it more
> > likely the file can be evicted during the garbage collection run? Was
> > there
> > any particular problem leading to addition of this call or was it
> > just "it
> > seemed like a good idea" thing?
> > 
> > Thanks in advance for ideas.
> 
> It was mainly introduced to reduce the amount of work that
> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> call to filp_close() can be expensive because it synchronously flushes
> out dirty pages. That again means that some of the calls to
> nfsd_file_dispose_list() can end up being very expensive (particularly
> the ones run by the garbage collector itself).

I see, thanks for info. So I'm pondering what options we have for fixing
the performance regression. Because the filemap_flush() call in
nfsd_file_put() is just too aggressive and doesn't allow enough dirty data
to accumulate in the page cache for a reasonable IO pattern.

E.g. if the concern is just too long nfsd_file_dispose_list() runtime when
there are more files in the dispose list, we could do two iterations there
- the first one that walks all the files and starts async writeback for all
  of them, and the second one which drops the reference which among other
things may end up in ->flush() doing the synchronous writeback (but that
will now have not much to do). This is how generic writeback actually does
things for synchronous writeback because it is much faster than doing
submit one file, wait for one file in a loop if there are multiple files to
write. Would something like this be acceptable for you?

If something like this is not enough, we could also do something like
having another delayed work walking unused files and starting writeback for
them.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 15:38   ` Chuck Lever III
@ 2022-03-30 16:19     ` Trond Myklebust
  2022-03-30 16:19     ` Jan Kara
  1 sibling, 0 replies; 12+ messages in thread
From: Trond Myklebust @ 2022-03-30 16:19 UTC (permalink / raw)
  To: jack, chuck.lever; +Cc: linux-nfs, nfbrown, bfields

On Wed, 2022-03-30 at 15:38 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > Hello,
> > > 
> > > during our performance testing we have noticed that commit
> > > b6669305d35a
> > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > introduced
> > > a
> > > performance regression when a client does random buffered writes.
> > > The
> > > workload on NFS client is fio running 4 processed doing random
> > > buffered writes to 4
> > > different files and the files are large enough to hit dirty
> > > limits
> > > and
> > > force writeback from the client. In particular the invocation is
> > > like:
> > > 
> > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> > > invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix
> > > --
> > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > 
> > > The reason why commit b6669305d35a regresses performance is the
> > > filemap_flush() call it adds into nfsd_file_put(). Before this
> > > commit
> > > writeback on the server happened from nfsd_commit() code
> > > resulting in
> > > rather long semisequential streams of 4k writes. After commit
> > > b6669305d35a
> > > all the writeback happens from filemap_flush() calls resulting in
> > > much
> > > longer average seek distance (IO from different files is more
> > > interleaved)
> > > and about 16-20% regression in the achieved writeback throughput
> > > when
> > > the
> > > backing store is rotational storage.
> > > 
> > > I think the filemap_flush() from nfsd_file_put() is indeed rather
> > > aggressive and I think we'd be better off to just leave writeback
> > > to
> > > either
> > > nfsd_commit() or standard dirty page cleaning happening on the
> > > system. I
> > > assume the rationale for the filemap_flush() call was to make it
> > > more
> > > likely the file can be evicted during the garbage collection run?
> > > Was
> > > there
> > > any particular problem leading to addition of this call or was it
> > > just "it
> > > seemed like a good idea" thing?
> > > 
> > > Thanks in advance for ideas.
> > > 
> > >                                                                
> > > Honza
> > 
> > It was mainly introduced to reduce the amount of work that
> > nfsd_file_free() needs to do. In particular when re-exporting NFS,
> > the
> > call to filp_close() can be expensive because it synchronously
> > flushes
> > out dirty pages. That again means that some of the calls to
> > nfsd_file_dispose_list() can end up being very expensive
> > (particularly
> > the ones run by the garbage collector itself).
> 
> The "no regressions" rule suggests that some kind of action needs
> to be taken. I don't have a sense of whether Jan's workload or NFS
> re-export is the more common use case, however.
> 
> I can see that filp_close() on a file on an NFS mount could be
> costly if that file has dirty pages, due to the NFS client's
> "flush on close" semantic.
> 
> Trond, are there alternatives to flushing in the nfsd_file_put()
> path? I'm willing to entertain bug fix patches rather than a
> mechanical revert of b6669305d35a.

Hang on a bit. This commit went into Linux 5.6 more than two years ago,
and this is the only report so far of a regression. We can afford the
time to give it some thought, particularly given the fact that the
laundrette currently runs on the system workqueue and so really isn't a
good choice for write back.

One option might be to simply exit early in nfs_file_flush() if the
caller is a workqueue thread. (i.e. enforce close-to-open only on
userspace processes).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 15:38   ` Chuck Lever III
  2022-03-30 16:19     ` Trond Myklebust
@ 2022-03-30 16:19     ` Jan Kara
  2022-03-30 17:56       ` Chuck Lever III
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2022-03-30 16:19 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Trond Myklebust, jack, Linux NFS Mailing List, nfbrown, Bruce Fields

On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > On Mar 30, 2022, at 11:03 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> > 
> > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> >> Hello,
> >> 
> >> during our performance testing we have noticed that commit
> >> b6669305d35a
> >> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> >> a
> >> performance regression when a client does random buffered writes. The
> >> workload on NFS client is fio running 4 processed doing random
> >> buffered writes to 4
> >> different files and the files are large enough to hit dirty limits
> >> and
> >> force writeback from the client. In particular the invocation is
> >> like:
> >> 
> >> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> >> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> >> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> >> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> >> filename_format=FioWorkloads.\$jobnum --end_fsync=1
> >> 
> >> The reason why commit b6669305d35a regresses performance is the
> >> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> >> writeback on the server happened from nfsd_commit() code resulting in
> >> rather long semisequential streams of 4k writes. After commit
> >> b6669305d35a
> >> all the writeback happens from filemap_flush() calls resulting in
> >> much
> >> longer average seek distance (IO from different files is more
> >> interleaved)
> >> and about 16-20% regression in the achieved writeback throughput when
> >> the
> >> backing store is rotational storage.
> >> 
> >> I think the filemap_flush() from nfsd_file_put() is indeed rather
> >> aggressive and I think we'd be better off to just leave writeback to
> >> either
> >> nfsd_commit() or standard dirty page cleaning happening on the
> >> system. I
> >> assume the rationale for the filemap_flush() call was to make it more
> >> likely the file can be evicted during the garbage collection run? Was
> >> there
> >> any particular problem leading to addition of this call or was it
> >> just "it
> >> seemed like a good idea" thing?
> >> 
> >> Thanks in advance for ideas.
> >> 
> >>                                                                 Honza
> > 
> > It was mainly introduced to reduce the amount of work that
> > nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> > call to filp_close() can be expensive because it synchronously flushes
> > out dirty pages. That again means that some of the calls to
> > nfsd_file_dispose_list() can end up being very expensive (particularly
> > the ones run by the garbage collector itself).
> 
> The "no regressions" rule suggests that some kind of action needs
> to be taken. I don't have a sense of whether Jan's workload or NFS
> re-export is the more common use case, however.
> 
> I can see that filp_close() on a file on an NFS mount could be
> costly if that file has dirty pages, due to the NFS client's
> "flush on close" semantic.
> 
> Trond, are there alternatives to flushing in the nfsd_file_put()
> path? I'm willing to entertain bug fix patches rather than a
> mechanical revert of b6669305d35a.

Yeah, I don't think we need to rush fixing this with a revert. Also because
just removing the filemap_flush() from nfsd_file_put() would keep other
benefits of that commit while fixing the regression AFAIU. But I think
making flushing less aggressive is desirable because as I wrote in my other
reply, currently it is preventing pagecache from accumulating enough dirty
data for a good IO pattern.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 16:19     ` Jan Kara
@ 2022-03-30 17:56       ` Chuck Lever III
  2022-03-30 22:02         ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-30 17:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Trond Myklebust, Linux NFS Mailing List, nfbrown, Bruce Fields



> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
> 
> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>>> 
>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>> Hello,
>>>> 
>>>> during our performance testing we have noticed that commit
>>>> b6669305d35a
>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
>>>> a
>>>> performance regression when a client does random buffered writes. The
>>>> workload on NFS client is fio running 4 processed doing random
>>>> buffered writes to 4
>>>> different files and the files are large enough to hit dirty limits
>>>> and
>>>> force writeback from the client. In particular the invocation is
>>>> like:
>>>> 
>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
>>>> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>> 
>>>> The reason why commit b6669305d35a regresses performance is the
>>>> filemap_flush() call it adds into nfsd_file_put(). Before this commit
>>>> writeback on the server happened from nfsd_commit() code resulting in
>>>> rather long semisequential streams of 4k writes. After commit
>>>> b6669305d35a
>>>> all the writeback happens from filemap_flush() calls resulting in
>>>> much
>>>> longer average seek distance (IO from different files is more
>>>> interleaved)
>>>> and about 16-20% regression in the achieved writeback throughput when
>>>> the
>>>> backing store is rotational storage.
>>>> 
>>>> I think the filemap_flush() from nfsd_file_put() is indeed rather
>>>> aggressive and I think we'd be better off to just leave writeback to
>>>> either
>>>> nfsd_commit() or standard dirty page cleaning happening on the
>>>> system. I
>>>> assume the rationale for the filemap_flush() call was to make it more
>>>> likely the file can be evicted during the garbage collection run? Was
>>>> there
>>>> any particular problem leading to addition of this call or was it
>>>> just "it
>>>> seemed like a good idea" thing?
>>>> 
>>>> Thanks in advance for ideas.
>>>> 
>>>>                                                                Honza
>>> 
>>> It was mainly introduced to reduce the amount of work that
>>> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
>>> call to filp_close() can be expensive because it synchronously flushes
>>> out dirty pages. That again means that some of the calls to
>>> nfsd_file_dispose_list() can end up being very expensive (particularly
>>> the ones run by the garbage collector itself).
>> 
>> The "no regressions" rule suggests that some kind of action needs
>> to be taken. I don't have a sense of whether Jan's workload or NFS
>> re-export is the more common use case, however.
>> 
>> I can see that filp_close() on a file on an NFS mount could be
>> costly if that file has dirty pages, due to the NFS client's
>> "flush on close" semantic.
>> 
>> Trond, are there alternatives to flushing in the nfsd_file_put()
>> path? I'm willing to entertain bug fix patches rather than a
>> mechanical revert of b6669305d35a.
> 
> Yeah, I don't think we need to rush fixing this with a revert.

Sorry I wasn't clear: I would prefer to apply a bug fix over
sending a revert commit, and I do not have enough information
yet to make that choice. Waiting a bit is OK with me.


> Also because
> just removing the filemap_flush() from nfsd_file_put() would keep other
> benefits of that commit while fixing the regression AFAIU. But I think
> making flushing less aggressive is desirable because as I wrote in my other
> reply, currently it is preventing pagecache from accumulating enough dirty
> data for a good IO pattern.

I might even go so far as to say that a small write workload
isn't especially good for solid state storage either.

I know Trond is trying to address NFS re-export performance, but
there appear to be some palpable effects outside of that narrow
use case that need to be considered. Thus a server-side fix,
rather than a fix in the NFS client used to do the re-export,
seems appropriate to consider.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 17:56       ` Chuck Lever III
@ 2022-03-30 22:02         ` Trond Myklebust
  2022-03-31 13:09           ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-03-30 22:02 UTC (permalink / raw)
  To: jack, chuck.lever; +Cc: linux-nfs, nfbrown, bfields

On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
> > 
> > On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > > > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > during our performance testing we have noticed that commit
> > > > > b6669305d35a
> > > > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > > > introduced
> > > > > a
> > > > > performance regression when a client does random buffered
> > > > > writes. The
> > > > > workload on NFS client is fio running 4 processed doing
> > > > > random
> > > > > buffered writes to 4
> > > > > different files and the files are large enough to hit dirty
> > > > > limits
> > > > > and
> > > > > force writeback from the client. In particular the invocation
> > > > > is
> > > > > like:
> > > > > 
> > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
> > > > > --
> > > > > invalidate=1 --group_reporting=1 --runtime=300 --
> > > > > fallocate=posix --
> > > > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > > > 
> > > > > The reason why commit b6669305d35a regresses performance is
> > > > > the
> > > > > filemap_flush() call it adds into nfsd_file_put(). Before
> > > > > this commit
> > > > > writeback on the server happened from nfsd_commit() code
> > > > > resulting in
> > > > > rather long semisequential streams of 4k writes. After commit
> > > > > b6669305d35a
> > > > > all the writeback happens from filemap_flush() calls
> > > > > resulting in
> > > > > much
> > > > > longer average seek distance (IO from different files is more
> > > > > interleaved)
> > > > > and about 16-20% regression in the achieved writeback
> > > > > throughput when
> > > > > the
> > > > > backing store is rotational storage.
> > > > > 
> > > > > I think the filemap_flush() from nfsd_file_put() is indeed
> > > > > rather
> > > > > aggressive and I think we'd be better off to just leave
> > > > > writeback to
> > > > > either
> > > > > nfsd_commit() or standard dirty page cleaning happening on
> > > > > the
> > > > > system. I
> > > > > assume the rationale for the filemap_flush() call was to make
> > > > > it more
> > > > > likely the file can be evicted during the garbage collection
> > > > > run? Was
> > > > > there
> > > > > any particular problem leading to addition of this call or
> > > > > was it
> > > > > just "it
> > > > > seemed like a good idea" thing?
> > > > > 
> > > > > Thanks in advance for ideas.
> > > > > 
> > > > >                                                              
> > > > >   Honza
> > > > 
> > > > It was mainly introduced to reduce the amount of work that
> > > > nfsd_file_free() needs to do. In particular when re-exporting
> > > > NFS, the
> > > > call to filp_close() can be expensive because it synchronously
> > > > flushes
> > > > out dirty pages. That again means that some of the calls to
> > > > nfsd_file_dispose_list() can end up being very expensive
> > > > (particularly
> > > > the ones run by the garbage collector itself).
> > > 
> > > The "no regressions" rule suggests that some kind of action needs
> > > to be taken. I don't have a sense of whether Jan's workload or
> > > NFS
> > > re-export is the more common use case, however.
> > > 
> > > I can see that filp_close() on a file on an NFS mount could be
> > > costly if that file has dirty pages, due to the NFS client's
> > > "flush on close" semantic.
> > > 
> > > Trond, are there alternatives to flushing in the nfsd_file_put()
> > > path? I'm willing to entertain bug fix patches rather than a
> > > mechanical revert of b6669305d35a.
> > 
> > Yeah, I don't think we need to rush fixing this with a revert.
> 
> Sorry I wasn't clear: I would prefer to apply a bug fix over
> sending a revert commit, and I do not have enough information
> yet to make that choice. Waiting a bit is OK with me.
> 
> 
> > Also because
> > just removing the filemap_flush() from nfsd_file_put() would keep
> > other
> > benefits of that commit while fixing the regression AFAIU. But I
> > think
> > making flushing less aggressive is desirable because as I wrote in
> > my other
> > reply, currently it is preventing pagecache from accumulating
> > enough dirty
> > data for a good IO pattern.
> 
> I might even go so far as to say that a small write workload
> isn't especially good for solid state storage either.
> 
> I know Trond is trying to address NFS re-export performance, but
> there appear to be some palpable effects outside of that narrow
> use case that need to be considered. Thus a server-side fix,
> rather than a fix in the NFS client used to do the re-export,
> seems appropriate to consider.

Turns out it is not just the NFS client that is the problem. It is
rather that we need in general to be able to detect flush errors and
either report them directly (through commit) or we need to change the
boot verifier to force clients to resend the unstable writes.

Hence, I think we're looking at something like this:

8<--------------------------------------------------------------------
From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@hammerspace.com>
Date: Wed, 30 Mar 2022 16:55:38 -0400
Subject: [PATCH] nfsd: Fix a write performance regression

The call to filemap_flush() in nfsd_file_put() is there to ensure that
we clear out any writes belonging to a NFSv3 client relatively quickly
and avoid situations where the file can't be evicted by the garbage
collector. It also ensures that we detect write errors quickly.

The problem is this causes a regression in performance for some
workloads.

So try to improve matters by deferring writeback until we're ready to
close the file, and need to detect errors so that we can force the
client to resend.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfsd/filecache.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8bc807c5fea4..9578a6317709 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
 	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
 }
 
+static void
+nfsd_file_flush(struct nfsd_file *nf)
+{
+	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
+		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
+}
+
 static void
 nfsd_file_do_unhash(struct nfsd_file *nf)
 {
@@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
 		return;
 	}
 
-	filemap_flush(nf->nf_file->f_mapping);
 	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
-	nfsd_file_put_noref(nf);
-	if (is_hashed)
+	if (!is_hashed) {
+		nfsd_file_flush(nf);
+		nfsd_file_put_noref(nf);
+	} else {
+		nfsd_file_put_noref(nf);
 		nfsd_file_schedule_laundrette();
+	}
 	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
 		nfsd_file_gc();
 }
@@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
 	while(!list_empty(dispose)) {
 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
 		list_del(&nf->nf_lru);
+		nfsd_file_flush(nf);
 		nfsd_file_put_noref(nf);
 	}
 }
@@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
 	while(!list_empty(dispose)) {
 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
 		list_del(&nf->nf_lru);
+		nfsd_file_flush(nf);
 		if (!refcount_dec_and_test(&nf->nf_ref))
 			continue;
 		if (nfsd_file_free(nf))
-- 
2.35.1



-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 10:34 Performance regression with random IO pattern from the client Jan Kara
  2022-03-30 15:03 ` Trond Myklebust
@ 2022-03-31  8:43 ` Thorsten Leemhuis
  1 sibling, 0 replies; 12+ messages in thread
From: Thorsten Leemhuis @ 2022-03-31  8:43 UTC (permalink / raw)
  To: Jan Kara, Trond Myklebust
  Cc: J. Bruce Fields, Chuck Lever, linux-nfs, nfbrown, regressions

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker.  CCing the regression
mailing list, as it should be in the loop for all regressions, as
explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

On 30.03.22 12:34, Jan Kara wrote:
> Hello,
> 
> during our performance testing we have noticed that commit b6669305d35a
> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced a
> performance regression when a client does random buffered writes. The
> workload on NFS client is fio running 4 processed doing random buffered writes to 4
> different files and the files are large enough to hit dirty limits and
> force writeback from the client. In particular the invocation is like:
> 
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --rw=randwrite --size=4000m --numjobs=4 --bs=4k --filename_format=FioWorkloads.\$jobnum --end_fsync=1
> 
> The reason why commit b6669305d35a regresses performance is the
> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> writeback on the server happened from nfsd_commit() code resulting in
> rather long semisequential streams of 4k writes. After commit b6669305d35a
> all the writeback happens from filemap_flush() calls resulting in much
> longer average seek distance (IO from different files is more interleaved)
> and about 16-20% regression in the achieved writeback throughput when the
> backing store is rotational storage.
> 
> I think the filemap_flush() from nfsd_file_put() is indeed rather
> aggressive and I think we'd be better off to just leave writeback to either
> nfsd_commit() or standard dirty page cleaning happening on the system. I
> assume the rationale for the filemap_flush() call was to make it more
> likely the file can be evicted during the garbage collection run? Was there
> any particular problem leading to addition of this call or was it just "it
> seemed like a good idea" thing?
> 
> Thanks in advance for ideas.
> 

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced b6669305d35a
#regzbot title nfs: Performance regression with random IO pattern from
the client
#regzbot backburner: introduced more than two years ago, first report,
hence developers want to give it some thought
#regzbot ignore-activity

If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.

I'm sending this to everyone that got the initial report, to make
everyone aware of the tracking. I also hope that messages like this
motivate people to directly get at least the regression mailing list and
ideally even regzbot involved when dealing with regressions, as messages
like this wouldn't be needed then. And don't worry, if I need to send
other mails regarding this regression only relevant for regzbot I'll
send them to the regressions lists only (with a tag in the subject so
people can filter them away). With a bit of luck no such messages will
be needed anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-30 22:02         ` Trond Myklebust
@ 2022-03-31 13:09           ` Jan Kara
  2022-03-31 14:20             ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2022-03-31 13:09 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: jack, chuck.lever, linux-nfs, nfbrown, bfields

On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
> > > 
> > > On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > > > > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > > > > <trondmy@hammerspace.com> wrote:
> > > > > 
> > > > > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > during our performance testing we have noticed that commit
> > > > > > b6669305d35a
> > > > > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > > > > introduced
> > > > > > a
> > > > > > performance regression when a client does random buffered
> > > > > > writes. The
> > > > > > workload on NFS client is fio running 4 processed doing
> > > > > > random
> > > > > > buffered writes to 4
> > > > > > different files and the files are large enough to hit dirty
> > > > > > limits
> > > > > > and
> > > > > > force writeback from the client. In particular the invocation
> > > > > > is
> > > > > > like:
> > > > > > 
> > > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
> > > > > > --
> > > > > > invalidate=1 --group_reporting=1 --runtime=300 --
> > > > > > fallocate=posix --
> > > > > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > > > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > > > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > > > > 
> > > > > > The reason why commit b6669305d35a regresses performance is
> > > > > > the
> > > > > > filemap_flush() call it adds into nfsd_file_put(). Before
> > > > > > this commit
> > > > > > writeback on the server happened from nfsd_commit() code
> > > > > > resulting in
> > > > > > rather long semisequential streams of 4k writes. After commit
> > > > > > b6669305d35a
> > > > > > all the writeback happens from filemap_flush() calls
> > > > > > resulting in
> > > > > > much
> > > > > > longer average seek distance (IO from different files is more
> > > > > > interleaved)
> > > > > > and about 16-20% regression in the achieved writeback
> > > > > > throughput when
> > > > > > the
> > > > > > backing store is rotational storage.
> > > > > > 
> > > > > > I think the filemap_flush() from nfsd_file_put() is indeed
> > > > > > rather
> > > > > > aggressive and I think we'd be better off to just leave
> > > > > > writeback to
> > > > > > either
> > > > > > nfsd_commit() or standard dirty page cleaning happening on
> > > > > > the
> > > > > > system. I
> > > > > > assume the rationale for the filemap_flush() call was to make
> > > > > > it more
> > > > > > likely the file can be evicted during the garbage collection
> > > > > > run? Was
> > > > > > there
> > > > > > any particular problem leading to addition of this call or
> > > > > > was it
> > > > > > just "it
> > > > > > seemed like a good idea" thing?
> > > > > > 
> > > > > > Thanks in advance for ideas.
> > > > > > 
> > > > > >                                                              
> > > > > >   Honza
> > > > > 
> > > > > It was mainly introduced to reduce the amount of work that
> > > > > nfsd_file_free() needs to do. In particular when re-exporting
> > > > > NFS, the
> > > > > call to filp_close() can be expensive because it synchronously
> > > > > flushes
> > > > > out dirty pages. That again means that some of the calls to
> > > > > nfsd_file_dispose_list() can end up being very expensive
> > > > > (particularly
> > > > > the ones run by the garbage collector itself).
> > > > 
> > > > The "no regressions" rule suggests that some kind of action needs
> > > > to be taken. I don't have a sense of whether Jan's workload or
> > > > NFS
> > > > re-export is the more common use case, however.
> > > > 
> > > > I can see that filp_close() on a file on an NFS mount could be
> > > > costly if that file has dirty pages, due to the NFS client's
> > > > "flush on close" semantic.
> > > > 
> > > > Trond, are there alternatives to flushing in the nfsd_file_put()
> > > > path? I'm willing to entertain bug fix patches rather than a
> > > > mechanical revert of b6669305d35a.
> > > 
> > > Yeah, I don't think we need to rush fixing this with a revert.
> > 
> > Sorry I wasn't clear: I would prefer to apply a bug fix over
> > sending a revert commit, and I do not have enough information
> > yet to make that choice. Waiting a bit is OK with me.
> > 
> > 
> > > Also because
> > > just removing the filemap_flush() from nfsd_file_put() would keep
> > > other
> > > benefits of that commit while fixing the regression AFAIU. But I
> > > think
> > > making flushing less aggressive is desirable because as I wrote in
> > > my other
> > > reply, currently it is preventing pagecache from accumulating
> > > enough dirty
> > > data for a good IO pattern.
> > 
> > I might even go so far as to say that a small write workload
> > isn't especially good for solid state storage either.
> > 
> > I know Trond is trying to address NFS re-export performance, but
> > there appear to be some palpable effects outside of that narrow
> > use case that need to be considered. Thus a server-side fix,
> > rather than a fix in the NFS client used to do the re-export,
> > seems appropriate to consider.
> 
> Turns out it is not just the NFS client that is the problem. It is
> rather that we need in general to be able to detect flush errors and
> either report them directly (through commit) or we need to change the
> boot verifier to force clients to resend the unstable writes.
> 
> Hence, I think we're looking at something like this:

Thanks for the fix! I've run the patch through some testing and your fix
indeed restores the good IO pattern and returns the performance back to
original levels. So feel free to add:

Tested-by: Jan Kara <jack@suse.cz>

								Honza

> 
> 8<--------------------------------------------------------------------
> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> Date: Wed, 30 Mar 2022 16:55:38 -0400
> Subject: [PATCH] nfsd: Fix a write performance regression
> 
> The call to filemap_flush() in nfsd_file_put() is there to ensure that
> we clear out any writes belonging to a NFSv3 client relatively quickly
> and avoid situations where the file can't be evicted by the garbage
> collector. It also ensures that we detect write errors quickly.
> 
> The problem is this causes a regression in performance for some
> workloads.
> 
> So try to improve matters by deferring writeback until we're ready to
> close the file, and need to detect errors so that we can force the
> client to resend.
> 
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfsd/filecache.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 8bc807c5fea4..9578a6317709 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>  	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>  }
>  
> +static void
> +nfsd_file_flush(struct nfsd_file *nf)
> +{
> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> +}
> +
>  static void
>  nfsd_file_do_unhash(struct nfsd_file *nf)
>  {
> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>  		return;
>  	}
>  
> -	filemap_flush(nf->nf_file->f_mapping);
>  	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
> -	nfsd_file_put_noref(nf);
> -	if (is_hashed)
> +	if (!is_hashed) {
> +		nfsd_file_flush(nf);
> +		nfsd_file_put_noref(nf);
> +	} else {
> +		nfsd_file_put_noref(nf);
>  		nfsd_file_schedule_laundrette();
> +	}
>  	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>  		nfsd_file_gc();
>  }
> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>  	while(!list_empty(dispose)) {
>  		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>  		list_del(&nf->nf_lru);
> +		nfsd_file_flush(nf);
>  		nfsd_file_put_noref(nf);
>  	}
>  }
> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>  	while(!list_empty(dispose)) {
>  		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>  		list_del(&nf->nf_lru);
> +		nfsd_file_flush(nf);
>  		if (!refcount_dec_and_test(&nf->nf_ref))
>  			continue;
>  		if (nfsd_file_free(nf))
> -- 
> 2.35.1
> 
> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-31 13:09           ` Jan Kara
@ 2022-03-31 14:20             ` Chuck Lever III
  2022-03-31 14:22               ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 14:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Trond Myklebust, Linux NFS Mailing List, nfbrown, Bruce Fields



> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>> 
>>> 
>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
>>>> 
>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>> 
>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> during our performance testing we have noticed that commit
>>>>>>> b6669305d35a
>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>> introduced
>>>>>>> a
>>>>>>> performance regression when a client does random buffered
>>>>>>> writes. The
>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>> random
>>>>>>> buffered writes to 4
>>>>>>> different files and the files are large enough to hit dirty
>>>>>>> limits
>>>>>>> and
>>>>>>> force writeback from the client. In particular the invocation
>>>>>>> is
>>>>>>> like:
>>>>>>> 
>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>> --
>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>> fallocate=posix --
>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>> 
>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>> the
>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>> this commit
>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>> resulting in
>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>> b6669305d35a
>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>> resulting in
>>>>>>> much
>>>>>>> longer average seek distance (IO from different files is more
>>>>>>> interleaved)
>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>> throughput when
>>>>>>> the
>>>>>>> backing store is rotational storage.
>>>>>>> 
>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>> rather
>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>> writeback to
>>>>>>> either
>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>> the
>>>>>>> system. I
>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>> it more
>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>> run? Was
>>>>>>> there
>>>>>>> any particular problem leading to addition of this call or
>>>>>>> was it
>>>>>>> just "it
>>>>>>> seemed like a good idea" thing?
>>>>>>> 
>>>>>>> Thanks in advance for ideas.
>>>>>>> 
>>>>>>>                                                              
>>>>>>>   Honza
>>>>>> 
>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>> NFS, the
>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>> flushes
>>>>>> out dirty pages. That again means that some of the calls to
>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>> (particularly
>>>>>> the ones run by the garbage collector itself).
>>>>> 
>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>> NFS
>>>>> re-export is the more common use case, however.
>>>>> 
>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>> "flush on close" semantic.
>>>>> 
>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>> mechanical revert of b6669305d35a.
>>>> 
>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>> 
>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>> sending a revert commit, and I do not have enough information
>>> yet to make that choice. Waiting a bit is OK with me.
>>> 
>>> 
>>>> Also because
>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>> other
>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>> think
>>>> making flushing less aggressive is desirable because as I wrote in
>>>> my other
>>>> reply, currently it is preventing pagecache from accumulating
>>>> enough dirty
>>>> data for a good IO pattern.
>>> 
>>> I might even go so far as to say that a small write workload
>>> isn't especially good for solid state storage either.
>>> 
>>> I know Trond is trying to address NFS re-export performance, but
>>> there appear to be some palpable effects outside of that narrow
>>> use case that need to be considered. Thus a server-side fix,
>>> rather than a fix in the NFS client used to do the re-export,
>>> seems appropriate to consider.
>> 
>> Turns out it is not just the NFS client that is the problem. It is
>> rather that we need in general to be able to detect flush errors and
>> either report them directly (through commit) or we need to change the
>> boot verifier to force clients to resend the unstable writes.
>> 
>> Hence, I think we're looking at something like this:
> 
> Thanks for the fix! I've run the patch through some testing and your fix
> indeed restores the good IO pattern and returns the performance back to
> original levels. So feel free to add:
> 
> Tested-by: Jan Kara <jack@suse.cz>

Excellent. I'll queue this up in the NFSD tree for 5.17-rc.


> 
> 								Honza
> 
>> 
>> 8<--------------------------------------------------------------------
>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>> Subject: [PATCH] nfsd: Fix a write performance regression
>> 
>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>> we clear out any writes belonging to a NFSv3 client relatively quickly
>> and avoid situations where the file can't be evicted by the garbage
>> collector. It also ensures that we detect write errors quickly.
>> 
>> The problem is this causes a regression in performance for some
>> workloads.
>> 
>> So try to improve matters by deferring writeback until we're ready to
>> close the file, and need to detect errors so that we can force the
>> client to resend.
>> 
>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>> ---
>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>> 1 file changed, 15 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index 8bc807c5fea4..9578a6317709 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>> 	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>> }
>> 
>> +static void
>> +nfsd_file_flush(struct nfsd_file *nf)
>> +{
>> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>> +}
>> +
>> static void
>> nfsd_file_do_unhash(struct nfsd_file *nf)
>> {
>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>> 		return;
>> 	}
>> 
>> -	filemap_flush(nf->nf_file->f_mapping);
>> 	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>> -	nfsd_file_put_noref(nf);
>> -	if (is_hashed)
>> +	if (!is_hashed) {
>> +		nfsd_file_flush(nf);
>> +		nfsd_file_put_noref(nf);
>> +	} else {
>> +		nfsd_file_put_noref(nf);
>> 		nfsd_file_schedule_laundrette();
>> +	}
>> 	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>> 		nfsd_file_gc();
>> }
>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>> 	while(!list_empty(dispose)) {
>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> 		list_del(&nf->nf_lru);
>> +		nfsd_file_flush(nf);
>> 		nfsd_file_put_noref(nf);
>> 	}
>> }
>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>> 	while(!list_empty(dispose)) {
>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> 		list_del(&nf->nf_lru);
>> +		nfsd_file_flush(nf);
>> 		if (!refcount_dec_and_test(&nf->nf_ref))
>> 			continue;
>> 		if (nfsd_file_free(nf))
>> -- 
>> 2.35.1
>> 
>> 
>> 
>> -- 
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> trond.myklebust@hammerspace.com
>> 
>> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance regression with random IO pattern from the client
  2022-03-31 14:20             ` Chuck Lever III
@ 2022-03-31 14:22               ` Chuck Lever III
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 14:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Trond Myklebust, Linux NFS Mailing List, nfbrown, Bruce Fields

 

> On Mar 31, 2022, at 10:20 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@suse.cz> wrote:
>> 
>> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@suse.cz> wrote:
>>>>> 
>>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>> 
>>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> during our performance testing we have noticed that commit
>>>>>>>> b6669305d35a
>>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>>> introduced
>>>>>>>> a
>>>>>>>> performance regression when a client does random buffered
>>>>>>>> writes. The
>>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>>> random
>>>>>>>> buffered writes to 4
>>>>>>>> different files and the files are large enough to hit dirty
>>>>>>>> limits
>>>>>>>> and
>>>>>>>> force writeback from the client. In particular the invocation
>>>>>>>> is
>>>>>>>> like:
>>>>>>>> 
>>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>>> --
>>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>>> fallocate=posix --
>>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>>> 
>>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>>> the
>>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>>> this commit
>>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>>> resulting in
>>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>>> b6669305d35a
>>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>>> resulting in
>>>>>>>> much
>>>>>>>> longer average seek distance (IO from different files is more
>>>>>>>> interleaved)
>>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>>> throughput when
>>>>>>>> the
>>>>>>>> backing store is rotational storage.
>>>>>>>> 
>>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>>> rather
>>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>>> writeback to
>>>>>>>> either
>>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>>> the
>>>>>>>> system. I
>>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>>> it more
>>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>>> run? Was
>>>>>>>> there
>>>>>>>> any particular problem leading to addition of this call or
>>>>>>>> was it
>>>>>>>> just "it
>>>>>>>> seemed like a good idea" thing?
>>>>>>>> 
>>>>>>>> Thanks in advance for ideas.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  Honza
>>>>>>> 
>>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>>> NFS, the
>>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>>> flushes
>>>>>>> out dirty pages. That again means that some of the calls to
>>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>>> (particularly
>>>>>>> the ones run by the garbage collector itself).
>>>>>> 
>>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>>> NFS
>>>>>> re-export is the more common use case, however.
>>>>>> 
>>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>>> "flush on close" semantic.
>>>>>> 
>>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>>> mechanical revert of b6669305d35a.
>>>>> 
>>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>>> 
>>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>>> sending a revert commit, and I do not have enough information
>>>> yet to make that choice. Waiting a bit is OK with me.
>>>> 
>>>> 
>>>>> Also because
>>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>>> other
>>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>>> think
>>>>> making flushing less aggressive is desirable because as I wrote in
>>>>> my other
>>>>> reply, currently it is preventing pagecache from accumulating
>>>>> enough dirty
>>>>> data for a good IO pattern.
>>>> 
>>>> I might even go so far as to say that a small write workload
>>>> isn't especially good for solid state storage either.
>>>> 
>>>> I know Trond is trying to address NFS re-export performance, but
>>>> there appear to be some palpable effects outside of that narrow
>>>> use case that need to be considered. Thus a server-side fix,
>>>> rather than a fix in the NFS client used to do the re-export,
>>>> seems appropriate to consider.
>>> 
>>> Turns out it is not just the NFS client that is the problem. It is
>>> rather that we need in general to be able to detect flush errors and
>>> either report them directly (through commit) or we need to change the
>>> boot verifier to force clients to resend the unstable writes.
>>> 
>>> Hence, I think we're looking at something like this:
>> 
>> Thanks for the fix! I've run the patch through some testing and your fix
>> indeed restores the good IO pattern and returns the performance back to
>> original levels. So feel free to add:
>> 
>> Tested-by: Jan Kara <jack@suse.cz>
> 
> Excellent. I'll queue this up in the NFSD tree for 5.17-rc.

Belay that. 5.18-rc.


>> 
>> 								Honza
>> 
>>> 
>>> 8<--------------------------------------------------------------------
>>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>>> Subject: [PATCH] nfsd: Fix a write performance regression
>>> 
>>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>>> we clear out any writes belonging to a NFSv3 client relatively quickly
>>> and avoid situations where the file can't be evicted by the garbage
>>> collector. It also ensures that we detect write errors quickly.
>>> 
>>> The problem is this causes a regression in performance for some
>>> workloads.
>>> 
>>> So try to improve matters by deferring writeback until we're ready to
>>> close the file, and need to detect errors so that we can force the
>>> client to resend.
>>> 
>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> ---
>>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 8bc807c5fea4..9578a6317709 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>>> 	return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>>> }
>>> 
>>> +static void
>>> +nfsd_file_flush(struct nfsd_file *nf)
>>> +{
>>> +	if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>>> +		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> +}
>>> +
>>> static void
>>> nfsd_file_do_unhash(struct nfsd_file *nf)
>>> {
>>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>>> 		return;
>>> 	}
>>> 
>>> -	filemap_flush(nf->nf_file->f_mapping);
>>> 	is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>>> -	nfsd_file_put_noref(nf);
>>> -	if (is_hashed)
>>> +	if (!is_hashed) {
>>> +		nfsd_file_flush(nf);
>>> +		nfsd_file_put_noref(nf);
>>> +	} else {
>>> +		nfsd_file_put_noref(nf);
>>> 		nfsd_file_schedule_laundrette();
>>> +	}
>>> 	if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>>> 		nfsd_file_gc();
>>> }
>>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>>> 	while(!list_empty(dispose)) {
>>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> 		list_del(&nf->nf_lru);
>>> +		nfsd_file_flush(nf);
>>> 		nfsd_file_put_noref(nf);
>>> 	}
>>> }
>>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>>> 	while(!list_empty(dispose)) {
>>> 		nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> 		list_del(&nf->nf_lru);
>>> +		nfsd_file_flush(nf);
>>> 		if (!refcount_dec_and_test(&nf->nf_ref))
>>> 			continue;
>>> 		if (nfsd_file_free(nf))
>>> -- 
>>> 2.35.1
>>> 
>>> 
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@hammerspace.com
>>> 
>>> 
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
> 
> --
> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-31 14:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 10:34 Performance regression with random IO pattern from the client Jan Kara
2022-03-30 15:03 ` Trond Myklebust
2022-03-30 15:38   ` Chuck Lever III
2022-03-30 16:19     ` Trond Myklebust
2022-03-30 16:19     ` Jan Kara
2022-03-30 17:56       ` Chuck Lever III
2022-03-30 22:02         ` Trond Myklebust
2022-03-31 13:09           ` Jan Kara
2022-03-31 14:20             ` Chuck Lever III
2022-03-31 14:22               ` Chuck Lever III
2022-03-30 16:14   ` Jan Kara
2022-03-31  8:43 ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).