Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work.

From: Chuck Lever <chuck.lever@oracle.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>, NeilBrown <neilb@suse.de>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work.
Date: Tue, 28 Nov 2023 10:22:33 -0500	[thread overview]
Message-ID: <ZWYFuWqCmX87C0ve@tissot.1015granger.net> (raw)
In-Reply-To: <518775f9f9bd3ad1afec0bde4d0a6bee3370bdd4.camel@kernel.org>

On Tue, Nov 28, 2023 at 09:15:39AM -0500, Jeff Layton wrote:
> On Tue, 2023-11-28 at 14:51 +0100, Christian Brauner wrote:
> > [Reusing the trimmed Cc]
> > 
> > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > 
> > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > was taken for analysis).
> > > > > 
> > > > > While this might point to a problem with the filesystem not handling the
> > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > lead to memory exhaustion.
> > > > 
> > > > I have this patch queued for v6.8:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > 
> > > 
> > > Thanks....
> > > I think that change is good, but I don't think it addresses the problem
> > > mentioned in the description, and it is not directly relevant to the
> > > problem I saw ... though it is complicated.
> > > 
> > > The problem "workqueue ...  hogged cpu..." probably means that
> > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > free to roam.
> > > 
> > > Also that work is calling filp_close() which primarily calls
> > > filp_flush().
> > > It also calls fput() but that does minimal work.  If there is much work
> > > to do then that is offloaded to another work-item.  *That* is the
> > > workitem that I had problems with.
> > > 
> > > The problem I saw was with an older kernel which didn't have the nfsd
> > > file cache and so probably is calling filp_close more often.  So maybe
> > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > most files in-task but instead offloads that to another task.  So the
> > > final fput will not be handled by the nfsd task either.
> > > 
> > > But I think there is room for improvement.  Gathering lots of files
> > > together into a list and closing them sequentially is not going to be as
> > > efficient as closing them in parallel.
> > > 
> > > > 
> > > > > For normal threads, the thread that closes the file also calls the
> > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > the thread from closing more files.
> > > > 
> > > > I don't think we want to block nfsd threads waiting for files to
> > > > close. Won't that be a potential denial of service?
> > > 
> > > Not as much as the denial of service caused by memory exhaustion due to
> > > an indefinitely growing list of files waiting to be closed by a single
> > > thread of workqueue.
> > 
> > It seems less likely that you run into memory exhausting than a DOS
> > because nfsd() is busy closing fds. Especially because you default to
> > single nfsd thread afaict.

I would expect a DoS too: the system should start pushing out dirty
file data itself well before exhausting memory.

> The default is currently 8 threads (which is ridiculously low for most
> uses, but that's another discussion). That policy is usually set by
> userland nfs-utils though.

With only 8 threads, it might be /more/ difficult for clients to
generate enough workload to cause an overwhelming flood of closes.
As Neil said in the cover letter text, he observed this issue with
256 nfsd threads.

> This is another place where we might want to reserve a "rescuer" thread
> that avoids doing work that can end up blocked. Maybe we could switch
> back to queuing them to the list when we're below a certain threshold of
> available threads (1? 2? 4?).
> 
> > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE,
> > > the nfsd thread should completely handle that request including all the
> > > flush and ->release etc.  If that causes any denial of service, then
> > > simple increase the number of nfsd threads.
> > 
> > But isn't that a significant behavioral change? So I would expect to
> > make this at configurable via a module- or Kconfig option?
> 
> I struggle to think about how we would document a new option like this. 

I think NFSv4 CLOSE can close files synchronously without an
observable behavior change. NFSv4 clients almost always COMMIT dirty
data before they CLOSE, so there should only rarely be a significant
flush component to an fput done here.

The problem is the garbage-collected (NFSv3) case, where the server
frequently closes files well before a client might have COMMITted
its dirty data.

> > > For NFSv3 it is more complex.  On the kernel where I saw a problem the
> > > filp_close happen after each READ or WRITE (though I think the customer
> > > was using NFSv4...).  With the file cache there is no thread that is
> > > obviously responsible for the close.
> > > To get the sort of throttling that I think is need, we could possibly
> > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > some small amount of progress.
> > > 
> > > But don't think it is reasonable for the nfsd threads to take none of
> > > the burden of closing files as that can result in imbalance.
> > 
> > It feels that this really needs to be tested under a similar workload in
> > question to see whether this is a viable solution.
> 
> Definitely. I'd also like to see how this behaves with NFS or Ceph
> reexport. Closing can be quite slow on those filesystems, so that might
> be a good place to try and break this.

-- 
Chuck Lever