Re: 5.4.188 and later: massive performance regression with nfsd

From: Chuck Lever III <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"linux@stwm.de" <linux@stwm.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"jlayton@kernel.org" <jlayton@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: 5.4.188 and later: massive performance regression with nfsd
Date: Sat, 21 May 2022 20:41:08 +0000	[thread overview]
Message-ID: <31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com> (raw)
In-Reply-To: <2117abef403c8b6e5e1672031bf8f36c8eafedc9.camel@hammerspace.com>

> On May 21, 2022, at 3:49 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Sat, 2022-05-21 at 19:11 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On May 21, 2022, at 2:10 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On May 20, 2022, at 7:43 PM, Chuck Lever III
>>>>> <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 20, 2022, at 6:24 PM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>> 
>>>>>> On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 20, 2022, at 12:40 PM, Trond Myklebust
>>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>>> 
>>>>>>>> On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 11, 2022, at 10:36 AM, Chuck Lever III
>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 11, 2022, at 10:23 AM, Greg KH
>>>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck
>>>>>>>>>>> Lever
>>>>>>>>>>> III
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 11, 2022, at 8:38 AM, Greg KH
>>>>>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, May 11, 2022 at 12:03:13PM +0200,
>>>>>>>>>>>>> Wolfgang
>>>>>>>>>>>>> Walter
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> starting with 5.4.188 wie see a massive
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> regression on our
>>>>>>>>>>>>>> nfs-server. It basically is serving requests
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> slowly with cpu
>>>>>>>>>>>>>> utilization of 100% (with 5.4.187 and earlier
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> 10%) so
>>>>>>>>>>>>>> that it is
>>>>>>>>>>>>>> unusable as a fileserver.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The culprit are commits (or one of it):
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> c32f1041382a88b17da5736886da4a492353a1bb
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> cleanup
>>>>>>>>>>>>>> nfsd_file_lru_dispose()"
>>>>>>>>>>>>>> 628adfa21815f74c04724abc85847f24b5dd1645
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> Containerise filecache
>>>>>>>>>>>>>> laundrette"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (upstream
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If I revert them in v5.4.192 the kernel works
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> performance is
>>>>>>>>>>>>>> ok again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I did not try to revert them one by one as
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> disruption
>>>>>>>>>>>>>> of our nfs-server
>>>>>>>>>>>>>> is a severe problem for us and I'm not sure
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> they are
>>>>>>>>>>>>>> related.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 5.10 and 5.15 both always performed very
>>>>>>>>>>>>>> badly on
>>>>>>>>>>>>>> our
>>>>>>>>>>>>>> nfs-
>>>>>>>>>>>>>> server in a
>>>>>>>>>>>>>> similar way so we were stuck with 5.4.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I now think this is because of
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63
>>>>>>>>>>>>>> and/or
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050
>>>>>>>>>>>>>> though
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> didn't tried to
>>>>>>>>>>>>>> revert them in 5.15 yet.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Odds are 5.18-rc6 is also a problem?
>>>>>>>>>>>> 
>>>>>>>>>>>> We believe that
>>>>>>>>>>>> 
>>>>>>>>>>>> 6b8a94332ee4 ("nfsd: Fix a write performance
>>>>>>>>>>>> regression")
>>>>>>>>>>>> 
>>>>>>>>>>>> addresses the performance regression. It was
>>>>>>>>>>>> merged
>>>>>>>>>>>> into
>>>>>>>>>>>> 5.18-
>>>>>>>>>>>> rc.
>>>>>>>>>>> 
>>>>>>>>>>> And into 5.17.4 if someone wants to try that
>>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> I don't have a lot of time to backport this one
>>>>>>>>>> myself,
>>>>>>>>>> so
>>>>>>>>>> I welcome anyone who wants to apply that commit to
>>>>>>>>>> their
>>>>>>>>>> favorite LTS kernel and test it for us.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> If so, I'll just wait for the fix to get into
>>>>>>>>>>>>> Linus's
>>>>>>>>>>>>> tree as
>>>>>>>>>>>>> this does
>>>>>>>>>>>>> not seem to be a stable-tree-only issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately I've received a recent report that
>>>>>>>>>>>> the
>>>>>>>>>>>> fix
>>>>>>>>>>>> introduces
>>>>>>>>>>>> a "sleep while spinlock is held" for NFSv4.0 in
>>>>>>>>>>>> rare
>>>>>>>>>>>> cases.
>>>>>>>>>>> 
>>>>>>>>>>> Ick, not good, any potential fixes for that?
>>>>>>>>>> 
>>>>>>>>>> Not yet. I was at LSF last week, so I've just started
>>>>>>>>>> digging
>>>>>>>>>> into this one. I've confirmed that the report is a
>>>>>>>>>> real
>>>>>>>>>> bug,
>>>>>>>>>> but we still don't know how hard it is to hit it with
>>>>>>>>>> real
>>>>>>>>>> workloads.
>>>>>>>>> 
>>>>>>>>> We believe the following, which should be part of the
>>>>>>>>> first
>>>>>>>>> NFSD pull request for 5.19, will properly address the
>>>>>>>>> splat.
>>>>>>>>> 
>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-next&id=556082f5e5d7ecfd0ee45c3641e2b364bff9ee44
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> Uh... What happens if you have 2 simultaneous calls to
>>>>>>>> nfsd4_release_lockowner() for the same file? i.e. 2
>>>>>>>> separate
>>>>>>>> processes
>>>>>>>> owned by the same user, both locking the same file.
>>>>>>>> 
>>>>>>>> Can't that cause the 'putlist' to get corrupted when both
>>>>>>>> callers
>>>>>>>> add
>>>>>>>> the same nf->nf_putfile to two separate lists?
>>>>>>> 
>>>>>>> IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.
>>>>>>> 
>>>>>>> The first call finds the lockowner in cl_ownerstr_hashtbl
>>>>>>> and
>>>>>>> unhashes it before releasing cl_lock.
>>>>>>> 
>>>>>>> Then the second cannot find that lockowner, thus it can't
>>>>>>> requeue it for bulk_put.
>>>>>>> 
>>>>>>> Am I missing something?
>>>>>> 
>>>>>> In the example I quoted, there are 2 separate processes
>>>>>> running
>>>>>> on the
>>>>>> client. Those processes could share the same open owner +
>>>>>> open
>>>>>> stateid,
>>>>>> and hence the same struct nfs4_file, since that depends only
>>>>>> on
>>>>>> the
>>>>>> process credentials matching. However they will not normally
>>>>>> share a
>>>>>> lock owner, since POSIX does not expect different processes
>>>>>> to
>>>>>> share
>>>>>> locks.
>>>>>> 
>>>>>> IOW: The point is that one can relatively easily create 2
>>>>>> different
>>>>>> lock owners with different lock stateids that share the same
>>>>>> underlying
>>>>>> struct nfs4_file.
>>>>> 
>>>>> Is there a similar exposure if two different clients are
>>>>> locking
>>>>> the same file? If so, then we can't use a per-nfs4_client
>>>>> semaphore
>>>>> to serialize access to the nf_putfile field.
>>>> 
>>>> I had a thought about an alternate approach.
>>>> 
>>>> Create a second nfsd_file_put API that is not allowed to sleep.
>>>> Let's call it "nfsd_file_put_async()". Teach check_for_locked()
>>>> to use that instead of nfsd_file_put().
>>>> 
>>>> Here's where I'm a little fuzzy: nfsd_file_put_async() could do
>>>> something like:
>>>> 
>>>> void nfsd_file_put_async(struct nfsd_file *nf)
>>>> {
>>>>         if (refcount_dec_and_test(&nf->nf_ref))
>>>>                 nfsd_file_close_inode(nf->nf_inode);
>>>> }
>>>> 
>>>> 
>>> 
>>> That approach moves the sync to the garbage collector, which was
>>> exactly what we're trying to avoid in the first place.
>> 
>> Totally understood.
>> 
>> My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID
>> would be unlikely to have any data to sync -- callers that
>> actually have data to flush are elsewhere, and those would
>> continue to use the synchronous nfsd_file_put() API.
>> 
>> Do you have a workload where we can test this assumption?
>> 
>> 
>>> Why not just do this "check_for_locks()" thing differently?
>>> 
>>> It really shouldn't be too hard to add something to
>>> nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in
>>> the
>>> lockowner in order to tell you whether or not locks are still held
>>> instead of doing this bone headed walk through the list of locks.
>> 
>> I thought of that a couple weeks ago. That doesn't work
>> because you can lock or unlock by range. That means the
>> symmetry of LOCK and LOCKU is not guaranteed, and I don't
>> believe these calls are used that way anyway. So I
>> abandoned the idea of using get_owner / put_owner.
>> 
> 
> Then you're misunderstanding how it works. lm_get_owner() is called
> when a lock is initialised from another one. The whole point is to
> ensure that each and every object representing a range lock on the
> inode's list maintains its own private reference to the knfsd lockowner
> (i.e. the fl->fl_owner).
> 
> For instance when a LOCK call calls posix_lock_inode(), then that
> function uses locks_copy_conflock() (which calls lm_get_owner) to
> initialise the range lock object that is being put on the inode list.
> If the new lock causes multiple existing locks to be replaced, they all
> call lm_put_owner to release their references to fl->fl_owner as part
> of the process of being freed.
> 
> Conversely, when LOCKU causes a range to get split, the two locks that
> replace the old one are both initialised using locks_copy_conflock(),
> so they both call lm_get_owner. The lock that represents the range
> being replaced is then made to call lm_put_owner() when it is freed.
> 
> etc, etc...

That is definitely not what it looked like when I traced it. The
reference count managed by get_owner / put_owner did not seem
to be usable for tracking whether a lockowner had locks or not.
The reference count was put pretty quickly after the lm_get_owner
call.

But I'm not interested in an argument. I'll go back and look at
get_owner / put_owner again, because I agree that not having to
traverse the inode's flc_posix list during check_for_locks() would
be awesome sauce.

--
Chuck Lever