All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"linux@stwm.de" <linux@stwm.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"jlayton@kernel.org" <jlayton@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: 5.4.188 and later: massive performance regression with nfsd
Date: Sat, 21 May 2022 20:41:08 +0000	[thread overview]
Message-ID: <31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com> (raw)
In-Reply-To: <2117abef403c8b6e5e1672031bf8f36c8eafedc9.camel@hammerspace.com>



> On May 21, 2022, at 3:49 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Sat, 2022-05-21 at 19:11 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On May 21, 2022, at 2:10 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On May 20, 2022, at 7:43 PM, Chuck Lever III
>>>>> <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 20, 2022, at 6:24 PM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>> 
>>>>>> On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 20, 2022, at 12:40 PM, Trond Myklebust
>>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>>> 
>>>>>>>> On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 11, 2022, at 10:36 AM, Chuck Lever III
>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 11, 2022, at 10:23 AM, Greg KH
>>>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck
>>>>>>>>>>> Lever
>>>>>>>>>>> III
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 11, 2022, at 8:38 AM, Greg KH
>>>>>>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, May 11, 2022 at 12:03:13PM +0200,
>>>>>>>>>>>>> Wolfgang
>>>>>>>>>>>>> Walter
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> starting with 5.4.188 wie see a massive
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> regression on our
>>>>>>>>>>>>>> nfs-server. It basically is serving requests
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> slowly with cpu
>>>>>>>>>>>>>> utilization of 100% (with 5.4.187 and earlier
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> 10%) so
>>>>>>>>>>>>>> that it is
>>>>>>>>>>>>>> unusable as a fileserver.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The culprit are commits (or one of it):
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> c32f1041382a88b17da5736886da4a492353a1bb
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> cleanup
>>>>>>>>>>>>>> nfsd_file_lru_dispose()"
>>>>>>>>>>>>>> 628adfa21815f74c04724abc85847f24b5dd1645
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> Containerise filecache
>>>>>>>>>>>>>> laundrette"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (upstream
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If I revert them in v5.4.192 the kernel works
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> performance is
>>>>>>>>>>>>>> ok again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I did not try to revert them one by one as
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> disruption
>>>>>>>>>>>>>> of our nfs-server
>>>>>>>>>>>>>> is a severe problem for us and I'm not sure
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> they are
>>>>>>>>>>>>>> related.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 5.10 and 5.15 both always performed very
>>>>>>>>>>>>>> badly on
>>>>>>>>>>>>>> our
>>>>>>>>>>>>>> nfs-
>>>>>>>>>>>>>> server in a
>>>>>>>>>>>>>> similar way so we were stuck with 5.4.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I now think this is because of
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63
>>>>>>>>>>>>>> and/or
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050
>>>>>>>>>>>>>> though
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> didn't tried to
>>>>>>>>>>>>>> revert them in 5.15 yet.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Odds are 5.18-rc6 is also a problem?
>>>>>>>>>>>> 
>>>>>>>>>>>> We believe that
>>>>>>>>>>>> 
>>>>>>>>>>>> 6b8a94332ee4 ("nfsd: Fix a write performance
>>>>>>>>>>>> regression")
>>>>>>>>>>>> 
>>>>>>>>>>>> addresses the performance regression. It was
>>>>>>>>>>>> merged
>>>>>>>>>>>> into
>>>>>>>>>>>> 5.18-
>>>>>>>>>>>> rc.
>>>>>>>>>>> 
>>>>>>>>>>> And into 5.17.4 if someone wants to try that
>>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> I don't have a lot of time to backport this one
>>>>>>>>>> myself,
>>>>>>>>>> so
>>>>>>>>>> I welcome anyone who wants to apply that commit to
>>>>>>>>>> their
>>>>>>>>>> favorite LTS kernel and test it for us.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> If so, I'll just wait for the fix to get into
>>>>>>>>>>>>> Linus's
>>>>>>>>>>>>> tree as
>>>>>>>>>>>>> this does
>>>>>>>>>>>>> not seem to be a stable-tree-only issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately I've received a recent report that
>>>>>>>>>>>> the
>>>>>>>>>>>> fix
>>>>>>>>>>>> introduces
>>>>>>>>>>>> a "sleep while spinlock is held" for NFSv4.0 in
>>>>>>>>>>>> rare
>>>>>>>>>>>> cases.
>>>>>>>>>>> 
>>>>>>>>>>> Ick, not good, any potential fixes for that?
>>>>>>>>>> 
>>>>>>>>>> Not yet. I was at LSF last week, so I've just started
>>>>>>>>>> digging
>>>>>>>>>> into this one. I've confirmed that the report is a
>>>>>>>>>> real
>>>>>>>>>> bug,
>>>>>>>>>> but we still don't know how hard it is to hit it with
>>>>>>>>>> real
>>>>>>>>>> workloads.
>>>>>>>>> 
>>>>>>>>> We believe the following, which should be part of the
>>>>>>>>> first
>>>>>>>>> NFSD pull request for 5.19, will properly address the
>>>>>>>>> splat.
>>>>>>>>> 
>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-next&id=556082f5e5d7ecfd0ee45c3641e2b364bff9ee44
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> Uh... What happens if you have 2 simultaneous calls to
>>>>>>>> nfsd4_release_lockowner() for the same file? i.e. 2
>>>>>>>> separate
>>>>>>>> processes
>>>>>>>> owned by the same user, both locking the same file.
>>>>>>>> 
>>>>>>>> Can't that cause the 'putlist' to get corrupted when both
>>>>>>>> callers
>>>>>>>> add
>>>>>>>> the same nf->nf_putfile to two separate lists?
>>>>>>> 
>>>>>>> IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.
>>>>>>> 
>>>>>>> The first call finds the lockowner in cl_ownerstr_hashtbl
>>>>>>> and
>>>>>>> unhashes it before releasing cl_lock.
>>>>>>> 
>>>>>>> Then the second cannot find that lockowner, thus it can't
>>>>>>> requeue it for bulk_put.
>>>>>>> 
>>>>>>> Am I missing something?
>>>>>> 
>>>>>> In the example I quoted, there are 2 separate processes
>>>>>> running
>>>>>> on the
>>>>>> client. Those processes could share the same open owner +
>>>>>> open
>>>>>> stateid,
>>>>>> and hence the same struct nfs4_file, since that depends only
>>>>>> on
>>>>>> the
>>>>>> process credentials matching. However they will not normally
>>>>>> share a
>>>>>> lock owner, since POSIX does not expect different processes
>>>>>> to
>>>>>> share
>>>>>> locks.
>>>>>> 
>>>>>> IOW: The point is that one can relatively easily create 2
>>>>>> different
>>>>>> lock owners with different lock stateids that share the same
>>>>>> underlying
>>>>>> struct nfs4_file.
>>>>> 
>>>>> Is there a similar exposure if two different clients are
>>>>> locking
>>>>> the same file? If so, then we can't use a per-nfs4_client
>>>>> semaphore
>>>>> to serialize access to the nf_putfile field.
>>>> 
>>>> I had a thought about an alternate approach.
>>>> 
>>>> Create a second nfsd_file_put API that is not allowed to sleep.
>>>> Let's call it "nfsd_file_put_async()". Teach check_for_locked()
>>>> to use that instead of nfsd_file_put().
>>>> 
>>>> Here's where I'm a little fuzzy: nfsd_file_put_async() could do
>>>> something like:
>>>> 
>>>> void nfsd_file_put_async(struct nfsd_file *nf)
>>>> {
>>>>         if (refcount_dec_and_test(&nf->nf_ref))
>>>>                 nfsd_file_close_inode(nf->nf_inode);
>>>> }
>>>> 
>>>> 
>>> 
>>> That approach moves the sync to the garbage collector, which was
>>> exactly what we're trying to avoid in the first place.
>> 
>> Totally understood.
>> 
>> My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID
>> would be unlikely to have any data to sync -- callers that
>> actually have data to flush are elsewhere, and those would
>> continue to use the synchronous nfsd_file_put() API.
>> 
>> Do you have a workload where we can test this assumption?
>> 
>> 
>>> Why not just do this "check_for_locks()" thing differently?
>>> 
>>> It really shouldn't be too hard to add something to
>>> nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in
>>> the
>>> lockowner in order to tell you whether or not locks are still held
>>> instead of doing this bone headed walk through the list of locks.
>> 
>> I thought of that a couple weeks ago. That doesn't work
>> because you can lock or unlock by range. That means the
>> symmetry of LOCK and LOCKU is not guaranteed, and I don't
>> believe these calls are used that way anyway. So I
>> abandoned the idea of using get_owner / put_owner.
>> 
> 
> Then you're misunderstanding how it works. lm_get_owner() is called
> when a lock is initialised from another one. The whole point is to
> ensure that each and every object representing a range lock on the
> inode's list maintains its own private reference to the knfsd lockowner
> (i.e. the fl->fl_owner).
> 
> For instance when a LOCK call calls posix_lock_inode(), then that
> function uses locks_copy_conflock() (which calls lm_get_owner) to
> initialise the range lock object that is being put on the inode list.
> If the new lock causes multiple existing locks to be replaced, they all
> call lm_put_owner to release their references to fl->fl_owner as part
> of the process of being freed.
> 
> Conversely, when LOCKU causes a range to get split, the two locks that
> replace the old one are both initialised using locks_copy_conflock(),
> so they both call lm_get_owner. The lock that represents the range
> being replaced is then made to call lm_put_owner() when it is freed.
> 
> etc, etc...

That is definitely not what it looked like when I traced it. The
reference count managed by get_owner / put_owner did not seem
to be usable for tracking whether a lockowner had locks or not.
The reference count was put pretty quickly after the lm_get_owner
call.

But I'm not interested in an argument. I'll go back and look at
get_owner / put_owner again, because I agree that not having to
traverse the inode's flc_posix list during check_for_locks() would
be awesome sauce.


--
Chuck Lever




      reply	other threads:[~2022-05-21 20:43 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-11 10:03 5.4.188 and later: massive performance regression with nfsd Wolfgang Walter
2022-05-11 12:38 ` Greg KH
2022-05-11 14:16   ` Chuck Lever III
2022-05-11 14:23     ` Greg KH
2022-05-11 14:36       ` Chuck Lever III
2022-05-11 15:30         ` Wolfgang Walter
2022-05-11 15:50           ` Chuck Lever III
2022-05-11 17:24             ` Wolfgang Walter
2022-05-20 15:36         ` Chuck Lever III
2022-05-20 16:40           ` Trond Myklebust
2022-05-20 17:06             ` dai.ngo
2022-05-20 21:52             ` Chuck Lever III
2022-05-20 22:24               ` Trond Myklebust
2022-05-20 23:43                 ` Chuck Lever III
2022-05-21 17:22                   ` Chuck Lever III
2022-05-21 18:10                     ` Trond Myklebust
2022-05-21 19:11                       ` Chuck Lever III
2022-05-21 19:49                         ` Trond Myklebust
2022-05-21 20:41                           ` Chuck Lever III [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jlayton@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux@stwm.de \
    --cc=stable@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.