Re: v6.2 client behaviour change (repeat access calls)? - Linux regression tracking #update (Thorsten Leemhuis)

From: "Linux regression tracking #update (Thorsten Leemhuis)" <regressions@leemhuis.info>
To: Daire Byrne <daire@dneg.com>,
	Linux regressions mailing list <regressions@lists.linux.dev>
Cc: linux-nfs <linux-nfs@vger.kernel.org>,
	Anna Schumaker <anna@kernel.org>,
	Trond Myklebust <trond.myklebust@hammerspace.com>
Subject: Re: v6.2 client behaviour change (repeat access calls)?
Date: Sun, 16 Apr 2023 15:14:11 +0200	[thread overview]
Message-ID: <e25e2e81-47e6-63a0-9c0d-00258397e47b@leemhuis.info> (raw)
In-Reply-To: <CAPt2mGNaNTD=a+UsDLDeRWqd+PVHk6Rh-OX0+0BxGc+xgovraA@mail.gmail.com>

On 23.03.23 13:02, Daire Byrne wrote:
> Yes, sorry, I did see that patch was posted not long after I reported
> the excessive repeat ACCESS calls but I have not had a chance to test
> it.
> 
> It certainly looks promising. I will try to get to it next week and
> update my original email.

Hmm, no update afaics (or I missed it). No worries, but I for now will
assume this was fixed. Please correct me if it later turns out that I've
been wrong.

#regzbot fix: 21fd9e8700de86d1169
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

> On Thu, 23 Mar 2023 at 11:54, Thorsten Leemhuis
> <regressions@leemhuis.info> wrote:
>>
>> On 20.03.23 13:34, Thorsten Leemhuis wrote:
>>> On 10.03.23 11:43, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> [CCing the regression list, as it should be in the loop for regressions:
>>>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>>>
>>>> [also adding the author of the culprit (Trond) and the second NFS client
>>>> maintainer (Anna) to the list of recipients]
>>>
>>> Trond, sorry to bother you, but do you still have this regression report
>>> on your radar? It looks a bit like it fall through the cracks, as I
>>> don't see any reply since it was posted nearly two weeks ago. Or did
>>> some progress to address this happen and I just missed it?
>>
>> Daire, it seems the NFS developers ignore my inquiries, sorry.
>>
>> But well, I noticed there is a patch that references the culprit you
>> found in the bisection:
>>
>> https://lore.kernel.org/all/20230308080327.33906-1-chengen.du@canonical.com/
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=21fd9e8700de86d1169f6336e97d7a74916ed04a
>>
>> I wonder if you are aware of it or maybe even tested it already. To me
>> it sounds like it could fix your problem, but this is not my area of
>> expertise, so I might be totally wrong.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot ignore-activity
>>
>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>> regressions; the text you find below is based on a few templates
>>>> paragraphs you might have encountered already in similar form.
>>>> See link in footer if these mails annoy you.]
>>>>
>>>> On 07.03.23 18:38, Daire Byrne wrote:
>>>>> I noticed a change in behaviour in the v6.2.x client versus v6.1.12 (and below).
>>>>>
>>>>> We have some servers that mount Netapps from different locations many
>>>>> milliseconds away, and these contain apps and libs that get added to
>>>>> the LD_LIBRARY_PATH and PATH on remote login.
>>>>>
>>>>> I then noticed that when I ssh'd into a remote server that had these
>>>>> mounts and the shell was starting, the first login was normal and I
>>>>> observed an expected flurry of lookups,getattrs and access calls for a
>>>>> grand total of only ~120 packets to the Netapp.
>>>>>
>>>>> But when I disconnect and reconnect (ssh), now I see a flood of access
>>>>> calls to the netapp for a handful of repeating filehandles which look
>>>>> something like:
>>>>>
>>>>>  2700 85.942563180 10.23.112.10 → 10.23.21.11  NFS 254 V3 ACCESS Call,
>>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL]
>>>>>  2701 85.999838796  10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply
>>>>> (Call In 2700), [Allowed: RD LU MD XT DL]
>>>>>  2702 85.999970825 10.23.112.10 → 10.23.21.11  NFS 254 V3 ACCESS Call,
>>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL]
>>>>>  2703 86.055340946  10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply
>>>>> (Call In 2702), [Allowed: RD LU MD XT DL]
>>>>>  2704 86.056865308 10.23.112.10 → 10.23.21.11  NFS 254 V3 ACCESS Call,
>>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL]
>>>>>  2705 86.112233415  10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply
>>>>> (Call In 2704), [Allowed: RD LU MD XT DL]
>>>>>
>>>>> This time we total 5000+ packets for this login which becomes very
>>>>> noticeable when the Netapp is 50ms away.
>>>>>
>>>>> I didn't understand why the first login was fine but the second goes
>>>>> into this repeating access pattern. I set actimeo=3600 (long) but it
>>>>> does not seem to affect it.
>>>>>
>>>>> I do not see this prior to v6.2 where repeated logins are equally fast
>>>>> and we don't see the repeating access calls.
>>>>>
>>>>> So a bit of digging through the v6.2 changes and this looked like the
>>>>> relevant change:
>>>>>
>>>>> commit 0eb43812c027 ("NFS: Clear the file access cache upon login”)
>>>>> [PATCH] NFS: Judge the file access cache's timestamp in rcu path?
>>>>>
>>>>> I reverted those and got the prior (v6.1) performance.
>>>>>
>>>>> What constitutes a login exactly? I also have services like "sysstat"
>>>>> or pcp that cause a systemd-logind to trigger regularly on our
>>>>> machines.... does that count and invalidate the cache?
>>>>>
>>>>> Do the repeated access calls on the same handful of filehandles make
>>>>> sense? Even prior to those patches (or v6.1) there are only a couple
>>>>> of ACCESS calls to the Netapp on login.
>>>>>
>>>>> We are a bit unique in that we run quite a few WAN high latency NFS
>>>>> workflows so are happy to trade long lived caches (e.g. actimeo and
>>>>> even nocto on occasion) for lower ops at the expense of total
>>>>> correctness.
>>>>
>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>> tracking bot:
>>>>
>>>> #regzbot ^introduced 0eb43812c027
>>>> #regzbot title nfs: flood of access on second log-in (first is fine)
>>>> #regzbot ignore-activity
>>>>
>>>> This isn't a regression? This issue or a fix for it are already
>>>> discussed somewhere else? It was fixed already? You want to clarify when
>>>> the regression started to happen? Or point out I got the title or
>>>> something else totally wrong? Then just reply and tell me -- ideally
>>>> while also telling regzbot about it, as explained by the page listed in
>>>> the footer of this mail.
>>>>
>>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>>> to the report (the parent of this mail). See page linked in footer for
>>>> details.
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
> 
>