From: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: NFS regression between 5.17 and 5.18
Date: Fri, 13 May 2022 07:58:30 -0400 [thread overview]
Message-ID: <9d3055f2-f751-71f4-1fc0-927817a07d99@cornelisnetworks.com> (raw)
In-Reply-To: <46beb079-fb43-a9c1-d9a0-9b66d5a36163@cornelisnetworks.com>
On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>
>>
>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote:
>>>
>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote:
>>>>>>>>
>>>>>>>> Hi NFS folks,
>>>>>>>>
>>>>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>>>>> between 5.17 and
>>>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>>>>> problem I'm seeing
>>>>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>>>>> disk. When it should
>>>>>>>> take less than half a second, which it did up through 5.17.
>>>>>>>>
>>>>>>>> It doesn't seem to be network related, but can't rule that out
>>>>>>>> completely.
>>>>>>>>
>>>>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>>>>> runs I'll see a
>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>>>>> Sometimes I'll see it
>>>>>>>> 100 out of 100.
>>>>>>>
>>>>>>> It's not clear from your problem report whether the problem
>>>>>>> appears
>>>>>>> when it's the server running v5.18-rc or the client.
>>>>>>
>>>>>> That's because I don't know which it is. I'll do a quick test and
>>>>>> find out. I
>>>>>> was testing the same kernel across both nodes.
>>>>>
>>>>> Looks like it is the client.
>>>>>
>>>>> server client result
>>>>> ------ ------ ------
>>>>> 5.17 5.17 Pass
>>>>> 5.17 5.18 Fail
>>>>> 5.18 5.18 Fail
>>>>> 5.18 5.17 Pass
>>>>>
>>>>> Is there a patch for the client issue you mentioned that I could try?
>>>>>
>>>>> -Denny
>>>>
>>>> Try this one
>>>
>>> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
>>> see intermittent hangs. I applied it on top of -rc4:
>>>
>>> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>>>
>>> real 2m6.072s
>>> user 0m0.002s
>>> sys 0m0.263s
>>> Done
>>>
>>> While it was hung I checked the mem usage on the machine:
>>>
>>> # free -h
>>> total used free shared buff/cache available
>>> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi
>>> Swap: 4.0Gi 0B 4.0Gi
>>>
>>> Doesn't appear to be under memory pressure.
>>
>> Hi, since you know now that it is the client, perhaps a bisect
>> would be more successful?
>
> I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus
> for 5.18 and tested. I see the problem on pretty much all the patches. However
> it's the frequency that it hits which changes.
>
> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
> "NFS: Convert readdir page cache to use a cookie based index"
>
> After this I start seeing it around 10 times in 500 and by the last patch 10
> times in less than 100.
>
> Is there any kind of tracing/debugging I could turn on to get more insight on
> what is taking so long when it does go bad?
>
Ran a test with -rc6 and this time see a hung task trace on the console as well
as an NFS RPC error.
[32719.991175] nfs: RPC call returned error 512
.
.
.
[32933.285126] INFO: task kworker/u145:23:886141 blocked for more than 122 seconds.
[32933.293543] Tainted: G S 5.18.0-rc6 #1
[32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 ppid: 2
flags:0x00004000
[32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
[32933.324524] Call Trace:
[32933.327347] <TASK>
[32933.329785] __schedule+0x3dd/0x970
[32933.333783] schedule+0x41/0xa0
[32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
[32933.343639] ? prepare_to_wait+0xd0/0xd0
[32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
[32933.354183] xprt_release+0x26/0x140 [sunrpc]
[32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
[32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
[32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
[32933.376564] ? __switch_to_asm+0x42/0x70
[32933.381046] ? finish_task_switch+0xb2/0x2c0
[32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
[32933.391391] process_one_work+0x1c8/0x390
[32933.395975] worker_thread+0x30/0x360
[32933.400162] ? process_one_work+0x390/0x390
[32933.404931] kthread+0xd9/0x100
[32933.408536] ? kthread_complete_and_exit+0x20/0x20
[32933.413984] ret_from_fork+0x22/0x30
[32933.418074] </TASK>
The call trace shows up again at 245, 368, and 491 seconds. Same task, same trace.
next prev parent reply other threads:[~2022-05-13 11:58 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-28 13:05 NFS regression between 5.17 and 5.18 Dennis Dalessandro
2022-04-28 14:57 ` Chuck Lever III
2022-04-28 15:42 ` Dennis Dalessandro
2022-04-28 19:47 ` Dennis Dalessandro
2022-04-28 19:56 ` Trond Myklebust
2022-04-29 12:54 ` Dennis Dalessandro
2022-04-29 13:37 ` Chuck Lever III
2022-05-06 13:24 ` Dennis Dalessandro
2022-05-13 11:58 ` Dennis Dalessandro [this message]
2022-05-13 13:30 ` Trond Myklebust
2022-05-13 14:59 ` Chuck Lever III
2022-05-13 15:19 ` Trond Myklebust
2022-05-13 18:53 ` Chuck Lever III
2022-05-17 13:40 ` Dennis Dalessandro
2022-05-17 14:02 ` Chuck Lever III
2022-06-20 7:46 ` Thorsten Leemhuis
2022-06-20 14:11 ` Chuck Lever III
2022-06-20 14:29 ` Thorsten Leemhuis
2022-06-20 14:40 ` Chuck Lever III
2022-06-20 17:06 ` Dennis Dalessandro
2022-06-21 16:04 ` Olga Kornievskaia
2022-06-21 16:58 ` Dennis Dalessandro
2022-06-21 17:51 ` Olga Kornievskaia
2022-06-21 17:53 ` Chuck Lever III
2022-05-04 9:45 ` Thorsten Leemhuis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d3055f2-f751-71f4-1fc0-927817a07d99@cornelisnetworks.com \
--to=dennis.dalessandro@cornelisnetworks.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).