* NFS regression between 5.17 and 5.18 @ 2022-04-28 13:05 Dennis Dalessandro 2022-04-28 14:57 ` Chuck Lever III 2022-05-04 9:45 ` Thorsten Leemhuis 0 siblings, 2 replies; 25+ messages in thread From: Dennis Dalessandro @ 2022-04-28 13:05 UTC (permalink / raw) To: linux-nfs; +Cc: Chuck Lever Hi NFS folks, I've noticed a pretty nasty regression in our NFS capability between 5.17 and 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing is it takes 3 minutes to copy a file from NFS to the local disk. When it should take less than half a second, which it did up through 5.17. It doesn't seem to be network related, but can't rule that out completely. I tried to bisect but the problem can be intermittent. Some runs I'll see a problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it 100 out of 100. The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk. Test 2 Creating /dev/shm/run_nfs_test.junk... 262144+0 records in 262144+0 records out 268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s Done copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk... real 0m0.502s user 0m0.001s sys 0m0.250s Done copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... real 4m11.835s user 0m0.001s sys 0m0.277s Done Comparing files... Done Remove /dev/shm/run_nfs_test.tmp... real 0m0.031s user 0m0.000s sys 0m0.031s Done Remove /dev/shm/run_nfs_test.junk... real 0m0.031s user 0m0.001s sys 0m0.030s Done Remove /mnt/nfs_test/run_nfs_test.junk... real 0m0.024s user 0m0.001s sys 0m0.022s Done Create the server with a 4G RAM disk: mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test exportfs -o fsid=0,rw,async,insecure,no_root_squash 192.168.254.0/255.255.255.0:/mnt/nfs_test Mount on the client side with: mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test Any advice on tracking this down further would be greatly appreciated. Thanks -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 13:05 NFS regression between 5.17 and 5.18 Dennis Dalessandro @ 2022-04-28 14:57 ` Chuck Lever III 2022-04-28 15:42 ` Dennis Dalessandro 2022-05-04 9:45 ` Thorsten Leemhuis 1 sibling, 1 reply; 25+ messages in thread From: Chuck Lever III @ 2022-04-28 14:57 UTC (permalink / raw) To: Dennis Dalessandro; +Cc: Linux NFS Mailing List > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: > > Hi NFS folks, > > I've noticed a pretty nasty regression in our NFS capability between 5.17 and > 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing > is it takes 3 minutes to copy a file from NFS to the local disk. When it should > take less than half a second, which it did up through 5.17. > > It doesn't seem to be network related, but can't rule that out completely. > > I tried to bisect but the problem can be intermittent. Some runs I'll see a > problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it > 100 out of 100. It's not clear from your problem report whether the problem appears when it's the server running v5.18-rc or the client. It looks vaguely like a recent client issue where it encounters a memory shortage while preparing an RPC and a delay is needed. > The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create > file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk. > > Test 2 > Creating /dev/shm/run_nfs_test.junk... > 262144+0 records in > 262144+0 records out > 268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s > Done > copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk... > > real 0m0.502s > user 0m0.001s > sys 0m0.250s > Done > copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... > > real 4m11.835s > user 0m0.001s > sys 0m0.277s > Done > Comparing files... > Done > Remove /dev/shm/run_nfs_test.tmp... > > real 0m0.031s > user 0m0.000s > sys 0m0.031s > Done > Remove /dev/shm/run_nfs_test.junk... > > real 0m0.031s > user 0m0.001s > sys 0m0.030s > Done > Remove /mnt/nfs_test/run_nfs_test.junk... > > real 0m0.024s > user 0m0.001s > sys 0m0.022s > Done > > Create the server with a 4G RAM disk: > mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test > > exportfs -o fsid=0,rw,async,insecure,no_root_squash > 192.168.254.0/255.255.255.0:/mnt/nfs_test > > Mount on the client side with: > mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test > > Any advice on tracking this down further would be greatly appreciated. > > Thanks > > -Denny > -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 14:57 ` Chuck Lever III @ 2022-04-28 15:42 ` Dennis Dalessandro 2022-04-28 19:47 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-04-28 15:42 UTC (permalink / raw) To: Chuck Lever III; +Cc: Linux NFS Mailing List On 4/28/22 10:57 AM, Chuck Lever III wrote: >> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >> >> Hi NFS folks, >> >> I've noticed a pretty nasty regression in our NFS capability between 5.17 and >> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing >> is it takes 3 minutes to copy a file from NFS to the local disk. When it should >> take less than half a second, which it did up through 5.17. >> >> It doesn't seem to be network related, but can't rule that out completely. >> >> I tried to bisect but the problem can be intermittent. Some runs I'll see a >> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it >> 100 out of 100. > > It's not clear from your problem report whether the problem appears > when it's the server running v5.18-rc or the client. That's because I don't know which it is. I'll do a quick test and find out. I was testing the same kernel across both nodes. -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 15:42 ` Dennis Dalessandro @ 2022-04-28 19:47 ` Dennis Dalessandro 2022-04-28 19:56 ` Trond Myklebust 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-04-28 19:47 UTC (permalink / raw) To: Chuck Lever III; +Cc: Linux NFS Mailing List On 4/28/22 11:42 AM, Dennis Dalessandro wrote: > On 4/28/22 10:57 AM, Chuck Lever III wrote: >>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >>> >>> Hi NFS folks, >>> >>> I've noticed a pretty nasty regression in our NFS capability between 5.17 and >>> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing >>> is it takes 3 minutes to copy a file from NFS to the local disk. When it should >>> take less than half a second, which it did up through 5.17. >>> >>> It doesn't seem to be network related, but can't rule that out completely. >>> >>> I tried to bisect but the problem can be intermittent. Some runs I'll see a >>> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it >>> 100 out of 100. >> >> It's not clear from your problem report whether the problem appears >> when it's the server running v5.18-rc or the client. > > That's because I don't know which it is. I'll do a quick test and find out. I > was testing the same kernel across both nodes. Looks like it is the client. server client result ------ ------ ------ 5.17 5.17 Pass 5.17 5.18 Fail 5.18 5.18 Fail 5.18 5.17 Pass Is there a patch for the client issue you mentioned that I could try? -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 19:47 ` Dennis Dalessandro @ 2022-04-28 19:56 ` Trond Myklebust 2022-04-29 12:54 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2022-04-28 19:56 UTC (permalink / raw) To: dennis.dalessandro, chuck.lever; +Cc: linux-nfs [-- Attachment #1: Type: text/plain, Size: 1731 bytes --] On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: > On 4/28/22 11:42 AM, Dennis Dalessandro wrote: > > On 4/28/22 10:57 AM, Chuck Lever III wrote: > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro > > > > <dennis.dalessandro@cornelisnetworks.com> wrote: > > > > > > > > Hi NFS folks, > > > > > > > > I've noticed a pretty nasty regression in our NFS capability > > > > between 5.17 and > > > > 5.18-rc1. I've tried to bisect but not having any luck. The > > > > problem I'm seeing > > > > is it takes 3 minutes to copy a file from NFS to the local > > > > disk. When it should > > > > take less than half a second, which it did up through 5.17. > > > > > > > > It doesn't seem to be network related, but can't rule that out > > > > completely. > > > > > > > > I tried to bisect but the problem can be intermittent. Some > > > > runs I'll see a > > > > problem in 3 out of 100 cycles, sometimes 0 out of 100. > > > > Sometimes I'll see it > > > > 100 out of 100. > > > > > > It's not clear from your problem report whether the problem > > > appears > > > when it's the server running v5.18-rc or the client. > > > > That's because I don't know which it is. I'll do a quick test and > > find out. I > > was testing the same kernel across both nodes. > > Looks like it is the client. > > server client result > ------ ------ ------ > 5.17 5.17 Pass > 5.17 5.18 Fail > 5.18 5.18 Fail > 5.18 5.17 Pass > > Is there a patch for the client issue you mentioned that I could try? > > -Denny Try this one -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-SUNRPC-Fix-an-RPC-RDMA-performance-regression.patch --] [-- Type: text/x-patch; name="0001-SUNRPC-Fix-an-RPC-RDMA-performance-regression.patch", Size: 2444 bytes --] From 0505ecf76f5131525e696d3dae3bd456a8b8b3ad Mon Sep 17 00:00:00 2001 From: Trond Myklebust <trond.myklebust@hammerspace.com> Date: Thu, 28 Apr 2022 15:46:01 -0400 Subject: [PATCH] SUNRPC: Fix an RPC/RDMA performance regression Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes issues when under memory pressure. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> --- net/sunrpc/auth_gss/auth_gss.c | 11 ++++------- net/sunrpc/sched.c | 1 + net/sunrpc/xprtrdma/transport.c | 6 +----- 3 files changed, 6 insertions(+), 12 deletions(-) diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index de7e5b41ab8f..a31a27816cc0 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -1340,14 +1340,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits) /* * Lookup RPCSEC_GSS cred for the current process */ -static struct rpc_cred * -gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags) +static struct rpc_cred *gss_lookup_cred(struct rpc_auth *auth, + struct auth_cred *acred, int flags) { - gfp_t gfp = GFP_KERNEL; - - if (flags & RPCAUTH_LOOKUP_ASYNC) - gfp = GFP_NOWAIT | __GFP_NOWARN; - return rpcauth_lookup_credcache(auth, acred, flags, gfp); + return rpcauth_lookup_credcache(auth, acred, flags, + rpc_task_gfp_mask()); } static struct rpc_cred * diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 7f70c1e608b7..25b9221950ff 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -63,6 +63,7 @@ gfp_t rpc_task_gfp_mask(void) return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; return GFP_KERNEL; } +EXPORT_SYMBOL_GPL(rpc_task_gfp_mask); unsigned long rpc_task_timeout(const struct rpc_task *task) diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index 6b7e10e5a141..bcb37b51adf6 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -571,11 +571,7 @@ xprt_rdma_allocate(struct rpc_task *task) struct rpc_rqst *rqst = task->tk_rqstp; struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt); struct rpcrdma_req *req = rpcr_to_rdmar(rqst); - gfp_t flags; - - flags = RPCRDMA_DEF_GFP; - if (RPC_IS_ASYNC(task)) - flags = GFP_NOWAIT | __GFP_NOWARN; + gfp_t flags = rpc_task_gfp_mask(); if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize, flags)) -- 2.35.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 19:56 ` Trond Myklebust @ 2022-04-29 12:54 ` Dennis Dalessandro 2022-04-29 13:37 ` Chuck Lever III 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-04-29 12:54 UTC (permalink / raw) To: Trond Myklebust, chuck.lever; +Cc: linux-nfs On 4/28/22 3:56 PM, Trond Myklebust wrote: > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>> >>>>> Hi NFS folks, >>>>> >>>>> I've noticed a pretty nasty regression in our NFS capability >>>>> between 5.17 and >>>>> 5.18-rc1. I've tried to bisect but not having any luck. The >>>>> problem I'm seeing >>>>> is it takes 3 minutes to copy a file from NFS to the local >>>>> disk. When it should >>>>> take less than half a second, which it did up through 5.17. >>>>> >>>>> It doesn't seem to be network related, but can't rule that out >>>>> completely. >>>>> >>>>> I tried to bisect but the problem can be intermittent. Some >>>>> runs I'll see a >>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. >>>>> Sometimes I'll see it >>>>> 100 out of 100. >>>> >>>> It's not clear from your problem report whether the problem >>>> appears >>>> when it's the server running v5.18-rc or the client. >>> >>> That's because I don't know which it is. I'll do a quick test and >>> find out. I >>> was testing the same kernel across both nodes. >> >> Looks like it is the client. >> >> server client result >> ------ ------ ------ >> 5.17 5.17 Pass >> 5.17 5.18 Fail >> 5.18 5.18 Fail >> 5.18 5.17 Pass >> >> Is there a patch for the client issue you mentioned that I could try? >> >> -Denny > > Try this one Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still see intermittent hangs. I applied it on top of -rc4: copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... real 2m6.072s user 0m0.002s sys 0m0.263s Done While it was hung I checked the mem usage on the machine: # free -h total used free shared buff/cache available Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi Swap: 4.0Gi 0B 4.0Gi Doesn't appear to be under memory pressure. -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-29 12:54 ` Dennis Dalessandro @ 2022-04-29 13:37 ` Chuck Lever III 2022-05-06 13:24 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever III @ 2022-04-29 13:37 UTC (permalink / raw) To: Dennis Dalessandro; +Cc: Trond Myklebust, Linux NFS Mailing List > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: > > On 4/28/22 3:56 PM, Trond Myklebust wrote: >> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>> >>>>>> Hi NFS folks, >>>>>> >>>>>> I've noticed a pretty nasty regression in our NFS capability >>>>>> between 5.17 and >>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The >>>>>> problem I'm seeing >>>>>> is it takes 3 minutes to copy a file from NFS to the local >>>>>> disk. When it should >>>>>> take less than half a second, which it did up through 5.17. >>>>>> >>>>>> It doesn't seem to be network related, but can't rule that out >>>>>> completely. >>>>>> >>>>>> I tried to bisect but the problem can be intermittent. Some >>>>>> runs I'll see a >>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. >>>>>> Sometimes I'll see it >>>>>> 100 out of 100. >>>>> >>>>> It's not clear from your problem report whether the problem >>>>> appears >>>>> when it's the server running v5.18-rc or the client. >>>> >>>> That's because I don't know which it is. I'll do a quick test and >>>> find out. I >>>> was testing the same kernel across both nodes. >>> >>> Looks like it is the client. >>> >>> server client result >>> ------ ------ ------ >>> 5.17 5.17 Pass >>> 5.17 5.18 Fail >>> 5.18 5.18 Fail >>> 5.18 5.17 Pass >>> >>> Is there a patch for the client issue you mentioned that I could try? >>> >>> -Denny >> >> Try this one > > Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still > see intermittent hangs. I applied it on top of -rc4: > > copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... > > real 2m6.072s > user 0m0.002s > sys 0m0.263s > Done > > While it was hung I checked the mem usage on the machine: > > # free -h > total used free shared buff/cache available > Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi > Swap: 4.0Gi 0B 4.0Gi > > Doesn't appear to be under memory pressure. Hi, since you know now that it is the client, perhaps a bisect would be more successful? -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-29 13:37 ` Chuck Lever III @ 2022-05-06 13:24 ` Dennis Dalessandro 2022-05-13 11:58 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-05-06 13:24 UTC (permalink / raw) To: Chuck Lever III; +Cc: Trond Myklebust, Linux NFS Mailing List On 4/29/22 9:37 AM, Chuck Lever III wrote: > > >> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >> >> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>>> >>>>>>> Hi NFS folks, >>>>>>> >>>>>>> I've noticed a pretty nasty regression in our NFS capability >>>>>>> between 5.17 and >>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The >>>>>>> problem I'm seeing >>>>>>> is it takes 3 minutes to copy a file from NFS to the local >>>>>>> disk. When it should >>>>>>> take less than half a second, which it did up through 5.17. >>>>>>> >>>>>>> It doesn't seem to be network related, but can't rule that out >>>>>>> completely. >>>>>>> >>>>>>> I tried to bisect but the problem can be intermittent. Some >>>>>>> runs I'll see a >>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. >>>>>>> Sometimes I'll see it >>>>>>> 100 out of 100. >>>>>> >>>>>> It's not clear from your problem report whether the problem >>>>>> appears >>>>>> when it's the server running v5.18-rc or the client. >>>>> >>>>> That's because I don't know which it is. I'll do a quick test and >>>>> find out. I >>>>> was testing the same kernel across both nodes. >>>> >>>> Looks like it is the client. >>>> >>>> server client result >>>> ------ ------ ------ >>>> 5.17 5.17 Pass >>>> 5.17 5.18 Fail >>>> 5.18 5.18 Fail >>>> 5.18 5.17 Pass >>>> >>>> Is there a patch for the client issue you mentioned that I could try? >>>> >>>> -Denny >>> >>> Try this one >> >> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still >> see intermittent hangs. I applied it on top of -rc4: >> >> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... >> >> real 2m6.072s >> user 0m0.002s >> sys 0m0.263s >> Done >> >> While it was hung I checked the mem usage on the machine: >> >> # free -h >> total used free shared buff/cache available >> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi >> Swap: 4.0Gi 0B 4.0Gi >> >> Doesn't appear to be under memory pressure. > > Hi, since you know now that it is the client, perhaps a bisect > would be more successful? I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus for 5.18 and tested. I see the problem on pretty much all the patches. However it's the frequency that it hits which changes. I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: "NFS: Convert readdir page cache to use a cookie based index" After this I start seeing it around 10 times in 500 and by the last patch 10 times in less than 100. Is there any kind of tracing/debugging I could turn on to get more insight on what is taking so long when it does go bad? -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-06 13:24 ` Dennis Dalessandro @ 2022-05-13 11:58 ` Dennis Dalessandro 2022-05-13 13:30 ` Trond Myklebust 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-05-13 11:58 UTC (permalink / raw) To: Chuck Lever III; +Cc: Trond Myklebust, Linux NFS Mailing List On 5/6/22 9:24 AM, Dennis Dalessandro wrote: > On 4/29/22 9:37 AM, Chuck Lever III wrote: >> >> >>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >>> >>> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>>>> >>>>>>>> Hi NFS folks, >>>>>>>> >>>>>>>> I've noticed a pretty nasty regression in our NFS capability >>>>>>>> between 5.17 and >>>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The >>>>>>>> problem I'm seeing >>>>>>>> is it takes 3 minutes to copy a file from NFS to the local >>>>>>>> disk. When it should >>>>>>>> take less than half a second, which it did up through 5.17. >>>>>>>> >>>>>>>> It doesn't seem to be network related, but can't rule that out >>>>>>>> completely. >>>>>>>> >>>>>>>> I tried to bisect but the problem can be intermittent. Some >>>>>>>> runs I'll see a >>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. >>>>>>>> Sometimes I'll see it >>>>>>>> 100 out of 100. >>>>>>> >>>>>>> It's not clear from your problem report whether the problem >>>>>>> appears >>>>>>> when it's the server running v5.18-rc or the client. >>>>>> >>>>>> That's because I don't know which it is. I'll do a quick test and >>>>>> find out. I >>>>>> was testing the same kernel across both nodes. >>>>> >>>>> Looks like it is the client. >>>>> >>>>> server client result >>>>> ------ ------ ------ >>>>> 5.17 5.17 Pass >>>>> 5.17 5.18 Fail >>>>> 5.18 5.18 Fail >>>>> 5.18 5.17 Pass >>>>> >>>>> Is there a patch for the client issue you mentioned that I could try? >>>>> >>>>> -Denny >>>> >>>> Try this one >>> >>> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still >>> see intermittent hangs. I applied it on top of -rc4: >>> >>> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... >>> >>> real 2m6.072s >>> user 0m0.002s >>> sys 0m0.263s >>> Done >>> >>> While it was hung I checked the mem usage on the machine: >>> >>> # free -h >>> total used free shared buff/cache available >>> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi >>> Swap: 4.0Gi 0B 4.0Gi >>> >>> Doesn't appear to be under memory pressure. >> >> Hi, since you know now that it is the client, perhaps a bisect >> would be more successful? > > I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus > for 5.18 and tested. I see the problem on pretty much all the patches. However > it's the frequency that it hits which changes. > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: > "NFS: Convert readdir page cache to use a cookie based index" > > After this I start seeing it around 10 times in 500 and by the last patch 10 > times in less than 100. > > Is there any kind of tracing/debugging I could turn on to get more insight on > what is taking so long when it does go bad? > Ran a test with -rc6 and this time see a hung task trace on the console as well as an NFS RPC error. [32719.991175] nfs: RPC call returned error 512 . . . [32933.285126] INFO: task kworker/u145:23:886141 blocked for more than 122 seconds. [32933.293543] Tainted: G S 5.18.0-rc6 #1 [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 ppid: 2 flags:0x00004000 [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] [32933.324524] Call Trace: [32933.327347] <TASK> [32933.329785] __schedule+0x3dd/0x970 [32933.333783] schedule+0x41/0xa0 [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] [32933.343639] ? prepare_to_wait+0xd0/0xd0 [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] [32933.354183] xprt_release+0x26/0x140 [sunrpc] [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] [32933.376564] ? __switch_to_asm+0x42/0x70 [32933.381046] ? finish_task_switch+0xb2/0x2c0 [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] [32933.391391] process_one_work+0x1c8/0x390 [32933.395975] worker_thread+0x30/0x360 [32933.400162] ? process_one_work+0x390/0x390 [32933.404931] kthread+0xd9/0x100 [32933.408536] ? kthread_complete_and_exit+0x20/0x20 [32933.413984] ret_from_fork+0x22/0x30 [32933.418074] </TASK> The call trace shows up again at 245, 368, and 491 seconds. Same task, same trace. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-13 11:58 ` Dennis Dalessandro @ 2022-05-13 13:30 ` Trond Myklebust 2022-05-13 14:59 ` Chuck Lever III 0 siblings, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2022-05-13 13:30 UTC (permalink / raw) To: aglo, dennis.dalessandro, chuck.lever; +Cc: linux-nfs On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: > On 5/6/22 9:24 AM, Dennis Dalessandro wrote: > > On 4/29/22 9:37 AM, Chuck Lever III wrote: > > > > > > > > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro > > > > <dennis.dalessandro@cornelisnetworks.com> wrote: > > > > > > > > On 4/28/22 3:56 PM, Trond Myklebust wrote: > > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: > > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote: > > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote: > > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro > > > > > > > > > <dennis.dalessandro@cornelisnetworks.com> wrote: > > > > > > > > > > > > > > > > > > Hi NFS folks, > > > > > > > > > > > > > > > > > > I've noticed a pretty nasty regression in our NFS > > > > > > > > > capability > > > > > > > > > between 5.17 and > > > > > > > > > 5.18-rc1. I've tried to bisect but not having any > > > > > > > > > luck. The > > > > > > > > > problem I'm seeing > > > > > > > > > is it takes 3 minutes to copy a file from NFS to the > > > > > > > > > local > > > > > > > > > disk. When it should > > > > > > > > > take less than half a second, which it did up through > > > > > > > > > 5.17. > > > > > > > > > > > > > > > > > > It doesn't seem to be network related, but can't rule > > > > > > > > > that out > > > > > > > > > completely. > > > > > > > > > > > > > > > > > > I tried to bisect but the problem can be > > > > > > > > > intermittent. Some > > > > > > > > > runs I'll see a > > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out of > > > > > > > > > 100. > > > > > > > > > Sometimes I'll see it > > > > > > > > > 100 out of 100. > > > > > > > > > > > > > > > > It's not clear from your problem report whether the > > > > > > > > problem > > > > > > > > appears > > > > > > > > when it's the server running v5.18-rc or the client. > > > > > > > > > > > > > > That's because I don't know which it is. I'll do a quick > > > > > > > test and > > > > > > > find out. I > > > > > > > was testing the same kernel across both nodes. > > > > > > > > > > > > Looks like it is the client. > > > > > > > > > > > > server client result > > > > > > ------ ------ ------ > > > > > > 5.17 5.17 Pass > > > > > > 5.17 5.18 Fail > > > > > > 5.18 5.18 Fail > > > > > > 5.18 5.17 Pass > > > > > > > > > > > > Is there a patch for the client issue you mentioned that I > > > > > > could try? > > > > > > > > > > > > -Denny > > > > > > > > > > Try this one > > > > > > > > Thanks for the patch. Unfortunately it doesn't seem to solve > > > > the issue, still > > > > see intermittent hangs. I applied it on top of -rc4: > > > > > > > > copy /mnt/nfs_test/run_nfs_test.junk to > > > > /dev/shm/run_nfs_test.tmp... > > > > > > > > real 2m6.072s > > > > user 0m0.002s > > > > sys 0m0.263s > > > > Done > > > > > > > > While it was hung I checked the mem usage on the machine: > > > > > > > > # free -h > > > > total used free shared > > > > buff/cache available > > > > Mem: 62Gi 871Mi 61Gi 342Mi > > > > 889Mi 61Gi > > > > Swap: 4.0Gi 0B 4.0Gi > > > > > > > > Doesn't appear to be under memory pressure. > > > > > > Hi, since you know now that it is the client, perhaps a bisect > > > would be more successful? > > > > I've been testing all week. I pulled the nfs-rdma tree that was > > sent to Linus > > for 5.18 and tested. I see the problem on pretty much all the > > patches. However > > it's the frequency that it hits which changes. > > > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: > > "NFS: Convert readdir page cache to use a cookie based index" > > > > After this I start seeing it around 10 times in 500 and by the last > > patch 10 > > times in less than 100. > > > > Is there any kind of tracing/debugging I could turn on to get more > > insight on > > what is taking so long when it does go bad? > > > > Ran a test with -rc6 and this time see a hung task trace on the > console as well > as an NFS RPC error. > > [32719.991175] nfs: RPC call returned error 512 > . > . > . > [32933.285126] INFO: task kworker/u145:23:886141 blocked for more > than 122 seconds. > [32933.293543] Tainted: G S 5.18.0-rc6 #1 > [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this > message. > [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 > ppid: 2 > flags:0x00004000 > [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] > [32933.324524] Call Trace: > [32933.327347] <TASK> > [32933.329785] __schedule+0x3dd/0x970 > [32933.333783] schedule+0x41/0xa0 > [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] > [32933.343639] ? prepare_to_wait+0xd0/0xd0 > [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > [32933.354183] xprt_release+0x26/0x140 [sunrpc] > [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] > [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] > [32933.376564] ? __switch_to_asm+0x42/0x70 > [32933.381046] ? finish_task_switch+0xb2/0x2c0 > [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] > [32933.391391] process_one_work+0x1c8/0x390 > [32933.395975] worker_thread+0x30/0x360 > [32933.400162] ? process_one_work+0x390/0x390 > [32933.404931] kthread+0xd9/0x100 > [32933.408536] ? kthread_complete_and_exit+0x20/0x20 > [32933.413984] ret_from_fork+0x22/0x30 > [32933.418074] </TASK> > > The call trace shows up again at 245, 368, and 491 seconds. Same > task, same trace. > > > > That's very helpful. The above trace suggests that the RDMA code is leaking a call to xprt_unpin_rqst(). From what I can see, rpcrdma_reply_handler() will always call xprt_pin_rqst(), and that call is required to be matched by a call to xprt_unpin_rqst() before the handler exits. However those calls to xprt_unpin_rqst() appear to be scattered around the code, and it is not at all obvious that the requirement is being fulfilled. Chuck, why does that code need to be so complex? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-13 13:30 ` Trond Myklebust @ 2022-05-13 14:59 ` Chuck Lever III 2022-05-13 15:19 ` Trond Myklebust 2022-05-17 13:40 ` Dennis Dalessandro 0 siblings, 2 replies; 25+ messages in thread From: Chuck Lever III @ 2022-05-13 14:59 UTC (permalink / raw) To: Trond Myklebust Cc: Olga Kornievskaia, dennis.dalessandro, Linux NFS Mailing List > On May 13, 2022, at 9:30 AM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: >> On 5/6/22 9:24 AM, Dennis Dalessandro wrote: >>> On 4/29/22 9:37 AM, Chuck Lever III wrote: >>>> >>>> >>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro >>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>> >>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>>>>>> >>>>>>>>>> Hi NFS folks, >>>>>>>>>> >>>>>>>>>> I've noticed a pretty nasty regression in our NFS >>>>>>>>>> capability >>>>>>>>>> between 5.17 and >>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any >>>>>>>>>> luck. The >>>>>>>>>> problem I'm seeing >>>>>>>>>> is it takes 3 minutes to copy a file from NFS to the >>>>>>>>>> local >>>>>>>>>> disk. When it should >>>>>>>>>> take less than half a second, which it did up through >>>>>>>>>> 5.17. >>>>>>>>>> >>>>>>>>>> It doesn't seem to be network related, but can't rule >>>>>>>>>> that out >>>>>>>>>> completely. >>>>>>>>>> >>>>>>>>>> I tried to bisect but the problem can be >>>>>>>>>> intermittent. Some >>>>>>>>>> runs I'll see a >>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of >>>>>>>>>> 100. >>>>>>>>>> Sometimes I'll see it >>>>>>>>>> 100 out of 100. >>>>>>>>> >>>>>>>>> It's not clear from your problem report whether the >>>>>>>>> problem >>>>>>>>> appears >>>>>>>>> when it's the server running v5.18-rc or the client. >>>>>>>> >>>>>>>> That's because I don't know which it is. I'll do a quick >>>>>>>> test and >>>>>>>> find out. I >>>>>>>> was testing the same kernel across both nodes. >>>>>>> >>>>>>> Looks like it is the client. >>>>>>> >>>>>>> server client result >>>>>>> ------ ------ ------ >>>>>>> 5.17 5.17 Pass >>>>>>> 5.17 5.18 Fail >>>>>>> 5.18 5.18 Fail >>>>>>> 5.18 5.17 Pass >>>>>>> >>>>>>> Is there a patch for the client issue you mentioned that I >>>>>>> could try? >>>>>>> >>>>>>> -Denny >>>>>> >>>>>> Try this one >>>>> >>>>> Thanks for the patch. Unfortunately it doesn't seem to solve >>>>> the issue, still >>>>> see intermittent hangs. I applied it on top of -rc4: >>>>> >>>>> copy /mnt/nfs_test/run_nfs_test.junk to >>>>> /dev/shm/run_nfs_test.tmp... >>>>> >>>>> real 2m6.072s >>>>> user 0m0.002s >>>>> sys 0m0.263s >>>>> Done >>>>> >>>>> While it was hung I checked the mem usage on the machine: >>>>> >>>>> # free -h >>>>> total used free shared >>>>> buff/cache available >>>>> Mem: 62Gi 871Mi 61Gi 342Mi >>>>> 889Mi 61Gi >>>>> Swap: 4.0Gi 0B 4.0Gi >>>>> >>>>> Doesn't appear to be under memory pressure. >>>> >>>> Hi, since you know now that it is the client, perhaps a bisect >>>> would be more successful? >>> >>> I've been testing all week. I pulled the nfs-rdma tree that was >>> sent to Linus >>> for 5.18 and tested. I see the problem on pretty much all the >>> patches. However >>> it's the frequency that it hits which changes. >>> >>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: >>> "NFS: Convert readdir page cache to use a cookie based index" >>> >>> After this I start seeing it around 10 times in 500 and by the last >>> patch 10 >>> times in less than 100. >>> >>> Is there any kind of tracing/debugging I could turn on to get more >>> insight on >>> what is taking so long when it does go bad? >>> >> >> Ran a test with -rc6 and this time see a hung task trace on the >> console as well >> as an NFS RPC error. >> >> [32719.991175] nfs: RPC call returned error 512 >> . >> . >> . >> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >> than 122 seconds. >> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this >> message. >> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >> ppid: 2 >> flags:0x00004000 >> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >> [32933.324524] Call Trace: >> [32933.327347] <TASK> >> [32933.329785] __schedule+0x3dd/0x970 >> [32933.333783] schedule+0x41/0xa0 >> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >> [32933.376564] ? __switch_to_asm+0x42/0x70 >> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >> [32933.391391] process_one_work+0x1c8/0x390 >> [32933.395975] worker_thread+0x30/0x360 >> [32933.400162] ? process_one_work+0x390/0x390 >> [32933.404931] kthread+0xd9/0x100 >> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >> [32933.413984] ret_from_fork+0x22/0x30 >> [32933.418074] </TASK> >> >> The call trace shows up again at 245, 368, and 491 seconds. Same >> task, same trace. >> >> >> >> > > That's very helpful. The above trace suggests that the RDMA code is > leaking a call to xprt_unpin_rqst(). IMHO this is unlikely to be related to the performance regression -- none of this code has changed in the past 5 kernel releases. Could be a different issue, though. As is often the case in these situations, the INFO trace above happens long after the issue that caused the missing unpin. So... unless Dennis has a reproducer that can trigger the issue frequently, I don't think there's much that can be extracted from that. Also "nfs: RPC call returned error 512" suggests someone hit ^C at some point. It's always possible that the xprt_rdma_free() path is missing an unpin. But again, that's not likely to be related to performance. > From what I can see, rpcrdma_reply_handler() will always call > xprt_pin_rqst(), and that call is required to be matched by a call to > xprt_unpin_rqst() before the handler exits. However those calls to > xprt_unpin_rqst() appear to be scattered around the code, and it is not > at all obvious that the requirement is being fulfilled. I don't think unpin is required before rpcrdma_reply_handler() exits. It /is/ required after xprt_complete_rqst(), and sometimes that call has to be deferred until the NIC is fully done with the RPC's buffers. The NIC is often not finished with the buffers in the Receive completion handler. > Chuck, why does that code need to be so complex? The overall goal is that the upper layer cannot be awoken until the NIC has finished with the RPC's header and payload buffers. These buffers are treated differently depending on how the send path chose to marshal the RPC. And, all of these are hot paths, so it's important to ensure that we are waiting/context-switching as infrequently as possible. We wake the upper layer waiter during an RDMA completion to reduce context switching. The first completion does not always mean there's nothing left for the NIC to do. Ignoring the server backchannel, there is a single unpin in the "complete without error" case in rpcrdma_complete_rqst(). There are three call sites for rpcrdma_complete_rqst: 1. rpcrdma_reply_done - called when no explicit invalidation is needed and the Receive completion fired after the Send completion. 2. rpcrdma_sendctx_done - called when no explicit invalidation is needed and the Send completion fired after the Receive completion. 3. frwr_wc_localinv_done - called when explicit invalidation was needed and that invalidation has completed. SQ ordering guarantees that the Send has also completed at this point, so a separate wait for Send completion is not needed. These are the points when the NIC has finished with all of the memory resources associated with an RPC, so that's when it is safe to wake the waiter and unpin the RPC. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-13 14:59 ` Chuck Lever III @ 2022-05-13 15:19 ` Trond Myklebust 2022-05-13 18:53 ` Chuck Lever III 2022-05-17 13:40 ` Dennis Dalessandro 1 sibling, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2022-05-13 15:19 UTC (permalink / raw) To: chuck.lever; +Cc: linux-nfs, aglo, dennis.dalessandro On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote: > > > > On May 13, 2022, at 9:30 AM, Trond Myklebust > > <trondmy@hammerspace.com> wrote: > > > > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: > > > On 5/6/22 9:24 AM, Dennis Dalessandro wrote: > > > > On 4/29/22 9:37 AM, Chuck Lever III wrote: > > > > > > > > > > > > > > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro > > > > > > <dennis.dalessandro@cornelisnetworks.com> wrote: > > > > > > > > > > > > On 4/28/22 3:56 PM, Trond Myklebust wrote: > > > > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro > > > > > > > wrote: > > > > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote: > > > > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote: > > > > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro > > > > > > > > > > > <dennis.dalessandro@cornelisnetworks.com> wrote: > > > > > > > > > > > > > > > > > > > > > > Hi NFS folks, > > > > > > > > > > > > > > > > > > > > > > I've noticed a pretty nasty regression in our NFS > > > > > > > > > > > capability > > > > > > > > > > > between 5.17 and > > > > > > > > > > > 5.18-rc1. I've tried to bisect but not having any > > > > > > > > > > > luck. The > > > > > > > > > > > problem I'm seeing > > > > > > > > > > > is it takes 3 minutes to copy a file from NFS to > > > > > > > > > > > the > > > > > > > > > > > local > > > > > > > > > > > disk. When it should > > > > > > > > > > > take less than half a second, which it did up > > > > > > > > > > > through > > > > > > > > > > > 5.17. > > > > > > > > > > > > > > > > > > > > > > It doesn't seem to be network related, but can't > > > > > > > > > > > rule > > > > > > > > > > > that out > > > > > > > > > > > completely. > > > > > > > > > > > > > > > > > > > > > > I tried to bisect but the problem can be > > > > > > > > > > > intermittent. Some > > > > > > > > > > > runs I'll see a > > > > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out > > > > > > > > > > > of > > > > > > > > > > > 100. > > > > > > > > > > > Sometimes I'll see it > > > > > > > > > > > 100 out of 100. > > > > > > > > > > > > > > > > > > > > It's not clear from your problem report whether the > > > > > > > > > > problem > > > > > > > > > > appears > > > > > > > > > > when it's the server running v5.18-rc or the > > > > > > > > > > client. > > > > > > > > > > > > > > > > > > That's because I don't know which it is. I'll do a > > > > > > > > > quick > > > > > > > > > test and > > > > > > > > > find out. I > > > > > > > > > was testing the same kernel across both nodes. > > > > > > > > > > > > > > > > Looks like it is the client. > > > > > > > > > > > > > > > > server client result > > > > > > > > ------ ------ ------ > > > > > > > > 5.17 5.17 Pass > > > > > > > > 5.17 5.18 Fail > > > > > > > > 5.18 5.18 Fail > > > > > > > > 5.18 5.17 Pass > > > > > > > > > > > > > > > > Is there a patch for the client issue you mentioned > > > > > > > > that I > > > > > > > > could try? > > > > > > > > > > > > > > > > -Denny > > > > > > > > > > > > > > Try this one > > > > > > > > > > > > Thanks for the patch. Unfortunately it doesn't seem to > > > > > > solve > > > > > > the issue, still > > > > > > see intermittent hangs. I applied it on top of -rc4: > > > > > > > > > > > > copy /mnt/nfs_test/run_nfs_test.junk to > > > > > > /dev/shm/run_nfs_test.tmp... > > > > > > > > > > > > real 2m6.072s > > > > > > user 0m0.002s > > > > > > sys 0m0.263s > > > > > > Done > > > > > > > > > > > > While it was hung I checked the mem usage on the machine: > > > > > > > > > > > > # free -h > > > > > > total used free shared > > > > > > buff/cache available > > > > > > Mem: 62Gi 871Mi 61Gi > > > > > > 342Mi > > > > > > 889Mi 61Gi > > > > > > Swap: 4.0Gi 0B 4.0Gi > > > > > > > > > > > > Doesn't appear to be under memory pressure. > > > > > > > > > > Hi, since you know now that it is the client, perhaps a > > > > > bisect > > > > > would be more successful? > > > > > > > > I've been testing all week. I pulled the nfs-rdma tree that was > > > > sent to Linus > > > > for 5.18 and tested. I see the problem on pretty much all the > > > > patches. However > > > > it's the frequency that it hits which changes. > > > > > > > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up > > > > to: > > > > "NFS: Convert readdir page cache to use a cookie based index" > > > > > > > > After this I start seeing it around 10 times in 500 and by the > > > > last > > > > patch 10 > > > > times in less than 100. > > > > > > > > Is there any kind of tracing/debugging I could turn on to get > > > > more > > > > insight on > > > > what is taking so long when it does go bad? > > > > > > > > > > Ran a test with -rc6 and this time see a hung task trace on the > > > console as well > > > as an NFS RPC error. > > > > > > [32719.991175] nfs: RPC call returned error 512 > > > . > > > . > > > . > > > [32933.285126] INFO: task kworker/u145:23:886141 blocked for more > > > than 122 seconds. > > > [32933.293543] Tainted: G S 5.18.0-rc6 #1 > > > [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > > disables this > > > message. > > > [32933.308740] task:kworker/u145:23 state:D stack: 0 > > > pid:886141 > > > ppid: 2 > > > flags:0x00004000 > > > [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] > > > [32933.324524] Call Trace: > > > [32933.327347] <TASK> > > > [32933.329785] __schedule+0x3dd/0x970 > > > [32933.333783] schedule+0x41/0xa0 > > > [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] > > > [32933.343639] ? prepare_to_wait+0xd0/0xd0 > > > [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > > > [32933.354183] xprt_release+0x26/0x140 [sunrpc] > > > [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] > > > [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] > > > [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] > > > [32933.376564] ? __switch_to_asm+0x42/0x70 > > > [32933.381046] ? finish_task_switch+0xb2/0x2c0 > > > [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] > > > [32933.391391] process_one_work+0x1c8/0x390 > > > [32933.395975] worker_thread+0x30/0x360 > > > [32933.400162] ? process_one_work+0x390/0x390 > > > [32933.404931] kthread+0xd9/0x100 > > > [32933.408536] ? kthread_complete_and_exit+0x20/0x20 > > > [32933.413984] ret_from_fork+0x22/0x30 > > > [32933.418074] </TASK> > > > > > > The call trace shows up again at 245, 368, and 491 seconds. Same > > > task, same trace. > > > > > > > > > > > > > > > > That's very helpful. The above trace suggests that the RDMA code is > > leaking a call to xprt_unpin_rqst(). > > IMHO this is unlikely to be related to the performance > regression -- none of this code has changed in the past 5 > kernel releases. Could be a different issue, though. > > As is often the case in these situations, the INFO trace > above happens long after the issue that caused the missing > unpin. So... unless Dennis has a reproducer that can trigger > the issue frequently, I don't think there's much that can > be extracted from that. > > Also "nfs: RPC call returned error 512" suggests someone > hit ^C at some point. It's always possible that the > xprt_rdma_free() path is missing an unpin. But again, > that's not likely to be related to performance. > > > > From what I can see, rpcrdma_reply_handler() will always call > > xprt_pin_rqst(), and that call is required to be matched by a call > > to > > xprt_unpin_rqst() before the handler exits. However those calls to > > xprt_unpin_rqst() appear to be scattered around the code, and it is > > not > > at all obvious that the requirement is being fulfilled. > > I don't think unpin is required before rpcrdma_reply_handler() > exits. It /is/ required after xprt_complete_rqst(), and > sometimes that call has to be deferred until the NIC is > fully done with the RPC's buffers. The NIC is often not > finished with the buffers in the Receive completion handler. > > > > Chuck, why does that code need to be so complex? > > The overall goal is that the upper layer cannot be awoken > until the NIC has finished with the RPC's header and > payload buffers. These buffers are treated differently > depending on how the send path chose to marshal the RPC. > > And, all of these are hot paths, so it's important to > ensure that we are waiting/context-switching as > infrequently as possible. We wake the upper layer waiter > during an RDMA completion to reduce context switching. > The first completion does not always mean there's nothing > left for the NIC to do. > > Ignoring the server backchannel, there is a single unpin > in the "complete without error" case in > rpcrdma_complete_rqst(). > > There are three call sites for rpcrdma_complete_rqst: > > 1. rpcrdma_reply_done - called when no explicit invalidation > is needed and the Receive completion fired after the Send > completion. > > 2. rpcrdma_sendctx_done - called when no explicit invalidation > is needed and the Send completion fired after the Receive > completion. > > 3. frwr_wc_localinv_done - called when explicit invalidation > was needed and that invalidation has completed. SQ ordering > guarantees that the Send has also completed at this point, > so a separate wait for Send completion is not needed. > > These are the points when the NIC has finished with all of > the memory resources associated with an RPC, so that's > when it is safe to wake the waiter and unpin the RPC. > From what I can see, what's important in the case of RDMA is to ensure that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync() or rpcrdma_complete_rqst(), and that whichever operation we select must completed before we call xprt->ops->free_slot(). xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between those 3 operations, but it is not required if there are other ways to do so. In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in xprt_release(), it would be much preferable to find a better solution than to pin the request if you concern is performance. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-13 15:19 ` Trond Myklebust @ 2022-05-13 18:53 ` Chuck Lever III 0 siblings, 0 replies; 25+ messages in thread From: Chuck Lever III @ 2022-05-13 18:53 UTC (permalink / raw) To: Trond Myklebust Cc: Linux NFS Mailing List, Olga Kornievskaia, dennis.dalessandro > On May 13, 2022, at 11:19 AM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote: >> >> >>> On May 13, 2022, at 9:30 AM, Trond Myklebust >>> <trondmy@hammerspace.com> wrote: >>> >>> On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote: >>>> On 5/6/22 9:24 AM, Dennis Dalessandro wrote: >>>>> On 4/29/22 9:37 AM, Chuck Lever III wrote: >>>>>> >>>>>> >>>>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro >>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>>> >>>>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>>>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro >>>>>>>> wrote: >>>>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>>>>>>> <dennis.dalessandro@cornelisnetworks.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi NFS folks, >>>>>>>>>>>> >>>>>>>>>>>> I've noticed a pretty nasty regression in our NFS >>>>>>>>>>>> capability >>>>>>>>>>>> between 5.17 and >>>>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any >>>>>>>>>>>> luck. The >>>>>>>>>>>> problem I'm seeing >>>>>>>>>>>> is it takes 3 minutes to copy a file from NFS to >>>>>>>>>>>> the >>>>>>>>>>>> local >>>>>>>>>>>> disk. When it should >>>>>>>>>>>> take less than half a second, which it did up >>>>>>>>>>>> through >>>>>>>>>>>> 5.17. >>>>>>>>>>>> >>>>>>>>>>>> It doesn't seem to be network related, but can't >>>>>>>>>>>> rule >>>>>>>>>>>> that out >>>>>>>>>>>> completely. >>>>>>>>>>>> >>>>>>>>>>>> I tried to bisect but the problem can be >>>>>>>>>>>> intermittent. Some >>>>>>>>>>>> runs I'll see a >>>>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out >>>>>>>>>>>> of >>>>>>>>>>>> 100. >>>>>>>>>>>> Sometimes I'll see it >>>>>>>>>>>> 100 out of 100. >>>>>>>>>>> >>>>>>>>>>> It's not clear from your problem report whether the >>>>>>>>>>> problem >>>>>>>>>>> appears >>>>>>>>>>> when it's the server running v5.18-rc or the >>>>>>>>>>> client. >>>>>>>>>> >>>>>>>>>> That's because I don't know which it is. I'll do a >>>>>>>>>> quick >>>>>>>>>> test and >>>>>>>>>> find out. I >>>>>>>>>> was testing the same kernel across both nodes. >>>>>>>>> >>>>>>>>> Looks like it is the client. >>>>>>>>> >>>>>>>>> server client result >>>>>>>>> ------ ------ ------ >>>>>>>>> 5.17 5.17 Pass >>>>>>>>> 5.17 5.18 Fail >>>>>>>>> 5.18 5.18 Fail >>>>>>>>> 5.18 5.17 Pass >>>>>>>>> >>>>>>>>> Is there a patch for the client issue you mentioned >>>>>>>>> that I >>>>>>>>> could try? >>>>>>>>> >>>>>>>>> -Denny >>>>>>>> >>>>>>>> Try this one >>>>>>> >>>>>>> Thanks for the patch. Unfortunately it doesn't seem to >>>>>>> solve >>>>>>> the issue, still >>>>>>> see intermittent hangs. I applied it on top of -rc4: >>>>>>> >>>>>>> copy /mnt/nfs_test/run_nfs_test.junk to >>>>>>> /dev/shm/run_nfs_test.tmp... >>>>>>> >>>>>>> real 2m6.072s >>>>>>> user 0m0.002s >>>>>>> sys 0m0.263s >>>>>>> Done >>>>>>> >>>>>>> While it was hung I checked the mem usage on the machine: >>>>>>> >>>>>>> # free -h >>>>>>> total used free shared >>>>>>> buff/cache available >>>>>>> Mem: 62Gi 871Mi 61Gi >>>>>>> 342Mi >>>>>>> 889Mi 61Gi >>>>>>> Swap: 4.0Gi 0B 4.0Gi >>>>>>> >>>>>>> Doesn't appear to be under memory pressure. >>>>>> >>>>>> Hi, since you know now that it is the client, perhaps a >>>>>> bisect >>>>>> would be more successful? >>>>> >>>>> I've been testing all week. I pulled the nfs-rdma tree that was >>>>> sent to Linus >>>>> for 5.18 and tested. I see the problem on pretty much all the >>>>> patches. However >>>>> it's the frequency that it hits which changes. >>>>> >>>>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up >>>>> to: >>>>> "NFS: Convert readdir page cache to use a cookie based index" >>>>> >>>>> After this I start seeing it around 10 times in 500 and by the >>>>> last >>>>> patch 10 >>>>> times in less than 100. >>>>> >>>>> Is there any kind of tracing/debugging I could turn on to get >>>>> more >>>>> insight on >>>>> what is taking so long when it does go bad? >>>>> >>>> >>>> Ran a test with -rc6 and this time see a hung task trace on the >>>> console as well >>>> as an NFS RPC error. >>>> >>>> [32719.991175] nfs: RPC call returned error 512 >>>> . >>>> . >>>> . >>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>>> than 122 seconds. >>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>> disables this >>>> message. >>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 >>>> pid:886141 >>>> ppid: 2 >>>> flags:0x00004000 >>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>>> [32933.324524] Call Trace: >>>> [32933.327347] <TASK> >>>> [32933.329785] __schedule+0x3dd/0x970 >>>> [32933.333783] schedule+0x41/0xa0 >>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>>> [32933.391391] process_one_work+0x1c8/0x390 >>>> [32933.395975] worker_thread+0x30/0x360 >>>> [32933.400162] ? process_one_work+0x390/0x390 >>>> [32933.404931] kthread+0xd9/0x100 >>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>>> [32933.413984] ret_from_fork+0x22/0x30 >>>> [32933.418074] </TASK> >>>> >>>> The call trace shows up again at 245, 368, and 491 seconds. Same >>>> task, same trace. >>>> >>>> >>>> >>>> >>> >>> That's very helpful. The above trace suggests that the RDMA code is >>> leaking a call to xprt_unpin_rqst(). >> >> IMHO this is unlikely to be related to the performance >> regression -- none of this code has changed in the past 5 >> kernel releases. Could be a different issue, though. >> >> As is often the case in these situations, the INFO trace >> above happens long after the issue that caused the missing >> unpin. So... unless Dennis has a reproducer that can trigger >> the issue frequently, I don't think there's much that can >> be extracted from that. >> >> Also "nfs: RPC call returned error 512" suggests someone >> hit ^C at some point. It's always possible that the >> xprt_rdma_free() path is missing an unpin. But again, >> that's not likely to be related to performance. >> >> >>> From what I can see, rpcrdma_reply_handler() will always call >>> xprt_pin_rqst(), and that call is required to be matched by a call >>> to >>> xprt_unpin_rqst() before the handler exits. However those calls to >>> xprt_unpin_rqst() appear to be scattered around the code, and it is >>> not >>> at all obvious that the requirement is being fulfilled. >> >> I don't think unpin is required before rpcrdma_reply_handler() >> exits. It /is/ required after xprt_complete_rqst(), and >> sometimes that call has to be deferred until the NIC is >> fully done with the RPC's buffers. The NIC is often not >> finished with the buffers in the Receive completion handler. >> >> >>> Chuck, why does that code need to be so complex? >> >> The overall goal is that the upper layer cannot be awoken >> until the NIC has finished with the RPC's header and >> payload buffers. These buffers are treated differently >> depending on how the send path chose to marshal the RPC. >> >> And, all of these are hot paths, so it's important to >> ensure that we are waiting/context-switching as >> infrequently as possible. We wake the upper layer waiter >> during an RDMA completion to reduce context switching. >> The first completion does not always mean there's nothing >> left for the NIC to do. >> >> Ignoring the server backchannel, there is a single unpin >> in the "complete without error" case in >> rpcrdma_complete_rqst(). >> >> There are three call sites for rpcrdma_complete_rqst: >> >> 1. rpcrdma_reply_done - called when no explicit invalidation >> is needed and the Receive completion fired after the Send >> completion. >> >> 2. rpcrdma_sendctx_done - called when no explicit invalidation >> is needed and the Send completion fired after the Receive >> completion. >> >> 3. frwr_wc_localinv_done - called when explicit invalidation >> was needed and that invalidation has completed. SQ ordering >> guarantees that the Send has also completed at this point, >> so a separate wait for Send completion is not needed. >> >> These are the points when the NIC has finished with all of >> the memory resources associated with an RPC, so that's >> when it is safe to wake the waiter and unpin the RPC. >> > > From what I can see, what's important in the case of RDMA is to ensure > that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync() > or rpcrdma_complete_rqst(), No, all of these mechanisms are supposed to end up at rpcrdma_complete_rqst(). That's where the RPC/RDMA transport header is parsed. Reply completion processing is kicked off by a Receive completion. If local invalidation is unneeded, then we have to wait explicitly for the Send completion too. Otherwise, a LOCAL_INV is posted and we wait for that to complete; that completion means the previous Send is complete, so waiting for Send completion in that case is unnecessary. When all of that dust settles, rpcrdma_complete_rqst() is called to wake up the upper layer. That's basically how xprtsock did it when I wrote all this years ago. (The Send completion is needed because the NIC is free to continue to read from sndbuf until then. We can't release that buffer until completion because otherwise a NIC retransmission could expose the different contents on the wire, or worse, the receiving end would see garbage. Works a bit like TCP retransmission). > and that whichever operation we select must > complete before we call xprt->ops->free_slot(). The NIC has to tell xprtrdma that its memory operations are complete before RPC can XDR decode either the RPC header or the NFS request itself. That's well before ->free_slot(), IIRC. > xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between > those 3 operations, but it is not required if there are other ways to > do so. > In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in > xprt_release(), it would be much preferable to find a better solution > than to pin the request if you concern is performance. In particular, my performance concern is to avoid waits and context switches, because those add latency, sometimes quite a bit. I've tried to build the completion process on only the unavoidable context switches. I don't immediately see how to do this by adding wait_on_pinned to the fray. frwr_unmap_sync() is used only to deal with ^C, so it can be ugly and/or slow. I am open to discussion about the structure of this code. The simpler the better, but there are lots of moving parts. So feedback is always welcome. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-13 14:59 ` Chuck Lever III 2022-05-13 15:19 ` Trond Myklebust @ 2022-05-17 13:40 ` Dennis Dalessandro 2022-05-17 14:02 ` Chuck Lever III 1 sibling, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-05-17 13:40 UTC (permalink / raw) To: Chuck Lever III, Trond Myklebust Cc: Olga Kornievskaia, Linux NFS Mailing List On 5/13/22 10:59 AM, Chuck Lever III wrote: >>> >>> Ran a test with -rc6 and this time see a hung task trace on the >>> console as well >>> as an NFS RPC error. >>> >>> [32719.991175] nfs: RPC call returned error 512 >>> . >>> . >>> . >>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>> than 122 seconds. >>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>> disables this >>> message. >>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >>> ppid: 2 >>> flags:0x00004000 >>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>> [32933.324524] Call Trace: >>> [32933.327347] <TASK> >>> [32933.329785] __schedule+0x3dd/0x970 >>> [32933.333783] schedule+0x41/0xa0 >>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>> [32933.391391] process_one_work+0x1c8/0x390 >>> [32933.395975] worker_thread+0x30/0x360 >>> [32933.400162] ? process_one_work+0x390/0x390 >>> [32933.404931] kthread+0xd9/0x100 >>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>> [32933.413984] ret_from_fork+0x22/0x30 >>> [32933.418074] </TASK> >>> >>> The call trace shows up again at 245, 368, and 491 seconds. Same >>> task, same trace. >>> >>> >>> >>> >> >> That's very helpful. The above trace suggests that the RDMA code is >> leaking a call to xprt_unpin_rqst(). > > IMHO this is unlikely to be related to the performance > regression -- none of this code has changed in the past 5 > kernel releases. Could be a different issue, though. > > As is often the case in these situations, the INFO trace > above happens long after the issue that caused the missing > unpin. So... unless Dennis has a reproducer that can trigger > the issue frequently, I don't think there's much that can > be extracted from that. To be fair, I've only seen this one time and have had the performance regression since -rc1. > Also "nfs: RPC call returned error 512" suggests someone > hit ^C at some point. It's always possible that the > xprt_rdma_free() path is missing an unpin. But again, > that's not likely to be related to performance. I've checked our test code and after 10 minutes it does give up trying to do the NFS copies and aborts (SIG_INT) the test. So in all my tests and bisect attempts it seems the possibility to hit a slow NFS operation that hangs for minutes has been possible for quite some time. However in 5.18 it gets much worse. Any likely places I should add traces to try and find out what's stuck or taking time? -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-17 13:40 ` Dennis Dalessandro @ 2022-05-17 14:02 ` Chuck Lever III 2022-06-20 7:46 ` Thorsten Leemhuis 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever III @ 2022-05-17 14:02 UTC (permalink / raw) To: Dennis Dalessandro Cc: Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List > On May 17, 2022, at 9:40 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: > > On 5/13/22 10:59 AM, Chuck Lever III wrote: >>>> >>>> Ran a test with -rc6 and this time see a hung task trace on the >>>> console as well >>>> as an NFS RPC error. >>>> >>>> [32719.991175] nfs: RPC call returned error 512 >>>> . >>>> . >>>> . >>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>>> than 122 seconds. >>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>> disables this >>>> message. >>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >>>> ppid: 2 >>>> flags:0x00004000 >>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>>> [32933.324524] Call Trace: >>>> [32933.327347] <TASK> >>>> [32933.329785] __schedule+0x3dd/0x970 >>>> [32933.333783] schedule+0x41/0xa0 >>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>>> [32933.391391] process_one_work+0x1c8/0x390 >>>> [32933.395975] worker_thread+0x30/0x360 >>>> [32933.400162] ? process_one_work+0x390/0x390 >>>> [32933.404931] kthread+0xd9/0x100 >>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>>> [32933.413984] ret_from_fork+0x22/0x30 >>>> [32933.418074] </TASK> >>>> >>>> The call trace shows up again at 245, 368, and 491 seconds. Same >>>> task, same trace. >>>> >>>> >>>> >>>> >>> >>> That's very helpful. The above trace suggests that the RDMA code is >>> leaking a call to xprt_unpin_rqst(). >> >> IMHO this is unlikely to be related to the performance >> regression -- none of this code has changed in the past 5 >> kernel releases. Could be a different issue, though. >> >> As is often the case in these situations, the INFO trace >> above happens long after the issue that caused the missing >> unpin. So... unless Dennis has a reproducer that can trigger >> the issue frequently, I don't think there's much that can >> be extracted from that. > > To be fair, I've only seen this one time and have had the performance regression > since -rc1. > >> Also "nfs: RPC call returned error 512" suggests someone >> hit ^C at some point. It's always possible that the >> xprt_rdma_free() path is missing an unpin. But again, >> that's not likely to be related to performance. > > I've checked our test code and after 10 minutes it does give up trying to do the > NFS copies and aborts (SIG_INT) the test. After sleeping on it, I'm fairly certain the stack trace above is a result of a gap in how xprtrdma handles a signaled RPC. Signal handling in that code is pretty hard to test, so not surprising that there's a lingering bug or two. One idea I had was to add a fault injector in the RPC scheduler to throw signals at random. I think it can be done without perturbing the hot path. Maybe I'll post an RFC patch. > So in all my tests and bisect attempts it seems the possibility to hit a slow > NFS operation that hangs for minutes has been possible for quite some time. > However in 5.18 it gets much worse. > > Any likely places I should add traces to try and find out what's stuck or taking > time? There's been a lot of churn in that area in recent releases, so I'm not familiar with the existing tracepoints. Maybe Ben or Trond could provide some guidance. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-05-17 14:02 ` Chuck Lever III @ 2022-06-20 7:46 ` Thorsten Leemhuis 2022-06-20 14:11 ` Chuck Lever III 0 siblings, 1 reply; 25+ messages in thread From: Thorsten Leemhuis @ 2022-06-20 7:46 UTC (permalink / raw) To: Chuck Lever III, Dennis Dalessandro Cc: Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List Dennis, Chuck, I have below issue on the list of tracked regressions. What's the status? Has any progress been made? Or is this not really a regression and can be ignored? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. #regzbot poke ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 On 17.05.22 16:02, Chuck Lever III wrote: >> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >> >> On 5/13/22 10:59 AM, Chuck Lever III wrote: >>>>> >>>>> Ran a test with -rc6 and this time see a hung task trace on the >>>>> console as well >>>>> as an NFS RPC error. >>>>> >>>>> [32719.991175] nfs: RPC call returned error 512 >>>>> . >>>>> . >>>>> . >>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>>>> than 122 seconds. >>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this >>>>> message. >>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >>>>> ppid: 2 >>>>> flags:0x00004000 >>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>>>> [32933.324524] Call Trace: >>>>> [32933.327347] <TASK> >>>>> [32933.329785] __schedule+0x3dd/0x970 >>>>> [32933.333783] schedule+0x41/0xa0 >>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>>>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>>>> [32933.391391] process_one_work+0x1c8/0x390 >>>>> [32933.395975] worker_thread+0x30/0x360 >>>>> [32933.400162] ? process_one_work+0x390/0x390 >>>>> [32933.404931] kthread+0xd9/0x100 >>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>>>> [32933.413984] ret_from_fork+0x22/0x30 >>>>> [32933.418074] </TASK> >>>>> >>>>> The call trace shows up again at 245, 368, and 491 seconds. Same >>>>> task, same trace. >>>>> >>>>> >>>>> >>>>> >>>> >>>> That's very helpful. The above trace suggests that the RDMA code is >>>> leaking a call to xprt_unpin_rqst(). >>> >>> IMHO this is unlikely to be related to the performance >>> regression -- none of this code has changed in the past 5 >>> kernel releases. Could be a different issue, though. >>> >>> As is often the case in these situations, the INFO trace >>> above happens long after the issue that caused the missing >>> unpin. So... unless Dennis has a reproducer that can trigger >>> the issue frequently, I don't think there's much that can >>> be extracted from that. >> >> To be fair, I've only seen this one time and have had the performance regression >> since -rc1. >> >>> Also "nfs: RPC call returned error 512" suggests someone >>> hit ^C at some point. It's always possible that the >>> xprt_rdma_free() path is missing an unpin. But again, >>> that's not likely to be related to performance. >> >> I've checked our test code and after 10 minutes it does give up trying to do the >> NFS copies and aborts (SIG_INT) the test. > > After sleeping on it, I'm fairly certain the stack trace > above is a result of a gap in how xprtrdma handles a > signaled RPC. > > Signal handling in that code is pretty hard to test, so not > surprising that there's a lingering bug or two. One idea I > had was to add a fault injector in the RPC scheduler to > throw signals at random. I think it can be done without > perturbing the hot path. Maybe I'll post an RFC patch. > > >> So in all my tests and bisect attempts it seems the possibility to hit a slow >> NFS operation that hangs for minutes has been possible for quite some time. >> However in 5.18 it gets much worse. >> >> Any likely places I should add traces to try and find out what's stuck or taking >> time? > > There's been a lot of churn in that area in recent releases, > so I'm not familiar with the existing tracepoints. Maybe > Ben or Trond could provide some guidance. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-20 7:46 ` Thorsten Leemhuis @ 2022-06-20 14:11 ` Chuck Lever III 2022-06-20 14:29 ` Thorsten Leemhuis 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever III @ 2022-06-20 14:11 UTC (permalink / raw) To: Thorsten Leemhuis Cc: Dennis Dalessandro, Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List > On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > > Dennis, Chuck, I have below issue on the list of tracked regressions. > What's the status? Has any progress been made? Or is this not really a > regression and can be ignored? > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > P.S.: As the Linux kernel's regression tracker I deal with a lot of > reports and sometimes miss something important when writing mails like > this. If that's the case here, don't hesitate to tell me in a public > reply, it's in everyone's interest to set the public record straight. > > #regzbot poke > ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 The above link points to an Apple trackpad bug. The bug described all the way at the bottom was the origin problem report. I believe this is an NFS client issue. We are waiting for a response from the NFS client maintainers to help Dennis track this down. > On 17.05.22 16:02, Chuck Lever III wrote: >>> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >>> >>> On 5/13/22 10:59 AM, Chuck Lever III wrote: >>>>>> >>>>>> Ran a test with -rc6 and this time see a hung task trace on the >>>>>> console as well >>>>>> as an NFS RPC error. >>>>>> >>>>>> [32719.991175] nfs: RPC call returned error 512 >>>>>> . >>>>>> . >>>>>> . >>>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>>>>> than 122 seconds. >>>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>>> disables this >>>>>> message. >>>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >>>>>> ppid: 2 >>>>>> flags:0x00004000 >>>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>>>>> [32933.324524] Call Trace: >>>>>> [32933.327347] <TASK> >>>>>> [32933.329785] __schedule+0x3dd/0x970 >>>>>> [32933.333783] schedule+0x41/0xa0 >>>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>>>>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>>>>> [32933.391391] process_one_work+0x1c8/0x390 >>>>>> [32933.395975] worker_thread+0x30/0x360 >>>>>> [32933.400162] ? process_one_work+0x390/0x390 >>>>>> [32933.404931] kthread+0xd9/0x100 >>>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>>>>> [32933.413984] ret_from_fork+0x22/0x30 >>>>>> [32933.418074] </TASK> >>>>>> >>>>>> The call trace shows up again at 245, 368, and 491 seconds. Same >>>>>> task, same trace. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> That's very helpful. The above trace suggests that the RDMA code is >>>>> leaking a call to xprt_unpin_rqst(). >>>> >>>> IMHO this is unlikely to be related to the performance >>>> regression -- none of this code has changed in the past 5 >>>> kernel releases. Could be a different issue, though. >>>> >>>> As is often the case in these situations, the INFO trace >>>> above happens long after the issue that caused the missing >>>> unpin. So... unless Dennis has a reproducer that can trigger >>>> the issue frequently, I don't think there's much that can >>>> be extracted from that. >>> >>> To be fair, I've only seen this one time and have had the performance regression >>> since -rc1. >>> >>>> Also "nfs: RPC call returned error 512" suggests someone >>>> hit ^C at some point. It's always possible that the >>>> xprt_rdma_free() path is missing an unpin. But again, >>>> that's not likely to be related to performance. >>> >>> I've checked our test code and after 10 minutes it does give up trying to do the >>> NFS copies and aborts (SIG_INT) the test. >> >> After sleeping on it, I'm fairly certain the stack trace >> above is a result of a gap in how xprtrdma handles a >> signaled RPC. >> >> Signal handling in that code is pretty hard to test, so not >> surprising that there's a lingering bug or two. One idea I >> had was to add a fault injector in the RPC scheduler to >> throw signals at random. I think it can be done without >> perturbing the hot path. Maybe I'll post an RFC patch. >> >> >>> So in all my tests and bisect attempts it seems the possibility to hit a slow >>> NFS operation that hangs for minutes has been possible for quite some time. >>> However in 5.18 it gets much worse. >>> >>> Any likely places I should add traces to try and find out what's stuck or taking >>> time? >> >> There's been a lot of churn in that area in recent releases, >> so I'm not familiar with the existing tracepoints. Maybe >> Ben or Trond could provide some guidance. > -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-20 14:11 ` Chuck Lever III @ 2022-06-20 14:29 ` Thorsten Leemhuis 2022-06-20 14:40 ` Chuck Lever III 0 siblings, 1 reply; 25+ messages in thread From: Thorsten Leemhuis @ 2022-06-20 14:29 UTC (permalink / raw) To: Chuck Lever III Cc: Dennis Dalessandro, Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List On 20.06.22 16:11, Chuck Lever III wrote: > > >> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >> >> Dennis, Chuck, I have below issue on the list of tracked regressions. >> What's the status? Has any progress been made? Or is this not really a >> regression and can be ignored? >> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> >> P.S.: As the Linux kernel's regression tracker I deal with a lot of >> reports and sometimes miss something important when writing mails like >> this. If that's the case here, don't hesitate to tell me in a public >> reply, it's in everyone's interest to set the public record straight. >> >> #regzbot poke >> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 > > The above link points to an Apple trackpad bug. Yeah, I know, sorry, should have mentioned: either I or my bot did something stupid and associated that report with this regression, that's why I deassociated it with the "unlink" command. > The bug described all the way at the bottom was the origin problem > report. I believe this is an NFS client issue. We are waiting for > a response from the NFS client maintainers to help Dennis track > this down. Many thx for the status update. Can anything be done to speed things up? This is taken quite a long time already -- way longer that outlined in "Prioritize work on fixing regressions" here: https://docs.kernel.org/process/handling-regressions.html Ciao, Thorsten >> On 17.05.22 16:02, Chuck Lever III wrote: >>>> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: >>>> >>>> On 5/13/22 10:59 AM, Chuck Lever III wrote: >>>>>>> >>>>>>> Ran a test with -rc6 and this time see a hung task trace on the >>>>>>> console as well >>>>>>> as an NFS RPC error. >>>>>>> >>>>>>> [32719.991175] nfs: RPC call returned error 512 >>>>>>> . >>>>>>> . >>>>>>> . >>>>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more >>>>>>> than 122 seconds. >>>>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1 >>>>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>>>> disables this >>>>>>> message. >>>>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 >>>>>>> ppid: 2 >>>>>>> flags:0x00004000 >>>>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc] >>>>>>> [32933.324524] Call Trace: >>>>>>> [32933.327347] <TASK> >>>>>>> [32933.329785] __schedule+0x3dd/0x970 >>>>>>> [32933.333783] schedule+0x41/0xa0 >>>>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc] >>>>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0 >>>>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc] >>>>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc] >>>>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc] >>>>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc] >>>>>>> [32933.376564] ? __switch_to_asm+0x42/0x70 >>>>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0 >>>>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc] >>>>>>> [32933.391391] process_one_work+0x1c8/0x390 >>>>>>> [32933.395975] worker_thread+0x30/0x360 >>>>>>> [32933.400162] ? process_one_work+0x390/0x390 >>>>>>> [32933.404931] kthread+0xd9/0x100 >>>>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20 >>>>>>> [32933.413984] ret_from_fork+0x22/0x30 >>>>>>> [32933.418074] </TASK> >>>>>>> >>>>>>> The call trace shows up again at 245, 368, and 491 seconds. Same >>>>>>> task, same trace. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> That's very helpful. The above trace suggests that the RDMA code is >>>>>> leaking a call to xprt_unpin_rqst(). >>>>> >>>>> IMHO this is unlikely to be related to the performance >>>>> regression -- none of this code has changed in the past 5 >>>>> kernel releases. Could be a different issue, though. >>>>> >>>>> As is often the case in these situations, the INFO trace >>>>> above happens long after the issue that caused the missing >>>>> unpin. So... unless Dennis has a reproducer that can trigger >>>>> the issue frequently, I don't think there's much that can >>>>> be extracted from that. >>>> >>>> To be fair, I've only seen this one time and have had the performance regression >>>> since -rc1. >>>> >>>>> Also "nfs: RPC call returned error 512" suggests someone >>>>> hit ^C at some point. It's always possible that the >>>>> xprt_rdma_free() path is missing an unpin. But again, >>>>> that's not likely to be related to performance. >>>> >>>> I've checked our test code and after 10 minutes it does give up trying to do the >>>> NFS copies and aborts (SIG_INT) the test. >>> >>> After sleeping on it, I'm fairly certain the stack trace >>> above is a result of a gap in how xprtrdma handles a >>> signaled RPC. >>> >>> Signal handling in that code is pretty hard to test, so not >>> surprising that there's a lingering bug or two. One idea I >>> had was to add a fault injector in the RPC scheduler to >>> throw signals at random. I think it can be done without >>> perturbing the hot path. Maybe I'll post an RFC patch. >>> >>> >>>> So in all my tests and bisect attempts it seems the possibility to hit a slow >>>> NFS operation that hangs for minutes has been possible for quite some time. >>>> However in 5.18 it gets much worse. >>>> >>>> Any likely places I should add traces to try and find out what's stuck or taking >>>> time? >>> >>> There's been a lot of churn in that area in recent releases, >>> so I'm not familiar with the existing tracepoints. Maybe >>> Ben or Trond could provide some guidance. >> > > -- > Chuck Lever > > > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-20 14:29 ` Thorsten Leemhuis @ 2022-06-20 14:40 ` Chuck Lever III 2022-06-20 17:06 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever III @ 2022-06-20 14:40 UTC (permalink / raw) To: Thorsten Leemhuis Cc: Dennis Dalessandro, Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List Hi Thorsten- > On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > > On 20.06.22 16:11, Chuck Lever III wrote: >> >> >>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >>> >>> Dennis, Chuck, I have below issue on the list of tracked regressions. >>> What's the status? Has any progress been made? Or is this not really a >>> regression and can be ignored? >>> >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>> >>> P.S.: As the Linux kernel's regression tracker I deal with a lot of >>> reports and sometimes miss something important when writing mails like >>> this. If that's the case here, don't hesitate to tell me in a public >>> reply, it's in everyone's interest to set the public record straight. >>> >>> #regzbot poke >>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 >> >> The above link points to an Apple trackpad bug. > > Yeah, I know, sorry, should have mentioned: either I or my bot did > something stupid and associated that report with this regression, that's > why I deassociated it with the "unlink" command. Is there an open bugzilla for the original regression? >> The bug described all the way at the bottom was the origin problem >> report. I believe this is an NFS client issue. We are waiting for >> a response from the NFS client maintainers to help Dennis track >> this down. > > Many thx for the status update. Can anything be done to speed things up? > This is taken quite a long time already -- way longer that outlined in > "Prioritize work on fixing regressions" here: > https://docs.kernel.org/process/handling-regressions.html ENOTMYMONKEYS ;-) I was involved to help with the ^C issue that happened while Dennis was troubleshooting. It's not related to the original regression, which needs to be pursued by the NFS client maintainers. The correct people to poke are Trond, Olga (both cc'd) and Anna Schumaker. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-20 14:40 ` Chuck Lever III @ 2022-06-20 17:06 ` Dennis Dalessandro 2022-06-21 16:04 ` Olga Kornievskaia 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-06-20 17:06 UTC (permalink / raw) To: Chuck Lever III, Thorsten Leemhuis Cc: Trond Myklebust, Olga Kornievskaia, Linux NFS Mailing List On 6/20/22 10:40 AM, Chuck Lever III wrote: > Hi Thorsten- > >> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >> >> On 20.06.22 16:11, Chuck Lever III wrote: >>> >>> >>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >>>> >>>> Dennis, Chuck, I have below issue on the list of tracked regressions. >>>> What's the status? Has any progress been made? Or is this not really a >>>> regression and can be ignored? >>>> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>> >>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of >>>> reports and sometimes miss something important when writing mails like >>>> this. If that's the case here, don't hesitate to tell me in a public >>>> reply, it's in everyone's interest to set the public record straight. >>>> >>>> #regzbot poke >>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 >>> >>> The above link points to an Apple trackpad bug. >> >> Yeah, I know, sorry, should have mentioned: either I or my bot did >> something stupid and associated that report with this regression, that's >> why I deassociated it with the "unlink" command. > > Is there an open bugzilla for the original regression? > > >>> The bug described all the way at the bottom was the origin problem >>> report. I believe this is an NFS client issue. We are waiting for >>> a response from the NFS client maintainers to help Dennis track >>> this down. >> >> Many thx for the status update. Can anything be done to speed things up? >> This is taken quite a long time already -- way longer that outlined in >> "Prioritize work on fixing regressions" here: >> https://docs.kernel.org/process/handling-regressions.html > > ENOTMYMONKEYS ;-) > > I was involved to help with the ^C issue that happened while > Dennis was troubleshooting. It's not related to the original > regression, which needs to be pursued by the NFS client > maintainers. > > The correct people to poke are Trond, Olga (both cc'd) and > Anna Schumaker. Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a result of the test we were running taking too long. It times out after 10 minutes or so and kills the process. So a downstream effect of the regression. The test is still continuing to fail as of 5.19-rc2. I'll double check that it's the same issue and open a bugzilla. Thanks for poking at this. -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-20 17:06 ` Dennis Dalessandro @ 2022-06-21 16:04 ` Olga Kornievskaia 2022-06-21 16:58 ` Dennis Dalessandro 0 siblings, 1 reply; 25+ messages in thread From: Olga Kornievskaia @ 2022-06-21 16:04 UTC (permalink / raw) To: Dennis Dalessandro Cc: Chuck Lever III, Thorsten Leemhuis, Trond Myklebust, Linux NFS Mailing List Hi Dennis, Can I ask some basic questions? Have you tried to get any kinds of profiling done to see where the client is spending time (using perf perhaps)? real 4m11.835s user 0m0.001s sys 0m0.277s sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do a network transfer (if we had a network trace we could see how long network transfer were)? Do you have one (that goes along with something that can tell us approximately when the request began from the cp's perspective, like a date before hand)? I see that there were no rdma changes that went into 5.18 kernel so whatever changed either a generic nfs behaviour or perhaps something in the rdma core code (is an mellonax card being used here?) I wonder if the slowdown only happens on rdma or is it visible on the tcp mount as well, have you tried? On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: > > On 6/20/22 10:40 AM, Chuck Lever III wrote: > > Hi Thorsten- > > > >> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > >> > >> On 20.06.22 16:11, Chuck Lever III wrote: > >>> > >>> > >>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > >>>> > >>>> Dennis, Chuck, I have below issue on the list of tracked regressions. > >>>> What's the status? Has any progress been made? Or is this not really a > >>>> regression and can be ignored? > >>>> > >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >>>> > >>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of > >>>> reports and sometimes miss something important when writing mails like > >>>> this. If that's the case here, don't hesitate to tell me in a public > >>>> reply, it's in everyone's interest to set the public record straight. > >>>> > >>>> #regzbot poke > >>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 > >>> > >>> The above link points to an Apple trackpad bug. > >> > >> Yeah, I know, sorry, should have mentioned: either I or my bot did > >> something stupid and associated that report with this regression, that's > >> why I deassociated it with the "unlink" command. > > > > Is there an open bugzilla for the original regression? > > > > > >>> The bug described all the way at the bottom was the origin problem > >>> report. I believe this is an NFS client issue. We are waiting for > >>> a response from the NFS client maintainers to help Dennis track > >>> this down. > >> > >> Many thx for the status update. Can anything be done to speed things up? > >> This is taken quite a long time already -- way longer that outlined in > >> "Prioritize work on fixing regressions" here: > >> https://docs.kernel.org/process/handling-regressions.html > > > > ENOTMYMONKEYS ;-) > > > > I was involved to help with the ^C issue that happened while > > Dennis was troubleshooting. It's not related to the original > > regression, which needs to be pursued by the NFS client > > maintainers. > > > > The correct people to poke are Trond, Olga (both cc'd) and > > Anna Schumaker. > > Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a > result of the test we were running taking too long. It times out after 10 > minutes or so and kills the process. So a downstream effect of the regression. > > The test is still continuing to fail as of 5.19-rc2. I'll double check that it's > the same issue and open a bugzilla. > > Thanks for poking at this. > > -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-21 16:04 ` Olga Kornievskaia @ 2022-06-21 16:58 ` Dennis Dalessandro 2022-06-21 17:51 ` Olga Kornievskaia 0 siblings, 1 reply; 25+ messages in thread From: Dennis Dalessandro @ 2022-06-21 16:58 UTC (permalink / raw) To: Olga Kornievskaia Cc: Chuck Lever III, Thorsten Leemhuis, Trond Myklebust, Linux NFS Mailing List On 6/21/22 12:04 PM, Olga Kornievskaia wrote: > Hi Dennis, > > Can I ask some basic questions? Have you tried to get any kinds of > profiling done to see where the client is spending time (using perf > perhaps)? > > real 4m11.835s > user 0m0.001s > sys 0m0.277s > > sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do > a network transfer (if we had a network trace we could see how long > network transfer were)? Do you have one (that goes along with > something that can tell us approximately when the request began from > the cp's perspective, like a date before hand)? > > I see that there were no rdma changes that went into 5.18 kernel so > whatever changed either a generic nfs behaviour or perhaps something > in the rdma core code (is an mellonax card being used here?) > > I wonder if the slowdown only happens on rdma or is it visible on the > tcp mount as well, have you tried? > Hi Olga, I have opened a Kernel Bugzilla if you would rather log future responses there: https://bugzilla.kernel.org/show_bug.cgi?id=216160 To answer your above questions: This is on Omni-Path hardware. I have not tried the TCP mount, I can though. I don't have any network trace per-se or a profile. We don't support like a TCP dump or anything like that. However I can tell you there is nothing going over the network while it appears to be hung. I can monitor the packet counters. If you have some ideas where I could put some trace points that could tell us something I can certainly add those. -Denny > > > On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro > <dennis.dalessandro@cornelisnetworks.com> wrote: >> >> On 6/20/22 10:40 AM, Chuck Lever III wrote: >>> Hi Thorsten- >>> >>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >>>> >>>> On 20.06.22 16:11, Chuck Lever III wrote: >>>>> >>>>> >>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: >>>>>> >>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions. >>>>>> What's the status? Has any progress been made? Or is this not really a >>>>>> regression and can be ignored? >>>>>> >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>>>> >>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of >>>>>> reports and sometimes miss something important when writing mails like >>>>>> this. If that's the case here, don't hesitate to tell me in a public >>>>>> reply, it's in everyone's interest to set the public record straight. >>>>>> >>>>>> #regzbot poke >>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 >>>>> >>>>> The above link points to an Apple trackpad bug. >>>> >>>> Yeah, I know, sorry, should have mentioned: either I or my bot did >>>> something stupid and associated that report with this regression, that's >>>> why I deassociated it with the "unlink" command. >>> >>> Is there an open bugzilla for the original regression? >>> >>> >>>>> The bug described all the way at the bottom was the origin problem >>>>> report. I believe this is an NFS client issue. We are waiting for >>>>> a response from the NFS client maintainers to help Dennis track >>>>> this down. >>>> >>>> Many thx for the status update. Can anything be done to speed things up? >>>> This is taken quite a long time already -- way longer that outlined in >>>> "Prioritize work on fixing regressions" here: >>>> https://docs.kernel.org/process/handling-regressions.html >>> >>> ENOTMYMONKEYS ;-) >>> >>> I was involved to help with the ^C issue that happened while >>> Dennis was troubleshooting. It's not related to the original >>> regression, which needs to be pursued by the NFS client >>> maintainers. >>> >>> The correct people to poke are Trond, Olga (both cc'd) and >>> Anna Schumaker. >> >> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a >> result of the test we were running taking too long. It times out after 10 >> minutes or so and kills the process. So a downstream effect of the regression. >> >> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's >> the same issue and open a bugzilla. >> >> Thanks for poking at this. >> >> -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-21 16:58 ` Dennis Dalessandro @ 2022-06-21 17:51 ` Olga Kornievskaia 2022-06-21 17:53 ` Chuck Lever III 0 siblings, 1 reply; 25+ messages in thread From: Olga Kornievskaia @ 2022-06-21 17:51 UTC (permalink / raw) To: Dennis Dalessandro Cc: Chuck Lever III, Thorsten Leemhuis, Trond Myklebust, Linux NFS Mailing List On Tue, Jun 21, 2022 at 12:58 PM Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> wrote: > > On 6/21/22 12:04 PM, Olga Kornievskaia wrote: > > Hi Dennis, > > > > Can I ask some basic questions? Have you tried to get any kinds of > > profiling done to see where the client is spending time (using perf > > perhaps)? > > > > real 4m11.835s > > user 0m0.001s > > sys 0m0.277s > > > > sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do > > a network transfer (if we had a network trace we could see how long > > network transfer were)? Do you have one (that goes along with > > something that can tell us approximately when the request began from > > the cp's perspective, like a date before hand)? > > > > I see that there were no rdma changes that went into 5.18 kernel so > > whatever changed either a generic nfs behaviour or perhaps something > > in the rdma core code (is an mellonax card being used here?) > > > > I wonder if the slowdown only happens on rdma or is it visible on the > > tcp mount as well, have you tried? > > > > Hi Olga, > > I have opened a Kernel Bugzilla if you would rather log future responses there: > https://bugzilla.kernel.org/show_bug.cgi?id=216160 > > To answer your above questions: This is on Omni-Path hardware. I have not tried > the TCP mount, I can though. I don't have any network trace per-se or a profile. > We don't support like a TCP dump or anything like that. However I can tell you > there is nothing going over the network while it appears to be hung. I can > monitor the packet counters. In this thread there are 2 problems raised (1) performance regression and (2) a single time run that hit a hung client. For #1, given that there were no rdma changes that were added to the 5.18, seems like something in the generic nfs is causing issues for you, thus I recommend to first use a linux profiler to get some information about the times spent in kernel functions that are triggered by the cp command. If you can't run the profiler (which I think you should be able to), then perhaps just enable the nfs4 and rpcrdma tracepoints which also have timestamps in them and looking at differences can give some clue where the time is being spent. 4min is a significant chunk of time and should be visible somewhere in those timestamps. For #2, I have personally ran into that stack trace while investigating a hung using soft iWarp as the rdma provider. It was an unpinned request but I think it was due to soft iWarp failure that cause it not to do a completion to rdma which led to the request never getting unpinned. Thus I would recommend looking into failures in your rdma provider for clues on that problem. > If you have some ideas where I could put some trace points that could tell us > something I can certainly add those. > > -Denny > > > > > > > On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro > > <dennis.dalessandro@cornelisnetworks.com> wrote: > >> > >> On 6/20/22 10:40 AM, Chuck Lever III wrote: > >>> Hi Thorsten- > >>> > >>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > >>>> > >>>> On 20.06.22 16:11, Chuck Lever III wrote: > >>>>> > >>>>> > >>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <regressions@leemhuis.info> wrote: > >>>>>> > >>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions. > >>>>>> What's the status? Has any progress been made? Or is this not really a > >>>>>> regression and can be ignored? > >>>>>> > >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >>>>>> > >>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of > >>>>>> reports and sometimes miss something important when writing mails like > >>>>>> this. If that's the case here, don't hesitate to tell me in a public > >>>>>> reply, it's in everyone's interest to set the public record straight. > >>>>>> > >>>>>> #regzbot poke > >>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890 > >>>>> > >>>>> The above link points to an Apple trackpad bug. > >>>> > >>>> Yeah, I know, sorry, should have mentioned: either I or my bot did > >>>> something stupid and associated that report with this regression, that's > >>>> why I deassociated it with the "unlink" command. > >>> > >>> Is there an open bugzilla for the original regression? > >>> > >>> > >>>>> The bug described all the way at the bottom was the origin problem > >>>>> report. I believe this is an NFS client issue. We are waiting for > >>>>> a response from the NFS client maintainers to help Dennis track > >>>>> this down. > >>>> > >>>> Many thx for the status update. Can anything be done to speed things up? > >>>> This is taken quite a long time already -- way longer that outlined in > >>>> "Prioritize work on fixing regressions" here: > >>>> https://docs.kernel.org/process/handling-regressions.html > >>> > >>> ENOTMYMONKEYS ;-) > >>> > >>> I was involved to help with the ^C issue that happened while > >>> Dennis was troubleshooting. It's not related to the original > >>> regression, which needs to be pursued by the NFS client > >>> maintainers. > >>> > >>> The correct people to poke are Trond, Olga (both cc'd) and > >>> Anna Schumaker. > >> > >> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a > >> result of the test we were running taking too long. It times out after 10 > >> minutes or so and kills the process. So a downstream effect of the regression. > >> > >> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's > >> the same issue and open a bugzilla. > >> > >> Thanks for poking at this. > >> > >> -Denny ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-06-21 17:51 ` Olga Kornievskaia @ 2022-06-21 17:53 ` Chuck Lever III 0 siblings, 0 replies; 25+ messages in thread From: Chuck Lever III @ 2022-06-21 17:53 UTC (permalink / raw) To: Olga Kornievskaia Cc: Dennis Dalessandro, Thorsten Leemhuis, Trond Myklebust, Linux NFS Mailing List > On Jun 21, 2022, at 1:51 PM, Olga Kornievskaia <aglo@umich.edu> wrote: > > For #2, I have personally ran into that stack trace while > investigating a hung using soft iWarp as the rdma provider. It was an > unpinned request but I think it was due to soft iWarp failure that > cause it not to do a completion to rdma which led to the request never > getting unpinned. Thus I would recommend looking into failures in your > rdma provider for clues on that problem. #2 is a separate problem entirely, and is a known bug. I plan to add some fault injection so we can reproduce and track it down. I think you should prune this issue and focus just on #1 for now. -- Chuck Lever ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: NFS regression between 5.17 and 5.18 2022-04-28 13:05 NFS regression between 5.17 and 5.18 Dennis Dalessandro 2022-04-28 14:57 ` Chuck Lever III @ 2022-05-04 9:45 ` Thorsten Leemhuis 1 sibling, 0 replies; 25+ messages in thread From: Thorsten Leemhuis @ 2022-05-04 9:45 UTC (permalink / raw) To: Dennis Dalessandro; +Cc: Chuck Lever, regressions, linux-nfs [TLDR: I'm adding the regression report below to regzbot, the Linux kernel regression tracking bot; all text you find below is compiled from a few templates paragraphs you might have encountered already already from similar mails.] Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone. To be sure below issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot: #regzbot ^introduced v5.17..v5.18 #regzbot title nfs: copy from NFS to the local disk suddently takes 3 min instead of half a second #regzbot ignore-activity If it turns out this isn't a regression, free free to remove it from the tracking by sending a reply to this thread containing a paragraph like "#regzbot invalid: reason why this is invalid" (without the quotes). Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. On 28.04.22 15:05, Dennis Dalessandro wrote: > Hi NFS folks, > > I've noticed a pretty nasty regression in our NFS capability between 5.17 and > 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing > is it takes 3 minutes to copy a file from NFS to the local disk. When it should > take less than half a second, which it did up through 5.17. > > It doesn't seem to be network related, but can't rule that out completely. > > I tried to bisect but the problem can be intermittent. Some runs I'll see a > problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it > 100 out of 100. > > The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create > file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk. > > Test 2 > Creating /dev/shm/run_nfs_test.junk... > 262144+0 records in > 262144+0 records out > 268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s > Done > copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk... > > real 0m0.502s > user 0m0.001s > sys 0m0.250s > Done > copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... > > real 4m11.835s > user 0m0.001s > sys 0m0.277s > Done > Comparing files... > Done > Remove /dev/shm/run_nfs_test.tmp... > > real 0m0.031s > user 0m0.000s > sys 0m0.031s > Done > Remove /dev/shm/run_nfs_test.junk... > > real 0m0.031s > user 0m0.001s > sys 0m0.030s > Done > Remove /mnt/nfs_test/run_nfs_test.junk... > > real 0m0.024s > user 0m0.001s > sys 0m0.022s > Done > > Create the server with a 4G RAM disk: > mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test > > exportfs -o fsid=0,rw,async,insecure,no_root_squash > 192.168.254.0/255.255.255.0:/mnt/nfs_test > > Mount on the client side with: > mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test > > Any advice on tracking this down further would be greatly appreciated. > > Thanks > > -Denny > -- Additional information about regzbot: If you want to know more about regzbot, check out its web-interface, the getting start guide, and the references documentation: https://linux-regtracking.leemhuis.info/regzbot/ https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md The last two documents will explain how you can interact with regzbot yourself if your want to. Hint for reporters: when reporting a regression it's in your interest to CC the regression list and tell regzbot about the issue, as that ensures the regression makes it onto the radar of the Linux kernel's regression tracker -- that's in your interest, as it ensures your report won't fall through the cracks unnoticed. Hint for developers: you normally don't need to care about regzbot once it's involved. Fix the issue as you normally would, just remember to include 'Link:' tag in the patch descriptions pointing to all reports about the issue. This has been expected from developers even before regzbot showed up for reasons explained in 'Documentation/process/submitting-patches.rst' and 'Documentation/process/5.Posting.rst'. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2022-06-21 17:54 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-28 13:05 NFS regression between 5.17 and 5.18 Dennis Dalessandro 2022-04-28 14:57 ` Chuck Lever III 2022-04-28 15:42 ` Dennis Dalessandro 2022-04-28 19:47 ` Dennis Dalessandro 2022-04-28 19:56 ` Trond Myklebust 2022-04-29 12:54 ` Dennis Dalessandro 2022-04-29 13:37 ` Chuck Lever III 2022-05-06 13:24 ` Dennis Dalessandro 2022-05-13 11:58 ` Dennis Dalessandro 2022-05-13 13:30 ` Trond Myklebust 2022-05-13 14:59 ` Chuck Lever III 2022-05-13 15:19 ` Trond Myklebust 2022-05-13 18:53 ` Chuck Lever III 2022-05-17 13:40 ` Dennis Dalessandro 2022-05-17 14:02 ` Chuck Lever III 2022-06-20 7:46 ` Thorsten Leemhuis 2022-06-20 14:11 ` Chuck Lever III 2022-06-20 14:29 ` Thorsten Leemhuis 2022-06-20 14:40 ` Chuck Lever III 2022-06-20 17:06 ` Dennis Dalessandro 2022-06-21 16:04 ` Olga Kornievskaia 2022-06-21 16:58 ` Dennis Dalessandro 2022-06-21 17:51 ` Olga Kornievskaia 2022-06-21 17:53 ` Chuck Lever III 2022-05-04 9:45 ` Thorsten Leemhuis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).