linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Spurious instability with NFSoRDMA under moderate load
@ 2021-05-16 17:29 Timo Rothenpieler
  2021-05-17 16:27 ` Chuck Lever III
  0 siblings, 1 reply; 10+ messages in thread
From: Timo Rothenpieler @ 2021-05-16 17:29 UTC (permalink / raw)
  To: Linux NFS Mailing List, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

This has happened 3 times so far over the last couple months, and I do 
not have a clear way to reproduce it.
It happens under moderate load, when lots of nodes read and write from 
the server. Though not in any super intense way. Just normal program 
execution, writing of light logs, and other standard tasks.

The issues on the clients manifest in a multitude of ways. Most of the 
time, random IO operations just fail, rarely hang indefinitely and make 
the process unkillable.
Another example would be: "Failed to remove 
'.../.nfs00000000007b03af00000001': Device or resource busy"

Once a client is in that state, the only way to get it back into order 
is a reboot.

On the server side, a single error cqe is dumped each time this problem 
happened. So far, I always rebooted the server as well, to make sure 
everything is back in order. Not sure if that is strictly necessary.

> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2

> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3

> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3

These all happened under different Versions of the 5.10 Kernel. The last 
one under 5.10.32 today.

Switching all clients to TCP seems to make NFS works perfectly reliable.

I'm not sure how to read those error dumps, so help there would be 
appreciated.

Could this be similar to spurious issues you get with UDP, where dropped 
packages cause havoc? Though I would not expect heavy load on IB to 
cause an error cqe to be logged.




Thanks,
Timo


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-05-16 17:29 Spurious instability with NFSoRDMA under moderate load Timo Rothenpieler
@ 2021-05-17 16:27 ` Chuck Lever III
  2021-05-17 17:37   ` Timo Rothenpieler
  2021-05-19 15:20   ` Leon Romanovsky
  0 siblings, 2 replies; 10+ messages in thread
From: Chuck Lever III @ 2021-05-17 16:27 UTC (permalink / raw)
  To: Timo Rothenpieler; +Cc: Linux NFS Mailing List, linux-rdma

Hello Timo-

> On May 16, 2021, at 1:29 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> 
> This has happened 3 times so far over the last couple months, and I do not have a clear way to reproduce it.
> It happens under moderate load, when lots of nodes read and write from the server. Though not in any super intense way. Just normal program execution, writing of light logs, and other standard tasks.
> 
> The issues on the clients manifest in a multitude of ways. Most of the time, random IO operations just fail, rarely hang indefinitely and make the process unkillable.
> Another example would be: "Failed to remove '.../.nfs00000000007b03af00000001': Device or resource busy"
> 
> Once a client is in that state, the only way to get it back into order is a reboot.
> 
> On the server side, a single error cqe is dumped each time this problem happened. So far, I always rebooted the server as well, to make sure everything is back in order. Not sure if that is strictly necessary.
> 
>> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
>> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2
> 
>> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
>> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3
> 
>> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
>> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3

I'm hoping Leon can get out his magic decoder ring and tell us if
these CQE dumps contain a useful WC status code.

Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
the problem persists.


> These all happened under different Versions of the 5.10 Kernel. The last one under 5.10.32 today.
> 
> Switching all clients to TCP seems to make NFS works perfectly reliable.
> 
> I'm not sure how to read those error dumps, so help there would be appreciated.
> 
> Could this be similar to spurious issues you get with UDP, where dropped packages cause havoc? Though I would not expect heavy load on IB to cause an error cqe to be logged.

The underlying RDMA connection for RPC/RDMA is reliable (lossless),
so I don't see an immediate relation to how RPC/UDP might behave.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-05-17 16:27 ` Chuck Lever III
@ 2021-05-17 17:37   ` Timo Rothenpieler
  2021-06-21 16:06     ` Timo Rothenpieler
  2021-05-19 15:20   ` Leon Romanovsky
  1 sibling, 1 reply; 10+ messages in thread
From: Timo Rothenpieler @ 2021-05-17 17:37 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 243 bytes --]

On 17.05.2021 18:27, Chuck Lever III wrote:
> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
> the problem persists.

I updated the NFS server to 5.12.4 now and will observe and maybe try to 
cause some mixed load.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-05-17 16:27 ` Chuck Lever III
  2021-05-17 17:37   ` Timo Rothenpieler
@ 2021-05-19 15:20   ` Leon Romanovsky
  1 sibling, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2021-05-19 15:20 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Timo Rothenpieler, Linux NFS Mailing List, linux-rdma

On Mon, May 17, 2021 at 04:27:29PM +0000, Chuck Lever III wrote:
> Hello Timo-
> 
> > On May 16, 2021, at 1:29 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> > 
> > This has happened 3 times so far over the last couple months, and I do not have a clear way to reproduce it.
> > It happens under moderate load, when lots of nodes read and write from the server. Though not in any super intense way. Just normal program execution, writing of light logs, and other standard tasks.
> > 
> > The issues on the clients manifest in a multitude of ways. Most of the time, random IO operations just fail, rarely hang indefinitely and make the process unkillable.
> > Another example would be: "Failed to remove '.../.nfs00000000007b03af00000001': Device or resource busy"
> > 
> > Once a client is in that state, the only way to get it back into order is a reboot.
> > 
> > On the server side, a single error cqe is dumped each time this problem happened. So far, I always rebooted the server as well, to make sure everything is back in order. Not sure if that is strictly necessary.
> > 
> >> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
> >> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2
> > 
> >> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
> >> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3
> > 
> >> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
> >> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3
> 
> I'm hoping Leon can get out his magic decoder ring and tell us if
> these CQE dumps contain a useful WC status code.

Unfortunately, no. I failed to parse it, if I read the dump correctly,
it is not marked as error and opcode is 0.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-05-17 17:37   ` Timo Rothenpieler
@ 2021-06-21 16:06     ` Timo Rothenpieler
  2021-06-21 16:28       ` Chuck Lever III
  2021-08-10 12:49       ` Timo Rothenpieler
  0 siblings, 2 replies; 10+ messages in thread
From: Timo Rothenpieler @ 2021-06-21 16:06 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On 17.05.2021 19:37, Timo Rothenpieler wrote:
> On 17.05.2021 18:27, Chuck Lever III wrote:
>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>> the problem persists.
> 
> I updated the NFS server to 5.12.4 now and will observe and maybe try to 
> cause some mixed load.
> 

Had this running on 5.12 for a month now, and haven't observed any kind 
of instability so far.
So I'd carefully assume that something in after 5.10, that made it into 
5.11 or 5.12 fixed or greatly improved the situation.

Can't really sensibly bisect this sadly, given the extremely long and 
uncertain turnaround times.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-06-21 16:06     ` Timo Rothenpieler
@ 2021-06-21 16:28       ` Chuck Lever III
  2021-08-10 12:49       ` Timo Rothenpieler
  1 sibling, 0 replies; 10+ messages in thread
From: Chuck Lever III @ 2021-06-21 16:28 UTC (permalink / raw)
  To: Timo Rothenpieler; +Cc: Linux NFS Mailing List, linux-rdma



> On Jun 21, 2021, at 12:06 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> 
> On 17.05.2021 19:37, Timo Rothenpieler wrote:
>> On 17.05.2021 18:27, Chuck Lever III wrote:
>>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>>> the problem persists.
>> I updated the NFS server to 5.12.4 now and will observe and maybe try to cause some mixed load.
> 
> Had this running on 5.12 for a month now, and haven't observed any kind of instability so far.
> So I'd carefully assume that something in after 5.10, that made it into 5.11 or 5.12 fixed or greatly improved the situation.
> 
> Can't really sensibly bisect this sadly, given the extremely long and uncertain turnaround times.

Thank you for the update!


--
Chuck Lever




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-06-21 16:06     ` Timo Rothenpieler
  2021-06-21 16:28       ` Chuck Lever III
@ 2021-08-10 12:49       ` Timo Rothenpieler
       [not found]         ` <a28b403e-42cf-3189-a4db-86d20da1b7aa@rothenpieler.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Timo Rothenpieler @ 2021-08-10 12:49 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On 21.06.2021 18:06, Timo Rothenpieler wrote:
> On 17.05.2021 19:37, Timo Rothenpieler wrote:
>> On 17.05.2021 18:27, Chuck Lever III wrote:
>>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>>> the problem persists.
>>
>> I updated the NFS server to 5.12.4 now and will observe and maybe try 
>> to cause some mixed load.
>>
> 
> Had this running on 5.12 for a month now, and haven't observed any kind 
> of instability so far.
> So I'd carefully assume that something in after 5.10, that made it into 
> 5.11 or 5.12 fixed or greatly improved the situation.
> 
> Can't really sensibly bisect this sadly, given the extremely long and 
> uncertain turnaround times.
> 

Ok, so this just happened again for the first time since upgrading to 5.12.
Exact same thing, except this time no error cqe was dumped 
simultaneously (It still appeared in dmesg, but over a week before the 
issue showed up). So I'll assume it's unrelated to this issue.

I had no issues while running 5.12.12 and below. Recently (14 days ago 
or so) updated to 5.12.19, and now it's happening again.
Unfortunately, with how rarely this issue happens, this can either be a 
regression between those two versions, or it was still present all along 
and just never triggered for several months.

Makes me wonder if this is somehow related to the problem described in 
"NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the 
pattern of the issue does not look all that similar, given that for me, 
the hangs never recover, and I have RDMA in the mix.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
       [not found]         ` <a28b403e-42cf-3189-a4db-86d20da1b7aa@rothenpieler.org>
@ 2021-08-10 17:17           ` Chuck Lever III
  2021-08-10 21:40             ` Timo Rothenpieler
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever III @ 2021-08-10 17:17 UTC (permalink / raw)
  To: Timo Rothenpieler; +Cc: Linux NFS Mailing List, linux-rdma



> On Aug 10, 2021, at 10:06 AM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> 
> On 10.08.2021 14:49, Timo Rothenpieler wrote:
>> Ok, so this just happened again for the first time since upgrading to 5.12.
>> Exact same thing, except this time no error cqe was dumped simultaneously (It still appeared in dmesg, but over a week before the issue showed up). So I'll assume it's unrelated to this issue.
>> I had no issues while running 5.12.12 and below. Recently (14 days ago or so) updated to 5.12.19, and now it's happening again.
>> Unfortunately, with how rarely this issue happens, this can either be a regression between those two versions, or it was still present all along and just never triggered for several months.
>> Makes me wonder if this is somehow related to the problem described in "NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the pattern of the issue does not look all that similar, given that for me, the hangs never recover, and I have RDMA in the mix.
> 
> Here's some trace and debug data I collected while it was in a state where everything using copy_range would get stuck.
> The file(s) in question is named "tmp-copy-*", with a bunch of numbers at the end.
> 
> This also seems to be entirely a client side issue, since other NFS clients on the same server work just fine the whole time, and a reboot of that one affected client makes it work normally again, without touching the server.
> 
> In this instance right now, I was even unable to cleanly reboot the client machine, since it got completely stuck, seemingly unable to unmount the NFS shares, getting stuck the same way other operations would.
> 
> What confuses me the most is that while it is in this messed up state, it generally still works fine, and it's only a few very specific operations (only one I know of for sure is copy_range) get stuck.
> The processes also aren't unkillable. a SIGTERM or SIGKILL will end them, and also flushes NFS. Captured that in the dmesg_output2.

What I see in this data is that the server is reporting

   SEQ4_STATUS_CB_PATH_DOWN

and the client is attempting to recover (repeatedly) using
BIND_CONN_TO_SESSION. But apparently the recovery didn't
actually work, because the server continues to report a
callback path problem.

[1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

I guess it's time to switch to tracing on the server side
to see if you can nail down why the server's callback
requests are failing. On your NFS server, run:

 # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma

at roughly the same point during your test that you captured
the previous client-side trace.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
  2021-08-10 17:17           ` Chuck Lever III
@ 2021-08-10 21:40             ` Timo Rothenpieler
       [not found]               ` <141fdf51-2aa1-6614-fe4e-96f168cbe6cf@rothenpieler.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Timo Rothenpieler @ 2021-08-10 21:40 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

On 10.08.2021 19:17, Chuck Lever III wrote:
> 
> What I see in this data is that the server is reporting
> 
>     SEQ4_STATUS_CB_PATH_DOWN
> 
> and the client is attempting to recover (repeatedly) using
> BIND_CONN_TO_SESSION. But apparently the recovery didn't
> actually work, because the server continues to report a
> callback path problem.
> 
> [1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
> 
> [1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
> 
> [1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
> 
> I guess it's time to switch to tracing on the server side
> to see if you can nail down why the server's callback
> requests are failing. On your NFS server, run:
> 
>   # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma
> 
> at roughly the same point during your test that you captured
> the previous client-side trace.

I wonder if reverting 6820bf77864d on the server, to have an easier way 
to reproduce this state, would be worth it.

Cause it seems like the actual underlying issue is the inability of the 
NFS server (and/or client) to reestablish the backchannel if it gets 
disconnected for whatever reason?

Right now I already rebooted the client, and everything is working 
again, so I'll have to wait a potentially long time for this to happen 
again otherwise.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Spurious instability with NFSoRDMA under moderate load
       [not found]               ` <141fdf51-2aa1-6614-fe4e-96f168cbe6cf@rothenpieler.org>
@ 2021-08-11  0:19                 ` Chuck Lever III
  0 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever III @ 2021-08-11  0:19 UTC (permalink / raw)
  To: Timo Rothenpieler; +Cc: Linux NFS Mailing List, linux-rdma


> On Aug 10, 2021, at 8:07 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> 
> On 10.08.2021 23:40, Timo Rothenpieler wrote:
>>> On 10.08.2021 19:17, Chuck Lever III wrote:
>>> 
>>> What I see in this data is that the server is reporting
>>> 
>>>     SEQ4_STATUS_CB_PATH_DOWN
>>> 
>>> and the client is attempting to recover (repeatedly) using
>>> BIND_CONN_TO_SESSION. But apparently the recovery didn't
>>> actually work, because the server continues to report a
>>> callback path problem.
>>> 
>>> [1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>> 
>>> [1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>> 
>>> [1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>> 
>>> I guess it's time to switch to tracing on the server side
>>> to see if you can nail down why the server's callback
>>> requests are failing. On your NFS server, run:
>>> 
>>>   # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma
>>> 
>>> at roughly the same point during your test that you captured
>>> the previous client-side trace.
>> I wonder if reverting 6820bf77864d on the server, to have an easier way to reproduce this state, would be worth it.
>> Cause it seems like the actual underlying issue is the inability of the NFS server (and/or client) to reestablish the backchannel if it gets disconnected for whatever reason?
>> Right now I already rebooted the client, and everything is working again, so I'll have to wait a potentially long time for this to happen again otherwise.
> 
> I actually ended up doing that, and sure enough, on 5.12.19, with 6820bf77864d reverted, it ends up getting stuck very quickly.
> Using xfs_io with copy_range this time, to have a more isolated test case.
> 
> Though the trace logs I'm getting from that look different, see attached traces and rpcdebug enabled dmesg from both client and server.
> There is no appearance of "nfs41_handle_sequence_flag_errors" or "nfs4_bind_conn_to_session" whatsoever, and it does not seem like it's trying to recover the timed out callback channel at all.

Thanks, I’ll take a look at the new information tomorrow.


> So I'm not sure if the other issue that spuriously happens is related.
> But there definitely is an issue with it not re-establishing the backchannel, so fixing one might fix the other as well.
> <rpcdebug.tar.xz>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-11  0:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-16 17:29 Spurious instability with NFSoRDMA under moderate load Timo Rothenpieler
2021-05-17 16:27 ` Chuck Lever III
2021-05-17 17:37   ` Timo Rothenpieler
2021-06-21 16:06     ` Timo Rothenpieler
2021-06-21 16:28       ` Chuck Lever III
2021-08-10 12:49       ` Timo Rothenpieler
     [not found]         ` <a28b403e-42cf-3189-a4db-86d20da1b7aa@rothenpieler.org>
2021-08-10 17:17           ` Chuck Lever III
2021-08-10 21:40             ` Timo Rothenpieler
     [not found]               ` <141fdf51-2aa1-6614-fe4e-96f168cbe6cf@rothenpieler.org>
2021-08-11  0:19                 ` Chuck Lever III
2021-05-19 15:20   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).