Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status

From: dai.ngo@oracle.com
To: Jeff Layton <jlayton@kernel.org>,
	Chuck Lever III <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Helen Chao <helen.chao@oracle.com>,
	Anna Schumaker <anna@kernel.org>,
	Trond Myklebust <trondmy@hammerspace.com>
Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
Date: Thu, 6 Apr 2023 13:58:17 -0700	[thread overview]
Message-ID: <b50336cf-09a3-60a9-0100-0a25adf4ee55@oracle.com> (raw)
In-Reply-To: <c659b24274539086c3adad8af4c7d30cf87ee83a.camel@kernel.org>

On 4/6/23 12:59 PM, Jeff Layton wrote:
> On Thu, 2023-04-06 at 19:43 +0000, Chuck Lever III wrote:
>>> On Apr 6, 2023, at 3:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>
>>> Hi Jeff,
>>>
>>> Thank you for taking a look at the patch.
>>>
>>> On 4/6/23 11:10 AM, Jeff Layton wrote:
>>>> On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote:
>>>>> On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@oracle.com wrote:
>>>>>> On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote:
>>>>>>> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>>>>>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>>
>>>>>>>>> Currently call_bind_status places a hard limit of 3 to the number of
>>>>>>>>> retries on EACCES error. This limit was done to accommodate the
>>>>>>>>> behavior
>>>>>>>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>>>>>>>> killed on the NFS server. However this change causes problem for other
>>>>>>>>> servers that take a little longer than 9 seconds for the port mapper to
>>>>>>>>>
>>>>>>>>>
>>>> Actually, the EACCES error means that the host doesn't have the port
>>>> registered.
>>> Yes, our SQA team runs stress lock test and restart the NFS server.
>>> Sometimes the NFS server starts up and register to the port mapper in
>>> time and there is no problem but occasionally it's late coming up causing
>>> this EACCES error.
>>>
>>>>   That could happen if (e.g.) the host had a NFSv3 mount up
>>>> with an NLM connection and then crashed and rebooted and didn't remount
>>>> it.
>>> can you please explain this scenario I don't quite follow it. If the v3
>>> client crashed and did not remount the export then how can the client try
>>> to access/lock anything on the server? I must have missing something here.
>>>
>>>>   
> Suppose you have a client with an admin that mounts a NFSv3 mount "by
> hand" (and doesn't set up statd to run at boot time). Client requests an
> NLM lock and then reboots.
>
> When it comes up, there's no notification to the server that the client
> rebooted. Later, the lock becomes free and the server tries to grant it
> to the client. It talks to rpcbind but lockd is never started and the
> server keeps querying the client's rpcbind forever.
>
> Maybe more likely situation: the client crashes and loses its DHCP
> address when it comes back up, and the old addr gets reassigned to
> another host that has rpcbind running but no NFS.
>
> Either way, it'd keep trying to call the client back indefinitely that
> way.

Got it Jeff, thank you for the explanation. This is when NLM requests
are originated from the NFS server.

>
>>>>>>>>> become ready when the NFS server is restarted.
>>>>>>>>>
>>>>>>>>> This patch removes this hard coded limit and let the RPC handles
>>>>>>>>> the retry according to whether the export is soft or hard mounted.
>>>>>>>>>
>>>>>>>>> To avoid the hang with buggy server, the client can use soft mount for
>>>>>>>>> the export.
>>>>>>>>>
>>>>>>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>>>>>>>> Reported-by: Helen Chao <helen.chao@oracle.com>
>>>>>>>>> Tested-by: Helen Chao <helen.chao@oracle.com>
>>>>>>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>>>>>> Helen is the royal queen of ^C  ;-)
>>>>>>>>
>>>>>>>> Did you try ^C on a mount while it waits for a rebind?
>>>>>>> She uses a test script that restarts the NFS server while NLM lock test
>>>>>>> is running. The failure is random, sometimes it fails and sometimes it
>>>>>>> passes depending on when the LOCK/UNLOCK requests come in so I think
>>>>>>> it's hard to time it to do the ^C, but I will ask.
>>>>>> We did the test with ^C and here is what we found.
>>>>>>
>>>>>> For synchronous RPC task the signal was delivered to the RPC task and
>>>>>> the task exit with -ERESTARTSYS from __rpc_execute as expected.
>>>>>>
>>>>>> For asynchronous RPC task the process that invokes the RPC task to send
>>>>>> the request detected the signal in rpc_wait_for_completion_task and exits
>>>>>> with -ERESTARTSYS. However the async RPC was allowed to continue to run
>>>>>> to completion. So if the async RPC task was retrying an operation and
>>>>>> the NFS server was down, it will retry forever if this is a hard mount
>>>>>> or until the NFS server comes back up.
>>>>>>
>>>>>> The question for the list is should we propagate the signal to the async
>>>>>> task via rpc_signal_task to stop its execution or just leave it alone as is.
>>>>>>
>>>>>>
>>>>> That is a good question.
>>> Maybe we should defer the propagation of the signal for later. Chuck has
>>> some concern since this can change the existing behavior of async task
>>> for other v4 operations.
>>>
> Fair enough.
>
>>>>> I like the patch overall, as it gets rid of a special one-off retry
>>>>> counter, but I too share some concerns about retrying indefinitely when
>>>>> an server goes missing.
>>>>>
>>>>> Propagating a signal seems like the right thing to do. Looks like
>>>>> rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ?
>>>>>
>>>>> It sounds pretty straightforward otherwise.
>>>> Erm, except that some of these xprts are in the context of the server.
>>>>
>>>> For instance: server-side lockd sometimes has to send messages to the
>>>> clients (e.g. GRANTED callbacks). Suppose we're trying to send a message
>>>> to the client, but it has crashed and rebooted...or maybe the client's
>>>> address got handed to another host that isn't doing NFS at all so the
>>>> NLM service will never come back.
>>>>
>>>> This will mean that those RPCs will retry forever now in this situation.
>>>> I'm not sure that's what we want. Maybe we need some way to distinguish
>>>> between "user-driven" RPCs and those that aren't?
>>>>
>>>> As a simpler workaround, would it work to just increase the number of
>>>> retries here? There's nothing magical about 3 tries. ISTM that having it
>>>> retry enough times to cover at least a minute or two would be
>>>> acceptable.
>>> I'm happy with the simple workaround by just increasing the number of
>>> retries. This would fix the immediate problem that we're facing without
>>> potential of breaking things in other areas. If you agree then I will
>>> prepare the simpler patch for this.
>> A longer retry period is a short-term solution. I can get behind
>> that as long as the patch description makes clear some of the
>> concerns that have been brought up in this email thread.
>>
> Sounds good to me too -- 9s is an very short amount of time, IMO. We
> generally wait on the order of minutes for RPC timeouts and this seems
> like a similar situation.

I'll prepare the patch to increase the time out.

Thanks,
-Dai