Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status

From: dai.ngo@oracle.com
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Helen Chao <helen.chao@oracle.com>,
	Anna Schumaker <anna@kernel.org>,
	Trond Myklebust <trondmy@hammerspace.com>
Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
Date: Thu, 6 Apr 2023 13:58:09 -0700	[thread overview]
Message-ID: <4b60c364-2086-c07d-2ed0-f296af90bd30@oracle.com> (raw)
In-Reply-To: <F925BB7E-4E1C-48CD-A0C0-A058058E55BD@oracle.com>

On 4/6/23 12:43 PM, Chuck Lever III wrote:
>
>> On Apr 6, 2023, at 3:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>
>> Hi Jeff,
>>
>> Thank you for taking a look at the patch.
>>
>> On 4/6/23 11:10 AM, Jeff Layton wrote:
>>> On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote:
>>>> On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@oracle.com wrote:
>>>>> On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote:
>>>>>> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>>>>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>
>>>>>>>> Currently call_bind_status places a hard limit of 3 to the number of
>>>>>>>> retries on EACCES error. This limit was done to accommodate the
>>>>>>>> behavior
>>>>>>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>>>>>>> killed on the NFS server. However this change causes problem for other
>>>>>>>> servers that take a little longer than 9 seconds for the port mapper to
>>>>>>>>
>>>>>>>>
>>> Actually, the EACCES error means that the host doesn't have the port
>>> registered.
>> Yes, our SQA team runs stress lock test and restart the NFS server.
>> Sometimes the NFS server starts up and register to the port mapper in
>> time and there is no problem but occasionally it's late coming up causing
>> this EACCES error.
>>
>>>   That could happen if (e.g.) the host had a NFSv3 mount up
>>> with an NLM connection and then crashed and rebooted and didn't remount
>>> it.
>> can you please explain this scenario I don't quite follow it. If the v3
>> client crashed and did not remount the export then how can the client try
>> to access/lock anything on the server? I must have missing something here.
>>
>>>   
>>>>>>>> become ready when the NFS server is restarted.
>>>>>>>>
>>>>>>>> This patch removes this hard coded limit and let the RPC handles
>>>>>>>> the retry according to whether the export is soft or hard mounted.
>>>>>>>>
>>>>>>>> To avoid the hang with buggy server, the client can use soft mount for
>>>>>>>> the export.
>>>>>>>>
>>>>>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>>>>>>> Reported-by: Helen Chao <helen.chao@oracle.com>
>>>>>>>> Tested-by: Helen Chao <helen.chao@oracle.com>
>>>>>>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
>>>>>>> Helen is the royal queen of ^C  ;-)
>>>>>>>
>>>>>>> Did you try ^C on a mount while it waits for a rebind?
>>>>>> She uses a test script that restarts the NFS server while NLM lock test
>>>>>> is running. The failure is random, sometimes it fails and sometimes it
>>>>>> passes depending on when the LOCK/UNLOCK requests come in so I think
>>>>>> it's hard to time it to do the ^C, but I will ask.
>>>>> We did the test with ^C and here is what we found.
>>>>>
>>>>> For synchronous RPC task the signal was delivered to the RPC task and
>>>>> the task exit with -ERESTARTSYS from __rpc_execute as expected.
>>>>>
>>>>> For asynchronous RPC task the process that invokes the RPC task to send
>>>>> the request detected the signal in rpc_wait_for_completion_task and exits
>>>>> with -ERESTARTSYS. However the async RPC was allowed to continue to run
>>>>> to completion. So if the async RPC task was retrying an operation and
>>>>> the NFS server was down, it will retry forever if this is a hard mount
>>>>> or until the NFS server comes back up.
>>>>>
>>>>> The question for the list is should we propagate the signal to the async
>>>>> task via rpc_signal_task to stop its execution or just leave it alone as is.
>>>>>
>>>>>
>>>> That is a good question.
>> Maybe we should defer the propagation of the signal for later. Chuck has
>> some concern since this can change the existing behavior of async task
>> for other v4 operations.
>>
>>>> I like the patch overall, as it gets rid of a special one-off retry
>>>> counter, but I too share some concerns about retrying indefinitely when
>>>> an server goes missing.
>>>>
>>>> Propagating a signal seems like the right thing to do. Looks like
>>>> rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ?
>>>>
>>>> It sounds pretty straightforward otherwise.
>>> Erm, except that some of these xprts are in the context of the server.
>>>
>>> For instance: server-side lockd sometimes has to send messages to the
>>> clients (e.g. GRANTED callbacks). Suppose we're trying to send a message
>>> to the client, but it has crashed and rebooted...or maybe the client's
>>> address got handed to another host that isn't doing NFS at all so the
>>> NLM service will never come back.
>>>
>>> This will mean that those RPCs will retry forever now in this situation.
>>> I'm not sure that's what we want. Maybe we need some way to distinguish
>>> between "user-driven" RPCs and those that aren't?
>>>
>>> As a simpler workaround, would it work to just increase the number of
>>> retries here? There's nothing magical about 3 tries. ISTM that having it
>>> retry enough times to cover at least a minute or two would be
>>> acceptable.
>> I'm happy with the simple workaround by just increasing the number of
>> retries. This would fix the immediate problem that we're facing without
>> potential of breaking things in other areas. If you agree then I will
>> prepare the simpler patch for this.
> A longer retry period is a short-term solution. I can get behind
> that as long as the patch description makes clear some of the
> concerns that have been brought up in this email thread.

Thank you Chuck, will do.

-Dai

>
>
> --
> Chuck Lever
>
>