linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: dai.ngo@oracle.com, Chuck Lever III <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Helen Chao <helen.chao@oracle.com>,
	Anna Schumaker <anna@kernel.org>,
	Trond Myklebust <trondmy@hammerspace.com>
Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
Date: Thu, 06 Apr 2023 17:51:58 -0400	[thread overview]
Message-ID: <b849b95d225cd89d43d2b094d183455249309838.camel@kernel.org> (raw)
In-Reply-To: <b50336cf-09a3-60a9-0100-0a25adf4ee55@oracle.com>

On Thu, 2023-04-06 at 13:58 -0700, dai.ngo@oracle.com wrote:
> On 4/6/23 12:59 PM, Jeff Layton wrote:
> > On Thu, 2023-04-06 at 19:43 +0000, Chuck Lever III wrote:
> > > > On Apr 6, 2023, at 3:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > > > 
> > > > Hi Jeff,
> > > > 
> > > > Thank you for taking a look at the patch.
> > > > 
> > > > On 4/6/23 11:10 AM, Jeff Layton wrote:
> > > > > On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote:
> > > > > > On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@oracle.com wrote:
> > > > > > > On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote:
> > > > > > > > On 3/8/23 10:50 AM, Chuck Lever III wrote:
> > > > > > > > > > On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > Currently call_bind_status places a hard limit of 3 to the number of
> > > > > > > > > > retries on EACCES error. This limit was done to accommodate the
> > > > > > > > > > behavior
> > > > > > > > > > of a buggy server that keeps returning garbage when the NLM daemon is
> > > > > > > > > > killed on the NFS server. However this change causes problem for other
> > > > > > > > > > servers that take a little longer than 9 seconds for the port mapper to
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > Actually, the EACCES error means that the host doesn't have the port
> > > > > registered.
> > > > Yes, our SQA team runs stress lock test and restart the NFS server.
> > > > Sometimes the NFS server starts up and register to the port mapper in
> > > > time and there is no problem but occasionally it's late coming up causing
> > > > this EACCES error.
> > > > 
> > > > >   That could happen if (e.g.) the host had a NFSv3 mount up
> > > > > with an NLM connection and then crashed and rebooted and didn't remount
> > > > > it.
> > > > can you please explain this scenario I don't quite follow it. If the v3
> > > > client crashed and did not remount the export then how can the client try
> > > > to access/lock anything on the server? I must have missing something here.
> > > > 
> > > > >   
> > Suppose you have a client with an admin that mounts a NFSv3 mount "by
> > hand" (and doesn't set up statd to run at boot time). Client requests an
> > NLM lock and then reboots.
> > 
> > When it comes up, there's no notification to the server that the client
> > rebooted. Later, the lock becomes free and the server tries to grant it
> > to the client. It talks to rpcbind but lockd is never started and the
> > server keeps querying the client's rpcbind forever.
> > 
> > Maybe more likely situation: the client crashes and loses its DHCP
> > address when it comes back up, and the old addr gets reassigned to
> > another host that has rpcbind running but no NFS.
> > 
> > Either way, it'd keep trying to call the client back indefinitely that
> > way.
> 
> Got it Jeff, thank you for the explanation. This is when NLM requests
> are originated from the NFS server.
> 

Mostly, yes. The old, stateless NFS v2/v3 server code didn't have much
in the way of callbacks, and v4 (for the most part) doesn't rely on
rpcbind.

That said, there may be some RPC calls done by the v2/3 NFS client that
don't have a direct connection to a client task. Consider stuff like
writeback requests. Signals won't do anything there.

I think keeping a hard timeout of some sort is probably prudent.

> 
-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2023-04-06 21:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-08 18:45 [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status Dai Ngo
2023-03-08 18:50 ` Chuck Lever III
2023-03-08 19:03   ` dai.ngo
2023-03-14 16:19     ` dai.ngo
2023-03-16 13:38       ` Fwd: " Chuck Lever III
2023-03-27 16:05         ` dai.ngo
2023-04-06 17:33       ` Jeff Layton
2023-04-06 18:10         ` Jeff Layton
2023-04-06 19:36           ` dai.ngo
2023-04-06 19:43             ` Chuck Lever III
2023-04-06 19:59               ` Jeff Layton
2023-04-06 20:58                 ` dai.ngo
2023-04-06 21:51                   ` Jeff Layton [this message]
2023-04-06 20:58               ` dai.ngo
2023-04-18 20:19 Dai Ngo
2023-04-19 10:06 ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b849b95d225cd89d43d2b094d183455249309838.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=anna@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=helen.chao@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).