All of lore.kernel.org
 help / color / mirror / Atom feed
* Listen backlog set to 64
@ 2010-11-15 18:43 Mark Hills
  2010-11-16 18:20 ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Hills @ 2010-11-15 18:43 UTC (permalink / raw)
  To: linux-nfs

I am looking into an issue of hanging clients to a set of NFS servers, on 
a large HPC cluster.

My investigation took me to the RPC code, svc_create_socket().

	if (protocol == IPPROTO_TCP) {
		if ((error = kernel_listen(sock, 64)) < 0)
			goto bummer;
	}

A fixed backlog of 64 connections at the server seems like it could be too 
low on a cluster like this, particularly when the protocol opens and 
closes the TCP connection.

I wondered what is the rationale is behind this number, particuarly as it 
is a fixed value. Perhaps there is a reason why this has no effect on 
nfsd, or is this a FAQ for people on large systems?

The servers show overflow of a listening queue, which I imagine is 
related.

  $ netstat -s
  [...]
  TcpExt:
    6475 times the listen queue of a socket overflowed
    6475 SYNs to LISTEN sockets ignored

The affected servers are old, kernel 2.6.9. But this limit of 64 is 
consistent across that and the latest kernel source.

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-15 18:43 Listen backlog set to 64 Mark Hills
@ 2010-11-16 18:20 ` J. Bruce Fields
  2010-11-16 19:05   ` Mark Hills
  2010-11-16 22:08   ` Neil Brown
  0 siblings, 2 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-11-16 18:20 UTC (permalink / raw)
  To: Mark Hills; +Cc: linux-nfs, neilb

On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> I am looking into an issue of hanging clients to a set of NFS servers, on 
> a large HPC cluster.
> 
> My investigation took me to the RPC code, svc_create_socket().
> 
> 	if (protocol == IPPROTO_TCP) {
> 		if ((error = kernel_listen(sock, 64)) < 0)
> 			goto bummer;
> 	}
> 
> A fixed backlog of 64 connections at the server seems like it could be too 
> low on a cluster like this, particularly when the protocol opens and 
> closes the TCP connection.
> 
> I wondered what is the rationale is behind this number, particuarly as it 
> is a fixed value. Perhaps there is a reason why this has no effect on 
> nfsd, or is this a FAQ for people on large systems?
> 
> The servers show overflow of a listening queue, which I imagine is 
> related.
> 
>   $ netstat -s
>   [...]
>   TcpExt:
>     6475 times the listen queue of a socket overflowed
>     6475 SYNs to LISTEN sockets ignored
> 
> The affected servers are old, kernel 2.6.9. But this limit of 64 is 
> consistent across that and the latest kernel source.

Looks like the last time that was touched was 8 years ago, by Neil (below, from
historical git archive).

I'd be inclined to just keep doubling it until people don't complain,
unless it's very expensive.  (How much memory (or whatever else) does a
pending connection tie up?)

The clients should be retrying, though, shouldn't they?

--b.

commit df0afc51f2f74756135c8bc08ec01134eb6de287
Author: Neil Brown <neilb@cse.unsw.edu.au>
Date:   Thu Aug 22 21:21:39 2002 -0700

    [PATCH] Fix two problems with multiple concurrent nfs/tcp connects.
    
    1/ connect requests would be get lost...
      As the comment at the top of svcsock.c says when discussing
      SK_CONN:
     *          after a set, svc_sock_enqueue must be called.
    
      We didn't and so lost conneciton requests.
    
    2/ set the max accept backlog to a more reasonable number to cope
       with bursts of lots of connection requests.

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ab28937..bbeee09 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -679,6 +679,8 @@ svc_tcp_accept(struct svc_sock *svsk)
                goto failed;            /* aborted connection or whatever */
        }
        set_bit(SK_CONN, &svsk->sk_flags);
+       svc_sock_enqueue(svsk);
+
        slen = sizeof(sin);
        err = ops->getname(newsock, (struct sockaddr *) &sin, &slen, 1);
        if (err < 0) {
@@ -1220,7 +1222,7 @@ svc_create_socket(struct svc_serv *serv, int protocol, str
        }
 
        if (protocol == IPPROTO_TCP) {
-               if ((error = sock->ops->listen(sock, 5)) < 0)
+               if ((error = sock->ops->listen(sock, 64)) < 0)
                        goto bummer;
        }


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-16 18:20 ` J. Bruce Fields
@ 2010-11-16 19:05   ` Mark Hills
  2010-11-16 22:08   ` Neil Brown
  1 sibling, 0 replies; 21+ messages in thread
From: Mark Hills @ 2010-11-16 19:05 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, neilb

On Tue, 16 Nov 2010, J. Bruce Fields wrote:

> On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > a large HPC cluster.
> > 
> > My investigation took me to the RPC code, svc_create_socket().
> > 
> > 	if (protocol == IPPROTO_TCP) {
> > 		if ((error = kernel_listen(sock, 64)) < 0)
> > 			goto bummer;
> > 	}
> > 
> > A fixed backlog of 64 connections at the server seems like it could be too 
> > low on a cluster like this, particularly when the protocol opens and 
> > closes the TCP connection.
> > 
> > I wondered what is the rationale is behind this number, particuarly as it 
> > is a fixed value. Perhaps there is a reason why this has no effect on 
> > nfsd, or is this a FAQ for people on large systems?
> > 
> > The servers show overflow of a listening queue, which I imagine is 
> > related.
> > 
> >   $ netstat -s
> >   [...]
> >   TcpExt:
> >     6475 times the listen queue of a socket overflowed
> >     6475 SYNs to LISTEN sockets ignored
> > 
> > The affected servers are old, kernel 2.6.9. But this limit of 64 is 
> > consistent across that and the latest kernel source.
> 
> Looks like the last time that was touched was 8 years ago, by Neil (below, from
> historical git archive).
> 
> I'd be inclined to just keep doubling it until people don't complain,
> unless it's very expensive.  (How much memory (or whatever else) does a
> pending connection tie up?)

Perhaps SOMAXCONN could also be appropriate.
 
> The clients should be retrying, though, shouldn't they?

I think so, but a quick glance at net/sunrpc/clnt.c looks like the 
timeouts are fixed, not randomised. With nothing to smooth out the load 
from a large number of (identical) clients, potentially they could 
continue this process for some time.

I may be in the wrong client code here though for a client TCP connection, 
perhaps someone with more experience can comment. I hope to investigate 
further tomorrow.

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-16 18:20 ` J. Bruce Fields
  2010-11-16 19:05   ` Mark Hills
@ 2010-11-16 22:08   ` Neil Brown
  2010-11-29 20:59     ` J. Bruce Fields
  1 sibling, 1 reply; 21+ messages in thread
From: Neil Brown @ 2010-11-16 22:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Mark Hills, linux-nfs

On Tue, 16 Nov 2010 13:20:26 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > a large HPC cluster.
> > 
> > My investigation took me to the RPC code, svc_create_socket().
> > 
> > 	if (protocol == IPPROTO_TCP) {
> > 		if ((error = kernel_listen(sock, 64)) < 0)
> > 			goto bummer;
> > 	}
> > 
> > A fixed backlog of 64 connections at the server seems like it could be too 
> > low on a cluster like this, particularly when the protocol opens and 
> > closes the TCP connection.
> > 
> > I wondered what is the rationale is behind this number, particuarly as it 
> > is a fixed value. Perhaps there is a reason why this has no effect on 
> > nfsd, or is this a FAQ for people on large systems?
> > 
> > The servers show overflow of a listening queue, which I imagine is 
> > related.
> > 
> >   $ netstat -s
> >   [...]
> >   TcpExt:
> >     6475 times the listen queue of a socket overflowed
> >     6475 SYNs to LISTEN sockets ignored
> > 
> > The affected servers are old, kernel 2.6.9. But this limit of 64 is 
> > consistent across that and the latest kernel source.
> 
> Looks like the last time that was touched was 8 years ago, by Neil (below, from
> historical git archive).
> 
> I'd be inclined to just keep doubling it until people don't complain,
> unless it's very expensive.  (How much memory (or whatever else) does a
> pending connection tie up?)

Surely we should "keep multiplying by 13" as that is what I did :-)

There is a sysctl 'somaxconn' which limits what a process can ask for in the
listen() system call, but as we bypass this syscall it doesn't directly
affect nfsd.
It defaults to SOMAXCONN == 128 but can be raised arbitrarily by the sysadmin.

There is another sysctl 'max_syn_backlog' which looks like a system-wide
limit to the connect backlog.  This defaults to 256.  The comment says it is
adjusted between 128 and 1024 based on memory size, though that isn't clear
in the code (to me at least).

So we could:
  - hard code a new number
  - make this another sysctl configurable
  - auto-adjust it so that it "just works".

I would prefer the latter if it is possible.   Possibly we could adjust it
based on the number of nfsd threads, like we do for receive buffer space.
Maybe something arbitrary like:
   min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)

which would get the current 64 at 24 threads, and can easily push up to 128
and beyond with more threads.

Or is that too arbitrary?

NeilBrown


> 
> The clients should be retrying, though, shouldn't they?
> 
> --b.
> 
> commit df0afc51f2f74756135c8bc08ec01134eb6de287
> Author: Neil Brown <neilb@cse.unsw.edu.au>
> Date:   Thu Aug 22 21:21:39 2002 -0700
> 
>     [PATCH] Fix two problems with multiple concurrent nfs/tcp connects.
>     
>     1/ connect requests would be get lost...
>       As the comment at the top of svcsock.c says when discussing
>       SK_CONN:
>      *          after a set, svc_sock_enqueue must be called.
>     
>       We didn't and so lost conneciton requests.
>     
>     2/ set the max accept backlog to a more reasonable number to cope
>        with bursts of lots of connection requests.
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index ab28937..bbeee09 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -679,6 +679,8 @@ svc_tcp_accept(struct svc_sock *svsk)
>                 goto failed;            /* aborted connection or whatever */
>         }
>         set_bit(SK_CONN, &svsk->sk_flags);
> +       svc_sock_enqueue(svsk);
> +
>         slen = sizeof(sin);
>         err = ops->getname(newsock, (struct sockaddr *) &sin, &slen, 1);
>         if (err < 0) {
> @@ -1220,7 +1222,7 @@ svc_create_socket(struct svc_serv *serv, int protocol, str
>         }
>  
>         if (protocol == IPPROTO_TCP) {
> -               if ((error = sock->ops->listen(sock, 5)) < 0)
> +               if ((error = sock->ops->listen(sock, 64)) < 0)
>                         goto bummer;
>         }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-16 22:08   ` Neil Brown
@ 2010-11-29 20:59     ` J. Bruce Fields
  2010-11-30 17:50       ` Mark Hills
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2010-11-29 20:59 UTC (permalink / raw)
  To: Neil Brown; +Cc: Mark Hills, linux-nfs

On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> On Tue, 16 Nov 2010 13:20:26 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > > a large HPC cluster.
> > > 
> > > My investigation took me to the RPC code, svc_create_socket().
> > > 
> > > 	if (protocol == IPPROTO_TCP) {
> > > 		if ((error = kernel_listen(sock, 64)) < 0)
> > > 			goto bummer;
> > > 	}
> > > 
> > > A fixed backlog of 64 connections at the server seems like it could be too 
> > > low on a cluster like this, particularly when the protocol opens and 
> > > closes the TCP connection.
> > > 
> > > I wondered what is the rationale is behind this number, particuarly as it 
> > > is a fixed value. Perhaps there is a reason why this has no effect on 
> > > nfsd, or is this a FAQ for people on large systems?
> > > 
> > > The servers show overflow of a listening queue, which I imagine is 
> > > related.
> > > 
> > >   $ netstat -s
> > >   [...]
> > >   TcpExt:
> > >     6475 times the listen queue of a socket overflowed
> > >     6475 SYNs to LISTEN sockets ignored
> > > 
> > > The affected servers are old, kernel 2.6.9. But this limit of 64 is 
> > > consistent across that and the latest kernel source.
> > 
> > Looks like the last time that was touched was 8 years ago, by Neil (below, from
> > historical git archive).
> > 
> > I'd be inclined to just keep doubling it until people don't complain,
> > unless it's very expensive.  (How much memory (or whatever else) does a
> > pending connection tie up?)
> 
> Surely we should "keep multiplying by 13" as that is what I did :-)
> 
> There is a sysctl 'somaxconn' which limits what a process can ask for in the
> listen() system call, but as we bypass this syscall it doesn't directly
> affect nfsd.
> It defaults to SOMAXCONN == 128 but can be raised arbitrarily by the sysadmin.
> 
> There is another sysctl 'max_syn_backlog' which looks like a system-wide
> limit to the connect backlog.
> This defaults to 256.  The comment says it is
> adjusted between 128 and 1024 based on memory size, though that isn't clear
> in the code (to me at least).

This comment?:

/*
 * Maximum number of SYN_RECV sockets in queue per LISTEN socket.
 * One SYN_RECV socket costs about 80bytes on a 32bit machine.
 * It would be better to replace it with a global counter for all sockets
 * but then some measure against one socket starving all other sockets
 * would be needed.
 *
 * It was 128 by default. Experiments with real servers show, that
 * it is absolutely not enough even at 100conn/sec. 256 cures most
 * of problems. This value is adjusted to 128 for very small machines
 * (<=32Mb of memory) and to 1024 on normal or better ones (>=256Mb).
 * Note : Dont forget somaxconn that may limit backlog too.
 */
int sysctl_max_syn_backlog = 256;

Looks like net/ipv4/tcp.c:tcp_init() does the memory-based calculation.

80 bytes sounds small.

> So we could:
>   - hard code a new number
>   - make this another sysctl configurable
>   - auto-adjust it so that it "just works".
> 
> I would prefer the latter if it is possible.   Possibly we could adjust it
> based on the number of nfsd threads, like we do for receive buffer space.
> Maybe something arbitrary like:
>    min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
> 
> which would get the current 64 at 24 threads, and can easily push up to 128
> and beyond with more threads.
> 
> Or is that too arbitrary?

I kinda like the idea of piggybacking on an existing constant like
sysctl_max_syn_backlog.  Somebody else hopefully keeps it set to something
reasonable, and we as a last resort it gives you a knob to twiddle.

But number of threads would work OK too.

At a minimum we should make sure we solve the original problem....
Mark, have you had a chance to check whether increasing that number to
128 or more is enough to solve your problem?

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-29 20:59     ` J. Bruce Fields
@ 2010-11-30 17:50       ` Mark Hills
  2010-11-30 20:00         ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Hills @ 2010-11-30 17:50 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Neil Brown, linux-nfs

On Mon, 29 Nov 2010, J. Bruce Fields wrote:

> On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > On Tue, 16 Nov 2010 13:20:26 -0500
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > > > a large HPC cluster.
> > > > 
> > > > My investigation took me to the RPC code, svc_create_socket().
> > > > 
> > > > 	if (protocol == IPPROTO_TCP) {
> > > > 		if ((error = kernel_listen(sock, 64)) < 0)
> > > > 			goto bummer;
> > > > 	}
> > > > 
> > > > A fixed backlog of 64 connections at the server seems like it could be too 
> > > > low on a cluster like this, particularly when the protocol opens and 
> > > > closes the TCP connection.
[...] 
> > So we could:
> >   - hard code a new number
> >   - make this another sysctl configurable
> >   - auto-adjust it so that it "just works".
> > 
> > I would prefer the latter if it is possible.   Possibly we could adjust it
> > based on the number of nfsd threads, like we do for receive buffer space.
> > Maybe something arbitrary like:
> >    min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
> > 
> > which would get the current 64 at 24 threads, and can easily push up to 128
> > and beyond with more threads.
> > 
> > Or is that too arbitrary?
> 
> I kinda like the idea of piggybacking on an existing constant like
> sysctl_max_syn_backlog.  Somebody else hopefully keeps it set to something
> reasonable, and we as a last resort it gives you a knob to twiddle.
> 
> But number of threads would work OK too.
> 
> At a minimum we should make sure we solve the original problem....
> Mark, have you had a chance to check whether increasing that number to
> 128 or more is enough to solve your problem?

I think we can hold off changing the queue size, for now at least. We 
reduced the reported queue overflows by increasing the number of mountd 
threads, allowing it to service the queue more quickly. However this did 
not fix the common problem, and I was hoping to have more information in 
this follow-up email.

Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
the client log we see messages like:

  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up

Using strace and isolating one of these, I can see a non-blocking connect 
has already managed to make a connection and even send/receive some data. 

But soon a timeout of 9999 milliseconds in poll() causes a problem in 
mount.nfs when waiting for a response of some sort. The socket in question 
is a connection to mountd:

  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>

When it returns:

  26512 <... poll resumed> )              = 0 (Timeout)
  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
  26512 close(3)                          = 0
  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100

There's no re-try from here, just a failed mount.

What is the source of this 9999 millisecond timeout used by poll() in 
mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
but I need more time to investigate.

If the server is being too slow to respond, what could the cause of this 
be? Multiple threads are already in use, but it seems like they are not 
all in use because a thread is able to accept() the connection. I haven't 
been able to pin this on the forward/reverse DNS lookup used by 
authentication and logging.

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-30 17:50       ` Mark Hills
@ 2010-11-30 20:00         ` J. Bruce Fields
  2010-11-30 22:09           ` Mark Hills
  2010-12-01 18:18           ` Mark Hills
  0 siblings, 2 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-11-30 20:00 UTC (permalink / raw)
  To: Mark Hills; +Cc: Neil Brown, linux-nfs

On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> On Mon, 29 Nov 2010, J. Bruce Fields wrote:
> 
> > On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > > On Tue, 16 Nov 2010 13:20:26 -0500
> > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > 
> > > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > > > > a large HPC cluster.
> > > > > 
> > > > > My investigation took me to the RPC code, svc_create_socket().
> > > > > 
> > > > > 	if (protocol == IPPROTO_TCP) {
> > > > > 		if ((error = kernel_listen(sock, 64)) < 0)
> > > > > 			goto bummer;
> > > > > 	}
> > > > > 
> > > > > A fixed backlog of 64 connections at the server seems like it could be too 
> > > > > low on a cluster like this, particularly when the protocol opens and 
> > > > > closes the TCP connection.
> [...] 
> > > So we could:
> > >   - hard code a new number
> > >   - make this another sysctl configurable
> > >   - auto-adjust it so that it "just works".
> > > 
> > > I would prefer the latter if it is possible.   Possibly we could adjust it
> > > based on the number of nfsd threads, like we do for receive buffer space.
> > > Maybe something arbitrary like:
> > >    min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
> > > 
> > > which would get the current 64 at 24 threads, and can easily push up to 128
> > > and beyond with more threads.
> > > 
> > > Or is that too arbitrary?
> > 
> > I kinda like the idea of piggybacking on an existing constant like
> > sysctl_max_syn_backlog.  Somebody else hopefully keeps it set to something
> > reasonable, and we as a last resort it gives you a knob to twiddle.
> > 
> > But number of threads would work OK too.
> > 
> > At a minimum we should make sure we solve the original problem....
> > Mark, have you had a chance to check whether increasing that number to
> > 128 or more is enough to solve your problem?
> 
> I think we can hold off changing the queue size, for now at least. We 
> reduced the reported queue overflows by increasing the number of mountd 
> threads, allowing it to service the queue more quickly.

Apologies, I should have thought to suggest that at the start.

> However this did 
> not fix the common problem, and I was hoping to have more information in 
> this follow-up email.
> 
> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> the client log we see messages like:
> 
>   Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> 
> Using strace and isolating one of these, I can see a non-blocking connect 
> has already managed to make a connection and even send/receive some data. 
> 
> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> mount.nfs when waiting for a response of some sort. The socket in question 
> is a connection to mountd:
> 
>   26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>   26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>   26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> 
> When it returns:
> 
>   26512 <... poll resumed> )              = 0 (Timeout)
>   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>   26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>   26512 close(3)                          = 0
>   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>   26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> 
> There's no re-try from here, just a failed mount.

That does sound wrong.  I'm not at all familiar with automount,
unfortunately; how is it invoking mount.nfs?

> What is the source of this 9999 millisecond timeout used by poll() in 
> mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
> but I need more time to investigate.
> 
> If the server is being too slow to respond, what could the cause of this 
> be? Multiple threads are already in use, but it seems like they are not 
> all in use because a thread is able to accept() the connection. I haven't 
> been able to pin this on the forward/reverse DNS lookup used by 
> authentication and logging.

Can you tell where the mountd threads are typically waiting?

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-30 20:00         ` J. Bruce Fields
@ 2010-11-30 22:09           ` Mark Hills
  2010-12-01 18:18           ` Mark Hills
  1 sibling, 0 replies; 21+ messages in thread
From: Mark Hills @ 2010-11-30 22:09 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Neil Brown, linux-nfs

On Tue, 30 Nov 2010, J. Bruce Fields wrote:

> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > On Mon, 29 Nov 2010, J. Bruce Fields wrote:
> > 
> > > On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > > > On Tue, 16 Nov 2010 13:20:26 -0500
> > > > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > > > 
> > > > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > > > I am looking into an issue of hanging clients to a set of NFS servers, on 
> > > > > > a large HPC cluster.
> > > > > > 
> > > > > > My investigation took me to the RPC code, svc_create_socket().
> > > > > > 
> > > > > > 	if (protocol == IPPROTO_TCP) {
> > > > > > 		if ((error = kernel_listen(sock, 64)) < 0)
> > > > > > 			goto bummer;
> > > > > > 	}
[...]
> > I think we can hold off changing the queue size, for now at least. We 
> > reduced the reported queue overflows by increasing the number of mountd 
> > threads, allowing it to service the queue more quickly.
> 
> Apologies, I should have thought to suggest that at the start.

No apology necessary. I actually thought we had ruled out the possibility 
that mountd is at fault, but I cannot remember the exact reasons. I think 
it is because we have _some_ cases of failure on a mounted filesystem. So 
maybe we will find ourselves back here again later.

> > However this did not fix the common problem, and I was hoping to have 
> > more information in this follow-up email.
> > 
> > Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> > the client log we see messages like:
> > 
> >   Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > 
> > Using strace and isolating one of these, I can see a non-blocking connect 
> > has already managed to make a connection and even send/receive some data. 
> > 
> > But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> > mount.nfs when waiting for a response of some sort. The socket in question 
> > is a connection to mountd:
> > 
> >   26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >   26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >   26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > 
> > When it returns:
> > 
> >   26512 <... poll resumed> )              = 0 (Timeout)
> >   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >   26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >   26512 close(3)                          = 0
> >   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >   26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > 
> > There's no re-try from here, just a failed mount.
> 
> That does sound wrong.  I'm not at all familiar with automount,
> unfortunately; how is it invoking mount.nfs?

It invokes 'mount' as a command using exec(). A more detailed log:

  Nov 22 12:07:36 nyrf014 automount[7883]: mount_mount: mount(nfs): calling mount -t nfs -s -o proto=tcp ss1a:/mnt/raid1/banana /proj/banana
  Nov 22 12:07:48 nyrf014 automount[7883]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up

The process trying to access the filesystem will block whilst automount 
does this. The failure of the mount command causes an empty directory 
until negative timeout expires, after which it will invoke 'mount' again. 
It's this period that causes problems; applications fail when they see an 
empty directory.

It looks like we are more likely to _see_ the problem when a large number 
of hosts are mounting at the same time, but it doesn't seem to be directly 
the cause of it.
 
> > What is the source of this 9999 millisecond timeout used by poll() in 
> > mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
> > but I need more time to investigate.
> > 
> > If the server is being too slow to respond, what could the cause of this 
> > be? Multiple threads are already in use, but it seems like they are not 
> > all in use because a thread is able to accept() the connection. I haven't 
> > been able to pin this on the forward/reverse DNS lookup used by 
> > authentication and logging.
> 
> Can you tell where the mountd threads are typically waiting?

I intend to investigate this, possibly by attaching strace to rpc.mountd 
and trying to correlate the log with one of the dropouts. Perhaps somebody 
has a better suggestion.

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-11-30 20:00         ` J. Bruce Fields
  2010-11-30 22:09           ` Mark Hills
@ 2010-12-01 18:18           ` Mark Hills
  2010-12-01 18:28             ` Chuck Lever
  2010-12-01 18:36             ` Listen backlog set to 64 J. Bruce Fields
  1 sibling, 2 replies; 21+ messages in thread
From: Mark Hills @ 2010-12-01 18:18 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Neil Brown, linux-nfs

On Tue, 30 Nov 2010, J. Bruce Fields wrote:

> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
[...]
> > Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> > the client log we see messages like:
> > 
> >   Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > 
> > Using strace and isolating one of these, I can see a non-blocking connect 
> > has already managed to make a connection and even send/receive some data. 
> > 
> > But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> > mount.nfs when waiting for a response of some sort. The socket in question 
> > is a connection to mountd:
> > 
> >   26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >   26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >   26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > 
> > When it returns:
> > 
> >   26512 <... poll resumed> )              = 0 (Timeout)
> >   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >   26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >   26512 close(3)                          = 0
> >   26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >   26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > 
> > There's no re-try from here, just a failed mount.
> 
> That does sound wrong.  I'm not at all familiar with automount,
> unfortunately; how is it invoking mount.nfs?
> 
> > What is the source of this 9999 millisecond timeout used by poll() in 
> > mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
> > but I need more time to investigate.
> > 
> > If the server is being too slow to respond, what could the cause of this 
> > be? Multiple threads are already in use, but it seems like they are not 
> > all in use because a thread is able to accept() the connection. I haven't 
> > been able to pin this on the forward/reverse DNS lookup used by 
> > authentication and logging.
> 
> Can you tell where the mountd threads are typically waiting?

Here's a trace from mountd. Note the long pause after fdatasync():

31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
31156 12:09:40 fdatasync(14)            = 0 <15.413401>
31156 12:09:55 close(14)                = 0 <0.000008>
31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
31156 12:09:55 close(13)                = 0 <0.000008>

This pause exceeds the mysterious 9999 millisecond timeout allowed at the 
client.

fdatasync() is happening at every write of rmtab -- a lot. Occasionally it 
takes a long time. The syncing of the rmtab seems a reasonable explanation 
why more threads does not help -- all mountd threads lock up.

The strace I am analysing here contains 670 calls to fdatasync() in a 16 
minute period. All of them take < 1 second, except for this burst:

31156 12:09:11 fdatasync(14)            = 0 <0.000010>
31156 12:09:11 fdatasync(15)            = 0 <0.136285>
31156 12:09:12 fdatasync(14)            = 0 <0.000010>
31156 12:09:12 fdatasync(15)            = 0 <14.911323>
31156 12:09:27 fdatasync(14)            = 0 <1.298896>
31156 12:09:29 fdatasync(14)            = 0 <1.159262>
31156 12:09:33 fdatasync(14)            = 0 <1.447169>
31156 12:09:35 fdatasync(14)            = 0 <1.345562>
31156 12:09:40 fdatasync(14)            = 0 <0.000010>
31156 12:09:40 fdatasync(14)            = 0 <15.413401>
31156 12:09:55 fdatasync(17)            = 0 <0.000010>
31156 12:09:55 fdatasync(17)            = 0 <1.068503>
31156 12:09:57 fdatasync(18)            = 0 <0.000010>
31156 12:09:57 fdatasync(18)            = 0 <0.089682>

/var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.

The server is only acting as a 'regular' and dedicated file server, but 
presumably some other process is causing /var to become erratic. Unless 
continual writing and syncing to a small (64KiB) file could cause this to 
happen periodically?

In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs 
mount to see if that helps, and will leave it running tonight.

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-12-01 18:18           ` Mark Hills
@ 2010-12-01 18:28             ` Chuck Lever
  2010-12-01 18:46               ` J. Bruce Fields
  2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
  2010-12-01 18:36             ` Listen backlog set to 64 J. Bruce Fields
  1 sibling, 2 replies; 21+ messages in thread
From: Chuck Lever @ 2010-12-01 18:28 UTC (permalink / raw)
  To: Mark Hills; +Cc: J. Bruce Fields, Neil Brown, linux-nfs


On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:

> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> 
>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> [...]
>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
>>> the client log we see messages like:
>>> 
>>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>> 
>>> Using strace and isolating one of these, I can see a non-blocking connect 
>>> has already managed to make a connection and even send/receive some data. 
>>> 
>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
>>> mount.nfs when waiting for a response of some sort. The socket in question 
>>> is a connection to mountd:
>>> 
>>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>> 
>>> When it returns:
>>> 
>>>  26512 <... poll resumed> )              = 0 (Timeout)
>>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>>  26512 close(3)                          = 0
>>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>> 
>>> There's no re-try from here, just a failed mount.
>> 
>> That does sound wrong.  I'm not at all familiar with automount,
>> unfortunately; how is it invoking mount.nfs?
>> 
>>> What is the source of this 9999 millisecond timeout used by poll() in 
>>> mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
>>> but I need more time to investigate.
>>> 
>>> If the server is being too slow to respond, what could the cause of this 
>>> be? Multiple threads are already in use, but it seems like they are not 
>>> all in use because a thread is able to accept() the connection. I haven't 
>>> been able to pin this on the forward/reverse DNS lookup used by 
>>> authentication and logging.
>> 
>> Can you tell where the mountd threads are typically waiting?
> 
> Here's a trace from mountd. Note the long pause after fdatasync():
> 
> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> 31156 12:09:55 close(14)                = 0 <0.000008>
> 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> 31156 12:09:55 close(13)                = 0 <0.000008>
> 
> This pause exceeds the mysterious 9999 millisecond timeout allowed at the 
> client.
> 
> fdatasync() is happening at every write of rmtab -- a lot. Occasionally it 
> takes a long time. The syncing of the rmtab seems a reasonable explanation 
> why more threads does not help -- all mountd threads lock up.
> 
> The strace I am analysing here contains 670 calls to fdatasync() in a 16 
> minute period. All of them take < 1 second, except for this burst:
> 
> 31156 12:09:11 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:11 fdatasync(15)            = 0 <0.136285>
> 31156 12:09:12 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:12 fdatasync(15)            = 0 <14.911323>
> 31156 12:09:27 fdatasync(14)            = 0 <1.298896>
> 31156 12:09:29 fdatasync(14)            = 0 <1.159262>
> 31156 12:09:33 fdatasync(14)            = 0 <1.447169>
> 31156 12:09:35 fdatasync(14)            = 0 <1.345562>
> 31156 12:09:40 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> 31156 12:09:55 fdatasync(17)            = 0 <0.000010>
> 31156 12:09:55 fdatasync(17)            = 0 <1.068503>
> 31156 12:09:57 fdatasync(18)            = 0 <0.000010>
> 31156 12:09:57 fdatasync(18)            = 0 <0.089682>
> 
> /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
> 
> The server is only acting as a 'regular' and dedicated file server, but 
> presumably some other process is causing /var to become erratic. Unless 
> continual writing and syncing to a small (64KiB) file could cause this to 
> happen periodically?

fdatasync() will write all outstanding dirty data on that partition to persistent storage, not just the dirty data for that file.  If /var is on a partition that has a lot of other activity, then this is the expected outcome, unfortunately.  I would expect such delays to be due to a large amount of outstanding writes, not to periodic synchronous writes.

This problem may be addressed in recent versions of nfs-utils.  I see a commit from Ben Myers on February 12 of this year that seems on point.  nfs-utils-1.2.2 has it, I think?

> In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs 
> mount to see if that helps, and will leave it running tonight.
> 
> -- 
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-12-01 18:18           ` Mark Hills
  2010-12-01 18:28             ` Chuck Lever
@ 2010-12-01 18:36             ` J. Bruce Fields
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-12-01 18:36 UTC (permalink / raw)
  To: Mark Hills; +Cc: Neil Brown, linux-nfs

On Wed, Dec 01, 2010 at 06:18:17PM +0000, Mark Hills wrote:
> Here's a trace from mountd. Note the long pause after fdatasync():
> 
> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> 31156 12:09:55 close(14)                = 0 <0.000008>
> 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> 31156 12:09:55 close(13)                = 0 <0.000008>
> 
> This pause exceeds the mysterious 9999 millisecond timeout allowed at the 
> client.
> 
> fdatasync() is happening at every write of rmtab -- a lot. Occasionally it 
> takes a long time. The syncing of the rmtab seems a reasonable explanation 
> why more threads does not help -- all mountd threads lock up.
> 
> The strace I am analysing here contains 670 calls to fdatasync() in a 16 
> minute period. All of them take < 1 second, except for this burst:
> 
> 31156 12:09:11 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:11 fdatasync(15)            = 0 <0.136285>
> 31156 12:09:12 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:12 fdatasync(15)            = 0 <14.911323>
> 31156 12:09:27 fdatasync(14)            = 0 <1.298896>
> 31156 12:09:29 fdatasync(14)            = 0 <1.159262>
> 31156 12:09:33 fdatasync(14)            = 0 <1.447169>
> 31156 12:09:35 fdatasync(14)            = 0 <1.345562>
> 31156 12:09:40 fdatasync(14)            = 0 <0.000010>
> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> 31156 12:09:55 fdatasync(17)            = 0 <0.000010>
> 31156 12:09:55 fdatasync(17)            = 0 <1.068503>
> 31156 12:09:57 fdatasync(18)            = 0 <0.000010>
> 31156 12:09:57 fdatasync(18)            = 0 <0.089682>
> 
> /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
> 
> The server is only acting as a 'regular' and dedicated file server, but 
> presumably some other process is causing /var to become erratic. Unless 
> continual writing and syncing to a small (64KiB) file could cause this to 
> happen periodically?
> 
> In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs 
> mount to see if that helps, and will leave it running tonight.

I didn't really follow the controversy earlier this year about fsync
performance on ext3--is that fixed to some degree on newer kernels?

But in any case, the rmtab file is, as the rpc.mountd man page says,
"mostly ornamental".  We shouldn't be taking so much care over it.

Hm, the code in question:

	if (!have_new_cache) {
		/*
		 * If we are using the old caching interface: exportfs
		 * uses the rmtab to determine what should be exported,
		 * so it is important that it be up-to-date.
		 *
		 * If we are using the new caching interface: the rmtab
		 * is ignored by exportfs and the fdatasync only serves
		 * to slow us down.
		 */
		fflush(fp);
		fdatasync(fileno(fp));
	}

It's a little odd that you'd be hitting this case.  Do you not have "nfsd"
mounted?  (What distro is this again?)

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Listen backlog set to 64
  2010-12-01 18:28             ` Chuck Lever
@ 2010-12-01 18:46               ` J. Bruce Fields
  2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-12-01 18:46 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Mark Hills, Neil Brown, linux-nfs

On Wed, Dec 01, 2010 at 01:28:40PM -0500, Chuck Lever wrote:
> 
> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> 
> > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > 
> >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > [...]
> >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> >>> the client log we see messages like:
> >>> 
> >>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>> 
> >>> Using strace and isolating one of these, I can see a non-blocking connect 
> >>> has already managed to make a connection and even send/receive some data. 
> >>> 
> >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> >>> mount.nfs when waiting for a response of some sort. The socket in question 
> >>> is a connection to mountd:
> >>> 
> >>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>> 
> >>> When it returns:
> >>> 
> >>>  26512 <... poll resumed> )              = 0 (Timeout)
> >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>>  26512 close(3)                          = 0
> >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>> 
> >>> There's no re-try from here, just a failed mount.
> >> 
> >> That does sound wrong.  I'm not at all familiar with automount,
> >> unfortunately; how is it invoking mount.nfs?
> >> 
> >>> What is the source of this 9999 millisecond timeout used by poll() in 
> >>> mount.nfs? It was not clear in an initial search of nfs-utils and glibc, 
> >>> but I need more time to investigate.
> >>> 
> >>> If the server is being too slow to respond, what could the cause of this 
> >>> be? Multiple threads are already in use, but it seems like they are not 
> >>> all in use because a thread is able to accept() the connection. I haven't 
> >>> been able to pin this on the forward/reverse DNS lookup used by 
> >>> authentication and logging.
> >> 
> >> Can you tell where the mountd threads are typically waiting?
> > 
> > Here's a trace from mountd. Note the long pause after fdatasync():
> > 
> > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> > 31156 12:09:55 close(14)                = 0 <0.000008>
> > 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> > 31156 12:09:55 close(13)                = 0 <0.000008>
> > 
> > This pause exceeds the mysterious 9999 millisecond timeout allowed at the 
> > client.
> > 
> > fdatasync() is happening at every write of rmtab -- a lot. Occasionally it 
> > takes a long time. The syncing of the rmtab seems a reasonable explanation 
> > why more threads does not help -- all mountd threads lock up.
> > 
> > The strace I am analysing here contains 670 calls to fdatasync() in a 16 
> > minute period. All of them take < 1 second, except for this burst:
> > 
> > 31156 12:09:11 fdatasync(14)            = 0 <0.000010>
> > 31156 12:09:11 fdatasync(15)            = 0 <0.136285>
> > 31156 12:09:12 fdatasync(14)            = 0 <0.000010>
> > 31156 12:09:12 fdatasync(15)            = 0 <14.911323>
> > 31156 12:09:27 fdatasync(14)            = 0 <1.298896>
> > 31156 12:09:29 fdatasync(14)            = 0 <1.159262>
> > 31156 12:09:33 fdatasync(14)            = 0 <1.447169>
> > 31156 12:09:35 fdatasync(14)            = 0 <1.345562>
> > 31156 12:09:40 fdatasync(14)            = 0 <0.000010>
> > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> > 31156 12:09:55 fdatasync(17)            = 0 <0.000010>
> > 31156 12:09:55 fdatasync(17)            = 0 <1.068503>
> > 31156 12:09:57 fdatasync(18)            = 0 <0.000010>
> > 31156 12:09:57 fdatasync(18)            = 0 <0.089682>
> > 
> > /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
> > 
> > The server is only acting as a 'regular' and dedicated file server, but 
> > presumably some other process is causing /var to become erratic. Unless 
> > continual writing and syncing to a small (64KiB) file could cause this to 
> > happen periodically?
> 
> fdatasync() will write all outstanding dirty data on that partition to persistent storage, not just the dirty data for that file.  If /var is on a partition that has a lot of other activity, then this is the expected outcome, unfortunately.  I would expect such delays to be due to a large amount of outstanding writes, not to periodic synchronous writes.
> 
> This problem may be addressed in recent versions of nfs-utils.  I see a commit from Ben Myers on February 12 of this year that seems on point.  nfs-utils-1.2.2 has it, I think?

Doh, yes, looks like the code I quoted went into nfs-utils 1.2.2 with
Ben's commit.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-01 18:28             ` Chuck Lever
  2010-12-01 18:46               ` J. Bruce Fields
@ 2010-12-08 14:45               ` Mark Hills
  2010-12-08 15:38                 ` J. Bruce Fields
  2010-12-08 16:45                 ` Chuck Lever
  1 sibling, 2 replies; 21+ messages in thread
From: Mark Hills @ 2010-12-08 14:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Neil Brown, linux-nfs

On Wed, 1 Dec 2010, Chuck Lever wrote:

> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> 
> > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > 
> >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > [...]
> >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> >>> the client log we see messages like:
> >>> 
> >>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>> 
> >>> Using strace and isolating one of these, I can see a non-blocking connect 
> >>> has already managed to make a connection and even send/receive some data. 
> >>> 
> >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> >>> mount.nfs when waiting for a response of some sort. The socket in question 
> >>> is a connection to mountd:
> >>> 
> >>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>> 
> >>> When it returns:
> >>> 
> >>>  26512 <... poll resumed> )              = 0 (Timeout)
> >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>>  26512 close(3)                          = 0
> >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>> 
> >>> There's no re-try from here, just a failed mount.
[...]
> >> Can you tell where the mountd threads are typically waiting?
> > 
> > Here's a trace from mountd. Note the long pause after fdatasync():
> > 
> > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
[...]
> fdatasync() will write all outstanding dirty data on that partition to 
> persistent storage, not just the dirty data for that file.  If /var is 
> on a partition that has a lot of other activity, then this is the 
> expected outcome, unfortunately.  I would expect such delays to be due 
> to a large amount of outstanding writes, not to periodic synchronous 
> writes.

Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
writing to /var, this time via syslog.

We identified a crontab entry which uses RPM to apply updates to the 
server OS at randomised times which was blocking /var for 10 seconds or 
so.

This returns the issue to giving up after the 9999 millisecond timeout 
above, which I think is making the client overly sensitive.

The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
We are using mount.nfs for a v3 mount over TCP.

Locating this timeout wasn't easy. There's no '9999' in any of the 
nfs-utils or glibc's rpc code.

It turns out that the timeout originates in 
nfs-utils/support/nfs/rpc_socket.c as

  #define NFSRPC_TIMEOUT_TCP	(10)

This is used to populate a struct timeval *timeout which is used and 
updated by nfs_connect_nb(). Then the remaining timeout of 9999 
milliseconds is used in an RPC clnt_call operation.

10 seconds over TCP is simply too short to be reliable here; it's not a 
long period of time for a fatal timeout. I will attempt to test on our 
cluster with an extended timeout.

I presume that the bug is that the timeout is too short, and not that the 
timeout should cause a retry in a layer above -- because re-sending the 
request over reliable TCP would not make sense.

For UDP the retries are handled inside the RPC call, clntudp_call(). In 
which case, what is the reason the UDP timeout differs from TCP? And is it 
also too small?

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
@ 2010-12-08 15:38                 ` J. Bruce Fields
  2010-12-08 16:45                 ` Chuck Lever
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-12-08 15:38 UTC (permalink / raw)
  To: Mark Hills; +Cc: Chuck Lever, Neil Brown, linux-nfs

On Wed, Dec 08, 2010 at 02:45:40PM +0000, Mark Hills wrote:
> On Wed, 1 Dec 2010, Chuck Lever wrote:
> 
> > On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> > 
> > > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > > 
> > >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > > [...]
> > >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> > >>> the client log we see messages like:
> > >>> 
> > >>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > >>> 
> > >>> Using strace and isolating one of these, I can see a non-blocking connect 
> > >>> has already managed to make a connection and even send/receive some data. 
> > >>> 
> > >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> > >>> mount.nfs when waiting for a response of some sort. The socket in question 
> > >>> is a connection to mountd:
> > >>> 
> > >>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > >>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > >>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > >>> 
> > >>> When it returns:
> > >>> 
> > >>>  26512 <... poll resumed> )              = 0 (Timeout)
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > >>>  26512 close(3)                          = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > >>> 
> > >>> There's no re-try from here, just a failed mount.
> [...]
> > >> Can you tell where the mountd threads are typically waiting?
> > > 
> > > Here's a trace from mountd. Note the long pause after fdatasync():
> > > 
> > > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> [...]
> > fdatasync() will write all outstanding dirty data on that partition to 
> > persistent storage, not just the dirty data for that file.  If /var is 
> > on a partition that has a lot of other activity, then this is the 
> > expected outcome, unfortunately.  I would expect such delays to be due 
> > to a large amount of outstanding writes, not to periodic synchronous 
> > writes.
> 
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
> writing to /var, this time via syslog.
> 
> We identified a crontab entry which uses RPM to apply updates to the 
> server OS at randomised times which was blocking /var for 10 seconds or 
> so.

So is the problem that mountd ends up waiting on a synchronous write of
one of those "authenticated mount request" log messages?

Maybe we should downgrade those messages?  On a server with a lot of
clients a log on every mount call seems obnoxious.  Doing it
synchronously even more so (though I guess that's a problem for the
syslog configuration.)

> This returns the issue to giving up after the 9999 millisecond timeout 
> above, which I think is making the client overly sensitive.
> 
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
> We are using mount.nfs for a v3 mount over TCP.
> 
> Locating this timeout wasn't easy. There's no '9999' in any of the 
> nfs-utils or glibc's rpc code.
> 
> It turns out that the timeout originates in 
> nfs-utils/support/nfs/rpc_socket.c as
> 
>   #define NFSRPC_TIMEOUT_TCP	(10)
> 
> This is used to populate a struct timeval *timeout which is used and 
> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
> milliseconds is used in an RPC clnt_call operation.
> 
> 10 seconds over TCP is simply too short to be reliable here; it's not a 
> long period of time for a fatal timeout. I will attempt to test on our 
> cluster with an extended timeout.

Should this be tied to some mount option?

Chuck probably knows how this is supposed to work....

--b.

> 
> I presume that the bug is that the timeout is too short, and not that the 
> timeout should cause a retry in a layer above -- because re-sending the 
> request over reliable TCP would not make sense.
> 
> For UDP the retries are handled inside the RPC call, clntudp_call(). In 
> which case, what is the reason the UDP timeout differs from TCP? And is it 
> also too small?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
  2010-12-08 15:38                 ` J. Bruce Fields
@ 2010-12-08 16:45                 ` Chuck Lever
  2010-12-08 17:31                   ` Mark Hills
  1 sibling, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2010-12-08 16:45 UTC (permalink / raw)
  To: Mark Hills; +Cc: J. Bruce Fields, Neil Brown, linux-nfs


On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:

> On Wed, 1 Dec 2010, Chuck Lever wrote:
> 
>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>> 
>>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
>>> 
>>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
>>> [...]
>>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
>>>>> the client log we see messages like:
>>>>> 
>>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>>>> 
>>>>> Using strace and isolating one of these, I can see a non-blocking connect 
>>>>> has already managed to make a connection and even send/receive some data. 
>>>>> 
>>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
>>>>> mount.nfs when waiting for a response of some sort. The socket in question 
>>>>> is a connection to mountd:
>>>>> 
>>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>>>> 
>>>>> When it returns:
>>>>> 
>>>>> 26512 <... poll resumed> )              = 0 (Timeout)
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>>>> 26512 close(3)                          = 0
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>>>> 
>>>>> There's no re-try from here, just a failed mount.
> [...]
>>>> Can you tell where the mountd threads are typically waiting?
>>> 
>>> Here's a trace from mountd. Note the long pause after fdatasync():
>>> 
>>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
>>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
>>> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> [...]
>> fdatasync() will write all outstanding dirty data on that partition to 
>> persistent storage, not just the dirty data for that file.  If /var is 
>> on a partition that has a lot of other activity, then this is the 
>> expected outcome, unfortunately.  I would expect such delays to be due 
>> to a large amount of outstanding writes, not to periodic synchronous 
>> writes.
> 
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
> writing to /var, this time via syslog.
> 
> We identified a crontab entry which uses RPM to apply updates to the 
> server OS at randomised times which was blocking /var for 10 seconds or 
> so.

I think this is closer to the root cause than the mount.nfs connect timeout.  I might consider RPM or syslog to be the final culprit, but I wonder why a multi-threaded mountd would stop servicing TCP connects because it couldn't write to /var or to syslog?

> This returns the issue to giving up after the 9999 millisecond timeout 
> above, which I think is making the client overly sensitive.
> 
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
> We are using mount.nfs for a v3 mount over TCP.
> 
> Locating this timeout wasn't easy. There's no '9999' in any of the 
> nfs-utils or glibc's rpc code.
> 
> It turns out that the timeout originates in 
> nfs-utils/support/nfs/rpc_socket.c as
> 
>  #define NFSRPC_TIMEOUT_TCP	(10)
> 
> This is used to populate a struct timeval *timeout which is used and 
> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
> milliseconds is used in an RPC clnt_call operation.
> 
> 10 seconds over TCP is simply too short to be reliable here; it's not a 
> long period of time for a fatal timeout. I will attempt to test on our 
> cluster with an extended timeout.

This is a TCP connect timeout, not a RPC retransmit timeout.  Waiting 10 seconds for a TCP connect is really not that obnoxious.  The reason this is short is because we want mount.nfs to realize quickly that it's making no progress on stuck servers.

> I presume that the bug is that the timeout is too short, and not that the 
> timeout should cause a retry in a layer above -- because re-sending the 
> request over reliable TCP would not make sense.

The upper layer wouldn't resend the request, since the underlying transport never connected.

> For UDP the retries are handled inside the RPC call, clntudp_call(). In 
> which case, what is the reason the UDP timeout differs from TCP? And is it 
> also too small?

RPC over UDP doesn't have to connect to the server.  I wonder what would happen if you specified "mountproto=udp" when mounting exports on this overloaded server.

We could consider making the TCP connect timeout value larger, but the price would be a mount.nfs command that is less responsive to hung servers.  For example, the legacy get_socket() code uses a 20 second TCP connect timeout for the pmap query, and a 30 second TCP connect timeout for the mount request.

But how large would be large enough?  If we chose a TCP connect timeout of 25 seconds, what would happen if your server required, say, 30 seconds to accept a new TCP connection?  I think we would agree that the server side should be fixed in that case.

IMO adding another mount option would be mistaken.  It adds clutter to the mount user interface to work around what is clearly a problem on the server.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 16:45                 ` Chuck Lever
@ 2010-12-08 17:31                   ` Mark Hills
  2010-12-08 18:28                     ` Chuck Lever
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Hills @ 2010-12-08 17:31 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Neil Brown, linux-nfs

On Wed, 8 Dec 2010, Chuck Lever wrote:

> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
> 
> > On Wed, 1 Dec 2010, Chuck Lever wrote:
> > 
> >> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> >> 
> >>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> >>> 
> >>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> >>> [...]
> >>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> >>>>> the client log we see messages like:
> >>>>> 
> >>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>>>> 
> >>>>> Using strace and isolating one of these, I can see a non-blocking connect 
> >>>>> has already managed to make a connection and even send/receive some data. 
> >>>>> 
> >>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> >>>>> mount.nfs when waiting for a response of some sort. The socket in question 
> >>>>> is a connection to mountd:
> >>>>> 
> >>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>>>> 
> >>>>> When it returns:
> >>>>> 
> >>>>> 26512 <... poll resumed> )              = 0 (Timeout)
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>>>> 26512 close(3)                          = 0
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>>>> 
> >>>>> There's no re-try from here, just a failed mount.
> > [...]
> >>>> Can you tell where the mountd threads are typically waiting?
> >>> 
> >>> Here's a trace from mountd. Note the long pause after fdatasync():
> >>> 
> >>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> >>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> >>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> >>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> >>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> >>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> >>> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> > [...]
> >> fdatasync() will write all outstanding dirty data on that partition to 
> >> persistent storage, not just the dirty data for that file.  If /var is 
> >> on a partition that has a lot of other activity, then this is the 
> >> expected outcome, unfortunately.  I would expect such delays to be due 
> >> to a large amount of outstanding writes, not to periodic synchronous 
> >> writes.
> > 
> > Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
> > writing to /var, this time via syslog.
> > 
> > We identified a crontab entry which uses RPM to apply updates to the 
> > server OS at randomised times which was blocking /var for 10 seconds or 
> > so.
> 
> I think this is closer to the root cause than the mount.nfs connect 
> timeout.  I might consider RPM or syslog to be the final culprit, but I 
> wonder why a multi-threaded mountd would stop servicing TCP connects 
> because it couldn't write to /var or to syslog?

Of course this is the root cause, but a mementary bottleneck at the server 
isn't such a rare occurance, particularly when servicing bursts on a large 
HPC cluster.

> > This returns the issue to giving up after the 9999 millisecond timeout 
> > above, which I think is making the client overly sensitive.
> > 
> > The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
> > We are using mount.nfs for a v3 mount over TCP.
> > 
> > Locating this timeout wasn't easy. There's no '9999' in any of the 
> > nfs-utils or glibc's rpc code.
> > 
> > It turns out that the timeout originates in 
> > nfs-utils/support/nfs/rpc_socket.c as
> > 
> >  #define NFSRPC_TIMEOUT_TCP	(10)
> > 
> > This is used to populate a struct timeval *timeout which is used and 
> > updated by nfs_connect_nb(). Then the remaining timeout of 9999 
> > milliseconds is used in an RPC clnt_call operation.
> > 
> > 10 seconds over TCP is simply too short to be reliable here; it's not a 
> > long period of time for a fatal timeout. I will attempt to test on our 
> > cluster with an extended timeout.
> 
> This is a TCP connect timeout, not a RPC retransmit timeout.  Waiting 10 
> seconds for a TCP connect is really not that obnoxious.  The reason this 
> is short is because we want mount.nfs to realize quickly that it's 
> making no progress on stuck servers.

If my understanding of the code is correct, the 10 second timeout is not 
just for the connect, but also for making an RPC call and receiving the 
response (the struct timeval is carried through). Hence the diagnosis in 
the previous mail.

> > I presume that the bug is that the timeout is too short, and not that 
> > the timeout should cause a retry in a layer above -- because 
> > re-sending the request over reliable TCP would not make sense.
> 
> The upper layer wouldn't resend the request, since the underlying 
> transport never connected.
> 
> > For UDP the retries are handled inside the RPC call, clntudp_call(). 
> > In which case, what is the reason the UDP timeout differs from TCP? 
> > And is it also too small?
> 
> RPC over UDP doesn't have to connect to the server.  I wonder what would 
> happen if you specified "mountproto=udp" when mounting exports on this 
> overloaded server.

I think we'd still see the same issue over UDP, perhaps even more brutally 
because of the shorter 3 second UDP timeout.

To clarify, we've deviated from the problem described in the original 
email; opening a TCP connection to the mountd which is the problem, but 
mountd's ability to send a message in response.
 
> We could consider making the TCP connect timeout value larger, but the 
> price would be a mount.nfs command that is less responsive to hung 
> servers.  For example, the legacy get_socket() code uses a 20 second TCP 
> connect timeout for the pmap query, and a 30 second TCP connect timeout 
> for the mount request.
> 
> But how large would be large enough?  If we chose a TCP connect timeout 
> of 25 seconds, what would happen if your server required, say, 30 
> seconds to accept a new TCP connection?  I think we would agree that the 
> server side should be fixed in that case.

Whilst these timeouts could have resolved our problem, they are perhaps 
too short to consider that a server is fatally down. Particularly if we 
have already managed to open a TCP connection.

We've certainly managed to see the problem in the wild on a fairly 
'standard' RHEL 5.4 Linux server. In this case it was rpm, so what if the 
sysadmin requires to run 'rpm' to perform some administration? They can't 
do this without high risk to the clients.

> IMO adding another mount option would be mistaken.  It adds clutter to 
> the mount user interface to work around what is clearly a problem on the 
> server.

I agree that new options can be cluttered.

But I think the problem needs attention at both the client and sever side. 
In this case the cause of our load spike was something that we hope can be 
avoided, but I can't eliminate the risk. When serving many hundreds of 
clients we expect hot-spots of load, especially when a single user has 
control over large groups of client hosts.

Furthermore the chain of events of a mount failure is so severe, because 
autofs caches the failed mount and present an empty directory for some 
time after.

-- 
Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 17:31                   ` Mark Hills
@ 2010-12-08 18:28                     ` Chuck Lever
  2010-12-08 18:37                       ` J. Bruce Fields
  2010-12-13 16:19                       ` Chuck Lever
  0 siblings, 2 replies; 21+ messages in thread
From: Chuck Lever @ 2010-12-08 18:28 UTC (permalink / raw)
  To: Mark Hills; +Cc: J. Bruce Fields, Neil Brown, linux-nfs


On Dec 8, 2010, at 12:31 PM, Mark Hills wrote:

> On Wed, 8 Dec 2010, Chuck Lever wrote:
> 
>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>> 
>>> On Wed, 1 Dec 2010, Chuck Lever wrote:
>>> 
>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>>> 
>>>>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
>>>>> 
>>>>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
>>>>> [...]
>>>>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
>>>>>>> the client log we see messages like:
>>>>>>> 
>>>>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>>>>>> 
>>>>>>> Using strace and isolating one of these, I can see a non-blocking connect 
>>>>>>> has already managed to make a connection and even send/receive some data. 
>>>>>>> 
>>>>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
>>>>>>> mount.nfs when waiting for a response of some sort. The socket in question 
>>>>>>> is a connection to mountd:
>>>>>>> 
>>>>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>>>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>>>>>> 
>>>>>>> When it returns:
>>>>>>> 
>>>>>>> 26512 <... poll resumed> )              = 0 (Timeout)
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>>>>>> 26512 close(3)                          = 0
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>>>>>> 
>>>>>>> There's no re-try from here, just a failed mount.
>>> [...]
>>>>>> Can you tell where the mountd threads are typically waiting?
>>>>> 
>>>>> Here's a trace from mountd. Note the long pause after fdatasync():
>>>>> 
>>>>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
>>>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>>>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
>>>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>>>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
>>>>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
>>>>> 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
>>> [...]
>>>> fdatasync() will write all outstanding dirty data on that partition to 
>>>> persistent storage, not just the dirty data for that file.  If /var is 
>>>> on a partition that has a lot of other activity, then this is the 
>>>> expected outcome, unfortunately.  I would expect such delays to be due 
>>>> to a large amount of outstanding writes, not to periodic synchronous 
>>>> writes.
>>> 
>>> Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
>>> writing to /var, this time via syslog.
>>> 
>>> We identified a crontab entry which uses RPM to apply updates to the 
>>> server OS at randomised times which was blocking /var for 10 seconds or 
>>> so.
>> 
>> I think this is closer to the root cause than the mount.nfs connect 
>> timeout.  I might consider RPM or syslog to be the final culprit, but I 
>> wonder why a multi-threaded mountd would stop servicing TCP connects 
>> because it couldn't write to /var or to syslog?
> 
> Of course this is the root cause, but a mementary bottleneck at the server 
> isn't such a rare occurance, particularly when servicing bursts on a large 
> HPC cluster.

Fwiw, the solution on enterprise storage servers (often typically used in HPC cluster environments) has been to improve mountd so that momentary server system bottlenecks do not cause sporadic mountd service.  That seems like the larger part of an appropriate response to this problem on Linux.

>>> This returns the issue to giving up after the 9999 millisecond timeout 
>>> above, which I think is making the client overly sensitive.
>>> 
>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
>>> We are using mount.nfs for a v3 mount over TCP.
>>> 
>>> Locating this timeout wasn't easy. There's no '9999' in any of the 
>>> nfs-utils or glibc's rpc code.
>>> 
>>> It turns out that the timeout originates in 
>>> nfs-utils/support/nfs/rpc_socket.c as
>>> 
>>> #define NFSRPC_TIMEOUT_TCP	(10)
>>> 
>>> This is used to populate a struct timeval *timeout which is used and 
>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
>>> milliseconds is used in an RPC clnt_call operation.
>>> 
>>> 10 seconds over TCP is simply too short to be reliable here; it's not a 
>>> long period of time for a fatal timeout. I will attempt to test on our 
>>> cluster with an extended timeout.
>> 
>> This is a TCP connect timeout, not a RPC retransmit timeout.  Waiting 10 
>> seconds for a TCP connect is really not that obnoxious.  The reason this 
>> is short is because we want mount.nfs to realize quickly that it's 
>> making no progress on stuck servers.
> 
> If my understanding of the code is correct, the 10 second timeout is not 
> just for the connect, but also for making an RPC call and receiving the 
> response (the struct timeval is carried through). Hence the diagnosis in 
> the previous mail.
> 
>>> I presume that the bug is that the timeout is too short, and not that 
>>> the timeout should cause a retry in a layer above -- because 
>>> re-sending the request over reliable TCP would not make sense.
>> 
>> The upper layer wouldn't resend the request, since the underlying 
>> transport never connected.
>> 
>>> For UDP the retries are handled inside the RPC call, clntudp_call(). 
>>> In which case, what is the reason the UDP timeout differs from TCP? 
>>> And is it also too small?
>> 
>> RPC over UDP doesn't have to connect to the server.  I wonder what would 
>> happen if you specified "mountproto=udp" when mounting exports on this 
>> overloaded server.
> 
> I think we'd still see the same issue over UDP, perhaps even more brutally 
> because of the shorter 3 second UDP timeout.
> 
> To clarify, we've deviated from the problem described in the original 
> email; opening a TCP connection to the mountd which is the problem, but 
> mountd's ability to send a message in response.

So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds?

>> We could consider making the TCP connect timeout value larger, but the 
>> price would be a mount.nfs command that is less responsive to hung 
>> servers.  For example, the legacy get_socket() code uses a 20 second TCP 
>> connect timeout for the pmap query, and a 30 second TCP connect timeout 
>> for the mount request.
>> 
>> But how large would be large enough?  If we chose a TCP connect timeout 
>> of 25 seconds, what would happen if your server required, say, 30 
>> seconds to accept a new TCP connection?  I think we would agree that the 
>> server side should be fixed in that case.
> 
> Whilst these timeouts could have resolved our problem, they are perhaps 
> too short to consider that a server is fatally down. Particularly if we 
> have already managed to open a TCP connection.

Perhaps.  If mountd is hung or otherwise unresponsive, a TCP connection could be open but the service may still be unresponsive for some reason.  But that's a nit.

> We've certainly managed to see the problem in the wild on a fairly 
> 'standard' RHEL 5.4 Linux server. In this case it was rpm, so what if the 
> sysadmin requires to run 'rpm' to perform some administration? They can't 
> do this without high risk to the clients.

>From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect.  Or some combination of these.

A ten second latency for file system operations is the real issue.

>> IMO adding another mount option would be mistaken.  It adds clutter to 
>> the mount user interface to work around what is clearly a problem on the 
>> server.
> 
> I agree that new options can be cluttered.
> 
> But I think the problem needs attention at both the client and sever side.

It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests.  The question I have is what would be reasonable values to use instead of what we have?  (That question is directed to everyone who is still reading).

> In this case the cause of our load spike was something that we hope can be 
> avoided, but I can't eliminate the risk. When serving many hundreds of 
> clients we expect hot-spots of load, especially when a single user has 
> control over large groups of client hosts.

Yes, that's a typical problem in compute clusters that use NFS.

> Furthermore the chain of events of a mount failure is so severe, because 
> autofs caches the failed mount and present an empty directory for some 
> time after.

That's a good argument to get mountd changed so that it can better handle such mount bursts.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 18:28                     ` Chuck Lever
@ 2010-12-08 18:37                       ` J. Bruce Fields
  2010-12-08 20:34                         ` Chuck Lever
  2010-12-08 21:04                         ` Chuck Lever
  2010-12-13 16:19                       ` Chuck Lever
  1 sibling, 2 replies; 21+ messages in thread
From: J. Bruce Fields @ 2010-12-08 18:37 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Mark Hills, Neil Brown, linux-nfs

On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect.  Or some combination of these.
> 
> A ten second latency for file system operations is the real issue.
> 
> >> IMO adding another mount option would be mistaken.  It adds clutter to 
> >> the mount user interface to work around what is clearly a problem on the 
> >> server.
> > 
> > I agree that new options can be cluttered.
> > 
> > But I think the problem needs attention at both the client and sever side.
> 
> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests.  The question I have is what would be reasonable values to use instead of what we have?  (That question is directed to everyone who is still reading).

What's the behavior you want when the server is just off or unreachable?

I'd've thought that there are cases where you just want the mount to
hang till it's interrupted or the problem is fixed.  (Autofs possibly
being one of them.)

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 18:37                       ` J. Bruce Fields
@ 2010-12-08 20:34                         ` Chuck Lever
  2010-12-08 21:04                         ` Chuck Lever
  1 sibling, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2010-12-08 20:34 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Mark Hills, Neil Brown, linux-nfs


On Dec 8, 2010, at 1:37 PM, J. Bruce Fields wrote:

> On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
>> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect.  Or some combination of these.
>> 
>> A ten second latency for file system operations is the real issue.
>> 
>>>> IMO adding another mount option would be mistaken.  It adds clutter to 
>>>> the mount user interface to work around what is clearly a problem on the 
>>>> server.
>>> 
>>> I agree that new options can be cluttered.
>>> 
>>> But I think the problem needs attention at both the client and sever side.
>> 
>> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests.  The question I have is what would be reasonable values to use instead of what we have?  (That question is directed to everyone who is still reading).
> 
> What's the behavior you want when the server is just off or unreachable?
> 
> I'd've thought that there are cases where you just want the mount to
> hang till it's interrupted or the problem is fixed.  (Autofs possibly
> being one of them.)

Perhaps we should retry in this case until the mount succeeds or the retry= timeout expires.  Then the value of NFSRPC_TIMEOUT_TCP would be inconsequential.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 18:37                       ` J. Bruce Fields
  2010-12-08 20:34                         ` Chuck Lever
@ 2010-12-08 21:04                         ` Chuck Lever
  1 sibling, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2010-12-08 21:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Mark Hills, Neil Brown, linux-nfs


On Dec 8, 2010, at 1:37 PM, J. Bruce Fields wrote:

> On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
>> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect.  Or some combination of these.
>> 
>> A ten second latency for file system operations is the real issue.
>> 
>>>> IMO adding another mount option would be mistaken.  It adds clutter to 
>>>> the mount user interface to work around what is clearly a problem on the 
>>>> server.
>>> 
>>> I agree that new options can be cluttered.
>>> 
>>> But I think the problem needs attention at both the client and sever side.
>> 
>> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests.  The question I have is what would be reasonable values to use instead of what we have?  (That question is directed to everyone who is still reading).
> 
> What's the behavior you want when the server is just off or unreachable?
> 
> I'd've thought that there are cases where you just want the mount to
> hang till it's interrupted or the problem is fixed.  (Autofs possibly
> being one of them.)

Be it noted however that autofs usually wants a quick yes or no answer in these cases.  Auto mounts that take a long time to succeed usually drive users crazy, especially if the automounter is single-threaded.  That's why this timeout is kept short.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
  2010-12-08 18:28                     ` Chuck Lever
  2010-12-08 18:37                       ` J. Bruce Fields
@ 2010-12-13 16:19                       ` Chuck Lever
  1 sibling, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2010-12-13 16:19 UTC (permalink / raw)
  To: Mark Hills; +Cc: J. Bruce Fields, Neil Brown, Linux NFS Mailing List


On Dec 8, 2010, at 1:28 PM, Chuck Lever wrote:

> 
> On Dec 8, 2010, at 12:31 PM, Mark Hills wrote:
> 
>> On Wed, 8 Dec 2010, Chuck Lever wrote:
>> 
>>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>>> 
>>>> On Wed, 1 Dec 2010, Chuck Lever wrote:
>>>> 
>>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>>>> 
>>>> This returns the issue to giving up after the 9999 millisecond timeout 
>>>> above, which I think is making the client overly sensitive.
>>>> 
>>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
>>>> We are using mount.nfs for a v3 mount over TCP.
>>>> 
>>>> Locating this timeout wasn't easy. There's no '9999' in any of the 
>>>> nfs-utils or glibc's rpc code.
>>>> 
>>>> It turns out that the timeout originates in 
>>>> nfs-utils/support/nfs/rpc_socket.c as
>>>> 
>>>> #define NFSRPC_TIMEOUT_TCP	(10)
>>>> 
>>>> This is used to populate a struct timeval *timeout which is used and 
>>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
>>>> milliseconds is used in an RPC clnt_call operation.
>>>> 
>>>> 10 seconds over TCP is simply too short to be reliable here; it's not a 
>>>> long period of time for a fatal timeout. I will attempt to test on our 
>>>> cluster with an extended timeout.
>>> 
>>> This is a TCP connect timeout, not a RPC retransmit timeout.  Waiting 10 
>>> seconds for a TCP connect is really not that obnoxious.  The reason this 
>>> is short is because we want mount.nfs to realize quickly that it's 
>>> making no progress on stuck servers.
>> 
>> If my understanding of the code is correct, the 10 second timeout is not 
>> just for the connect, but also for making an RPC call and receiving the 
>> response (the struct timeval is carried through). Hence the diagnosis in 
>> the previous mail.
>> 
>>>> I presume that the bug is that the timeout is too short, and not that 
>>>> the timeout should cause a retry in a layer above -- because 
>>>> re-sending the request over reliable TCP would not make sense.
>>> 
>>> The upper layer wouldn't resend the request, since the underlying 
>>> transport never connected.
>>> 
>>>> For UDP the retries are handled inside the RPC call, clntudp_call(). 
>>>> In which case, what is the reason the UDP timeout differs from TCP? 
>>>> And is it also too small?
>>> 
>>> RPC over UDP doesn't have to connect to the server.  I wonder what would 
>>> happen if you specified "mountproto=udp" when mounting exports on this 
>>> overloaded server.
>> 
>> I think we'd still see the same issue over UDP, perhaps even more brutally 
>> because of the shorter 3 second UDP timeout.
>> 
>> To clarify, we've deviated from the problem described in the original 
>> email; opening a TCP connection to the mountd which is the problem, but 
>> mountd's ability to send a message in response.
> 
> So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds?

I've opened https://bugzilla.linux-nfs.org/show_bug.cgi?id=196 to track the mount.nfs part of this issue.

It seems to me, however, that under normal circumstances nfs-utils 1.2.1 on Fedora 12 should be doing a kernel mount, and mount.nfs should not be attempting to send a MNT request.  Would it be possible to enable some trace debugging to capture the sequence of kernel events during a failed mount?

On the clients:

  # rpcdebug -m nfs -s mount

To disable:

  # rpcdebug -m nfs -c

Debugging messages will appear in /var/log/messages.

A full network trace could also be useful.  Just needed while your jobs are starting so we can capture the network traffic at mount time.

  # tcpdump -s 1600 -w /tmp/raw

Doing this on your clients would likely be more reliable than on the server.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-12-13 16:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-15 18:43 Listen backlog set to 64 Mark Hills
2010-11-16 18:20 ` J. Bruce Fields
2010-11-16 19:05   ` Mark Hills
2010-11-16 22:08   ` Neil Brown
2010-11-29 20:59     ` J. Bruce Fields
2010-11-30 17:50       ` Mark Hills
2010-11-30 20:00         ` J. Bruce Fields
2010-11-30 22:09           ` Mark Hills
2010-12-01 18:18           ` Mark Hills
2010-12-01 18:28             ` Chuck Lever
2010-12-01 18:46               ` J. Bruce Fields
2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
2010-12-08 15:38                 ` J. Bruce Fields
2010-12-08 16:45                 ` Chuck Lever
2010-12-08 17:31                   ` Mark Hills
2010-12-08 18:28                     ` Chuck Lever
2010-12-08 18:37                       ` J. Bruce Fields
2010-12-08 20:34                         ` Chuck Lever
2010-12-08 21:04                         ` Chuck Lever
2010-12-13 16:19                       ` Chuck Lever
2010-12-01 18:36             ` Listen backlog set to 64 J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.