All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Mark Hills <mark@pogo.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>, Neil Brown <neilb@suse.de>,
	linux-nfs@vger.kernel.org
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)
Date: Wed, 8 Dec 2010 10:38:38 -0500	[thread overview]
Message-ID: <20101208153838.GD32305@fieldses.org> (raw)
In-Reply-To: <alpine.NEB.2.01.1012071143080.28772@jrf.vwaro.pbz>

On Wed, Dec 08, 2010 at 02:45:40PM +0000, Mark Hills wrote:
> On Wed, 1 Dec 2010, Chuck Lever wrote:
> 
> > On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> > 
> > > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > > 
> > >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > > [...]
> > >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In 
> > >>> the client log we see messages like:
> > >>> 
> > >>>  Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > >>> 
> > >>> Using strace and isolating one of these, I can see a non-blocking connect 
> > >>> has already managed to make a connection and even send/receive some data. 
> > >>> 
> > >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in 
> > >>> mount.nfs when waiting for a response of some sort. The socket in question 
> > >>> is a connection to mountd:
> > >>> 
> > >>>  26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > >>>  26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > >>>  26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > >>> 
> > >>> When it returns:
> > >>> 
> > >>>  26512 <... poll resumed> )              = 0 (Timeout)
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > >>>  26512 close(3)                          = 0
> > >>>  26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>>  26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > >>> 
> > >>> There's no re-try from here, just a failed mount.
> [...]
> > >> Can you tell where the mountd threads are typically waiting?
> > > 
> > > Here's a trace from mountd. Note the long pause after fdatasync():
> > > 
> > > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > > 31156 12:09:40 fdatasync(14)            = 0 <15.413401>
> [...]
> > fdatasync() will write all outstanding dirty data on that partition to 
> > persistent storage, not just the dirty data for that file.  If /var is 
> > on a partition that has a lot of other activity, then this is the 
> > expected outcome, unfortunately.  I would expect such delays to be due 
> > to a large amount of outstanding writes, not to periodic synchronous 
> > writes.
> 
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck 
> writing to /var, this time via syslog.
> 
> We identified a crontab entry which uses RPM to apply updates to the 
> server OS at randomised times which was blocking /var for 10 seconds or 
> so.

So is the problem that mountd ends up waiting on a synchronous write of
one of those "authenticated mount request" log messages?

Maybe we should downgrade those messages?  On a server with a lot of
clients a log on every mount call seems obnoxious.  Doing it
synchronously even more so (though I guess that's a problem for the
syslog configuration.)

> This returns the issue to giving up after the 9999 millisecond timeout 
> above, which I think is making the client overly sensitive.
> 
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12. 
> We are using mount.nfs for a v3 mount over TCP.
> 
> Locating this timeout wasn't easy. There's no '9999' in any of the 
> nfs-utils or glibc's rpc code.
> 
> It turns out that the timeout originates in 
> nfs-utils/support/nfs/rpc_socket.c as
> 
>   #define NFSRPC_TIMEOUT_TCP	(10)
> 
> This is used to populate a struct timeval *timeout which is used and 
> updated by nfs_connect_nb(). Then the remaining timeout of 9999 
> milliseconds is used in an RPC clnt_call operation.
> 
> 10 seconds over TCP is simply too short to be reliable here; it's not a 
> long period of time for a fatal timeout. I will attempt to test on our 
> cluster with an extended timeout.

Should this be tied to some mount option?

Chuck probably knows how this is supposed to work....

--b.

> 
> I presume that the bug is that the timeout is too short, and not that the 
> timeout should cause a retry in a layer above -- because re-sending the 
> request over reliable TCP would not make sense.
> 
> For UDP the retries are handled inside the RPC call, clntudp_call(). In 
> which case, what is the reason the UDP timeout differs from TCP? And is it 
> also too small?

  reply	other threads:[~2010-12-08 15:38 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-15 18:43 Listen backlog set to 64 Mark Hills
2010-11-16 18:20 ` J. Bruce Fields
2010-11-16 19:05   ` Mark Hills
2010-11-16 22:08   ` Neil Brown
2010-11-29 20:59     ` J. Bruce Fields
2010-11-30 17:50       ` Mark Hills
2010-11-30 20:00         ` J. Bruce Fields
2010-11-30 22:09           ` Mark Hills
2010-12-01 18:18           ` Mark Hills
2010-12-01 18:28             ` Chuck Lever
2010-12-01 18:46               ` J. Bruce Fields
2010-12-08 14:45               ` mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64) Mark Hills
2010-12-08 15:38                 ` J. Bruce Fields [this message]
2010-12-08 16:45                 ` Chuck Lever
2010-12-08 17:31                   ` Mark Hills
2010-12-08 18:28                     ` Chuck Lever
2010-12-08 18:37                       ` J. Bruce Fields
2010-12-08 20:34                         ` Chuck Lever
2010-12-08 21:04                         ` Chuck Lever
2010-12-13 16:19                       ` Chuck Lever
2010-12-01 18:36             ` Listen backlog set to 64 J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101208153838.GD32305@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mark@pogo.org.uk \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.