NFS4 client loop (10025 / BAD_STATEID)

* NFS4 client loop (10025 / BAD_STATEID)
@ 2012-04-05 17:26 Mike Grant
  2012-04-08 21:34 ` J. Bruce Fields
  0 siblings, 1 reply; 2+ messages in thread
From: Mike Grant @ 2012-04-05 17:26 UTC (permalink / raw)
  To: linux-nfs

Hi,

We've recently had some issues with NFS clients hammering servers to a
crawl due to a loop condition with NFS4 BAD_STATEID.  After trawling the
archives, I found something similar:
 http://www.spinics.net/lists/linux-nfs/msg25012.html
  ("RE: NFS4ERR_STALE_CLIENTID loop" Oct 2011)

I believe the outcome was that this was probably a Solaris server bug,
but the archive search makes it tricky to be sure.

Our issue is similar albeit with BAD_STATEID.  A couple of tcpdumps can
be found at http://rsg.pml.ac.uk/staff/mggr/linux-nfs/  The clients are
a bit outdated (Fedora 14, running 2.6.35.14-106.fc14.x86_64).

This is also against a Solaris server and, while not reproducable on
demand, happens about once every 2 days.  There are three machines in
this loop as I write ;)  Anyway, I'm assuming that's Oracle's (and our)
problem..

However, we have seen the same situation against a Linux server (RHEL 6,
2.6.32-71.el6.x86_64) about two weeks ago.  It occurred when the server
was rebooted and 2 workstations (out of 40) that were active at the time
of the reboot went into the same sort of loop when the server
reappeared.  Unfortunately the workstations were quickly rebooted
without gathering info and it's not yet reoccurred.

We're likely to do another reboot sometime after Easter, so I have my
fingers crossed we'll get a repeat of the issue.  If so, what info and
conditions would you ideally want us to try and get, bearing in mind
this is a core operational fileserver?  (i.e. we'd rather not run
development kernels on it)

Cheers,

Mike Grant.

^ permalink raw reply	[flat|nested] 2+ messages in thread