All of lore.kernel.org
 help / color / mirror / Atom feed
* Stale NFS file handle
@ 2012-02-13 23:32 Székelyi Szabolcs
  2012-02-13 23:34 ` Sage Weil
  2012-02-14  1:04 ` Tommi Virtanen
  0 siblings, 2 replies; 22+ messages in thread
From: Székelyi Szabolcs @ 2012-02-13 23:32 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS file 
errors when trying to read a file or list a directory. Logs and scrubbing 
doesn't show any errors or suspicious entries. After remounting the filesystem 
either by restarting the cluster thus forcing the clients to reconnect or 
umount+mount, files and directories either show up again or seem lost forever.

Can you give me any hint on what to check?

Thanks,
-- 
cc



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-13 23:32 Stale NFS file handle Székelyi Szabolcs
@ 2012-02-13 23:34 ` Sage Weil
  2012-02-13 23:51   ` Székelyi Szabolcs
  2012-02-14  1:04 ` Tommi Virtanen
  1 sibling, 1 reply; 22+ messages in thread
From: Sage Weil @ 2012-02-13 23:34 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 589 bytes --]

On Tue, 14 Feb 2012, Székelyi Szabolcs wrote:
> I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS file 
> errors when trying to read a file or list a directory. Logs and scrubbing 
> doesn't show any errors or suspicious entries. After remounting the filesystem 
> either by restarting the cluster thus forcing the clients to reconnect or 
> umount+mount, files and directories either show up again or seem lost forever.
> 
> Can you give me any hint on what to check?

Are you reexporting NFS, or are you getting ESTALE from the fuse mount 
itself?

sage

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-13 23:34 ` Sage Weil
@ 2012-02-13 23:51   ` Székelyi Szabolcs
  2012-02-13 23:54     ` Sage Weil
  0 siblings, 1 reply; 22+ messages in thread
From: Székelyi Szabolcs @ 2012-02-13 23:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 2012. February 13. 15:34:13 Sage Weil wrote:
> On Tue, 14 Feb 2012, Székelyi Szabolcs wrote:
> > I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS
> > file errors when trying to read a file or list a directory. Logs and
> > scrubbing doesn't show any errors or suspicious entries. After
> > remounting the filesystem either by restarting the cluster thus forcing
> > the clients to reconnect or umount+mount, files and directories either
> > show up again or seem lost forever.
> > 
> > Can you give me any hint on what to check?
> 
> Are you reexporting NFS, or are you getting ESTALE from the fuse mount
> itself?

No, there's no NFS in the picture. The OSDs' backend storage is on a local 
filesystem. I think it's the FUSE client telling me this.

-- 
cc


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-13 23:51   ` Székelyi Szabolcs
@ 2012-02-13 23:54     ` Sage Weil
  2012-02-14  0:51       ` Székelyi Szabolcs
  0 siblings, 1 reply; 22+ messages in thread
From: Sage Weil @ 2012-02-13 23:54 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1107 bytes --]

On Tue, 14 Feb 2012, Székelyi Szabolcs wrote:
> On 2012. February 13. 15:34:13 Sage Weil wrote:
> > On Tue, 14 Feb 2012, Székelyi Szabolcs wrote:
> > > I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS
> > > file errors when trying to read a file or list a directory. Logs and
> > > scrubbing doesn't show any errors or suspicious entries. After
> > > remounting the filesystem either by restarting the cluster thus forcing
> > > the clients to reconnect or umount+mount, files and directories either
> > > show up again or seem lost forever.
> > > 
> > > Can you give me any hint on what to check?
> > 
> > Are you reexporting NFS, or are you getting ESTALE from the fuse mount
> > itself?
> 
> No, there's no NFS in the picture. The OSDs' backend storage is on a local 
> filesystem. I think it's the FUSE client telling me this.

Okay, that sounds like a bug then.  The two interesting things would be a 
ceph-fuse log (--debug-client 10 --debug-ms 1 --log-file /path/to/log) and 
an mds log (debug mds = 20, debug ms = 1 in [mds] section of ceph.conf).

sage 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-13 23:54     ` Sage Weil
@ 2012-02-14  0:51       ` Székelyi Szabolcs
  2012-02-23 18:43         ` Tommi Virtanen
  0 siblings, 1 reply; 22+ messages in thread
From: Székelyi Szabolcs @ 2012-02-14  0:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 2012. February 13. 15:54:39 Sage Weil wrote:
> On Tue, 14 Feb 2012, Székelyi Szabolcs wrote:
> > No, there's no NFS in the picture. The OSDs' backend storage is on a
> > local filesystem. I think it's the FUSE client telling me this.
> 
> Okay, that sounds like a bug then.  The two interesting things would be a
> ceph-fuse log (--debug-client 10 --debug-ms 1 --log-file /path/to/log) and
> an mds log (debug mds = 20, debug ms = 1 in [mds] section of ceph.conf).

Thanks, I've set it up, now waiting for it to break. ;)

-- 
cc


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-13 23:32 Stale NFS file handle Székelyi Szabolcs
  2012-02-13 23:34 ` Sage Weil
@ 2012-02-14  1:04 ` Tommi Virtanen
  2012-02-14 13:20   ` Székelyi Szabolcs
  1 sibling, 1 reply; 22+ messages in thread
From: Tommi Virtanen @ 2012-02-14  1:04 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

2012/2/13 Székelyi Szabolcs <szekelyi@niif.hu>:
> I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS file
> errors when trying to read a file or list a directory. Logs and scrubbing
> doesn't show any errors or suspicious entries. After remounting the filesystem
> either by restarting the cluster thus forcing the clients to reconnect or
> umount+mount, files and directories either show up again or seem lost forever.
>
> Can you give me any hint on what to check?

Is this on a 32-bit machine, by any chance?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-14  1:04 ` Tommi Virtanen
@ 2012-02-14 13:20   ` Székelyi Szabolcs
  0 siblings, 0 replies; 22+ messages in thread
From: Székelyi Szabolcs @ 2012-02-14 13:20 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On 2012. February 13. 17:04:27 Tommi Virtanen wrote:
> 2012/2/13 Székelyi Szabolcs <szekelyi@niif.hu>:
> > I'm using Ceph 0.41 with the FUSE client. After a while I get stale NFS
> > file errors when trying to read a file or list a directory. Logs and
> > scrubbing doesn't show any errors or suspicious entries. After
> > remounting the filesystem either by restarting the cluster thus forcing
> > the clients to reconnect or umount+mount, files and directories either
> > show up again or seem lost forever.
> > 
> > Can you give me any hint on what to check?
> 
> Is this on a 32-bit machine, by any chance?

No, all machines run 64-bit kernel and 64-bit userspace.

-- 
cc


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-14  0:51       ` Székelyi Szabolcs
@ 2012-02-23 18:43         ` Tommi Virtanen
  2012-02-24 12:25           ` Székelyi Szabolcs
  0 siblings, 1 reply; 22+ messages in thread
From: Tommi Virtanen @ 2012-02-23 18:43 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: Sage Weil, ceph-devel

2012/2/13 Székelyi Szabolcs <szekelyi@niif.hu>:
>> Okay, that sounds like a bug then.  The two interesting things would be a
>> ceph-fuse log (--debug-client 10 --debug-ms 1 --log-file /path/to/log) and
>> an mds log (debug mds = 20, debug ms = 1 in [mds] section of ceph.conf).
>
> Thanks, I've set it up, now waiting for it to break. ;)

Anything new here? Did you manage to capture logs of the problem?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2012-02-23 18:43         ` Tommi Virtanen
@ 2012-02-24 12:25           ` Székelyi Szabolcs
  0 siblings, 0 replies; 22+ messages in thread
From: Székelyi Szabolcs @ 2012-02-24 12:25 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sage Weil, ceph-devel

On 2012. February 23. 10:43:02 Tommi Virtanen wrote:
> 2012/2/13 Székelyi Szabolcs <szekelyi@niif.hu>:
> >> Okay, that sounds like a bug then.  The two interesting things would
> >> be a ceph-fuse log (--debug-client 10 --debug-ms 1 --log-file
> >> /path/to/log) and an mds log (debug mds = 20, debug ms = 1 in [mds]
> >> section of ceph.conf).> 
> > Thanks, I've set it up, now waiting for it to break. ;)
> 
> Anything new here? Did you manage to capture logs of the problem?

No, not yet. I suspect that it was not a result of static operation, but was 
related to some cluster change (nodes going up & down), what hasn't happened 
since then. I'm still keeping an eye on it.

-- 
cc


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2016-12-24  9:48 Xen
@ 2017-01-03 19:41 ` J. Bruce Fields
  0 siblings, 0 replies; 22+ messages in thread
From: J. Bruce Fields @ 2017-01-03 19:41 UTC (permalink / raw)
  To: Xen; +Cc: linux-nfs

On Sat, Dec 24, 2016 at 10:48:29AM +0100, Xen wrote:
> Hi,
> 
> On a Debian server I have mounted several snapshots daily that I
> export with NFS.
> 
> At the end of the day the nfs-kernel-server service is shut down,
> the snapshots are renewed, remounted, and the server is brought
> online again.
> 
> In the beginning (I haven't been doing this for long) it all worked
> fine and I could mount the shares on the client, which is an older
> NAS unit, running an old kernel as 2.6.32.
> 
> Yet one of the shares now refuses to get mounted and I don't know
> why. The only thing I haven't tried is actually renaming the mount
> points.
> 
> mount: mounting island.vpn:/srv/root on /mnt/remote/root failed:
> Stale NFS file handle
> 
> This "island.vpn" simply translates to 10.8.20.25, in this case.
> 
> This is one of 5 mounts and one of 5 snapshots. The other snapshots
> simply succeed.
> 
> I have rebooted both servers.
> 
> I have removed the mount points on both places: the mount points for
> the snapshots, and the mount points for the shares on the client.
> 
> I have run exportfs -r and exportfs -f.
> 
> Oh, apologies, I see the issue, or at least part of it.
> 
> Dec 24 02:45:35 island rpc.mountd[3217]: / and /srv/root have same
> filehandle for diskstation.vpn, using first

Huh.  That message is from utils/mountd/cache.c:nfsd_fh().

> I really wanted to find out if it uses nfs3 or nfs4, but I think it
> uses nfs 4.
> 
> The above message does not always repeat itself:
> 
> Dec 24 02:56:35 island rpc.mountd[3217]: authenticated mount request
> from 10.8.20.1:944 for /srv/root (/srv/root)
> Dec 24 02:58:09 island rpc.mountd[3217]: authenticated mount request
> from 10.8.20.1:638 for /srv/boot (/srv/boot)
> 
> The site uses LVM snapshots, root (and boot) are regular, non-thin
> snapshots.
> 
> These are my exports:
> 
> /srv/home       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/data       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/sites      diskstation(ro,no_subtree_check,no_root_squash)
> /srv/boot       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/root       diskstation(ro,no_subtree_check,no_root_squash)
> 
> All other mounts succeed without issue. Root did fine at first as well.
> 
> Edit: adding fsid=22 to the root line fixed it:
> 
> /srv/home       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/data       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/sites      diskstation(ro,no_subtree_check,no_root_squash)
> /srv/boot       diskstation(ro,no_subtree_check,no_root_squash)
> /srv/root       diskstation(ro,fsid=22,no_subtree_check,no_root_squash)
> 
> All snapshots are independently mounted and hence do not contain
> other mounts on them.
> 
> Well I'm glad that's sorted. I don't know why the NFS server would
> pick a filesystem to export that wasn't even mentioned. Of course
> the snapshot and the root (original) will have the same UUID.

At a guess, this may be the v4 pseudoroot code: it exports (under heavy
restrictions) all the directories required to reach any exported
filesystem.

So maybe mountd, when searching for a filesystem matching the given
filehandle, found that pseudoroot export for "/", then later found the
real export for "/srv/root", and resolved the conflict by sticking with
the first one.

We could tell mountd to resolve such conflicts in favor of non-pseudroot
filesystems.  I'm not sure that would work.

Is there some way we could make sure a new uuid is generated for the
snapshots, so we avoid this kind of conflict even when explicitly
exporting multiple snapshots of the same filesystem?

Requiring admins to add explicit fsid='s all over seems unhelpful.

--b.

> Not its partition, but its filesystem will.
> 
> So I apologize for this message ;-).
> 
> Regards.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Stale NFS file handle
@ 2016-12-24  9:48 Xen
  2017-01-03 19:41 ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Xen @ 2016-12-24  9:48 UTC (permalink / raw)
  To: linux-nfs

Hi,

On a Debian server I have mounted several snapshots daily that I export 
with NFS.

At the end of the day the nfs-kernel-server service is shut down, the 
snapshots are renewed, remounted, and the server is brought online 
again.

In the beginning (I haven't been doing this for long) it all worked fine 
and I could mount the shares on the client, which is an older NAS unit, 
running an old kernel as 2.6.32.

Yet one of the shares now refuses to get mounted and I don't know why. 
The only thing I haven't tried is actually renaming the mount points.

mount: mounting island.vpn:/srv/root on /mnt/remote/root failed: Stale 
NFS file handle

This "island.vpn" simply translates to 10.8.20.25, in this case.

This is one of 5 mounts and one of 5 snapshots. The other snapshots 
simply succeed.

I have rebooted both servers.

I have removed the mount points on both places: the mount points for the 
snapshots, and the mount points for the shares on the client.

I have run exportfs -r and exportfs -f.

Oh, apologies, I see the issue, or at least part of it.

Dec 24 02:45:35 island rpc.mountd[3217]: / and /srv/root have same 
filehandle for diskstation.vpn, using first

I really wanted to find out if it uses nfs3 or nfs4, but I think it uses 
nfs 4.

The above message does not always repeat itself:

Dec 24 02:56:35 island rpc.mountd[3217]: authenticated mount request 
from 10.8.20.1:944 for /srv/root (/srv/root)
Dec 24 02:58:09 island rpc.mountd[3217]: authenticated mount request 
from 10.8.20.1:638 for /srv/boot (/srv/boot)

The site uses LVM snapshots, root (and boot) are regular, non-thin 
snapshots.

These are my exports:

/srv/home       diskstation(ro,no_subtree_check,no_root_squash)
/srv/data       diskstation(ro,no_subtree_check,no_root_squash)
/srv/sites      diskstation(ro,no_subtree_check,no_root_squash)
/srv/boot       diskstation(ro,no_subtree_check,no_root_squash)
/srv/root       diskstation(ro,no_subtree_check,no_root_squash)

All other mounts succeed without issue. Root did fine at first as well.

Edit: adding fsid=22 to the root line fixed it:

/srv/home       diskstation(ro,no_subtree_check,no_root_squash)
/srv/data       diskstation(ro,no_subtree_check,no_root_squash)
/srv/sites      diskstation(ro,no_subtree_check,no_root_squash)
/srv/boot       diskstation(ro,no_subtree_check,no_root_squash)
/srv/root       diskstation(ro,fsid=22,no_subtree_check,no_root_squash)

All snapshots are independently mounted and hence do not contain other 
mounts on them.

Well I'm glad that's sorted. I don't know why the NFS server would pick 
a filesystem to export that wasn't even mentioned. Of course the 
snapshot and the root (original) will have the same UUID.

Not its partition, but its filesystem will.

So I apologize for this message ;-).

Regards.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Stale NFS File Handle
  2006-02-03 19:09 ` Trond Myklebust
@ 2006-02-03 19:28   ` Roger Heflin
  0 siblings, 0 replies; 22+ messages in thread
From: Roger Heflin @ 2006-02-03 19:28 UTC (permalink / raw)
  To: 'Trond Myklebust', 'Brian D. McGrew'; +Cc: linux-kernel

 

> 
> Kernels prior to 2.6.12 (if memory serves me correctly) had a 
> series of errors in the code that converts filehandles into 
> valid dentries on the server. Upgrading to the FC4 kernel, 
> which I believe to be 2.6.14 based, is therefore very likely 
> to solve your problem.
> 
> Cheers,
>   Trond

Default FC4 is 2.6.11... so he would need to install on of the
updated kernels on FC4.

                           Roger


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Stale NFS File Handle
  2006-02-03 18:05 Stale NFS File Handle Brian D. McGrew
  2006-02-03 19:09 ` Trond Myklebust
@ 2006-02-03 19:24 ` Roger Heflin
  1 sibling, 0 replies; 22+ messages in thread
From: Roger Heflin @ 2006-02-03 19:24 UTC (permalink / raw)
  To: 'Brian D. McGrew', linux-kernel

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
> Brian D. McGrew
> Sent: Friday, February 03, 2006 12:06 PM
> To: linux-kernel@vger.kernel.org
> Subject: Stale NFS File Handle
> 
> Good morning all (kind of a long winded mail, please have patience!)
> 
> I've got an FC3 server running a 2.6.9 kernel and sharing 
> about 500GB of disk space on a RAID5 array via NFS.  This box 
> has been running fine for over a year now but in the last 
> three weeks or so I'm seeing a ton of Stale NFS File Handle 
> errors; especially in my overnight builds.
> 
> Most of my clients are FC3 and a couple of Solaris boxes 
> running a stock configuration.  All we're doing is serving up 
> NFS and compiling with GCC.  We're seeing this error more and 
> more and the harder I try to track it down, the more we're 
> seeing it (ok, maybe that's my imagination).
> 
> I'm guessing that the problem has to be somewhere in the FC3 
> server because I've still got some Solaris NFS servers that 
> have been running for years with no problems.
> 
> What should I be looking for in tracking this error down?  
> Should I upgrade my kernel?  Should I throw away FC3 and go 
> to Enterprise Linux?
> I'm at the end of my rope here because this is now causing a 
> major set back to our development team!
> 
> Please help!


Brian,

That is an ancient kernel well over a year old, I would try a
later kernel.

At a min put on a later kernel, and maybe put on FC4 as there
as are several different kernels to choose from there, some
of which may have issues, others of which may work.

You might also check when and how your are doing "exportfs -r"
and other exportfs type commands because I have seen this command
before cause interesting race conditions (ie there is a spot
where apparently the clients get a failure response).   My setup
to get those messages required a busy machine, and updating
/etc/exports in cron and rerunning exportfs often, even with
all of that the failures were pretty rare, and only affected
some nodes on a given failure.

I don't know if the bug is still around, but it is something
to check.

                              Roger




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS File Handle
  2006-02-03 18:05 Stale NFS File Handle Brian D. McGrew
@ 2006-02-03 19:09 ` Trond Myklebust
  2006-02-03 19:28   ` Roger Heflin
  2006-02-03 19:24 ` Roger Heflin
  1 sibling, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2006-02-03 19:09 UTC (permalink / raw)
  To: Brian D. McGrew; +Cc: linux-kernel

On Fri, 2006-02-03 at 10:05 -0800, Brian D. McGrew wrote:
> Good morning all (kind of a long winded mail, please have patience!)
> 
> I've got an FC3 server running a 2.6.9 kernel and sharing about 500GB of
> disk space on a RAID5 array via NFS.  This box has been running fine for
> over a year now but in the last three weeks or so I'm seeing a ton of
> Stale NFS File Handle errors; especially in my overnight builds.
> 
> Most of my clients are FC3 and a couple of Solaris boxes running a stock
> configuration.  All we're doing is serving up NFS and compiling with
> GCC.  We're seeing this error more and more and the harder I try to
> track it down, the more we're seeing it (ok, maybe that's my
> imagination).
> 
> I'm guessing that the problem has to be somewhere in the FC3 server
> because I've still got some Solaris NFS servers that have been running
> for years with no problems.
> 
> What should I be looking for in tracking this error down?  Should I
> upgrade my kernel?  Should I throw away FC3 and go to Enterprise Linux?
> I'm at the end of my rope here because this is now causing a major set
> back to our development team!

Kernels prior to 2.6.12 (if memory serves me correctly) had a series of
errors in the code that converts filehandles into valid dentries on the
server. Upgrading to the FC4 kernel, which I believe to be 2.6.14 based,
is therefore very likely to solve your problem.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Stale NFS File Handle
@ 2006-02-03 18:05 Brian D. McGrew
  2006-02-03 19:09 ` Trond Myklebust
  2006-02-03 19:24 ` Roger Heflin
  0 siblings, 2 replies; 22+ messages in thread
From: Brian D. McGrew @ 2006-02-03 18:05 UTC (permalink / raw)
  To: linux-kernel

Good morning all (kind of a long winded mail, please have patience!)

I've got an FC3 server running a 2.6.9 kernel and sharing about 500GB of
disk space on a RAID5 array via NFS.  This box has been running fine for
over a year now but in the last three weeks or so I'm seeing a ton of
Stale NFS File Handle errors; especially in my overnight builds.

Most of my clients are FC3 and a couple of Solaris boxes running a stock
configuration.  All we're doing is serving up NFS and compiling with
GCC.  We're seeing this error more and more and the harder I try to
track it down, the more we're seeing it (ok, maybe that's my
imagination).

I'm guessing that the problem has to be somewhere in the FC3 server
because I've still got some Solaris NFS servers that have been running
for years with no problems.

What should I be looking for in tracking this error down?  Should I
upgrade my kernel?  Should I throw away FC3 and go to Enterprise Linux?
I'm at the end of my rope here because this is now causing a major set
back to our development team!

Please help!

-brian

Brian D. McGrew { brian@visionpro.com || brian@doubledimension.com }
--
> Those of you who think you know it all,
  really annoy those of us who do! 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Stale NFS file handle
@ 2005-03-23 18:59 Lever, Charles
  0 siblings, 0 replies; 22+ messages in thread
From: Lever, Charles @ 2005-03-23 18:59 UTC (permalink / raw)
  To: Filipe Brandenburger; +Cc: nfs

filipe-

in general the kernel patches i referred to earlier will prevent most
issues when using rsync and serving web pages.  an occassional ESTALE is
unavoidable because no NFS client can recover from an ESTALE during a
read operation.  however, the patches do allow a subsequent open(2)
operation on that pathname to find the new file.


> -----Original Message-----
> From: Filipe Brandenburger [mailto:branden@terra.com.br]=20
> Sent: Wednesday, March 23, 2005 12:15 PM
> To: Trond Myklebust
> Cc: Steve Dickson; nfs@lists.sourceforge.net
> Subject: Re: [NFS] Stale NFS file handle
>=20
>=20
> * Wed, 23 Mar 2005 08:57:15 -0500, Trond Myklebust=20
> <trond.myklebust@fys.uio.no>:
> > He was running
> >=20
> > while :; do cat test.txt; done >/dev/null
> >=20
> > on a client, then deleting the file on the server. Even if=20
> the call to
> > open() is successful, you both can and will get ESTALEs on the=20
> > subsequent call to read().
>=20
> Ok,
>=20
> But then, how do you suggest I should change applications to=20
> do it? The applications that publish content to the NFS run=20
> on one host and are based on rsync, the applications that=20
> deliver content are web servers
> (Apache) reading from this same NFS on another pool of hosts=20
> (these are the ones that get the ESTALE error).
>=20
> Where is the problem? On the applications that publish?=20
> Should they open the file and update it in-place instead of=20
> creating a new one and renaming? I don't think so! This would=20
> lead to content that is a mix of the old and the new, that is corrupt.
>=20
> Or is the webserver? Should the application protect itself=20
> from ESTALE errors and retry? Somehow that seems wrong to me=20
> also. Then I would have to change all applications that read=20
> this content to do it. Why doesn't the NFS client recover=20
> from this kind of errors?
>=20
> If it's really not possible to change it on the NFS client=20
> (the kernel), what workaround would you suggest me to use?
>=20
> Thanks,
> Filipe
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by Microsoft Mobile & Embedded=20
> DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more=20
> about the latest Windows
> Embedded(r) & Windows Mobile(tm) platforms, applications &=20
> content.  Register by 3/29 & save $300=20
> http://ads.osdn.com/?ad_id=3D6883&alloc_id=3D15149&op=3Dclick
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net=20
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20


-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2005-03-23 17:15     ` Filipe Brandenburger
@ 2005-03-23 17:26       ` Trond Myklebust
  0 siblings, 0 replies; 22+ messages in thread
From: Trond Myklebust @ 2005-03-23 17:26 UTC (permalink / raw)
  To: Filipe Brandenburger; +Cc: Steve Dickson, nfs

on den 23.03.2005 Klokka 14:15 (-0300) skreiv Filipe Brandenburger:

> If it's really not possible to change it on the NFS client (the kernel),
> what workaround would you suggest me to use?

Rename the file, then delete it once you know that the clients no longer
have it open. That's the obvious and standard way of dealing with this
sort of problem over NFS.

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2005-03-23 13:57   ` Trond Myklebust
@ 2005-03-23 17:15     ` Filipe Brandenburger
  2005-03-23 17:26       ` Trond Myklebust
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Brandenburger @ 2005-03-23 17:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Steve Dickson, nfs


* Wed, 23 Mar 2005 08:57:15 -0500, Trond Myklebust <trond.myklebust@fys.uio.no>:
> He was running
> 
> while :; do cat test.txt; done >/dev/null
> 
> on a client, then deleting the file on the server. Even if the call to
> open() is successful, you both can and will get ESTALEs on the
> subsequent call to read().

Ok,

But then, how do you suggest I should change applications to do it? The
applications that publish content to the NFS run on one host and are
based on rsync, the applications that deliver content are web servers 
(Apache) reading from this same NFS on another pool of hosts (these are
the ones that get the ESTALE error).

Where is the problem? On the applications that publish? Should they open
the file and update it in-place instead of creating a new one and
renaming? I don't think so! This would lead to content that is a mix of
the old and the new, that is corrupt.

Or is the webserver? Should the application protect itself from ESTALE
errors and retry? Somehow that seems wrong to me also. Then I would have
to change all applications that read this content to do it. Why doesn't
the NFS client recover from this kind of errors?

If it's really not possible to change it on the NFS client (the kernel),
what workaround would you suggest me to use?

Thanks,
Filipe



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Stale NFS file handle
@ 2005-03-23 14:42 Lever, Charles
  0 siblings, 0 replies; 22+ messages in thread
From: Lever, Charles @ 2005-03-23 14:42 UTC (permalink / raw)
  To: Filipe Brandenburger; +Cc: nfs

i saw trond's post.  he is correct, your use case is broken.  the
patches will fix ESTALE for open(2), but not for read(2).  i don't know
of any NFS implementation that will recover from an ESTALE on a read
operation.

you need to understand that NFS is not a cluster file system.  it does
not provide single-system semantics.  for a better understanding of the
limitations of NFS's caching model, take a look at Callaghan's "NFS
Illustrated."

> -----Original Message-----
> From: Filipe Brandenburger [mailto:branden@terra.com.br]=20
> Sent: Wednesday, March 23, 2005 9:35 AM
> To: Lever, Charles
> Subject: Re: [NFS] Stale NFS file handle
>=20
>=20
> Hi, there.
>=20
> Thanks for your answer. Do you know of such a patch that=20
> would solve this issue at the Linux Kernel level? I'm using=20
> kernel 2.4, do you know if 2.6 is any better on that? Do you=20
> know if other client implementations actually recover from=20
> these errors? I googled around and found out that this may be=20
> an issue on Solaris as well...
>=20
> Thanks,
> Filipe
>=20
>=20
>=20
> * Wed, 23 Mar 2005 05:53:39 -0800, "Lever, Charles"=20
> <Charles.Lever@netapp.com>:
> > when you replaced the file, client 2 still had the old file handle=20
> > cached.  when it used that old file handle again, the=20
> server reported=20
> > the file no longer existed with an ESTALE error.
> >=20
> > the problem is that Linux NFS clients don't recover from=20
> ESTALE errors.
> > it's a deficiency in the client implementation that, at=20
> this point, is=20
> > fixed only by patches.  at some point soon the patches will be=20
> > integrated into the mainline and distributions.
>=20
>=20


-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2005-03-23 13:12 ` Steve Dickson
@ 2005-03-23 13:57   ` Trond Myklebust
  2005-03-23 17:15     ` Filipe Brandenburger
  0 siblings, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2005-03-23 13:57 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Filipe Brandenburger, nfs

on den 23.03.2005 Klokka 08:12 (-0500) skreiv Steve Dickson:
> Filipe Brandenburger wrote:
> > I have a problem where I'm getting "Stale NFS file handle" errors when a
> > file is updated. I can easily reproduce the problem if I run a sequence
> > of commands in two different hosts.
> > 
> > My environment is:
> > 
> > 1) Server: Netapp FAS940
> > 2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
> > 3) Client 2: exactly the same as client 1.
> Sometime back there was an Netapp issue that was causing
> ESTALES with mostly 2.4 kernels... I upgraded the OS on
> our toater and the problem when away....

... yeah, but this was a pretty obvious case of user error.

He was running

while :; do cat test.txt; done >/dev/null

on a client, then deleting the file on the server. Even if the call to
open() is successful, you both can and will get ESTALEs on the
subsequent call to read().

Cheers,
  Trond

-- 
Trond Myklebust <trond.myklebust@fys.uio.no>



-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Stale NFS file handle
  2005-03-23  0:19 Filipe Brandenburger
@ 2005-03-23 13:12 ` Steve Dickson
  2005-03-23 13:57   ` Trond Myklebust
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Dickson @ 2005-03-23 13:12 UTC (permalink / raw)
  To: Filipe Brandenburger; +Cc: nfs

Filipe Brandenburger wrote:
> I have a problem where I'm getting "Stale NFS file handle" errors when a
> file is updated. I can easily reproduce the problem if I run a sequence
> of commands in two different hosts.
> 
> My environment is:
> 
> 1) Server: Netapp FAS940
> 2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
> 3) Client 2: exactly the same as client 1.
Sometime back there was an Netapp issue that was causing
ESTALES with mostly 2.4 kernels... I upgraded the OS on
our toater and the problem when away....

steved.


-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Stale NFS file handle
@ 2005-03-23  0:19 Filipe Brandenburger
  2005-03-23 13:12 ` Steve Dickson
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Brandenburger @ 2005-03-23  0:19 UTC (permalink / raw)
  To: nfs


Hello,

I have a problem where I'm getting "Stale NFS file handle" errors when a
file is updated. I can easily reproduce the problem if I run a sequence
of commands in two different hosts.

My environment is:

1) Server: Netapp FAS940
2) Client 1: Linux RedHat 9 with kernel 2.4.21-4.ELsmp (kernel of RHAS3)
3) Client 2: exactly the same as client 1.

The file system is mounted on both clients with the options
rsize=8192,wsize=8192,timeo=28,intr, additionally it's mounted read-only
on client 2 (it also gives me stale file handle if it's mounted
read-write, so it doesn't really matter).

My test setup is:

On client 2, I setup a loop to read a file:

# while :; do cat test.txt; done >/dev/null

Then, on client 1, I create a new file and rename it over the original
file:

# date >new.txt; mv -f new.txt test.txt

Whenever I execute this on client 1, I get the following error message
on client 2:

cat: test.txt: Stale NFS file handle



Why is this happening? Is there a way to fix this problem? I tried the
mount options "noac" and "nocto" on client 2, and used "mount -o remount"
on it, after that the output of "mount" returned these options, and it
didn't solve the issue.

Is there a way to solve this issue without changing applications that
access this file? Because although my test environment consists of only
"cat" and "mv", my real production environment is of proprietary
applications, that are harder to fix, "cat" and "mv" was only the way I
used to reproduce the problem in a controlled environment...

Thanks a lot,
Filipe



-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-01-03 19:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-13 23:32 Stale NFS file handle Székelyi Szabolcs
2012-02-13 23:34 ` Sage Weil
2012-02-13 23:51   ` Székelyi Szabolcs
2012-02-13 23:54     ` Sage Weil
2012-02-14  0:51       ` Székelyi Szabolcs
2012-02-23 18:43         ` Tommi Virtanen
2012-02-24 12:25           ` Székelyi Szabolcs
2012-02-14  1:04 ` Tommi Virtanen
2012-02-14 13:20   ` Székelyi Szabolcs
  -- strict thread matches above, loose matches on Subject: below --
2016-12-24  9:48 Xen
2017-01-03 19:41 ` J. Bruce Fields
2006-02-03 18:05 Stale NFS File Handle Brian D. McGrew
2006-02-03 19:09 ` Trond Myklebust
2006-02-03 19:28   ` Roger Heflin
2006-02-03 19:24 ` Roger Heflin
2005-03-23 18:59 Stale NFS file handle Lever, Charles
2005-03-23 14:42 Lever, Charles
2005-03-23  0:19 Filipe Brandenburger
2005-03-23 13:12 ` Steve Dickson
2005-03-23 13:57   ` Trond Myklebust
2005-03-23 17:15     ` Filipe Brandenburger
2005-03-23 17:26       ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.