linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: NFS client locking hangs for period
@ 2003-04-25  4:57 Christian Reis
  0 siblings, 0 replies; 7+ messages in thread
From: Christian Reis @ 2003-04-25  4:57 UTC (permalink / raw)
  To: NFS; +Cc: linux-kernel


Well, since I've more or less moved on from my original problems, I
should probably post a summary of what was going on, and what I did to
work around it.

Details can be read out from [1]: after a certain amount of time a
number diskless clients, which were mounting everything from the same
NFS server, started getting hung lock requests from the server. The
server ran 2.4.20, reiserfs over RAID-1 mounted with 2 SCSI disks on an
Adaptec 29160. The clients were debian woodys running 2.4.20.

Our diskless setup is a bit unusual: all the clients mount the same root
partition. I tried to be very careful to make sure no files were written
to on /, but I never got to the point where the clients could mount the
directory read-only. I used devfs to make sure that the /dev directories
were `localized' and syslog/console ownership and permissions kept sane.

The locking problem, however, was not related to the root filesystem --
it seems to have happened with files on the /var/log mount, which is
separate for each box (but still coming from a shared filesystem
/export/root on the server, which contains all the client directories).
If I mounted /var/log with the nolock option, they ran fine. This took
me a very long time to figure out, and I'd advise anyone with locking
problems to give it a go.

I should point out that this *does* seem to be a bug in the NFS server
code. I think it is associated with reiserfs, being that I haven't seen
it happen on other partition types. Rebooting the server cleared up the
problem. Erasing or changing files in /var/lib/nfs did not. While I was
initially using a volatile /var/lib/nfs directory on the *clients*, I
changed this on Trond's suggestion [2]. It did not fix the problem.

However, since I know little about the code itself, and it's not very
clear how one should debug, I was unable to pinpoint the exact source of
the problem, which very much saddens me.  The workaround, however, was
quite effective.

[1] http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&th=9db70994c3458f46&rnum=1
[2] http://groups.google.com/groups?q=christian+reis+nfs+locking&hl=en&lr=&ie=UTF-8&scoring=d&selm=20030126231006%246e11%40gated-at.bofh.it&rnum=3

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NFS client locking hangs for period
  2003-01-28  8:00     ` Denis Vlasenko
  2003-01-28 16:44       ` Christian Reis
@ 2003-01-29 21:53       ` Daniel Egger
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Egger @ 2003-01-29 21:53 UTC (permalink / raw)
  To: vda; +Cc: linux-kernel, NFS

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

Am Die, 2003-01-28 um 09.00 schrieb Denis Vlasenko:

> It was not really *that* difficult for me. I used devfs and symlinks.
> /etc, /var, /tmp are different directories per client,
> /home, /usr are shared. The rest stays on root fs readonly.
> ssh to NFS server if you want to modify some files on root fs.

This will only work dandy if the server runs the same OS on the 
same architecture and its own system is well enough equipped to
do software installations and bootstraps. Although I'm using Linux on
my server as well as the same architecture as most of the clients
I sometimes experience troubles working in the chrooted client
environment.

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NFS client locking hangs for period
  2003-01-28  8:00     ` Denis Vlasenko
@ 2003-01-28 16:44       ` Christian Reis
  2003-01-29 21:53       ` Daniel Egger
  1 sibling, 0 replies; 7+ messages in thread
From: Christian Reis @ 2003-01-28 16:44 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Neil Brown, linux-kernel, NFS

On Tue, Jan 28, 2003 at 10:00:05AM +0200, Denis Vlasenko wrote:
> > Well, mounting root read-only is a good idea but it sacrifices being
> > able to administer the system from any station, and it also puts a
> > lot of burden on me to fix *all* programs to not write to anywhere on
> > it. This shouldn't be too hard, but we're still just working around
> > the bug, which I would really like to identify and fix.
> 
> It was not really *that* difficult for me. I used devfs and symlinks.
> /etc, /var, /tmp are different directories per client,
> /home, /usr are shared. The rest stays on root fs readonly.
> ssh to NFS server if you want to modify some files on root fs.
> 
> Separate etc/var/tmp files for each client = no concurrent rw access.

I agree it is a lot simpler; however, you have to give up the ability to
install and upgrade system software seamlessly. When Debian reports a
security issue, all I do is apt-get -u upgrade and skim through it - all
boxes are magically updated. No need to update the individual /etc files
for the changes, and no messy links either.

It does require you take care, though. The most important issue is
finding out what files are written to in these directories (in violation
of the LFS/FHS, I must say). The current culprit I am after is a
/sbin/init, who writes to /etc/ioctl.save (why, I wonder). After a lot
of cleanup, I've managed to pair this down to teh minimum, and I'm going
after some of the last culprits now.

> File locking over the network is hard to do reliably.
> I have no experience with that in NFS, but presume there
> can be problems in some situations (statd or portmap
> crashed on a client, client hung/disconnected from the net,
> etc etc etc...)
> 
> Anyway, such corner cases are painful, thank you for
> your efforts to nail it down.

It seems Trond has given us the answer to the problem: the persistence
of /var/lib/nfs seems to be essential to a healthy diskless client. One
of our co-workers who was an expert as triggering the problems is at the
beach this week, so I can't tell for sure, but next Tuesday or so I hope
to post to NFS-list with [SUMMARY] in the Subject line <wink>

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NFS client locking hangs for period
  2003-01-26 16:02   ` Christian Reis
@ 2003-01-28  8:00     ` Denis Vlasenko
  2003-01-28 16:44       ` Christian Reis
  2003-01-29 21:53       ` Daniel Egger
  0 siblings, 2 replies; 7+ messages in thread
From: Denis Vlasenko @ 2003-01-28  8:00 UTC (permalink / raw)
  To: Christian Reis, Neil Brown; +Cc: linux-kernel, NFS

On 26 January 2003 18:02, Christian Reis wrote:
> On Sat, Jan 25, 2003 at 02:54:09PM +1100, Neil Brown wrote:
> > Hmmm.  So you have several clients all mounting the same root
> > filesystem, and mounting it writable?  That doesn't sound like a
> > plan for success.  How do you make sure the clients don't tread
> > over each other when using /etc files?
>
> The truth is few (broken wrt the FHS) programs actually write to
> /etc. I have set up everything so nothing is written to in /etc, and
> it actually works very well (have to use a special init(8) that
> doesn't write to /etc/ioctl.save). This setup has been running for
> almost a year now, with the locking problem being the only one left
> to fix.

My root fs is RO. Works wonders. Clients simply CANNOT trash their
/bin, /lib etc ;)

> > I suspect that what you really want is to mount root read-only, or
> > mount separate roots for each client, and then in either case to
> > mount with the "nolock" flag.
>
> Well, mounting root read-only is a good idea but it sacrifices being
> able to administer the system from any station, and it also puts a
> lot of burden on me to fix *all* programs to not write to anywhere on
> it. This shouldn't be too hard, but we're still just working around
> the bug, which I would really like to identify and fix.

It was not really *that* difficult for me. I used devfs and symlinks.
/etc, /var, /tmp are different directories per client,
/home, /usr are shared. The rest stays on root fs readonly.
ssh to NFS server if you want to modify some files on root fs.

Separate etc/var/tmp files for each client = no concurrent rw access.

> > I suspect that your problem is related to the client trying to do
> > locking, but no having statd running on the client.
>
> I am 100% positive statd runs on every single client. This problem
> here only happens spuriously.  It goes away when I restart nfsd and
> mountd (in that order). It really does look like a bug <wink>

File locking over the network is hard to do reliably.
I have no experience with that in NFS, but presume there
can be problems in some situations (statd or portmap
crashed on a client, client hung/disconnected from the net,
etc etc etc...)

Anyway, such corner cases are painful, thank you for
your efforts to nail it down.
--
vda

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NFS client locking hangs for period
  2003-01-25  3:54 ` Neil Brown
@ 2003-01-26 16:02   ` Christian Reis
  2003-01-28  8:00     ` Denis Vlasenko
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Reis @ 2003-01-26 16:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel, NFS

On Sat, Jan 25, 2003 at 02:54:09PM +1100, Neil Brown wrote:
> Hmmm.  So you have several clients all mounting the same root
> filesystem, and mounting it writable?  That doesn't sound like a plan
> for success.  How do you make sure the clients don't tread over each
> other when using /etc files?

The truth is few (broken wrt the FHS) programs actually write to /etc. I
have set up everything so nothing is written to in /etc, and it actually
works very well (have to use a special init(8) that doesn't write to
/etc/ioctl.save). This setup has been running for almost a year now,
with the locking problem being the only one left to fix.

> I suspect that what you really want is to mount root read-only, or
> mount separate roots for each client, and then in either case to mount
> with the "nolock" flag.

Well, mounting root read-only is a good idea but it sacrifices being
able to administer the system from any station, and it also puts a lot
of burden on me to fix *all* programs to not write to anywhere on it.
This shouldn't be too hard, but we're still just working around the bug,
which I would really like to identify and fix.

> I suspect that your problem is related to the client trying to do
> locking, but no having statd running on the client.

I am 100% positive statd runs on every single client. This problem here
only happens spuriously.  It goes away when I restart nfsd and mountd
(in that order). It really does look like a bug <wink>

> You cannot meaningfully do locking on an NFS mounted root filesystem.
> Infact, I think it would be good if the default mount options for nfs
> root included nolock... and if I read fs/nfs/nfsroot.c:root_nfs_name
> correctly, nolock is the default.  Are you overriding that default
> be explicitly setting "lock"??

Nope. I've just tested and the default (specifying no lock option upon
bootup) really is nolock:

/dev/root on / type nfs (rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=192.168.99.4)

I wonder why you can't do locking on NFS root (if it's a current
limitation of if it doesn't make sense). 

But I also think this problem shouldn't be happening if no locking was
going on. And when I checked using nlm_debug it sure did seem locking
was being used. What do you make of it?

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NFS client locking hangs for period
  2003-01-24 20:49 Christian Reis
@ 2003-01-25  3:54 ` Neil Brown
  2003-01-26 16:02   ` Christian Reis
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2003-01-25  3:54 UTC (permalink / raw)
  To: Christian Reis; +Cc: linux-kernel, NFS

On Friday January 24, kiko@async.com.br wrote:
> 
> Hello Neil,

Hi.

> 
> I've been trying to get at this problem for a while now, ....

> 
> It seems to be reproducible by having the client hang or reboot without
> shutting down properly. Another tip is that the server gets files left
> over in /var/lib/nfs/sm/ for the hanging client(s). 

> 
> Mount options follow for the client filesystems:
> 
> anthem:/export/root/    /   nfs defaults,rw,rsize=8192,wsize=8192,nfsvers=2 0 0
> anthem:/home    /home   nfs defaults,rw,rsize=8192,wsize=8192,nfsvers=3 0 0
> 

Hmmm.  So you have several clients all mounting the same root
filesystem, and mounting it writable?  That doesn't sound like a plan
for success.  How do you make sure the clients don't tread over each
other when using /etc files?

I suspect that what you really want is to mount root read-only, or
mount separate roots for each client, and then in either case to mount
with the "nolock" flag.

I suspect that your problem is related to the client trying to do
locking, but no having statd running on the client.
You cannot meaningfully do locking on an NFS mounted root filesystem.
Infact, I think it would be good if the default mount options for nfs
root included nolock... and if I read fs/nfs/nfsroot.c:root_nfs_name
correctly, nolock is the default.  Are you overriding that default
be explicitly setting "lock"??

NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* NFS client locking hangs for period
@ 2003-01-24 20:49 Christian Reis
  2003-01-25  3:54 ` Neil Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Reis @ 2003-01-24 20:49 UTC (permalink / raw)
  To: neilb; +Cc: linux-kernel, NFS


Hello Neil,

I've been trying to get at this problem for a while now, and had been
concentrating on the client-side of the problem (and consequently
bothering Trond about it) [1,2]. I am now pretty much convinced this is a
server-side problem, and as I've patched 2.4.20 with all the NFS patches
pending (that didn't have to do with the kernel lock breaking) and still
see the issue, I decided to report this bug.

The scenario is: a set of NFS clients with root mounted over nfs from a
single server. Clients run vanilla 2.4.20, server runs 2.4.20 patched
with your server-side patches I mentioned above. The clients run okay
for a period, and then one of them will start to hang for long periods
of time for certain operations (it happens on startup and shutdown, for
instance). Once the client hangs start the server needs to be rebooted
for it to clear up.

It seems to be reproducible by having the client hang or reboot without
shutting down properly. Another tip is that the server gets files left
over in /var/lib/nfs/sm/ for the hanging client(s). 

I've been trying to track this down for a while, but since I'm not very
proficient with debugging at this level, I haven't had much luck. It's
really a problem because I need to reboot and make 20 people stop
working when the problem gets serious. Trond has had a hand trying
to help me, but we still haven't uncovered anything. I wonder if you
have any clue what could be happenning?

The other details are standard: the clients are debian woodys with
nfs-utils 1.0.1 installed, and the server has the same version. The
server runs reiserfs over RAID-1 partitions (using the kernel md
driver). Could it be triggered because of this perhaps unusual
combination?

Some of the messages I point out below have some info about the issue -
including tcpdumps and traces of nlm_debug on the server and client.

Mount options follow for the client filesystems:

anthem:/export/root/    /   nfs defaults,rw,rsize=8192,wsize=8192,nfsvers=2 0 0
anthem:/home    /home   nfs defaults,rw,rsize=8192,wsize=8192,nfsvers=3 0 0

I have checked and, yes, root is mounted using version 2 and the rest as
version 3. Perhaps I should try getting the kernel to mount root using
version 3?

[1] http://groups.google.com/groups?q=trond+christian+nfs&hl=pt&lr=&ie=UTF-8&client=googlet&scoring=d&selm=20030108151424.N2628%40blackjesus.async.com.br.lucky.linux.kernel&rnum=1
[2] http://groups.google.com/groups?hl=pt&lr=&ie=UTF-8&client=googlet&th=3575b3c5f3360eb0&seekm=20030108151424.N2628%40blackjesus.async.com.br.lucky.linux.kernel&frame=off

Thanks for any help you can give.

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 261 2331 | NMFL

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-04-25  4:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-25  4:57 NFS client locking hangs for period Christian Reis
  -- strict thread matches above, loose matches on Subject: below --
2003-01-24 20:49 Christian Reis
2003-01-25  3:54 ` Neil Brown
2003-01-26 16:02   ` Christian Reis
2003-01-28  8:00     ` Denis Vlasenko
2003-01-28 16:44       ` Christian Reis
2003-01-29 21:53       ` Daniel Egger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).