linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
@ 2005-05-14 13:18 Justin Piszcz
  2005-05-17 14:47 ` Alan Cox
  2005-06-16  9:19 ` Michael Heyse
  0 siblings, 2 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-05-14 13:18 UTC (permalink / raw)
  To: linux-kernel

When I run the following command over NFS:
dd if=/dev/hde of=/remote/disk5/file.img bs=1M

After > 30-60 seconds, it kills the remote machine.

I cannot ping the machine, nor can I wake up the monitor to see what 
happened.

The mount options I am using are:
rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0

1] /dev/hde is on a promise controller on an abit-ic7-g
2] /remote/disk5 is on a promise controller on another abit-ic7-g

Both filesystems are XFS.
The network interface is 1000mbpx full duplex.

Log of what happens, the packet loss begins when the dd starts moving the 
bits over to the other box.

# The following is right before I ran the dd command.
PING routerbox (192.168.0.1) 56(84) bytes of data.
64 bytes from routerbox (192.168.0.1): icmp_seq=1 ttl=64 time=0.214 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=2 ttl=64 time=0.154 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=3 ttl=64 time=0.157 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=4 ttl=64 time=0.171 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=5 ttl=64 time=0.179 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=6 ttl=64 time=0.191 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=7 ttl=64 time=0.179 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=8 ttl=64 time=0.212 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=9 ttl=64 time=0.133 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=10 ttl=64 time=0.228 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=11 ttl=64 time=0.119 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=12 ttl=64 time=0.129 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=13 ttl=64 time=0.260 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=14 ttl=64 time=0.264 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=15 ttl=64 time=0.280 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=16 ttl=64 time=0.276 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=17 ttl=64 time=0.181 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=18 ttl=64 time=0.188 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=19 ttl=64 time=3.17 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=20 ttl=64 time=2.44 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=36 ttl=64 time=0.250 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=37 ttl=64 time=0.256 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=38 ttl=64 time=0.204 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=39 ttl=64 time=0.281 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=40 ttl=64 time=0.170 ms
>From mybox (192.168.0.12) icmp_seq=74 Destination Host Unreachable
>From mybox (192.168.0.12) icmp_seq=75 Destination Host Unreachable
>From mybox (192.168.0.12) icmp_seq=76 Destination Host Unreachable

--- routerbox ping statistics ---
79 packets transmitted, 25 received, +3 errors, 68% packet loss, time 
77984ms
rtt min/avg/max/mdev = 0.119/0.411/3.170/0.715 ms, pipe 3
mybox@mybox:~$


Oh, and incase one may think there is a network issue, there is not, 
during normal operation when I am not running dd, there are no network 
problems, as shown below.

77 packets transmitted, 77 received, 0% packet loss, time 76003ms
rtt min/avg/max/mdev = 0.106/0.226/0.478/0.068 ms







^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
@ 2005-05-17 14:47 ` Alan Cox
  2005-05-18  0:35   ` Justin Piszcz
  2005-05-18  0:36   ` Justin Piszcz
  2005-06-16  9:19 ` Michael Heyse
  1 sibling, 2 replies; 11+ messages in thread
From: Alan Cox @ 2005-05-17 14:47 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux Kernel Mailing List

On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
> The mount options I am using are:
> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0

These are rather extreme r/wsizes especially if you are using UDP - I'm
assuming this is TCP ?

> Oh, and incase one may think there is a network issue, there is not, 
> during normal operation when I am not running dd, there are no network 
> problems, as shown below.

I would certainly expect it to be a memory issue. Does it occur with
8192 as the size ?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-05-17 14:47 ` Alan Cox
@ 2005-05-18  0:35   ` Justin Piszcz
  2005-05-18  0:36   ` Justin Piszcz
  1 sibling, 0 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-05-18  0:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Alan,

It also fails with 8192, totally crashes the box!
Seconds after (3-8 seconds), the remote host is dead, even with 8192.

I get an IRQ #18, Nobody Cared!
Has something with e1000x/and ata/x/something in the crash dump.
Disabling interrupt 18.

box:~# df -h | grep /x5
box:~# mount -a
box:~# df -h | grep /x5
p500:/d5/x5           234G   45G  189G  20% /p500/x5
box:~# grep /x5 /etc/fstab
#p500:/d5/x5      /p500/x5         nfs 
rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
p500:/d5/x5      /p500/x5         nfs 
rw,hard,intr,rsize=8192,wsize=8192,nfsvers=3 0 0
box:~#
box:~# mount | grep x5
p500:/d5/x5 on /p500/x5 type nfs 
(rw,hard,intr,rsize=8192,wsize=8192,nfsvers=3,addr=192.168.0.253)
box:~# dd if=/dev/hde of=/p500/x5/file.img bs=1M

And Al,

This is going to sound absolutely crazy, but my box would *NOT* come back 
up after the dd.  It gets to the point where it initializes the network but the 
machine that did the dd must be sending some kind of NASTY packet that KILLS
the kernel, as soon as it initializes eth0 BAM, it freezes at the console when
it is trying to boot up.  The fix is to shut off the machine that did the 
dd or disconnect the network cable and then voila it comes back up.  Also, the 
machine that did the dd was SERIOUSLY lagged almost unusable.

Any ideas?

I'd prefer not to repeat this problem again, thanks!

Justin.



On Tue, 17 May 2005, Alan Cox wrote:

> On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
>> The mount options I am using are:
>> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
>
> These are rather extreme r/wsizes especially if you are using UDP - I'm
> assuming this is TCP ?
>
>> Oh, and incase one may think there is a network issue, there is not,
>> during normal operation when I am not running dd, there are no network
>> problems, as shown below.
>
> I would certainly expect it to be a memory issue. Does it occur with
> 8192 as the size ?
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-05-17 14:47 ` Alan Cox
  2005-05-18  0:35   ` Justin Piszcz
@ 2005-05-18  0:36   ` Justin Piszcz
  2005-05-18 12:34     ` Peter Staubach
  1 sibling, 1 reply; 11+ messages in thread
From: Justin Piszcz @ 2005-05-18  0:36 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

And I am using UDP, not TCP.

NFS Version 3.

On Tue, 17 May 2005, Alan Cox wrote:

> On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
>> The mount options I am using are:
>> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
>
> These are rather extreme r/wsizes especially if you are using UDP - I'm
> assuming this is TCP ?
>
>> Oh, and incase one may think there is a network issue, there is not,
>> during normal operation when I am not running dd, there are no network
>> problems, as shown below.
>
> I would certainly expect it to be a memory issue. Does it occur with
> 8192 as the size ?
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-05-18  0:36   ` Justin Piszcz
@ 2005-05-18 12:34     ` Peter Staubach
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Staubach @ 2005-05-18 12:34 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Alan Cox, Linux Kernel Mailing List

Justin Piszcz wrote:

> And I am using UDP, not TCP.
>
> NFS Version 3.


You may able to specify rsize and wsize of 65536 with NFS Version 3 running
over UDP, but it is guaranteed not to work if either the client or the 
server attempts
a 64K transfer.

The problem is that UDP is limited to a 64K datagram.  This datagram 
must hold
the data, some NFS protocol data structures, and some RPC data 
structures.  This
exceeds the 64K limit.  RPC over UDP will not allow the use of multiple UDP
datagrams, so RPC over UDP is limited to less than 64K payloads.  RPC over
TCP will allow larger operations because there is no such single 
datagram limit.

You could specify 56K or 60K transfer sizes if you wanted to stay at a 
multiple
of 8K or 4K, but there doesn't seem to be much point.  The 32K number was
chosen because it was the largest power of 2 below 64K and seems to work
pretty well in most circumstances.

In general, I wouldn't recommend mucking with the read/write transfer sizes
unless you really know what you are doing and understand the target 
environment
very well.

       ps

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
  2005-05-17 14:47 ` Alan Cox
@ 2005-06-16  9:19 ` Michael Heyse
  2005-06-16  9:24   ` Justin Piszcz
  1 sibling, 1 reply; 11+ messages in thread
From: Michael Heyse @ 2005-06-16  9:19 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

Hi Justin and others,

did you manage to resolve this problem? I'm also experiencing apparantly NFS-related crashes (kernel
hangs after a couple of seconds up to minutes, no syslog entries, nothing at all works any more)
using 2.6.11.10 and NFS V3 over TCP, standard r/wsizes, ext3 on a RAID5 array. Is this possibly
arch- or otherwise hardware-dependent? The NFS server works fine on my P4 on ASUS P4P800 board,
while it crashes my EPIA Board (VIA C3) using the same software configuration. Other network
applications run fine (as a workaround I'm using samba right now instead of nfs), so I don't think
my hardware is broken.

Thanks,
Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-06-16  9:19 ` Michael Heyse
@ 2005-06-16  9:24   ` Justin Piszcz
  2005-06-16 14:59     ` Lee Revell
  2005-06-17  8:11     ` Michael Heyse
  0 siblings, 2 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-06-16  9:24 UTC (permalink / raw)
  To: Michael Heyse; +Cc: linux-kernel

Alan followed up with me but we did not reach any conclusion as to what 
was causing it to crash.  The main way I got it to crash was dd 
if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues 
as far as copying files and such.  For you, is it on a particular box or 
boxes, have you tried copying the other direction?  I use NFS over UDP btw 
(v3).

# mount
mount:/disk/1 on /remote/1 type nfs 
(rw,hard,intr,nfsvers=3,addr=192.168.168.253)


On Thu, 16 Jun 2005, Michael Heyse wrote:

> Hi Justin and others,
>
> did you manage to resolve this problem? I'm also experiencing apparantly NFS-related crashes (kernel
> hangs after a couple of seconds up to minutes, no syslog entries, nothing at all works any more)
> using 2.6.11.10 and NFS V3 over TCP, standard r/wsizes, ext3 on a RAID5 array. Is this possibly
> arch- or otherwise hardware-dependent? The NFS server works fine on my P4 on ASUS P4P800 board,
> while it crashes my EPIA Board (VIA C3) using the same software configuration. Other network
> applications run fine (as a workaround I'm using samba right now instead of nfs), so I don't think
> my hardware is broken.
>
> Thanks,
> Michael
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-06-16  9:24   ` Justin Piszcz
@ 2005-06-16 14:59     ` Lee Revell
  2005-06-16 15:10       ` Justin Piszcz
  2005-06-17  8:12       ` Michael Heyse
  2005-06-17  8:11     ` Michael Heyse
  1 sibling, 2 replies; 11+ messages in thread
From: Lee Revell @ 2005-06-16 14:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Michael Heyse, linux-kernel

On Thu, 2005-06-16 at 05:24 -0400, Justin Piszcz wrote:
> Alan followed up with me but we did not reach any conclusion as to what 
> was causing it to crash.  The main way I got it to crash was dd 
> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues 
> as far as copying files and such.  For you, is it on a particular box or 
> boxes, have you tried copying the other direction?  I use NFS over UDP btw 
> (v3).
> 
> # mount
> mount:/disk/1 on /remote/1 type nfs 
> (rw,hard,intr,nfsvers=3,addr=192.168.168.253)

Are you both using NFS + software RAID?  Is 4KSTACKS enabled?

IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
combination.

Lee


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-06-16 14:59     ` Lee Revell
@ 2005-06-16 15:10       ` Justin Piszcz
  2005-06-17  8:12       ` Michael Heyse
  1 sibling, 0 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-06-16 15:10 UTC (permalink / raw)
  To: Lee Revell; +Cc: Michael Heyse, linux-kernel



On Thu, 16 Jun 2005, Lee Revell wrote:

> On Thu, 2005-06-16 at 05:24 -0400, Justin Piszcz wrote:
>> Alan followed up with me but we did not reach any conclusion as to what
>> was causing it to crash.  The main way I got it to crash was dd
>> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues
>> as far as copying files and such.  For you, is it on a particular box or
>> boxes, have you tried copying the other direction?  I use NFS over UDP btw
>> (v3).
>>
>> # mount
>> mount:/disk/1 on /remote/1 type nfs
>> (rw,hard,intr,nfsvers=3,addr=192.168.168.253)
>
> Are you both using NFS + software RAID?  Is 4KSTACKS enabled?
>
> IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
> combination.
>
> Lee
>

I was not using any type of RAID, SW or HW.
4K stacks was not enabled on either machine.

I am using XFS though.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-06-16  9:24   ` Justin Piszcz
  2005-06-16 14:59     ` Lee Revell
@ 2005-06-17  8:11     ` Michael Heyse
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Heyse @ 2005-06-17  8:11 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

Justin Piszcz wrote:
> Alan followed up with me but we did not reach any conclusion as to what
> was causing it to crash.  The main way I got it to crash was dd
> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any
> issues as far as copying files and such.  For you, is it on a particular
> box or boxes, have you tried copying the other direction?  I use NFS
> over UDP btw (v3).

Sadly I had to discover that those crashes are not really NFS related, but when I'm using NFS they
are triggered much more often than otherwise. The machine ran stable for almost 2 days now without
NFS but then still hung.

Thank you for your time!
Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
  2005-06-16 14:59     ` Lee Revell
  2005-06-16 15:10       ` Justin Piszcz
@ 2005-06-17  8:12       ` Michael Heyse
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Heyse @ 2005-06-17  8:12 UTC (permalink / raw)
  To: Lee Revell; +Cc: Justin Piszcz, linux-kernel

Lee Revell wrote:

> Are you both using NFS + software RAID?  Is 4KSTACKS enabled?
> 
> IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
> combination.

In my case 4k stacks are enabled. Thanks for the hint, I'll try again with 8k stacks.

Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-06-17  8:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
2005-05-17 14:47 ` Alan Cox
2005-05-18  0:35   ` Justin Piszcz
2005-05-18  0:36   ` Justin Piszcz
2005-05-18 12:34     ` Peter Staubach
2005-06-16  9:19 ` Michael Heyse
2005-06-16  9:24   ` Justin Piszcz
2005-06-16 14:59     ` Lee Revell
2005-06-16 15:10       ` Justin Piszcz
2005-06-17  8:12       ` Michael Heyse
2005-06-17  8:11     ` Michael Heyse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).