* Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
@ 2005-05-14 13:18 Justin Piszcz
2005-05-17 14:47 ` Alan Cox
2005-06-16 9:19 ` Michael Heyse
0 siblings, 2 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-05-14 13:18 UTC (permalink / raw)
To: linux-kernel
When I run the following command over NFS:
dd if=/dev/hde of=/remote/disk5/file.img bs=1M
After > 30-60 seconds, it kills the remote machine.
I cannot ping the machine, nor can I wake up the monitor to see what
happened.
The mount options I am using are:
rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
1] /dev/hde is on a promise controller on an abit-ic7-g
2] /remote/disk5 is on a promise controller on another abit-ic7-g
Both filesystems are XFS.
The network interface is 1000mbpx full duplex.
Log of what happens, the packet loss begins when the dd starts moving the
bits over to the other box.
# The following is right before I ran the dd command.
PING routerbox (192.168.0.1) 56(84) bytes of data.
64 bytes from routerbox (192.168.0.1): icmp_seq=1 ttl=64 time=0.214 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=2 ttl=64 time=0.154 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=3 ttl=64 time=0.157 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=4 ttl=64 time=0.171 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=5 ttl=64 time=0.179 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=6 ttl=64 time=0.191 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=7 ttl=64 time=0.179 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=8 ttl=64 time=0.212 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=9 ttl=64 time=0.133 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=10 ttl=64 time=0.228 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=11 ttl=64 time=0.119 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=12 ttl=64 time=0.129 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=13 ttl=64 time=0.260 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=14 ttl=64 time=0.264 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=15 ttl=64 time=0.280 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=16 ttl=64 time=0.276 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=17 ttl=64 time=0.181 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=18 ttl=64 time=0.188 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=19 ttl=64 time=3.17 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=20 ttl=64 time=2.44 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=36 ttl=64 time=0.250 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=37 ttl=64 time=0.256 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=38 ttl=64 time=0.204 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=39 ttl=64 time=0.281 ms
64 bytes from routerbox (192.168.0.1): icmp_seq=40 ttl=64 time=0.170 ms
>From mybox (192.168.0.12) icmp_seq=74 Destination Host Unreachable
>From mybox (192.168.0.12) icmp_seq=75 Destination Host Unreachable
>From mybox (192.168.0.12) icmp_seq=76 Destination Host Unreachable
--- routerbox ping statistics ---
79 packets transmitted, 25 received, +3 errors, 68% packet loss, time
77984ms
rtt min/avg/max/mdev = 0.119/0.411/3.170/0.715 ms, pipe 3
mybox@mybox:~$
Oh, and incase one may think there is a network issue, there is not,
during normal operation when I am not running dd, there are no network
problems, as shown below.
77 packets transmitted, 77 received, 0% packet loss, time 76003ms
rtt min/avg/max/mdev = 0.106/0.226/0.478/0.068 ms
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
@ 2005-05-17 14:47 ` Alan Cox
2005-05-18 0:35 ` Justin Piszcz
2005-05-18 0:36 ` Justin Piszcz
2005-06-16 9:19 ` Michael Heyse
1 sibling, 2 replies; 11+ messages in thread
From: Alan Cox @ 2005-05-17 14:47 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Linux Kernel Mailing List
On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
> The mount options I am using are:
> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
These are rather extreme r/wsizes especially if you are using UDP - I'm
assuming this is TCP ?
> Oh, and incase one may think there is a network issue, there is not,
> during normal operation when I am not running dd, there are no network
> problems, as shown below.
I would certainly expect it to be a memory issue. Does it occur with
8192 as the size ?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-05-17 14:47 ` Alan Cox
@ 2005-05-18 0:35 ` Justin Piszcz
2005-05-18 0:36 ` Justin Piszcz
1 sibling, 0 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-05-18 0:35 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
Alan,
It also fails with 8192, totally crashes the box!
Seconds after (3-8 seconds), the remote host is dead, even with 8192.
I get an IRQ #18, Nobody Cared!
Has something with e1000x/and ata/x/something in the crash dump.
Disabling interrupt 18.
box:~# df -h | grep /x5
box:~# mount -a
box:~# df -h | grep /x5
p500:/d5/x5 234G 45G 189G 20% /p500/x5
box:~# grep /x5 /etc/fstab
#p500:/d5/x5 /p500/x5 nfs
rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
p500:/d5/x5 /p500/x5 nfs
rw,hard,intr,rsize=8192,wsize=8192,nfsvers=3 0 0
box:~#
box:~# mount | grep x5
p500:/d5/x5 on /p500/x5 type nfs
(rw,hard,intr,rsize=8192,wsize=8192,nfsvers=3,addr=192.168.0.253)
box:~# dd if=/dev/hde of=/p500/x5/file.img bs=1M
And Al,
This is going to sound absolutely crazy, but my box would *NOT* come back
up after the dd. It gets to the point where it initializes the network but the
machine that did the dd must be sending some kind of NASTY packet that KILLS
the kernel, as soon as it initializes eth0 BAM, it freezes at the console when
it is trying to boot up. The fix is to shut off the machine that did the
dd or disconnect the network cable and then voila it comes back up. Also, the
machine that did the dd was SERIOUSLY lagged almost unusable.
Any ideas?
I'd prefer not to repeat this problem again, thanks!
Justin.
On Tue, 17 May 2005, Alan Cox wrote:
> On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
>> The mount options I am using are:
>> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
>
> These are rather extreme r/wsizes especially if you are using UDP - I'm
> assuming this is TCP ?
>
>> Oh, and incase one may think there is a network issue, there is not,
>> during normal operation when I am not running dd, there are no network
>> problems, as shown below.
>
> I would certainly expect it to be a memory issue. Does it occur with
> 8192 as the size ?
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-05-17 14:47 ` Alan Cox
2005-05-18 0:35 ` Justin Piszcz
@ 2005-05-18 0:36 ` Justin Piszcz
2005-05-18 12:34 ` Peter Staubach
1 sibling, 1 reply; 11+ messages in thread
From: Justin Piszcz @ 2005-05-18 0:36 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
And I am using UDP, not TCP.
NFS Version 3.
On Tue, 17 May 2005, Alan Cox wrote:
> On Sad, 2005-05-14 at 14:18, Justin Piszcz wrote:
>> The mount options I am using are:
>> rw,hard,intr,rsize=65536,wsize=65536,nfsvers=3 0 0
>
> These are rather extreme r/wsizes especially if you are using UDP - I'm
> assuming this is TCP ?
>
>> Oh, and incase one may think there is a network issue, there is not,
>> during normal operation when I am not running dd, there are no network
>> problems, as shown below.
>
> I would certainly expect it to be a memory issue. Does it occur with
> 8192 as the size ?
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-05-18 0:36 ` Justin Piszcz
@ 2005-05-18 12:34 ` Peter Staubach
0 siblings, 0 replies; 11+ messages in thread
From: Peter Staubach @ 2005-05-18 12:34 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Alan Cox, Linux Kernel Mailing List
Justin Piszcz wrote:
> And I am using UDP, not TCP.
>
> NFS Version 3.
You may able to specify rsize and wsize of 65536 with NFS Version 3 running
over UDP, but it is guaranteed not to work if either the client or the
server attempts
a 64K transfer.
The problem is that UDP is limited to a 64K datagram. This datagram
must hold
the data, some NFS protocol data structures, and some RPC data
structures. This
exceeds the 64K limit. RPC over UDP will not allow the use of multiple UDP
datagrams, so RPC over UDP is limited to less than 64K payloads. RPC over
TCP will allow larger operations because there is no such single
datagram limit.
You could specify 56K or 60K transfer sizes if you wanted to stay at a
multiple
of 8K or 4K, but there doesn't seem to be much point. The 32K number was
chosen because it was the largest power of 2 below 64K and seems to work
pretty well in most circumstances.
In general, I wouldn't recommend mucking with the read/write transfer sizes
unless you really know what you are doing and understand the target
environment
very well.
ps
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
2005-05-17 14:47 ` Alan Cox
@ 2005-06-16 9:19 ` Michael Heyse
2005-06-16 9:24 ` Justin Piszcz
1 sibling, 1 reply; 11+ messages in thread
From: Michael Heyse @ 2005-06-16 9:19 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel
Hi Justin and others,
did you manage to resolve this problem? I'm also experiencing apparantly NFS-related crashes (kernel
hangs after a couple of seconds up to minutes, no syslog entries, nothing at all works any more)
using 2.6.11.10 and NFS V3 over TCP, standard r/wsizes, ext3 on a RAID5 array. Is this possibly
arch- or otherwise hardware-dependent? The NFS server works fine on my P4 on ASUS P4P800 board,
while it crashes my EPIA Board (VIA C3) using the same software configuration. Other network
applications run fine (as a workaround I'm using samba right now instead of nfs), so I don't think
my hardware is broken.
Thanks,
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-06-16 9:19 ` Michael Heyse
@ 2005-06-16 9:24 ` Justin Piszcz
2005-06-16 14:59 ` Lee Revell
2005-06-17 8:11 ` Michael Heyse
0 siblings, 2 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-06-16 9:24 UTC (permalink / raw)
To: Michael Heyse; +Cc: linux-kernel
Alan followed up with me but we did not reach any conclusion as to what
was causing it to crash. The main way I got it to crash was dd
if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues
as far as copying files and such. For you, is it on a particular box or
boxes, have you tried copying the other direction? I use NFS over UDP btw
(v3).
# mount
mount:/disk/1 on /remote/1 type nfs
(rw,hard,intr,nfsvers=3,addr=192.168.168.253)
On Thu, 16 Jun 2005, Michael Heyse wrote:
> Hi Justin and others,
>
> did you manage to resolve this problem? I'm also experiencing apparantly NFS-related crashes (kernel
> hangs after a couple of seconds up to minutes, no syslog entries, nothing at all works any more)
> using 2.6.11.10 and NFS V3 over TCP, standard r/wsizes, ext3 on a RAID5 array. Is this possibly
> arch- or otherwise hardware-dependent? The NFS server works fine on my P4 on ASUS P4P800 board,
> while it crashes my EPIA Board (VIA C3) using the same software configuration. Other network
> applications run fine (as a workaround I'm using samba right now instead of nfs), so I don't think
> my hardware is broken.
>
> Thanks,
> Michael
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-06-16 9:24 ` Justin Piszcz
@ 2005-06-16 14:59 ` Lee Revell
2005-06-16 15:10 ` Justin Piszcz
2005-06-17 8:12 ` Michael Heyse
2005-06-17 8:11 ` Michael Heyse
1 sibling, 2 replies; 11+ messages in thread
From: Lee Revell @ 2005-06-16 14:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Michael Heyse, linux-kernel
On Thu, 2005-06-16 at 05:24 -0400, Justin Piszcz wrote:
> Alan followed up with me but we did not reach any conclusion as to what
> was causing it to crash. The main way I got it to crash was dd
> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues
> as far as copying files and such. For you, is it on a particular box or
> boxes, have you tried copying the other direction? I use NFS over UDP btw
> (v3).
>
> # mount
> mount:/disk/1 on /remote/1 type nfs
> (rw,hard,intr,nfsvers=3,addr=192.168.168.253)
Are you both using NFS + software RAID? Is 4KSTACKS enabled?
IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
combination.
Lee
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-06-16 14:59 ` Lee Revell
@ 2005-06-16 15:10 ` Justin Piszcz
2005-06-17 8:12 ` Michael Heyse
1 sibling, 0 replies; 11+ messages in thread
From: Justin Piszcz @ 2005-06-16 15:10 UTC (permalink / raw)
To: Lee Revell; +Cc: Michael Heyse, linux-kernel
On Thu, 16 Jun 2005, Lee Revell wrote:
> On Thu, 2005-06-16 at 05:24 -0400, Justin Piszcz wrote:
>> Alan followed up with me but we did not reach any conclusion as to what
>> was causing it to crash. The main way I got it to crash was dd
>> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any issues
>> as far as copying files and such. For you, is it on a particular box or
>> boxes, have you tried copying the other direction? I use NFS over UDP btw
>> (v3).
>>
>> # mount
>> mount:/disk/1 on /remote/1 type nfs
>> (rw,hard,intr,nfsvers=3,addr=192.168.168.253)
>
> Are you both using NFS + software RAID? Is 4KSTACKS enabled?
>
> IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
> combination.
>
> Lee
>
I was not using any type of RAID, SW or HW.
4K stacks was not enabled on either machine.
I am using XFS though.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-06-16 9:24 ` Justin Piszcz
2005-06-16 14:59 ` Lee Revell
@ 2005-06-17 8:11 ` Michael Heyse
1 sibling, 0 replies; 11+ messages in thread
From: Michael Heyse @ 2005-06-17 8:11 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel
Justin Piszcz wrote:
> Alan followed up with me but we did not reach any conclusion as to what
> was causing it to crash. The main way I got it to crash was dd
> if=/dev/hde (root drive) of=/nfs/file.img bs=1M, I have not had any
> issues as far as copying files and such. For you, is it on a particular
> box or boxes, have you tried copying the other direction? I use NFS
> over UDP btw (v3).
Sadly I had to discover that those crashes are not really NFS related, but when I'm using NFS they
are triggered much more often than otherwise. The machine ran stable for almost 2 days now without
NFS but then still hung.
Thank you for your time!
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Reproducible 2.6.11.9 NFS Kernel Crashing Bug!
2005-06-16 14:59 ` Lee Revell
2005-06-16 15:10 ` Justin Piszcz
@ 2005-06-17 8:12 ` Michael Heyse
1 sibling, 0 replies; 11+ messages in thread
From: Michael Heyse @ 2005-06-17 8:12 UTC (permalink / raw)
To: Lee Revell; +Cc: Justin Piszcz, linux-kernel
Lee Revell wrote:
> Are you both using NFS + software RAID? Is 4KSTACKS enabled?
>
> IIRC people were getting stack overflows with the NFS + RAID + 4K stacks
> combination.
In my case 4k stacks are enabled. Thanks for the hint, I'll try again with 8k stacks.
Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-06-17 8:13 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-14 13:18 Reproducible 2.6.11.9 NFS Kernel Crashing Bug! Justin Piszcz
2005-05-17 14:47 ` Alan Cox
2005-05-18 0:35 ` Justin Piszcz
2005-05-18 0:36 ` Justin Piszcz
2005-05-18 12:34 ` Peter Staubach
2005-06-16 9:19 ` Michael Heyse
2005-06-16 9:24 ` Justin Piszcz
2005-06-16 14:59 ` Lee Revell
2005-06-16 15:10 ` Justin Piszcz
2005-06-17 8:12 ` Michael Heyse
2005-06-17 8:11 ` Michael Heyse
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).