* 3.0+ NFS issues @ 2012-05-25 6:53 Michael Tokarev 2012-05-29 15:24 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-05-25 6:53 UTC (permalink / raw) To: linux-nfs, Linux-kernel Hello. I updated my nfs server machine to kernel 3.0, and noticed that its main usage become, well, problematic. While trying to dig deeper, I also found a few other interesting issues, which are mentioned below. But first thing first: nfs. i686pae kernel, lots of RAM, Atom-based (cedar trail) machine with usual rtl8169 NIC. 3.0 or 3.2 kernel (I will try current 3.4 but I don't have much hopes there). NFSv4. When a client machine (also 3.0 kernel) does some reading, the process often stalls somewhere in the read syscall, or, rarer, during close, for up to two MINUTES. During this time, the client (kernel) reports "NFS server <foo> does not respond" several times, and finally "NFS server <foo> ok", client process "unstucks" from the read(2), and is able to perform a few more reads till the whole thing repeats. While this happens, ie, while the client process is stuck, any access to the mounted filesystem results in an uninterruptible sleep - all processes including the initial client are unkillable. This is very annoying, but I was never able to fix this, including the real case when the server becomes unreachable - "intr" mount option which is supposed to help here makes no difference. This is first and longstanding issue, but it is not really related to the issue at hand. On the server side things are more interesting. During the client stall, all NFSD threads are busy running, using all available CPU. The machine has 2 cores, and if I reduce number of NFSD threads to 1, one core will be used by the single NFSD to 100%. With default 8 NFSDs, both cores are used, by 4 NFSD threads each. This is according to top(1), which show proportional and equal CPU activity for each of NFSD thread. But. And this is where things becomes more interesting. perf top shows almost no activity with any number of NFSD threads but one. For example, this is a typical perf top display with 2 NFSD threads: samples pcnt function DSO 1558.00 6.4% svc_recv [sunrpc] 1084.00 5.2% __schedule [kernel] 1065.00 5.1% select_task_rq_fair [kernel] 692.00 3.3% try_to_wake_up [kernel] Yes, there's some 6.4% time spent in svc_recv, but it is FAR from the 200% CPU usage reported by top(1). Now, when I reduce number of NFSD threds to 1, perf top output becomes dramatically different: samples pcnt function DSO 5996.00 59.2% svc_recv [sunrpc] 750.00 7.4% cache_check [sunrpc] 519.00 5.1% svc_tcp_has_wspace [sunrpc] 465.00 4.6% local_bh_enable_ip [kernel.kallsyms] 301.00 3.0% svc_defer [sunrpc] 294.00 2.9% page_address [kernel.kallsyms] Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times the difference. So, apparently, `perf top' tool does not show the truth. Trying to catch this high CPU usage with Sysrq+T does not show anything interesting, the NFSD thread is running (and is the only thread(s) shown in the "Runnable tasks" output): May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000 May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140 May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575 May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000 May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace: May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90 May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90 May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30 May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc] May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30 May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc] May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30 May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30 May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc] May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90 May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200 May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd] May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60 May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70 May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150 May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000 May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575 May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace: May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90 May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30 May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc] May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc] May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc] May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc] May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc] May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200 May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd] May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60 May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70 May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150 May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd This is 3.2, on 3.0 it was in different place. This problem is persistent, always reproducible, happens on reads from the server (I haven't seen it with writes), and completely goes away when reverting to the kernel I previously used on this machine - 2.6.32. But unfortunately 2.6.32 does not quite work on the new hardware (I replaced the motherboard), due to lack of drivers for several pieces. (This is a different story, I think NFSD should work in 3.0 regardless of 2.6.32 issues :) Anything I can do to make NFS work again? Thanks, /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-25 6:53 3.0+ NFS issues Michael Tokarev @ 2012-05-29 15:24 ` J. Bruce Fields 2012-05-30 7:11 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-05-29 15:24 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-nfs, Linux-kernel On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote: > I updated my nfs server machine to kernel 3.0, and > noticed that its main usage become, well, problematic. > > While trying to dig deeper, I also found a few other > interesting issues, which are mentioned below. > > But first thing first: nfs. > > i686pae kernel, lots of RAM, Atom-based (cedar trail) > machine with usual rtl8169 NIC. 3.0 or 3.2 kernel > (I will try current 3.4 but I don't have much hopes > there). NFSv4. > > When a client machine (also 3.0 kernel) does some reading, > the process often stalls somewhere in the read syscall, > or, rarer, during close, for up to two MINUTES. During > this time, the client (kernel) reports "NFS server <foo> > does not respond" several times, and finally "NFS server > <foo> ok", client process "unstucks" from the read(2), > and is able to perform a few more reads till the whole > thing repeats. You say 2.6.32 was OK; have you tried anything else between? And you're holding the client constant while varying only the server version, right? Is your network otherwise working? (E.g. does transferring a bunch of data from server to client using some other protocol work reliably?) Is there anything happening on the network during these stalls? (You can watch the network with wireshark, for example.) Does NFSv3 behave the same way? --b. > > While this happens, ie, while the client process is stuck, > any access to the mounted filesystem results in an > uninterruptible sleep - all processes including the > initial client are unkillable. This is very annoying, > but I was never able to fix this, including the real > case when the server becomes unreachable - "intr" mount > option which is supposed to help here makes no difference. > This is first and longstanding issue, but it is not really > related to the issue at hand. > > On the server side things are more interesting. During > the client stall, all NFSD threads are busy running, > using all available CPU. The machine has 2 cores, and > if I reduce number of NFSD threads to 1, one core will > be used by the single NFSD to 100%. With default 8 > NFSDs, both cores are used, by 4 NFSD threads each. > This is according to top(1), which show proportional > and equal CPU activity for each of NFSD thread. > > But. And this is where things becomes more interesting. > perf top shows almost no activity with any number of NFSD > threads but one. For example, this is a typical perf top > display with 2 NFSD threads: > > samples pcnt function DSO > 1558.00 6.4% svc_recv [sunrpc] > 1084.00 5.2% __schedule [kernel] > 1065.00 5.1% select_task_rq_fair [kernel] > 692.00 3.3% try_to_wake_up [kernel] > > > Yes, there's some 6.4% time spent in svc_recv, but it is > FAR from the 200% CPU usage reported by top(1). Now, when > I reduce number of NFSD threds to 1, perf top output becomes > dramatically different: > > samples pcnt function DSO > 5996.00 59.2% svc_recv [sunrpc] > 750.00 7.4% cache_check [sunrpc] > 519.00 5.1% svc_tcp_has_wspace [sunrpc] > 465.00 4.6% local_bh_enable_ip [kernel.kallsyms] > 301.00 3.0% svc_defer [sunrpc] > 294.00 2.9% page_address [kernel.kallsyms] > > Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times > the difference. > > So, apparently, `perf top' tool does not show the truth. > > Trying to catch this high CPU usage with Sysrq+T does not > show anything interesting, the NFSD thread is running (and > is the only thread(s) shown in the "Runnable tasks" output): > > > May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000 > May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140 > May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575 > May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000 > May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace: > May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90 > May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90 > May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30 > May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc] > May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30 > May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc] > May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30 > May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30 > May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc] > May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90 > May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200 > May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd] > May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60 > May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff > May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70 > May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150 > May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd > > May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000 > May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c > May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575 > May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c > May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace: > May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90 > May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30 > May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc] > May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc] > May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc] > May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc] > May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc] > May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200 > May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd] > May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60 > May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff > May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70 > May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150 > May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd > > This is 3.2, on 3.0 it was in different place. > > This problem is persistent, always reproducible, happens > on reads from the server (I haven't seen it with writes), > and completely goes away when reverting to the kernel I > previously used on this machine - 2.6.32. But unfortunately > 2.6.32 does not quite work on the new hardware (I replaced > the motherboard), due to lack of drivers for several pieces. > (This is a different story, I think NFSD should work in 3.0 > regardless of 2.6.32 issues :) > > Anything I can do to make NFS work again? > > Thanks, > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-29 15:24 ` J. Bruce Fields @ 2012-05-30 7:11 ` Michael Tokarev 2012-05-30 13:25 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-05-30 7:11 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Linux-kernel On 29.05.2012 19:24, J. Bruce Fields wrote: > On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote: >> I updated my nfs server machine to kernel 3.0, and >> noticed that its main usage become, well, problematic. >> >> While trying to dig deeper, I also found a few other >> interesting issues, which are mentioned below. >> >> But first thing first: nfs. >> >> i686pae kernel, lots of RAM, Atom-based (cedar trail) >> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel >> (I will try current 3.4 but I don't have much hopes >> there). NFSv4. >> >> When a client machine (also 3.0 kernel) does some reading, >> the process often stalls somewhere in the read syscall, >> or, rarer, during close, for up to two MINUTES. During >> this time, the client (kernel) reports "NFS server <foo> >> does not respond" several times, and finally "NFS server >> <foo> ok", client process "unstucks" from the read(2), >> and is able to perform a few more reads till the whole >> thing repeats. > > You say 2.6.32 was OK; have you tried anything else between? Well, I thought bisecting between 2.6.32 and 3.0 will be quite painful... But I'll try if nothing else helps. And no, I haven't tried anything in-between. > And you're holding the client constant while varying only the server > version, right? Yes. > Is your network otherwise working? (E.g. does transferring a bunch of > data from server to client using some other protocol work reliably?) Yes, it works flawlessly, all other protocols works so far. To the date, I resorted to using a small webserver plus wget as an ugly workaround for the problem - http works for reads from the server, while nfs works for writes. > Is there anything happening on the network during these stalls? (You > can watch the network with wireshark, for example.) The network load is irrelevant - it behaves the same way with 100% idle network or with network busy doing other stuff. > Does NFSv3 behave the same way? Yes it does. With all NFSDs on server eating all available CPUs for quite some time, and with being "ghosts" for perf top. And with the client being unkillable again. Can at least the client be made interruptible? Mounting with -o intr,soft makes no visible difference... Thanks, /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-30 7:11 ` Michael Tokarev @ 2012-05-30 13:25 ` J. Bruce Fields 2012-05-31 6:47 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-05-30 13:25 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-nfs, Linux-kernel On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote: > On 29.05.2012 19:24, J. Bruce Fields wrote: > > On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote: > >> I updated my nfs server machine to kernel 3.0, and > >> noticed that its main usage become, well, problematic. > >> > >> While trying to dig deeper, I also found a few other > >> interesting issues, which are mentioned below. > >> > >> But first thing first: nfs. > >> > >> i686pae kernel, lots of RAM, Atom-based (cedar trail) > >> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel > >> (I will try current 3.4 but I don't have much hopes > >> there). NFSv4. > >> > >> When a client machine (also 3.0 kernel) does some reading, > >> the process often stalls somewhere in the read syscall, > >> or, rarer, during close, for up to two MINUTES. During > >> this time, the client (kernel) reports "NFS server <foo> > >> does not respond" several times, and finally "NFS server > >> <foo> ok", client process "unstucks" from the read(2), > >> and is able to perform a few more reads till the whole > >> thing repeats. > > > > You say 2.6.32 was OK; have you tried anything else between? > > Well, I thought bisecting between 2.6.32 and 3.0 will be quite > painful... But I'll try if nothing else helps. And no, I haven't > tried anything in-between. > > > And you're holding the client constant while varying only the server > > version, right? > > Yes. > > > Is your network otherwise working? (E.g. does transferring a bunch of > > data from server to client using some other protocol work reliably?) > > Yes, it works flawlessly, all other protocols works so far. > > To the date, I resorted to using a small webserver plus wget as an ugly > workaround for the problem - http works for reads from the server, while > nfs works for writes. > > > Is there anything happening on the network during these stalls? (You > > can watch the network with wireshark, for example.) > > The network load is irrelevant - it behaves the same way with > 100% idle network or with network busy doing other stuff. That's not what I meant. During one of these read stalls, if you watch the network with wireshark, do you see any NFS traffic between the client and server? Also: do you have a reliable way of reproducing this quickly? --b. > > Does NFSv3 behave the same way? > > Yes it does. With all NFSDs on server eating all available CPUs for > quite some time, and with being "ghosts" for perf top. > > And with the client being unkillable again. > > Can at least the client be made interruptible? Mounting with > -o intr,soft makes no visible difference... > > Thanks, > > /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-30 13:25 ` J. Bruce Fields @ 2012-05-31 6:47 ` Michael Tokarev 2012-05-31 12:59 ` Myklebust, Trond 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-05-31 6:47 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Linux-kernel On 30.05.2012 17:25, J. Bruce Fields wrote: > On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote: [] > That's not what I meant. During one of these read stalls, if you watch > the network with wireshark, do you see any NFS traffic between the > client and server? Oh. Indeed, I misunderstood. And no, during these stalls, there's no network activity at all. Here's the typical scenario: ... 10:38:53.781990 IP (tos 0x0, ttl 64, id 35131, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xb43d), ack 89530281, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0 10:38:53.782000 IP (tos 0x0, ttl 64, id 6329, offset 0, flags [DF], proto TCP (6), length 1500) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe827 (correct), seq 89530281:89531729, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448 10:38:53.782027 IP (tos 0x0, ttl 64, id 6330, offset 0, flags [DF], proto TCP (6), length 1708) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x37f6 (incorrect -> 0x6790), seq 89531729:89533385, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1656 10:38:53.782029 IP (tos 0x0, ttl 64, id 35132, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xa81d), ack 89533385, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0 10:38:53.782040 IP (tos 0x0, ttl 64, id 6333, offset 0, flags [DF], proto TCP (6), length 1500) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x0d5b (correct), seq 89534833:89536281, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448 10:38:53.782082 IP (tos 0x0, ttl 64, id 6334, offset 0, flags [DF], proto TCP (6), length 4396) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x4276 (incorrect -> 0x778a), seq 89536281:89540625, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 4344 10:38:53.782088 IP (tos 0x0, ttl 64, id 35134, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0x8bd5), ack 89540625, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0 10:38:53.782096 IP (tos 0x0, ttl 64, id 6337, offset 0, flags [DF], proto TCP (6), length 1500) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x835d (correct), seq 89540625:89542073, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448 10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268) 192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22 10:38:53.827673 IP (tos 0x0, ttl 64, id 6428, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe4e5 (correct), ack 41617, win 6289, options [nop,nop,TS val 122195221 ecr 3298142], length 0 10:38:53.827699 IP (tos 0x0, ttl 64, id 6429, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdfd4 (correct), ack 42913, win 6289, options [nop,nop,TS val 122195221 ecr 3298143], length 0 10:38:53.865036 IP (tos 0x0, ttl 64, id 6430, offset 0, flags [DF], proto TCP (6), length 52) 192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdd40 (correct), ack 43561, win 6289, options [nop,nop,TS val 122195233 ecr 3298143], length 0 [pause] ^C 192.168.88.2 is the client, .4 is the server. I'm not sure if the series of getattr requests from the client is right before or right after the beginning of the stall, but after the 3 last replies from server there's no other activity for a long time, and the server is eating 100% available CPU as I described previously. > Also: do you have a reliable way of reproducing this quickly? Yes, it is enough to start copying any large file and in a few seconds the first stall happens. Can you suggest something for the other part of the question: >> Can at least the client be made interruptible? Mounting with >> -o intr,soft makes no visible difference... please? :) Thank you! /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 6:47 ` Michael Tokarev @ 2012-05-31 12:59 ` Myklebust, Trond 0 siblings, 0 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-05-31 12:59 UTC (permalink / raw) To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3347 bytes --] On Thu, 2012-05-31 at 10:47 +0400, Michael Tokarev wrote: > 10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22 > 10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268) > 192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22 That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3. Can you either please use wireshark to provide a full text dump (using something like 'tshark -V -O nfs,rpc'), or just send us the binary tcpdump output using 'tcpdump -w /tmp/foo -s 90000'? > >> Can at least the client be made interruptible? Mounting with > >> -o intr,soft makes no visible difference... > > please? :) It already is interruptible: try 'kill -9' or any other fatal signal. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues @ 2012-05-31 12:59 ` Myklebust, Trond 0 siblings, 0 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-05-31 12:59 UTC (permalink / raw) To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel T24gVGh1LCAyMDEyLTA1LTMxIGF0IDEwOjQ3ICswNDAwLCBNaWNoYWVsIFRva2FyZXYgd3JvdGU6 DQoNCj4gMTA6Mzg6NTMuODI3MzU1IElQICh0b3MgMHgwLCB0dGwgNjQsIGlkIDM1MTYwLCBvZmZz ZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkNCj4gICAgIDE5Mi4x NjguODguMi4xMzk2NTQ4MDk4ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBnZXRhdHRyIGZoIDAs MC8yMg0KPiAxMDozODo1My44MjczNzkgSVAgKHRvcyAweDAsIHR0bCA2NCwgaWQgMzUxNjEsIG9m ZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4KQ0KPiAgICAgMTky LjE2OC44OC4yLjE0MTMzMjUzMTQgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEyIGdldGF0dHIgZmgg MCwwLzIyDQo+IDEwOjM4OjUzLjgyNzM4NSBJUCAodG9zIDB4MCwgdHRsIDY0LCBpZCAzNTE2Miwg b2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAyNjgpDQo+ICAgICAx OTIuMTY4Ljg4LjIuMTQzMDEwMjUzMCA+IDE5Mi4xNjguODguNC4yMDQ5OiAyMTIgZ2V0YXR0ciBm aCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDAwIElQICh0b3MgMHgwLCB0dGwgNjQsIGlkIDM1MTYz LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkNCj4gICAg IDE5Mi4xNjguODguMi4xNDQ2ODc5NzQ2ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBnZXRhdHRy IGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MDYgSVAgKHRvcyAweDAsIHR0bCA2NCwgaWQgMzUx NjQsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4KQ0KPiAg ICAgMTkyLjE2OC44OC4yLjE0NjM2NTY5NjIgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEyIGdldGF0 dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQwOSBJUCAodG9zIDB4MCwgdHRsIDY0LCBpZCAz NTE2NSwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAyNjgpDQo+ ICAgICAxOTIuMTY4Ljg4LjIuMTQ4MDQzNDE3OCA+IDE5Mi4xNjguODguNC4yMDQ5OiAyMTIgZ2V0 YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDEzIElQICh0b3MgMHgwLCB0dGwgNjQsIGlk IDM1MTY2LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkN Cj4gICAgIDE5Mi4xNjguODguMi4xNDk3MjExMzk0ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBn ZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MTcgSVAgKHRvcyAweDAsIHR0bCA2NCwg aWQgMzUxNjcsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4 KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE1MTM5ODg2MTAgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEy IGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQyMCBJUCAodG9zIDB4MCwgdHRsIDY0 LCBpZCAzNTE2OCwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAy NjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTUzMDc2NTgyNiA+IDE5Mi4xNjguODguNC4yMDQ5OiAy MTIgZ2V0YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDI0IElQICh0b3MgMHgwLCB0dGwg NjQsIGlkIDM1MTY5LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3Ro IDI2OCkNCj4gICAgIDE5Mi4xNjguODguMi4xNTQ3NTQzMDQyID4gMTkyLjE2OC44OC40LjIwNDk6 IDIxMiBnZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MjcgSVAgKHRvcyAweDAsIHR0 bCA2NCwgaWQgMzUxNzAsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5n dGggMjY4KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE1NjQzMjAyNTggPiAxOTIuMTY4Ljg4LjQuMjA0 OTogMjEyIGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQzNCBJUCAodG9zIDB4MCwg dHRsIDY0LCBpZCAzNTE3MSwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxl bmd0aCAyNjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTU4MTA5NzQ3NCA+IDE5Mi4xNjguODguNC4y MDQ5OiAyMTIgZ2V0YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDM4IElQICh0b3MgMHgw LCB0dGwgNjQsIGlkIDM1MTcyLCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwg bGVuZ3RoIDI2OCkNCj4gICAgIDE5Mi4xNjguODguMi4xNTk3ODc0NjkwID4gMTkyLjE2OC44OC40 LjIwNDk6IDIxMiBnZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0NDMgSVAgKHRvcyAw eDAsIHR0bCA2NCwgaWQgMzUxNzMsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYp LCBsZW5ndGggMjY4KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE2MTQ2NTE5MDYgPiAxOTIuMTY4Ljg4 LjQuMjA0OTogMjEyIGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQ0NyBJUCAodG9z IDB4MCwgdHRsIDY0LCBpZCAzNTE3NCwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAo NiksIGxlbmd0aCAyNjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTYzMTQyOTEyMiA+IDE5Mi4xNjgu ODguNC4yMDQ5OiAyMTIgZ2V0YXR0ciBmaCAwLDAvMjINCg0KVGhhdCBpcyB0Y3BkdW1wIHRyeWlu ZyB0byBpbnRlcnByZXQgeW91ciBORlN2NCB0cmFjZSBhcyBORlN2Mi92My4NCg0KQ2FuIHlvdSBl aXRoZXIgcGxlYXNlIHVzZSB3aXJlc2hhcmsgdG8gcHJvdmlkZSBhIGZ1bGwgdGV4dCBkdW1wICh1 c2luZw0Kc29tZXRoaW5nIGxpa2UgJ3RzaGFyayAtViAtTyBuZnMscnBjJyksIG9yIGp1c3Qgc2Vu ZCB1cyB0aGUgYmluYXJ5DQp0Y3BkdW1wIG91dHB1dCB1c2luZyAndGNwZHVtcCAtdyAvdG1wL2Zv byAtcyA5MDAwMCc/DQoNCj4gPj4gQ2FuIGF0IGxlYXN0IHRoZSBjbGllbnQgYmUgbWFkZSBpbnRl cnJ1cHRpYmxlPyAgTW91bnRpbmcgd2l0aA0KPiA+PiAtbyBpbnRyLHNvZnQgbWFrZXMgbm8gdmlz aWJsZSBkaWZmZXJlbmNlLi4uDQo+IA0KPiBwbGVhc2U/IDopDQoNCkl0IGFscmVhZHkgaXMgaW50 ZXJydXB0aWJsZTogdHJ5ICdraWxsIC05JyBvciBhbnkgb3RoZXIgZmF0YWwgc2lnbmFsLg0KDQot LSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFw cA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg== ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 12:59 ` Myklebust, Trond (?) @ 2012-05-31 13:24 ` Michael Tokarev 2012-05-31 13:46 ` Myklebust, Trond -1 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-05-31 13:24 UTC (permalink / raw) To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel On 31.05.2012 16:59, Myklebust, Trond wrote: [] > That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3. Oh. > Can you either please use wireshark to provide a full text dump (using > something like 'tshark -V -O nfs,rpc'), or just send us the binary > tcpdump output using 'tcpdump -w /tmp/foo -s 90000'? I started tcpdump: tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump on the client (192.168.88.2). Next I mounted a directory on the client, and started reading (tar'ing) a directory into /dev/null. It captured a few stalls. Tcpdump shows number of packets it got, the stalls are at packet counts 58090, 97069 and 97071. I cancelled the capture after that. The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , it is 220Mb uncompressed and 1.3Mb compressed. The source files are 10 files of 1Gb each, all made by using `truncate' utility, so does not take place on disk at all. This also makes it obvious that the issue does not depend on the speed of disk on the server (since in this case, the server disk isn't even in use). What I noticed is: with default 8 NFSD threads, it takes quite a bit longer to hit this issue, but it still happens. The capture was done using 2 threads. >>>> Can at least the client be made interruptible? Mounting with >>>> -o intr,soft makes no visible difference... >> >> please? :) > > It already is interruptible: try 'kill -9' or any other fatal signal. Oh. Indeed, I can kill it from another terminal using -9. That is great already! Thank you! /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 13:24 ` Michael Tokarev @ 2012-05-31 13:46 ` Myklebust, Trond 0 siblings, 0 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-05-31 13:46 UTC (permalink / raw) To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1912 bytes --] On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: > On 31.05.2012 16:59, Myklebust, Trond wrote: > [] > > That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3. > > Oh. > > > Can you either please use wireshark to provide a full text dump (using > > something like 'tshark -V -O nfs,rpc'), or just send us the binary > > tcpdump output using 'tcpdump -w /tmp/foo -s 90000'? > > I started tcpdump: > > tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump > > on the client (192.168.88.2). Next I mounted a directory on the client, > and started reading (tar'ing) a directory into /dev/null. It captured a > few stalls. Tcpdump shows number of packets it got, the stalls are at > packet counts 58090, 97069 and 97071. I cancelled the capture after that. > > The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , > it is 220Mb uncompressed and 1.3Mb compressed. The source files are > 10 files of 1Gb each, all made by using `truncate' utility, so does not > take place on disk at all. This also makes it obvious that the issue > does not depend on the speed of disk on the server (since in this case, > the server disk isn't even in use). OK. So from the above file it looks as if the traffic is mainly READ requests. In 2 places the server stops responding. In both cases, the client seems to be sending a single TCP frame containing several COMPOUNDS containing READ requests (which should be legal) just prior to the hang. When the server doesn't respond, the client pings it with a RENEW, before it ends up severing the TCP connection and then retransmitting. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues @ 2012-05-31 13:46 ` Myklebust, Trond 0 siblings, 0 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-05-31 13:46 UTC (permalink / raw) To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel T24gVGh1LCAyMDEyLTA1LTMxIGF0IDE3OjI0ICswNDAwLCBNaWNoYWVsIFRva2FyZXYgd3JvdGU6 DQo+IE9uIDMxLjA1LjIwMTIgMTY6NTksIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IFtdDQo+ ID4gVGhhdCBpcyB0Y3BkdW1wIHRyeWluZyB0byBpbnRlcnByZXQgeW91ciBORlN2NCB0cmFjZSBh cyBORlN2Mi92My4NCj4gDQo+IE9oLg0KPiANCj4gPiBDYW4geW91IGVpdGhlciBwbGVhc2UgdXNl IHdpcmVzaGFyayB0byBwcm92aWRlIGEgZnVsbCB0ZXh0IGR1bXAgKHVzaW5nDQo+ID4gc29tZXRo aW5nIGxpa2UgJ3RzaGFyayAtViAtTyBuZnMscnBjJyksIG9yIGp1c3Qgc2VuZCB1cyB0aGUgYmlu YXJ5DQo+ID4gdGNwZHVtcCBvdXRwdXQgdXNpbmcgJ3RjcGR1bXAgLXcgL3RtcC9mb28gLXMgOTAw MDAnPw0KPiANCj4gSSBzdGFydGVkIHRjcGR1bXA6DQo+IA0KPiAgdGNwZHVtcCAtbnB2aSBicjAg LXMgMCBob3N0IDE5Mi4xNjguODguNCBhbmQgXCggcHJvdG8gSUNNUCBvciBwb3J0IDIwNDkgXCkg LXcgbmZzZHVtcA0KPiANCj4gb24gdGhlIGNsaWVudCAoMTkyLjE2OC44OC4yKS4gIE5leHQgSSBt b3VudGVkIGEgZGlyZWN0b3J5IG9uIHRoZSBjbGllbnQsDQo+IGFuZCBzdGFydGVkIHJlYWRpbmcg KHRhcidpbmcpIGEgZGlyZWN0b3J5IGludG8gL2Rldi9udWxsLiAgSXQgY2FwdHVyZWQgYQ0KPiBm ZXcgc3RhbGxzLiAgVGNwZHVtcCBzaG93cyBudW1iZXIgb2YgcGFja2V0cyBpdCBnb3QsIHRoZSBz dGFsbHMgYXJlIGF0DQo+IHBhY2tldCBjb3VudHMgNTgwOTAsIDk3MDY5IGFuZCA5NzA3MS4gIEkg Y2FuY2VsbGVkIHRoZSBjYXB0dXJlIGFmdGVyIHRoYXQuDQo+IA0KPiBUaGUgcmVzdWx0aW5nIGZp bGUgaXMgYXZhaWxhYmxlIGF0IGh0dHA6Ly93d3cuY29ycGl0LnJ1L21qdC90bXAvbmZzZHVtcC54 eiAsDQo+IGl0IGlzIDIyME1iIHVuY29tcHJlc3NlZCBhbmQgMS4zTWIgY29tcHJlc3NlZC4gIFRo ZSBzb3VyY2UgZmlsZXMgYXJlDQo+IDEwIGZpbGVzIG9mIDFHYiBlYWNoLCBhbGwgbWFkZSBieSB1 c2luZyBgdHJ1bmNhdGUnIHV0aWxpdHksIHNvIGRvZXMgbm90DQo+IHRha2UgcGxhY2Ugb24gZGlz ayBhdCBhbGwuICBUaGlzIGFsc28gbWFrZXMgaXQgb2J2aW91cyB0aGF0IHRoZSBpc3N1ZQ0KPiBk b2VzIG5vdCBkZXBlbmQgb24gdGhlIHNwZWVkIG9mIGRpc2sgb24gdGhlIHNlcnZlciAoc2luY2Ug aW4gdGhpcyBjYXNlLA0KPiB0aGUgc2VydmVyIGRpc2sgaXNuJ3QgZXZlbiBpbiB1c2UpLg0KDQpP Sy4gU28gZnJvbSB0aGUgYWJvdmUgZmlsZSBpdCBsb29rcyBhcyBpZiB0aGUgdHJhZmZpYyBpcyBt YWlubHkgUkVBRA0KcmVxdWVzdHMuDQoNCkluIDIgcGxhY2VzIHRoZSBzZXJ2ZXIgc3RvcHMgcmVz cG9uZGluZy4gSW4gYm90aCBjYXNlcywgdGhlIGNsaWVudCBzZWVtcw0KdG8gYmUgc2VuZGluZyBh IHNpbmdsZSBUQ1AgZnJhbWUgY29udGFpbmluZyBzZXZlcmFsIENPTVBPVU5EUyBjb250YWluaW5n DQpSRUFEIHJlcXVlc3RzICh3aGljaCBzaG91bGQgYmUgbGVnYWwpIGp1c3QgcHJpb3IgdG8gdGhl IGhhbmcuIFdoZW4gdGhlDQpzZXJ2ZXIgZG9lc24ndCByZXNwb25kLCB0aGUgY2xpZW50IHBpbmdz IGl0IHdpdGggYSBSRU5FVywgYmVmb3JlIGl0IGVuZHMNCnVwIHNldmVyaW5nIHRoZSBUQ1AgY29u bmVjdGlvbiBhbmQgdGhlbiByZXRyYW5zbWl0dGluZy4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QN CkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBu ZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo= ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 13:46 ` Myklebust, Trond (?) @ 2012-05-31 13:51 ` Michael Tokarev 2012-06-20 12:52 ` Christoph Bartoschek 2012-07-10 12:52 ` Michael Tokarev -1 siblings, 2 replies; 27+ messages in thread From: Michael Tokarev @ 2012-05-31 13:51 UTC (permalink / raw) To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel On 31.05.2012 17:46, Myklebust, Trond wrote: > On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: [] >> I started tcpdump: >> >> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump >> >> on the client (192.168.88.2). Next I mounted a directory on the client, >> and started reading (tar'ing) a directory into /dev/null. It captured a >> few stalls. Tcpdump shows number of packets it got, the stalls are at >> packet counts 58090, 97069 and 97071. I cancelled the capture after that. >> >> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , >> it is 220Mb uncompressed and 1.3Mb compressed. The source files are >> 10 files of 1Gb each, all made by using `truncate' utility, so does not >> take place on disk at all. This also makes it obvious that the issue >> does not depend on the speed of disk on the server (since in this case, >> the server disk isn't even in use). > > OK. So from the above file it looks as if the traffic is mainly READ > requests. The issue here happens only with reads. > In 2 places the server stops responding. In both cases, the client seems > to be sending a single TCP frame containing several COMPOUNDS containing > READ requests (which should be legal) just prior to the hang. When the > server doesn't respond, the client pings it with a RENEW, before it ends > up severing the TCP connection and then retransmitting. And sometimes -- speaking only from the behavour I've seen, not from the actual frames sent -- server does not respond to the RENEW too, in which case the client reports "nfs server no responding", and on the next renew it may actually respond. This happens too, but much more rare. During these stalls, ie, when there's no network activity at all, the server NFSD threads are busy eating all available CPU. What does it all tell us? :) Thank you! /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 13:51 ` Michael Tokarev @ 2012-06-20 12:52 ` Christoph Bartoschek 2012-07-10 12:52 ` Michael Tokarev 1 sibling, 0 replies; 27+ messages in thread From: Christoph Bartoschek @ 2012-06-20 12:52 UTC (permalink / raw) To: linux-nfs >> In 2 places the server stops responding. In both cases, the client seems >> to be sending a single TCP frame containing several COMPOUNDS containing >> READ requests (which should be legal) just prior to the hang. When the >> server doesn't respond, the client pings it with a RENEW, before it ends >> up severing the TCP connection and then retransmitting. > > And sometimes -- speaking only from the behavour I've seen, not from the > actual frames sent -- server does not respond to the RENEW too, in which > case the client reports "nfs server no responding", and on the next > renew it may actually respond. This happens too, but much more rare. > > During these stalls, ie, when there's no network activity at all, > the server NFSD threads are busy eating all available CPU. > > What does it all tell us? :) Hi, are there news on this issue? We see bad NFS performance on systems using Ubuntu 12.04 compared to systems using Ubuntu 10.04 that look similar. There are also two bug reports for Ubuntu with some information: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1006446 https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334 Chrsitoph ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-05-31 13:51 ` Michael Tokarev 2012-06-20 12:52 ` Christoph Bartoschek @ 2012-07-10 12:52 ` Michael Tokarev 2012-07-12 12:53 ` J. Bruce Fields 1 sibling, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-07-10 12:52 UTC (permalink / raw) To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel I tried to debug this again, maybe to reproduce in a virtual machine, and found out that it is only 32bit server code shows this issue: after updating the kernel on the server to 64bit (the same version) I can't reproduce this issue anymore. Rebooting back to 32bit, and voila, it is here again. Something apparenlty isn't right on 32bits... ;) (And yes, the prob is still present and is very annoying :) Thanks, /mjt On 31.05.2012 17:51, Michael Tokarev wrote: > On 31.05.2012 17:46, Myklebust, Trond wrote: >> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: > [] >>> I started tcpdump: >>> >>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump >>> >>> on the client (192.168.88.2). Next I mounted a directory on the client, >>> and started reading (tar'ing) a directory into /dev/null. It captured a >>> few stalls. Tcpdump shows number of packets it got, the stalls are at >>> packet counts 58090, 97069 and 97071. I cancelled the capture after that. >>> >>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , >>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are >>> 10 files of 1Gb each, all made by using `truncate' utility, so does not >>> take place on disk at all. This also makes it obvious that the issue >>> does not depend on the speed of disk on the server (since in this case, >>> the server disk isn't even in use). >> >> OK. So from the above file it looks as if the traffic is mainly READ >> requests. > > The issue here happens only with reads. > >> In 2 places the server stops responding. In both cases, the client seems >> to be sending a single TCP frame containing several COMPOUNDS containing >> READ requests (which should be legal) just prior to the hang. When the >> server doesn't respond, the client pings it with a RENEW, before it ends >> up severing the TCP connection and then retransmitting. > > And sometimes -- speaking only from the behavour I've seen, not from the > actual frames sent -- server does not respond to the RENEW too, in which > case the client reports "nfs server no responding", and on the next > renew it may actually respond. This happens too, but much more rare. > > During these stalls, ie, when there's no network activity at all, > the server NFSD threads are busy eating all available CPU. > > What does it all tell us? :) > > Thank you! > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues 2012-07-10 12:52 ` Michael Tokarev @ 2012-07-12 12:53 ` J. Bruce Fields 2012-08-17 1:56 ` 3.0+ NFS issues (bisected) Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-07-12 12:53 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote: > I tried to debug this again, maybe to reproduce in a virtual machine, > and found out that it is only 32bit server code shows this issue: > after updating the kernel on the server to 64bit (the same version) > I can't reproduce this issue anymore. Rebooting back to 32bit, > and voila, it is here again. > > Something apparenlty isn't right on 32bits... ;) > > (And yes, the prob is still present and is very annoying :) OK, that's very useful, thanks. So probably a bug got introduced in the 32-bit case between 2.6.32 and 3.0. My personal upstream testing is normally all x86_64 only. I'll kick off a 32-bit install and see if I can reproduce this quickly. Let me know if you're able to narrow this down any more. --b. > > Thanks, > > /mjt > > > On 31.05.2012 17:51, Michael Tokarev wrote: > > On 31.05.2012 17:46, Myklebust, Trond wrote: > >> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: > > [] > >>> I started tcpdump: > >>> > >>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump > >>> > >>> on the client (192.168.88.2). Next I mounted a directory on the client, > >>> and started reading (tar'ing) a directory into /dev/null. It captured a > >>> few stalls. Tcpdump shows number of packets it got, the stalls are at > >>> packet counts 58090, 97069 and 97071. I cancelled the capture after that. > >>> > >>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , > >>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are > >>> 10 files of 1Gb each, all made by using `truncate' utility, so does not > >>> take place on disk at all. This also makes it obvious that the issue > >>> does not depend on the speed of disk on the server (since in this case, > >>> the server disk isn't even in use). > >> > >> OK. So from the above file it looks as if the traffic is mainly READ > >> requests. > > > > The issue here happens only with reads. > > > >> In 2 places the server stops responding. In both cases, the client seems > >> to be sending a single TCP frame containing several COMPOUNDS containing > >> READ requests (which should be legal) just prior to the hang. When the > >> server doesn't respond, the client pings it with a RENEW, before it ends > >> up severing the TCP connection and then retransmitting. > > > > And sometimes -- speaking only from the behavour I've seen, not from the > > actual frames sent -- server does not respond to the RENEW too, in which > > case the client reports "nfs server no responding", and on the next > > renew it may actually respond. This happens too, but much more rare. > > > > During these stalls, ie, when there's no network activity at all, > > the server NFSD threads are busy eating all available CPU. > > > > What does it all tell us? :) > > > > Thank you! > > > > /mjt > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-07-12 12:53 ` J. Bruce Fields @ 2012-08-17 1:56 ` Michael Tokarev 2012-08-17 14:56 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-08-17 1:56 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 12.07.2012 16:53, J. Bruce Fields wrote: > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote: >> I tried to debug this again, maybe to reproduce in a virtual machine, >> and found out that it is only 32bit server code shows this issue: >> after updating the kernel on the server to 64bit (the same version) >> I can't reproduce this issue anymore. Rebooting back to 32bit, >> and voila, it is here again. >> >> Something apparenlty isn't right on 32bits... ;) >> >> (And yes, the prob is still present and is very annoying :) > > OK, that's very useful, thanks. So probably a bug got introduced in the > 32-bit case between 2.6.32 and 3.0. > > My personal upstream testing is normally all x86_64 only. I'll kick off > a 32-bit install and see if I can reproduce this quickly. Actually it has nothing to do with 32 vs 64 bits as I initially thought. It happens on 64bits too, but takes more time (or data to transfer) to trigger. > Let me know if you're able to narrow this down any more. I bisected this issue to the following commit: commit f03d78db65085609938fdb686238867e65003181 Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu Jul 7 00:27:05 2011 -0700 net: refine {udp|tcp|sctp}_mem limits Current tcp/udp/sctp global memory limits are not taking into account hugepages allocations, and allow 50% of ram to be used by buffers of a single protocol [ not counting space used by sockets / inodes ...] Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram per protocol, and a minimum of 128 pages. Heavy duty machines sysadmins probably need to tweak limits anyway. Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes the behavour here. This machine has 4Gb of memory. On 3.0, with this patch applied (as it is part of 3.0), tcp_mem is like this: 21228 28306 42456 with this patch reverted, tcp_mem shows: 81216 108288 162432 and with these values, it works fine. So it looks like something else goes wrong there, which lead to all nfsds fighting with each other for something and eating 100% of available CPU instead of servicing clients. For added fun, when setting tcp_mem to the "good" value from "bad" value (after booting into kernel with that patch applied), the problem is _not_ fixed. Any further hints? Thanks, /mjt >> On 31.05.2012 17:51, Michael Tokarev wrote: >>> On 31.05.2012 17:46, Myklebust, Trond wrote: >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: >>> [] >>>>> I started tcpdump: >>>>> >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump >>>>> >>>>> on the client (192.168.88.2). Next I mounted a directory on the client, >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that. >>>>> >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not >>>>> take place on disk at all. This also makes it obvious that the issue >>>>> does not depend on the speed of disk on the server (since in this case, >>>>> the server disk isn't even in use). >>>> >>>> OK. So from the above file it looks as if the traffic is mainly READ >>>> requests. >>> >>> The issue here happens only with reads. >>> >>>> In 2 places the server stops responding. In both cases, the client seems >>>> to be sending a single TCP frame containing several COMPOUNDS containing >>>> READ requests (which should be legal) just prior to the hang. When the >>>> server doesn't respond, the client pings it with a RENEW, before it ends >>>> up severing the TCP connection and then retransmitting. >>> >>> And sometimes -- speaking only from the behavour I've seen, not from the >>> actual frames sent -- server does not respond to the RENEW too, in which >>> case the client reports "nfs server no responding", and on the next >>> renew it may actually respond. This happens too, but much more rare. >>> >>> During these stalls, ie, when there's no network activity at all, >>> the server NFSD threads are busy eating all available CPU. >>> >>> What does it all tell us? :) >>> >>> Thank you! >>> >>> /mjt >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 1:56 ` 3.0+ NFS issues (bisected) Michael Tokarev @ 2012-08-17 14:56 ` J. Bruce Fields 2012-08-17 16:00 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 14:56 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote: > On 12.07.2012 16:53, J. Bruce Fields wrote: > > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote: > >> I tried to debug this again, maybe to reproduce in a virtual machine, > >> and found out that it is only 32bit server code shows this issue: > >> after updating the kernel on the server to 64bit (the same version) > >> I can't reproduce this issue anymore. Rebooting back to 32bit, > >> and voila, it is here again. > >> > >> Something apparenlty isn't right on 32bits... ;) > >> > >> (And yes, the prob is still present and is very annoying :) > > > > OK, that's very useful, thanks. So probably a bug got introduced in the > > 32-bit case between 2.6.32 and 3.0. > > > > My personal upstream testing is normally all x86_64 only. I'll kick off > > a 32-bit install and see if I can reproduce this quickly. > > Actually it has nothing to do with 32 vs 64 bits as I > initially thought. It happens on 64bits too, but takes > more time (or data to transfer) to trigger. That makes it sound like some kind of leak: you're hitting this case eventually either way, but it takes longer in the case where you have more (low) memory. I wish I was more familiar with the tcp code.... What number exactly is being compared against those limits, and how could we watch it from userspace? --b. > > > > Let me know if you're able to narrow this down any more. > > I bisected this issue to the following commit: > > commit f03d78db65085609938fdb686238867e65003181 > Author: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu Jul 7 00:27:05 2011 -0700 > > net: refine {udp|tcp|sctp}_mem limits > > Current tcp/udp/sctp global memory limits are not taking into account > hugepages allocations, and allow 50% of ram to be used by buffers of a > single protocol [ not counting space used by sockets / inodes ...] > > Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram > per protocol, and a minimum of 128 pages. > Heavy duty machines sysadmins probably need to tweak limits anyway. > > > Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes > the behavour here. > > This machine has 4Gb of memory. On 3.0, with this patch applied > (as it is part of 3.0), tcp_mem is like this: > > 21228 28306 42456 > > with this patch reverted, tcp_mem shows: > > 81216 108288 162432 > > and with these values, it works fine. > > So it looks like something else goes wrong there, > which lead to all nfsds fighting with each other > for something and eating 100% of available CPU > instead of servicing clients. > > For added fun, when setting tcp_mem to the "good" value > from "bad" value (after booting into kernel with that > patch applied), the problem is _not_ fixed. > > Any further hints? > > Thanks, > > /mjt > > >> On 31.05.2012 17:51, Michael Tokarev wrote: > >>> On 31.05.2012 17:46, Myklebust, Trond wrote: > >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: > >>> [] > >>>>> I started tcpdump: > >>>>> > >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump > >>>>> > >>>>> on the client (192.168.88.2). Next I mounted a directory on the client, > >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a > >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at > >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that. > >>>>> > >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , > >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are > >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not > >>>>> take place on disk at all. This also makes it obvious that the issue > >>>>> does not depend on the speed of disk on the server (since in this case, > >>>>> the server disk isn't even in use). > >>>> > >>>> OK. So from the above file it looks as if the traffic is mainly READ > >>>> requests. > >>> > >>> The issue here happens only with reads. > >>> > >>>> In 2 places the server stops responding. In both cases, the client seems > >>>> to be sending a single TCP frame containing several COMPOUNDS containing > >>>> READ requests (which should be legal) just prior to the hang. When the > >>>> server doesn't respond, the client pings it with a RENEW, before it ends > >>>> up severing the TCP connection and then retransmitting. > >>> > >>> And sometimes -- speaking only from the behavour I've seen, not from the > >>> actual frames sent -- server does not respond to the RENEW too, in which > >>> case the client reports "nfs server no responding", and on the next > >>> renew it may actually respond. This happens too, but much more rare. > >>> > >>> During these stalls, ie, when there's no network activity at all, > >>> the server NFSD threads are busy eating all available CPU. > >>> > >>> What does it all tell us? :) > >>> > >>> Thank you! > >>> > >>> /mjt > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> Please read the FAQ at http://www.tux.org/lkml/ > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 14:56 ` J. Bruce Fields @ 2012-08-17 16:00 ` J. Bruce Fields 2012-08-17 17:12 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 16:00 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 10:56:16AM -0400, J. Bruce Fields wrote: > On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote: > > On 12.07.2012 16:53, J. Bruce Fields wrote: > > > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote: > > >> I tried to debug this again, maybe to reproduce in a virtual machine, > > >> and found out that it is only 32bit server code shows this issue: > > >> after updating the kernel on the server to 64bit (the same version) > > >> I can't reproduce this issue anymore. Rebooting back to 32bit, > > >> and voila, it is here again. > > >> > > >> Something apparenlty isn't right on 32bits... ;) > > >> > > >> (And yes, the prob is still present and is very annoying :) > > > > > > OK, that's very useful, thanks. So probably a bug got introduced in the > > > 32-bit case between 2.6.32 and 3.0. > > > > > > My personal upstream testing is normally all x86_64 only. I'll kick off > > > a 32-bit install and see if I can reproduce this quickly. > > > > Actually it has nothing to do with 32 vs 64 bits as I > > initially thought. It happens on 64bits too, but takes > > more time (or data to transfer) to trigger. > > That makes it sound like some kind of leak: you're hitting this case > eventually either way, but it takes longer in the case where you have > more (low) memory. > > I wish I was more familiar with the tcp code.... What number exactly is > being compared against those limits, and how could we watch it from > userspace? Uh, if I grepped my way through this right: it looks like it's the "memory" column of the "TCP" row of /proc/net/protocols; might be interesting to see how that's changing over time. > > --b. > > > > > > > > Let me know if you're able to narrow this down any more. > > > > I bisected this issue to the following commit: > > > > commit f03d78db65085609938fdb686238867e65003181 > > Author: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Thu Jul 7 00:27:05 2011 -0700 > > > > net: refine {udp|tcp|sctp}_mem limits > > > > Current tcp/udp/sctp global memory limits are not taking into account > > hugepages allocations, and allow 50% of ram to be used by buffers of a > > single protocol [ not counting space used by sockets / inodes ...] > > > > Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram > > per protocol, and a minimum of 128 pages. > > Heavy duty machines sysadmins probably need to tweak limits anyway. > > > > > > Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes > > the behavour here. > > > > This machine has 4Gb of memory. On 3.0, with this patch applied > > (as it is part of 3.0), tcp_mem is like this: > > > > 21228 28306 42456 > > > > with this patch reverted, tcp_mem shows: > > > > 81216 108288 162432 > > > > and with these values, it works fine. > > > > So it looks like something else goes wrong there, > > which lead to all nfsds fighting with each other > > for something and eating 100% of available CPU > > instead of servicing clients. > > > > For added fun, when setting tcp_mem to the "good" value > > from "bad" value (after booting into kernel with that > > patch applied), the problem is _not_ fixed. > > > > Any further hints? > > > > Thanks, > > > > /mjt > > > > >> On 31.05.2012 17:51, Michael Tokarev wrote: > > >>> On 31.05.2012 17:46, Myklebust, Trond wrote: > > >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote: > > >>> [] > > >>>>> I started tcpdump: > > >>>>> > > >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump > > >>>>> > > >>>>> on the client (192.168.88.2). Next I mounted a directory on the client, > > >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a > > >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at > > >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that. > > >>>>> > > >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz , > > >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are > > >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not > > >>>>> take place on disk at all. This also makes it obvious that the issue > > >>>>> does not depend on the speed of disk on the server (since in this case, > > >>>>> the server disk isn't even in use). > > >>>> > > >>>> OK. So from the above file it looks as if the traffic is mainly READ > > >>>> requests. > > >>> > > >>> The issue here happens only with reads. > > >>> > > >>>> In 2 places the server stops responding. In both cases, the client seems > > >>>> to be sending a single TCP frame containing several COMPOUNDS containing > > >>>> READ requests (which should be legal) just prior to the hang. When the > > >>>> server doesn't respond, the client pings it with a RENEW, before it ends > > >>>> up severing the TCP connection and then retransmitting. > > >>> > > >>> And sometimes -- speaking only from the behavour I've seen, not from the > > >>> actual frames sent -- server does not respond to the RENEW too, in which > > >>> case the client reports "nfs server no responding", and on the next > > >>> renew it may actually respond. This happens too, but much more rare. > > >>> > > >>> During these stalls, ie, when there's no network activity at all, > > >>> the server NFSD threads are busy eating all available CPU. > > >>> > > >>> What does it all tell us? :) > > >>> > > >>> Thank you! > > >>> > > >>> /mjt > > >>> -- > > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > >>> the body of a message to majordomo@vger.kernel.org > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >>> Please read the FAQ at http://www.tux.org/lkml/ > > >> > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 16:00 ` J. Bruce Fields @ 2012-08-17 17:12 ` Michael Tokarev 2012-08-17 17:18 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-08-17 17:12 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 17.08.2012 20:00, J. Bruce Fields wrote: []> Uh, if I grepped my way through this right: it looks like it's the > "memory" column of the "TCP" row of /proc/net/protocols; might be > interesting to see how that's changing over time. This file does not look interesting. Memory usage does not jump, there's no high increase either. But there's something else which is interesting here. I noticed that in perf top, the top consumer of CPU is svc_recv() (I mentioned this in the start of this thread). So I looked how this routine is called from nfsd. And here we go. fs/nfsd/nfssvc.c: /* * This is the NFS server kernel thread */ static int nfsd(void *vrqstp) { ... /* * The main request loop */ for (;;) { /* * Find a socket with data available and call its * recvfrom routine. */ int i = 0; while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN) ++i; printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err); if (err == -EINTR) break; ... (I added the "i" counter and the printk). And here's the output: [19626.401136] calling svc_recv: 0 times (err=212) [19626.405059] calling svc_recv: 1478 times (err=212) [19626.409512] calling svc_recv: 1106 times (err=212) [19626.543020] calling svc_recv: 0 times (err=212) [19626.543059] calling svc_recv: 0 times (err=212) [19626.548074] calling svc_recv: 0 times (err=212) [19626.549515] calling svc_recv: 0 times (err=212) [19626.552320] calling svc_recv: 0 times (err=212) [19626.553503] calling svc_recv: 0 times (err=212) [19626.556007] calling svc_recv: 0 times (err=212) [19626.557152] calling svc_recv: 0 times (err=212) [19626.560109] calling svc_recv: 0 times (err=212) [19626.560943] calling svc_recv: 0 times (err=212) [19626.565315] calling svc_recv: 1067 times (err=212) [19626.569735] calling svc_recv: 2571 times (err=212) [19626.574150] calling svc_recv: 3842 times (err=212) [19626.581914] calling svc_recv: 2891 times (err=212) [19626.583072] calling svc_recv: 1247 times (err=212) [19626.616885] calling svc_recv: 0 times (err=212) [19626.616952] calling svc_recv: 0 times (err=212) [19626.622889] calling svc_recv: 0 times (err=212) [19626.624518] calling svc_recv: 0 times (err=212) [19626.627118] calling svc_recv: 0 times (err=212) [19626.629735] calling svc_recv: 0 times (err=212) [19626.631777] calling svc_recv: 0 times (err=212) [19626.633986] calling svc_recv: 0 times (err=212) [19626.636746] calling svc_recv: 0 times (err=212) [19626.637692] calling svc_recv: 0 times (err=212) [19626.640769] calling svc_recv: 0 times (err=212) [19626.657852] calling svc_recv: 0 times (err=212) [19626.661602] calling svc_recv: 0 times (err=212) [19626.670160] calling svc_recv: 0 times (err=212) [19626.671917] calling svc_recv: 0 times (err=212) [19626.684643] calling svc_recv: 0 times (err=212) [19626.684680] calling svc_recv: 0 times (err=212) [19626.812820] calling svc_recv: 0 times (err=212) [19626.814697] calling svc_recv: 0 times (err=212) [19626.817195] calling svc_recv: 0 times (err=212) [19626.820324] calling svc_recv: 0 times (err=212) [19626.822855] calling svc_recv: 0 times (err=212) [19626.824823] calling svc_recv: 0 times (err=212) [19626.828016] calling svc_recv: 0 times (err=212) [19626.829021] calling svc_recv: 0 times (err=212) [19626.831970] calling svc_recv: 0 times (err=212) > the stall begin: [19686.823135] calling svc_recv: 3670352 times (err=212) [19686.823524] calling svc_recv: 3659205 times (err=212) > transfer continues [19686.854734] calling svc_recv: 0 times (err=212) [19686.860023] calling svc_recv: 0 times (err=212) [19686.887124] calling svc_recv: 0 times (err=212) [19686.895532] calling svc_recv: 0 times (err=212) [19686.903667] calling svc_recv: 0 times (err=212) [19686.922780] calling svc_recv: 0 times (err=212) So we're calling svc_recv in a tight loop, eating all available CPU. (The above is with just 2 nfsd threads). Something is definitely wrong here. And it happens mure more often after the mentioned commit (f03d78db65085). Thanks, /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 17:12 ` Michael Tokarev @ 2012-08-17 17:18 ` J. Bruce Fields 2012-08-17 17:26 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 17:18 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote: > On 17.08.2012 20:00, J. Bruce Fields wrote: > []> Uh, if I grepped my way through this right: it looks like it's the > > "memory" column of the "TCP" row of /proc/net/protocols; might be > > interesting to see how that's changing over time. > > This file does not look interesting. Memory usage does not jump, > there's no high increase either. > > But there's something else which is interesting here. > > I noticed that in perf top, the top consumer of CPU is svc_recv() > (I mentioned this in the start of this thread). So I looked how > this routine is called from nfsd. And here we go. > > fs/nfsd/nfssvc.c: > > /* > * This is the NFS server kernel thread > */ > static int > nfsd(void *vrqstp) > { > ... > /* > * The main request loop > */ > for (;;) { > /* > * Find a socket with data available and call its > * recvfrom routine. > */ > int i = 0; > while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN) > ++i; > printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err); > if (err == -EINTR) > break; > ... > > (I added the "i" counter and the printk). And here's the output: > > [19626.401136] calling svc_recv: 0 times (err=212) > [19626.405059] calling svc_recv: 1478 times (err=212) > [19626.409512] calling svc_recv: 1106 times (err=212) > [19626.543020] calling svc_recv: 0 times (err=212) > [19626.543059] calling svc_recv: 0 times (err=212) > [19626.548074] calling svc_recv: 0 times (err=212) > [19626.549515] calling svc_recv: 0 times (err=212) > [19626.552320] calling svc_recv: 0 times (err=212) > [19626.553503] calling svc_recv: 0 times (err=212) > [19626.556007] calling svc_recv: 0 times (err=212) > [19626.557152] calling svc_recv: 0 times (err=212) > [19626.560109] calling svc_recv: 0 times (err=212) > [19626.560943] calling svc_recv: 0 times (err=212) > [19626.565315] calling svc_recv: 1067 times (err=212) > [19626.569735] calling svc_recv: 2571 times (err=212) > [19626.574150] calling svc_recv: 3842 times (err=212) > [19626.581914] calling svc_recv: 2891 times (err=212) > [19626.583072] calling svc_recv: 1247 times (err=212) > [19626.616885] calling svc_recv: 0 times (err=212) > [19626.616952] calling svc_recv: 0 times (err=212) > [19626.622889] calling svc_recv: 0 times (err=212) > [19626.624518] calling svc_recv: 0 times (err=212) > [19626.627118] calling svc_recv: 0 times (err=212) > [19626.629735] calling svc_recv: 0 times (err=212) > [19626.631777] calling svc_recv: 0 times (err=212) > [19626.633986] calling svc_recv: 0 times (err=212) > [19626.636746] calling svc_recv: 0 times (err=212) > [19626.637692] calling svc_recv: 0 times (err=212) > [19626.640769] calling svc_recv: 0 times (err=212) > [19626.657852] calling svc_recv: 0 times (err=212) > [19626.661602] calling svc_recv: 0 times (err=212) > [19626.670160] calling svc_recv: 0 times (err=212) > [19626.671917] calling svc_recv: 0 times (err=212) > [19626.684643] calling svc_recv: 0 times (err=212) > [19626.684680] calling svc_recv: 0 times (err=212) > [19626.812820] calling svc_recv: 0 times (err=212) > [19626.814697] calling svc_recv: 0 times (err=212) > [19626.817195] calling svc_recv: 0 times (err=212) > [19626.820324] calling svc_recv: 0 times (err=212) > [19626.822855] calling svc_recv: 0 times (err=212) > [19626.824823] calling svc_recv: 0 times (err=212) > [19626.828016] calling svc_recv: 0 times (err=212) > [19626.829021] calling svc_recv: 0 times (err=212) > [19626.831970] calling svc_recv: 0 times (err=212) > > > the stall begin: > [19686.823135] calling svc_recv: 3670352 times (err=212) > [19686.823524] calling svc_recv: 3659205 times (err=212) > > > transfer continues > [19686.854734] calling svc_recv: 0 times (err=212) > [19686.860023] calling svc_recv: 0 times (err=212) > [19686.887124] calling svc_recv: 0 times (err=212) > [19686.895532] calling svc_recv: 0 times (err=212) > [19686.903667] calling svc_recv: 0 times (err=212) > [19686.922780] calling svc_recv: 0 times (err=212) > > So we're calling svc_recv in a tight loop, eating > all available CPU. (The above is with just 2 nfsd > threads). > > Something is definitely wrong here. And it happens mure more > often after the mentioned commit (f03d78db65085). Oh, neat. Hm. That commit doesn't really sound like the cause, then. Is that busy-looping reproduceable on kernels before that commit? --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 17:18 ` J. Bruce Fields @ 2012-08-17 17:26 ` Michael Tokarev 2012-08-17 17:29 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-08-17 17:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 17.08.2012 21:18, J. Bruce Fields wrote: > On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote: [] >> So we're calling svc_recv in a tight loop, eating >> all available CPU. (The above is with just 2 nfsd >> threads). >> >> Something is definitely wrong here. And it happens mure more >> often after the mentioned commit (f03d78db65085). > > Oh, neat. Hm. That commit doesn't really sound like the cause, then. > Is that busy-looping reproduceable on kernels before that commit? Note I bisected this issue to this commit. I haven't seen it happening before this commit, and reverting it from 3.0 or 3.2 kernel makes the problem to go away. I guess it is looping there: net/sunrpc/svc_xprt.c:svc_recv() ... len = 0; ... if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) { ... } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory... ... len = <something> } /* No data, incomplete (TCP) read, or accept() */ if (len == 0 || len == -EAGAIN) goto out; ... out: rqstp->rq_res.len = 0; svc_xprt_release(rqstp); return -EAGAIN; } I'm trying to verify this theory... /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 17:26 ` Michael Tokarev @ 2012-08-17 17:29 ` Michael Tokarev 2012-08-17 19:18 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-08-17 17:29 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 17.08.2012 21:26, Michael Tokarev wrote: > On 17.08.2012 21:18, J. Bruce Fields wrote: >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote: > [] >>> So we're calling svc_recv in a tight loop, eating >>> all available CPU. (The above is with just 2 nfsd >>> threads). >>> >>> Something is definitely wrong here. And it happens mure more >>> often after the mentioned commit (f03d78db65085). >> >> Oh, neat. Hm. That commit doesn't really sound like the cause, then. >> Is that busy-looping reproduceable on kernels before that commit? > > Note I bisected this issue to this commit. I haven't seen it > happening before this commit, and reverting it from 3.0 or 3.2 > kernel makes the problem to go away. > > I guess it is looping there: > > > net/sunrpc/svc_xprt.c:svc_recv() > ... > len = 0; > ... > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) { > ... > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory... > ... len = <something> > } > > /* No data, incomplete (TCP) read, or accept() */ > if (len == 0 || len == -EAGAIN) > goto out; > ... > out: > rqstp->rq_res.len = 0; > svc_xprt_release(rqstp); > return -EAGAIN; > } > > I'm trying to verify this theory... Yes. I inserted a printk there, and all these million times while we're waiting in this EAGAIN loop, this printk is triggering: .... [21052.533053] svc_recv: !has_wspace [21052.533070] svc_recv: !has_wspace [21052.533087] svc_recv: !has_wspace [21052.533105] svc_recv: !has_wspace [21052.533122] svc_recv: !has_wspace [21052.533139] svc_recv: !has_wspace [21052.533156] svc_recv: !has_wspace [21052.533174] svc_recv: !has_wspace [21052.533191] svc_recv: !has_wspace [21052.533208] svc_recv: !has_wspace [21052.533226] svc_recv: !has_wspace [21052.533244] svc_recv: !has_wspace [21052.533265] calling svc_recv: 1228163 times (err=-4) [21052.533403] calling svc_recv: 1226616 times (err=-4) [21052.534520] nfsd: last server has exited, flushing export cache (I stopped nfsd since it was flooding the log). I can only guess that before that commit, we always had space, now we don't anymore, and are looping like crazy. Thanks, /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 17:29 ` Michael Tokarev @ 2012-08-17 19:18 ` J. Bruce Fields 2012-08-17 20:08 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 19:18 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote: > On 17.08.2012 21:26, Michael Tokarev wrote: > > On 17.08.2012 21:18, J. Bruce Fields wrote: > >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote: > > [] > >>> So we're calling svc_recv in a tight loop, eating > >>> all available CPU. (The above is with just 2 nfsd > >>> threads). > >>> > >>> Something is definitely wrong here. And it happens mure more > >>> often after the mentioned commit (f03d78db65085). > >> > >> Oh, neat. Hm. That commit doesn't really sound like the cause, then. > >> Is that busy-looping reproduceable on kernels before that commit? > > > > Note I bisected this issue to this commit. I haven't seen it > > happening before this commit, and reverting it from 3.0 or 3.2 > > kernel makes the problem to go away. > > > > I guess it is looping there: > > > > > > net/sunrpc/svc_xprt.c:svc_recv() > > ... > > len = 0; > > ... > > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) { > > ... > > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory... > > ... len = <something> > > } > > > > /* No data, incomplete (TCP) read, or accept() */ > > if (len == 0 || len == -EAGAIN) > > goto out; > > ... > > out: > > rqstp->rq_res.len = 0; > > svc_xprt_release(rqstp); > > return -EAGAIN; > > } > > > > I'm trying to verify this theory... > > Yes. I inserted a printk there, and all these million times while > we're waiting in this EAGAIN loop, this printk is triggering: > > .... > [21052.533053] svc_recv: !has_wspace > [21052.533070] svc_recv: !has_wspace > [21052.533087] svc_recv: !has_wspace > [21052.533105] svc_recv: !has_wspace > [21052.533122] svc_recv: !has_wspace > [21052.533139] svc_recv: !has_wspace > [21052.533156] svc_recv: !has_wspace > [21052.533174] svc_recv: !has_wspace > [21052.533191] svc_recv: !has_wspace > [21052.533208] svc_recv: !has_wspace > [21052.533226] svc_recv: !has_wspace > [21052.533244] svc_recv: !has_wspace > [21052.533265] calling svc_recv: 1228163 times (err=-4) > [21052.533403] calling svc_recv: 1226616 times (err=-4) > [21052.534520] nfsd: last server has exited, flushing export cache > > (I stopped nfsd since it was flooding the log). > > I can only guess that before that commit, we always had space, > now we don't anymore, and are looping like crazy. Thanks! But, arrgh--that should be enough to go on at this point, but I'm not seeing it. If has_wspace is returning false then it's likely also returning false to the call at the start of svc_xprt_enqueue() (see svc_xprt_has_something_to_do), which means the xprt shouldn't be getting requeued and the next svc_recv call should find no socket ready (so svc_xprt_dequeue() returns NULL), and goes to sleep. But clearly it's not working that way.... --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 19:18 ` J. Bruce Fields @ 2012-08-17 20:08 ` J. Bruce Fields 2012-08-17 22:32 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 20:08 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 03:18:00PM -0400, J. Bruce Fields wrote: > On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote: > > On 17.08.2012 21:26, Michael Tokarev wrote: > > > On 17.08.2012 21:18, J. Bruce Fields wrote: > > >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote: > > > [] > > >>> So we're calling svc_recv in a tight loop, eating > > >>> all available CPU. (The above is with just 2 nfsd > > >>> threads). > > >>> > > >>> Something is definitely wrong here. And it happens mure more > > >>> often after the mentioned commit (f03d78db65085). > > >> > > >> Oh, neat. Hm. That commit doesn't really sound like the cause, then. > > >> Is that busy-looping reproduceable on kernels before that commit? > > > > > > Note I bisected this issue to this commit. I haven't seen it > > > happening before this commit, and reverting it from 3.0 or 3.2 > > > kernel makes the problem to go away. > > > > > > I guess it is looping there: > > > > > > > > > net/sunrpc/svc_xprt.c:svc_recv() > > > ... > > > len = 0; > > > ... > > > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) { > > > ... > > > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory... > > > ... len = <something> > > > } > > > > > > /* No data, incomplete (TCP) read, or accept() */ > > > if (len == 0 || len == -EAGAIN) > > > goto out; > > > ... > > > out: > > > rqstp->rq_res.len = 0; > > > svc_xprt_release(rqstp); > > > return -EAGAIN; > > > } > > > > > > I'm trying to verify this theory... > > > > Yes. I inserted a printk there, and all these million times while > > we're waiting in this EAGAIN loop, this printk is triggering: > > > > .... > > [21052.533053] svc_recv: !has_wspace > > [21052.533070] svc_recv: !has_wspace > > [21052.533087] svc_recv: !has_wspace > > [21052.533105] svc_recv: !has_wspace > > [21052.533122] svc_recv: !has_wspace > > [21052.533139] svc_recv: !has_wspace > > [21052.533156] svc_recv: !has_wspace > > [21052.533174] svc_recv: !has_wspace > > [21052.533191] svc_recv: !has_wspace > > [21052.533208] svc_recv: !has_wspace > > [21052.533226] svc_recv: !has_wspace > > [21052.533244] svc_recv: !has_wspace > > [21052.533265] calling svc_recv: 1228163 times (err=-4) > > [21052.533403] calling svc_recv: 1226616 times (err=-4) > > [21052.534520] nfsd: last server has exited, flushing export cache > > > > (I stopped nfsd since it was flooding the log). > > > > I can only guess that before that commit, we always had space, > > now we don't anymore, and are looping like crazy. > > Thanks! But, arrgh--that should be enough to go on at this point, but > I'm not seeing it. If has_wspace is returning false then it's likely > also returning false to the call at the start of svc_xprt_enqueue() Wait a minute, that assumption's a problem because that calculation depends in part on xpt_reserved, which is changed here.... In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a lower xpt_reserved value. That could well explain this. --b. > (see > svc_xprt_has_something_to_do), which means the xprt shouldn't be getting > requeued and the next svc_recv call should find no socket ready (so > svc_xprt_dequeue() returns NULL), and goes to sleep. > > But clearly it's not working that way.... > > --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 20:08 ` J. Bruce Fields @ 2012-08-17 22:32 ` J. Bruce Fields 2012-08-18 6:49 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-17 22:32 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote: > Wait a minute, that assumption's a problem because that calculation > depends in part on xpt_reserved, which is changed here.... > > In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which > subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a > lower xpt_reserved value. That could well explain this. So, maybe something like this? --b. commit c8136c319ad85d0db870021fc3f9074d37f26d4a Author: J. Bruce Fields <bfields@redhat.com> Date: Fri Aug 17 17:31:53 2012 -0400 svcrpc: don't add to xpt_reserved till we receive The rpc server tries to ensure that there will be room to send a reply before it receives a request. It does this by tracking, in xpt_reserved, an upper bound on the total size of the replies that is has already committed to for the socket. Currently it is adding in the estimate for a new reply *before* it checks whether there is space available. If it finds that there is not space, it then subtracts the estimate back out. This may lead the subsequent svc_xprt_enqueue to decide that there is space after all. The results is a svc_recv() that will repeatedly return -EAGAIN, causing server threads to loop without doing any actual work. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index ec99849a..59ff3a3 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) rqstp, rqstp->rq_xprt); rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); pool->sp_stats.threads_woken++; wake_up(&rqstp->rq_wait); } else { @@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) if (xprt) { rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); /* As there is a shortage of threads and this request * had to be queued, don't allow the thread to wait so @@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) len = xprt->xpt_ops->xpo_recvfrom(rqstp); dprintk("svc: got len=%d\n", len); } + if (len > 0) { + rqstp->rq_reserved = serv->sv_max_mesg; + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); + } svc_xprt_received(xprt); /* No data, incomplete (TCP) read, or accept() */ ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-17 22:32 ` J. Bruce Fields @ 2012-08-18 6:49 ` Michael Tokarev 2012-08-18 11:13 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Michael Tokarev @ 2012-08-18 6:49 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 18.08.2012 02:32, J. Bruce Fields wrote: > On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote: >> Wait a minute, that assumption's a problem because that calculation >> depends in part on xpt_reserved, which is changed here.... >> >> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which >> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a >> lower xpt_reserved value. That could well explain this. > > So, maybe something like this? Well. What can I say? With the change below applied (to 3.2 kernel at least), I don't see any stalls or high CPU usage on the server anymore. It survived several multi-gigabyte transfers, for several hours, without any problem. So it is a good step forward ;) But the whole thing seems to be quite a bit fragile. I tried to follow the logic in there, and the thing is quite a bit, well, "twisted", and somewhat difficult to follow. So I don't know if this is the right fix or not. At least it works! :) And I really wonder why no one else reported this problem before. Is me the only one in this world who uses linux nfsd? :) Thank you for all your patience and the proposed fix! /mjt > commit c8136c319ad85d0db870021fc3f9074d37f26d4a > Author: J. Bruce Fields <bfields@redhat.com> > Date: Fri Aug 17 17:31:53 2012 -0400 > > svcrpc: don't add to xpt_reserved till we receive > > The rpc server tries to ensure that there will be room to send a reply > before it receives a request. > > It does this by tracking, in xpt_reserved, an upper bound on the total > size of the replies that is has already committed to for the socket. > > Currently it is adding in the estimate for a new reply *before* it > checks whether there is space available. If it finds that there is not > space, it then subtracts the estimate back out. > > This may lead the subsequent svc_xprt_enqueue to decide that there is > space after all. > > The results is a svc_recv() that will repeatedly return -EAGAIN, causing > server threads to loop without doing any actual work. > > Reported-by: Michael Tokarev <mjt@tls.msk.ru> > Signed-off-by: J. Bruce Fields <bfields@redhat.com> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c > index ec99849a..59ff3a3 100644 > --- a/net/sunrpc/svc_xprt.c > +++ b/net/sunrpc/svc_xprt.c > @@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) > rqstp, rqstp->rq_xprt); > rqstp->rq_xprt = xprt; > svc_xprt_get(xprt); > - rqstp->rq_reserved = serv->sv_max_mesg; > - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > pool->sp_stats.threads_woken++; > wake_up(&rqstp->rq_wait); > } else { > @@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) > if (xprt) { > rqstp->rq_xprt = xprt; > svc_xprt_get(xprt); > - rqstp->rq_reserved = serv->sv_max_mesg; > - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > > /* As there is a shortage of threads and this request > * had to be queued, don't allow the thread to wait so > @@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) > len = xprt->xpt_ops->xpo_recvfrom(rqstp); > dprintk("svc: got len=%d\n", len); > } > + if (len > 0) { > + rqstp->rq_reserved = serv->sv_max_mesg; > + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > + } > svc_xprt_received(xprt); > > /* No data, incomplete (TCP) read, or accept() */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-18 6:49 ` Michael Tokarev @ 2012-08-18 11:13 ` J. Bruce Fields 2012-08-18 12:58 ` Michael Tokarev 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-08-18 11:13 UTC (permalink / raw) To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote: > On 18.08.2012 02:32, J. Bruce Fields wrote: > > On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote: > >> Wait a minute, that assumption's a problem because that calculation > >> depends in part on xpt_reserved, which is changed here.... > >> > >> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which > >> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a > >> lower xpt_reserved value. That could well explain this. > > > > So, maybe something like this? > > Well. What can I say? With the change below applied (to 3.2 kernel > at least), I don't see any stalls or high CPU usage on the server > anymore. It survived several multi-gigabyte transfers, for several > hours, without any problem. So it is a good step forward ;) > > But the whole thing seems to be quite a bit fragile. I tried to follow > the logic in there, and the thing is quite a bit, well, "twisted", and > somewhat difficult to follow. So I don't know if this is the right > fix or not. At least it works! :) Suggestions welcomed. > And I really wonder why no one else reported this problem before. > Is me the only one in this world who uses linux nfsd? :) This, for example: http://marc.info/?l=linux-nfs&m=134131915612287&w=2 may well describe the same problem.... It just needed some debugging persistence, thanks! --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 3.0+ NFS issues (bisected) 2012-08-18 11:13 ` J. Bruce Fields @ 2012-08-18 12:58 ` Michael Tokarev 0 siblings, 0 replies; 27+ messages in thread From: Michael Tokarev @ 2012-08-18 12:58 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet On 18.08.2012 15:13, J. Bruce Fields wrote: > On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote: [] >> Well. What can I say? With the change below applied (to 3.2 kernel >> at least), I don't see any stalls or high CPU usage on the server >> anymore. It survived several multi-gigabyte transfers, for several >> hours, without any problem. So it is a good step forward ;) >> >> But the whole thing seems to be quite a bit fragile. I tried to follow >> the logic in there, and the thing is quite a bit, well, "twisted", and >> somewhat difficult to follow. So I don't know if this is the right >> fix or not. At least it works! :) > > Suggestions welcomed. Ok... Meanwhile, you can add my Tested-By: Michael Tokarev <mjt@tls.msk.ru> to the patch. >> And I really wonder why no one else reported this problem before. >> Is me the only one in this world who uses linux nfsd? :) > > This, for example: > > http://marc.info/?l=linux-nfs&m=134131915612287&w=2 > > may well describe the same problem.... It just needed some debugging > persistence, thanks! Ah. I tried to find something when I initially sent this report, but weren't able to. Apparently I'm not alone with this problem indeed! Thank you for all the work! /mjt ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2012-08-18 12:58 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-05-25 6:53 3.0+ NFS issues Michael Tokarev 2012-05-29 15:24 ` J. Bruce Fields 2012-05-30 7:11 ` Michael Tokarev 2012-05-30 13:25 ` J. Bruce Fields 2012-05-31 6:47 ` Michael Tokarev 2012-05-31 12:59 ` Myklebust, Trond 2012-05-31 12:59 ` Myklebust, Trond 2012-05-31 13:24 ` Michael Tokarev 2012-05-31 13:46 ` Myklebust, Trond 2012-05-31 13:46 ` Myklebust, Trond 2012-05-31 13:51 ` Michael Tokarev 2012-06-20 12:52 ` Christoph Bartoschek 2012-07-10 12:52 ` Michael Tokarev 2012-07-12 12:53 ` J. Bruce Fields 2012-08-17 1:56 ` 3.0+ NFS issues (bisected) Michael Tokarev 2012-08-17 14:56 ` J. Bruce Fields 2012-08-17 16:00 ` J. Bruce Fields 2012-08-17 17:12 ` Michael Tokarev 2012-08-17 17:18 ` J. Bruce Fields 2012-08-17 17:26 ` Michael Tokarev 2012-08-17 17:29 ` Michael Tokarev 2012-08-17 19:18 ` J. Bruce Fields 2012-08-17 20:08 ` J. Bruce Fields 2012-08-17 22:32 ` J. Bruce Fields 2012-08-18 6:49 ` Michael Tokarev 2012-08-18 11:13 ` J. Bruce Fields 2012-08-18 12:58 ` Michael Tokarev
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.