linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 3.0+ NFS issues
@ 2012-05-25  6:53 Michael Tokarev
  2012-05-29 15:24 ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-05-25  6:53 UTC (permalink / raw)
  To: linux-nfs, Linux-kernel

Hello.

I updated my nfs server machine to kernel 3.0, and
noticed that its main usage become, well, problematic.

While trying to dig deeper, I also found a few other
interesting issues, which are mentioned below.

But first thing first: nfs.

i686pae kernel, lots of RAM, Atom-based (cedar trail)
machine with usual rtl8169 NIC.  3.0 or 3.2 kernel
(I will try current 3.4 but I don't have much hopes
there).  NFSv4.

When a client machine (also 3.0 kernel) does some reading,
the process often stalls somewhere in the read syscall,
or, rarer, during close, for up to two MINUTES.  During
this time, the client (kernel) reports "NFS server <foo>
does not respond" several times, and finally "NFS server
<foo> ok", client process "unstucks" from the read(2),
and is able to perform a few more reads till the whole
thing repeats.

While this happens, ie, while the client process is stuck,
any access to the mounted filesystem results in an
uninterruptible sleep - all processes including the
initial client are unkillable.  This is very annoying,
but I was never able to fix this, including the real
case when the server becomes unreachable - "intr" mount
option which is supposed to help here makes no difference.
This is first and longstanding issue, but it is not really
related to the issue at hand.

On the server side things are more interesting.  During
the client stall, all NFSD threads are busy running,
using all available CPU.  The machine has 2 cores, and
if I reduce number of NFSD threads to 1, one core will
be used by the single NFSD to 100%.  With default 8
NFSDs, both cores are used, by 4 NFSD threads each.
This is according to top(1), which show proportional
and equal CPU activity for each of NFSD thread.

But.  And this is where things becomes more interesting.
perf top shows almost no activity with any number of NFSD
threads but one.  For example, this is a typical perf top
display with 2 NFSD threads:

     samples  pcnt function                  DSO
     1558.00  6.4% svc_recv                  [sunrpc]
     1084.00  5.2% __schedule                [kernel]
     1065.00  5.1% select_task_rq_fair       [kernel]
      692.00  3.3% try_to_wake_up            [kernel]


Yes, there's some 6.4% time spent in svc_recv, but it is
FAR from the 200% CPU usage reported by top(1).  Now, when
I reduce number of NFSD threds to 1, perf top output becomes
dramatically different:

     samples  pcnt function                    DSO
     5996.00 59.2% svc_recv                    [sunrpc]
      750.00  7.4% cache_check                 [sunrpc]
      519.00  5.1% svc_tcp_has_wspace          [sunrpc]
      465.00  4.6% local_bh_enable_ip          [kernel.kallsyms]
      301.00  3.0% svc_defer                   [sunrpc]
      294.00  2.9% page_address                [kernel.kallsyms]

Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
the difference.

So, apparently, `perf top' tool does not show the truth.

Trying to catch this high CPU usage with Sysrq+T does not
show anything interesting, the NFSD thread is running (and
is the only thread(s) shown in the "Runnable tasks" output):


May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd            R running      0  5853      2 0x00000000
May 25 10:48:37 gnome vmunix: [ 1709.483796]  f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
May 25 10:48:37 gnome vmunix: [ 1709.483841]  050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
May 25 10:48:37 gnome vmunix: [ 1709.483887]  c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
May 25 10:48:37 gnome vmunix: [ 1709.483950]  [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483973]  [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483997]  [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484029]  [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484053]  [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484084]  [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484107]  [<c103d3e3>] ? __cond_resched+0x13/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484129]  [<c12f92b5>] ? _cond_resched+0x25/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484159]  [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484182]  [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.484206]  [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:48:37 gnome vmunix: [ 1709.484233]  [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:48:37 gnome vmunix: [ 1709.484255]  [<c102d240>] ? complete+0x40/0x60
May 25 10:48:37 gnome vmunix: [ 1709.484279]  [<f90ea000>] ? 0xf90e9fff
May 25 10:48:37 gnome vmunix: [ 1709.484299]  [<c105ea99>] ? kthread+0x69/0x70
May 25 10:48:37 gnome vmunix: [ 1709.484321]  [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:48:37 gnome vmunix: [ 1709.484345]  [<c1301876>] ? kernel_thread_helper+0x6/0xd

May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd            R running      0  5853      2 0x00000000
May 25 10:49:37 gnome vmunix: [ 1769.982816]  f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
May 25 10:49:37 gnome vmunix: [ 1769.982859]  7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
May 25 10:49:37 gnome vmunix: [ 1769.982903]  f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
May 25 10:49:37 gnome vmunix: [ 1769.982966]  [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:49:37 gnome vmunix: [ 1769.982987]  [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:49:37 gnome vmunix: [ 1769.983019]  [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983050]  [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983081]  [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983112]  [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983143]  [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983166]  [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:49:37 gnome vmunix: [ 1769.983193]  [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:49:37 gnome vmunix: [ 1769.983215]  [<c102d240>] ? complete+0x40/0x60
May 25 10:49:37 gnome vmunix: [ 1769.983238]  [<f90ea000>] ? 0xf90e9fff
May 25 10:49:37 gnome vmunix: [ 1769.983258]  [<c105ea99>] ? kthread+0x69/0x70
May 25 10:49:37 gnome vmunix: [ 1769.983278]  [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:49:37 gnome vmunix: [ 1769.983301]  [<c1301876>] ? kernel_thread_helper+0x6/0xd

This is 3.2, on 3.0 it was in different place.

This problem is persistent, always reproducible, happens
on reads from the server (I haven't seen it with writes),
and completely goes away when reverting to the kernel I
previously used on this machine - 2.6.32.  But unfortunately
2.6.32 does not quite work on the new hardware (I replaced
the motherboard), due to lack of drivers for several pieces.
(This is a different story, I think NFSD should work in 3.0
regardless of 2.6.32 issues :)

Anything I can do to make NFS work again?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-25  6:53 3.0+ NFS issues Michael Tokarev
@ 2012-05-29 15:24 ` J. Bruce Fields
  2012-05-30  7:11   ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-05-29 15:24 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-nfs, Linux-kernel

On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> I updated my nfs server machine to kernel 3.0, and
> noticed that its main usage become, well, problematic.
> 
> While trying to dig deeper, I also found a few other
> interesting issues, which are mentioned below.
> 
> But first thing first: nfs.
> 
> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> machine with usual rtl8169 NIC.  3.0 or 3.2 kernel
> (I will try current 3.4 but I don't have much hopes
> there).  NFSv4.
> 
> When a client machine (also 3.0 kernel) does some reading,
> the process often stalls somewhere in the read syscall,
> or, rarer, during close, for up to two MINUTES.  During
> this time, the client (kernel) reports "NFS server <foo>
> does not respond" several times, and finally "NFS server
> <foo> ok", client process "unstucks" from the read(2),
> and is able to perform a few more reads till the whole
> thing repeats.

You say 2.6.32 was OK; have you tried anything else between?

And you're holding the client constant while varying only the server
version, right?

Is your network otherwise working?  (E.g. does transferring a bunch of
data from server to client using some other protocol work reliably?)

Is there anything happening on the network during these stalls?  (You
can watch the network with wireshark, for example.)

Does NFSv3 behave the same way?

--b.

> 
> While this happens, ie, while the client process is stuck,
> any access to the mounted filesystem results in an
> uninterruptible sleep - all processes including the
> initial client are unkillable.  This is very annoying,
> but I was never able to fix this, including the real
> case when the server becomes unreachable - "intr" mount
> option which is supposed to help here makes no difference.
> This is first and longstanding issue, but it is not really
> related to the issue at hand.
> 
> On the server side things are more interesting.  During
> the client stall, all NFSD threads are busy running,
> using all available CPU.  The machine has 2 cores, and
> if I reduce number of NFSD threads to 1, one core will
> be used by the single NFSD to 100%.  With default 8
> NFSDs, both cores are used, by 4 NFSD threads each.
> This is according to top(1), which show proportional
> and equal CPU activity for each of NFSD thread.
> 
> But.  And this is where things becomes more interesting.
> perf top shows almost no activity with any number of NFSD
> threads but one.  For example, this is a typical perf top
> display with 2 NFSD threads:
> 
>      samples  pcnt function                  DSO
>      1558.00  6.4% svc_recv                  [sunrpc]
>      1084.00  5.2% __schedule                [kernel]
>      1065.00  5.1% select_task_rq_fair       [kernel]
>       692.00  3.3% try_to_wake_up            [kernel]
> 
> 
> Yes, there's some 6.4% time spent in svc_recv, but it is
> FAR from the 200% CPU usage reported by top(1).  Now, when
> I reduce number of NFSD threds to 1, perf top output becomes
> dramatically different:
> 
>      samples  pcnt function                    DSO
>      5996.00 59.2% svc_recv                    [sunrpc]
>       750.00  7.4% cache_check                 [sunrpc]
>       519.00  5.1% svc_tcp_has_wspace          [sunrpc]
>       465.00  4.6% local_bh_enable_ip          [kernel.kallsyms]
>       301.00  3.0% svc_defer                   [sunrpc]
>       294.00  2.9% page_address                [kernel.kallsyms]
> 
> Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
> the difference.
> 
> So, apparently, `perf top' tool does not show the truth.
> 
> Trying to catch this high CPU usage with Sysrq+T does not
> show anything interesting, the NFSD thread is running (and
> is the only thread(s) shown in the "Runnable tasks" output):
> 
> 
> May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd            R running      0  5853      2 0x00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483796]  f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
> May 25 10:48:37 gnome vmunix: [ 1709.483841]  050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
> May 25 10:48:37 gnome vmunix: [ 1709.483887]  c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
> May 25 10:48:37 gnome vmunix: [ 1709.483950]  [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483973]  [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483997]  [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484029]  [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484053]  [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484084]  [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484107]  [<c103d3e3>] ? __cond_resched+0x13/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484129]  [<c12f92b5>] ? _cond_resched+0x25/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484159]  [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484182]  [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.484206]  [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:48:37 gnome vmunix: [ 1709.484233]  [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:48:37 gnome vmunix: [ 1709.484255]  [<c102d240>] ? complete+0x40/0x60
> May 25 10:48:37 gnome vmunix: [ 1709.484279]  [<f90ea000>] ? 0xf90e9fff
> May 25 10:48:37 gnome vmunix: [ 1709.484299]  [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:48:37 gnome vmunix: [ 1709.484321]  [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:48:37 gnome vmunix: [ 1709.484345]  [<c1301876>] ? kernel_thread_helper+0x6/0xd
> 
> May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd            R running      0  5853      2 0x00000000
> May 25 10:49:37 gnome vmunix: [ 1769.982816]  f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
> May 25 10:49:37 gnome vmunix: [ 1769.982859]  7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
> May 25 10:49:37 gnome vmunix: [ 1769.982903]  f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
> May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
> May 25 10:49:37 gnome vmunix: [ 1769.982966]  [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:49:37 gnome vmunix: [ 1769.982987]  [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:49:37 gnome vmunix: [ 1769.983019]  [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983050]  [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983081]  [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983112]  [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983143]  [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983166]  [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:49:37 gnome vmunix: [ 1769.983193]  [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:49:37 gnome vmunix: [ 1769.983215]  [<c102d240>] ? complete+0x40/0x60
> May 25 10:49:37 gnome vmunix: [ 1769.983238]  [<f90ea000>] ? 0xf90e9fff
> May 25 10:49:37 gnome vmunix: [ 1769.983258]  [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:49:37 gnome vmunix: [ 1769.983278]  [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:49:37 gnome vmunix: [ 1769.983301]  [<c1301876>] ? kernel_thread_helper+0x6/0xd
> 
> This is 3.2, on 3.0 it was in different place.
> 
> This problem is persistent, always reproducible, happens
> on reads from the server (I haven't seen it with writes),
> and completely goes away when reverting to the kernel I
> previously used on this machine - 2.6.32.  But unfortunately
> 2.6.32 does not quite work on the new hardware (I replaced
> the motherboard), due to lack of drivers for several pieces.
> (This is a different story, I think NFSD should work in 3.0
> regardless of 2.6.32 issues :)
> 
> Anything I can do to make NFS work again?
> 
> Thanks,
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-29 15:24 ` J. Bruce Fields
@ 2012-05-30  7:11   ` Michael Tokarev
  2012-05-30 13:25     ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-05-30  7:11 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Linux-kernel

On 29.05.2012 19:24, J. Bruce Fields wrote:
> On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
>> I updated my nfs server machine to kernel 3.0, and
>> noticed that its main usage become, well, problematic.
>>
>> While trying to dig deeper, I also found a few other
>> interesting issues, which are mentioned below.
>>
>> But first thing first: nfs.
>>
>> i686pae kernel, lots of RAM, Atom-based (cedar trail)
>> machine with usual rtl8169 NIC.  3.0 or 3.2 kernel
>> (I will try current 3.4 but I don't have much hopes
>> there).  NFSv4.
>>
>> When a client machine (also 3.0 kernel) does some reading,
>> the process often stalls somewhere in the read syscall,
>> or, rarer, during close, for up to two MINUTES.  During
>> this time, the client (kernel) reports "NFS server <foo>
>> does not respond" several times, and finally "NFS server
>> <foo> ok", client process "unstucks" from the read(2),
>> and is able to perform a few more reads till the whole
>> thing repeats.
> 
> You say 2.6.32 was OK; have you tried anything else between?

Well, I thought bisecting between 2.6.32 and 3.0 will be quite
painful...  But I'll try if nothing else helps.  And no, I haven't
tried anything in-between.

> And you're holding the client constant while varying only the server
> version, right?

Yes.

> Is your network otherwise working?  (E.g. does transferring a bunch of
> data from server to client using some other protocol work reliably?)

Yes, it works flawlessly, all other protocols works so far.

To the date, I resorted to using a small webserver plus wget as an ugly
workaround for the problem - http works for reads from the server, while
nfs works for writes.

> Is there anything happening on the network during these stalls?  (You
> can watch the network with wireshark, for example.)

The network load is irrelevant - it behaves the same way with
100% idle network or with network busy doing other stuff.

> Does NFSv3 behave the same way?

Yes it does.  With all NFSDs on server eating all available CPUs for
quite some time, and with being "ghosts" for perf top.

And with the client being unkillable again.

Can at least the client be made interruptible?  Mounting with
-o intr,soft makes no visible difference...

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-30  7:11   ` Michael Tokarev
@ 2012-05-30 13:25     ` J. Bruce Fields
  2012-05-31  6:47       ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-05-30 13:25 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-nfs, Linux-kernel

On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
> On 29.05.2012 19:24, J. Bruce Fields wrote:
> > On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> >> I updated my nfs server machine to kernel 3.0, and
> >> noticed that its main usage become, well, problematic.
> >>
> >> While trying to dig deeper, I also found a few other
> >> interesting issues, which are mentioned below.
> >>
> >> But first thing first: nfs.
> >>
> >> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> >> machine with usual rtl8169 NIC.  3.0 or 3.2 kernel
> >> (I will try current 3.4 but I don't have much hopes
> >> there).  NFSv4.
> >>
> >> When a client machine (also 3.0 kernel) does some reading,
> >> the process often stalls somewhere in the read syscall,
> >> or, rarer, during close, for up to two MINUTES.  During
> >> this time, the client (kernel) reports "NFS server <foo>
> >> does not respond" several times, and finally "NFS server
> >> <foo> ok", client process "unstucks" from the read(2),
> >> and is able to perform a few more reads till the whole
> >> thing repeats.
> > 
> > You say 2.6.32 was OK; have you tried anything else between?
> 
> Well, I thought bisecting between 2.6.32 and 3.0 will be quite
> painful...  But I'll try if nothing else helps.  And no, I haven't
> tried anything in-between.
> 
> > And you're holding the client constant while varying only the server
> > version, right?
> 
> Yes.
> 
> > Is your network otherwise working?  (E.g. does transferring a bunch of
> > data from server to client using some other protocol work reliably?)
> 
> Yes, it works flawlessly, all other protocols works so far.
> 
> To the date, I resorted to using a small webserver plus wget as an ugly
> workaround for the problem - http works for reads from the server, while
> nfs works for writes.
> 
> > Is there anything happening on the network during these stalls?  (You
> > can watch the network with wireshark, for example.)
> 
> The network load is irrelevant - it behaves the same way with
> 100% idle network or with network busy doing other stuff.

That's not what I meant.  During one of these read stalls, if you watch
the network with wireshark, do you see any NFS traffic between the
client and server?

Also: do you have a reliable way of reproducing this quickly?

--b.

> > Does NFSv3 behave the same way?
> 
> Yes it does.  With all NFSDs on server eating all available CPUs for
> quite some time, and with being "ghosts" for perf top.
> 
> And with the client being unkillable again.
> 
> Can at least the client be made interruptible?  Mounting with
> -o intr,soft makes no visible difference...
> 
> Thanks,
> 
> /mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-30 13:25     ` J. Bruce Fields
@ 2012-05-31  6:47       ` Michael Tokarev
  2012-05-31 12:59         ` Myklebust, Trond
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-05-31  6:47 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Linux-kernel

On 30.05.2012 17:25, J. Bruce Fields wrote:
> On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
[]
> That's not what I meant.  During one of these read stalls, if you watch
> the network with wireshark, do you see any NFS traffic between the
> client and server?

Oh.  Indeed, I misunderstood.

And no, during these stalls, there's no network activity at all.
Here's the typical scenario:

...
10:38:53.781990 IP (tos 0x0, ttl 64, id 35131, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xb43d), ack 89530281, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782000 IP (tos 0x0, ttl 64, id 6329, offset 0, flags [DF], proto TCP (6), length 1500)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe827 (correct), seq 89530281:89531729, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782027 IP (tos 0x0, ttl 64, id 6330, offset 0, flags [DF], proto TCP (6), length 1708)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x37f6 (incorrect -> 0x6790), seq 89531729:89533385, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1656
10:38:53.782029 IP (tos 0x0, ttl 64, id 35132, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xa81d), ack 89533385, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782040 IP (tos 0x0, ttl 64, id 6333, offset 0, flags [DF], proto TCP (6), length 1500)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x0d5b (correct), seq 89534833:89536281, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782082 IP (tos 0x0, ttl 64, id 6334, offset 0, flags [DF], proto TCP (6), length 4396)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x4276 (incorrect -> 0x778a), seq 89536281:89540625, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 4344
10:38:53.782088 IP (tos 0x0, ttl 64, id 35134, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0x8bd5), ack 89540625, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782096 IP (tos 0x0, ttl 64, id 6337, offset 0, flags [DF], proto TCP (6), length 1500)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x835d (correct), seq 89540625:89542073, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268)
    192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827673 IP (tos 0x0, ttl 64, id 6428, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe4e5 (correct), ack 41617, win 6289, options [nop,nop,TS val 122195221 ecr 3298142], length 0
10:38:53.827699 IP (tos 0x0, ttl 64, id 6429, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdfd4 (correct), ack 42913, win 6289, options [nop,nop,TS val 122195221 ecr 3298143], length 0
10:38:53.865036 IP (tos 0x0, ttl 64, id 6430, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdd40 (correct), ack 43561, win 6289, options [nop,nop,TS val 122195233 ecr 3298143], length 0
[pause]
^C

192.168.88.2 is the client, .4 is the server.

I'm not sure if the series of getattr requests from the client is
right before or right after the beginning of the stall, but after
the 3 last replies from server there's no other activity for a
long time, and the server is eating 100% available CPU as I described
previously.

> Also: do you have a reliable way of reproducing this quickly?

Yes, it is enough to start copying any large file and in a few
seconds the first stall happens.

Can you suggest something for the other part of the question:

>> Can at least the client be made interruptible?  Mounting with
>> -o intr,soft makes no visible difference...

please? :)

Thank you!

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-31  6:47       ` Michael Tokarev
@ 2012-05-31 12:59         ` Myklebust, Trond
  2012-05-31 13:24           ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: Myklebust, Trond @ 2012-05-31 12:59 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3347 bytes --]

On Thu, 2012-05-31 at 10:47 +0400, Michael Tokarev wrote:

> 10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268)
>     192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22

That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.

Can you either please use wireshark to provide a full text dump (using
something like 'tshark -V -O nfs,rpc'), or just send us the binary
tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?

> >> Can at least the client be made interruptible?  Mounting with
> >> -o intr,soft makes no visible difference...
> 
> please? :)

It already is interruptible: try 'kill -9' or any other fatal signal.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-31 12:59         ` Myklebust, Trond
@ 2012-05-31 13:24           ` Michael Tokarev
  2012-05-31 13:46             ` Myklebust, Trond
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-05-31 13:24 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel

On 31.05.2012 16:59, Myklebust, Trond wrote:
[]
> That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.

Oh.

> Can you either please use wireshark to provide a full text dump (using
> something like 'tshark -V -O nfs,rpc'), or just send us the binary
> tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?

I started tcpdump:

 tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump

on the client (192.168.88.2).  Next I mounted a directory on the client,
and started reading (tar'ing) a directory into /dev/null.  It captured a
few stalls.  Tcpdump shows number of packets it got, the stalls are at
packet counts 58090, 97069 and 97071.  I cancelled the capture after that.

The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
10 files of 1Gb each, all made by using `truncate' utility, so does not
take place on disk at all.  This also makes it obvious that the issue
does not depend on the speed of disk on the server (since in this case,
the server disk isn't even in use).

What I noticed is: with default 8 NFSD threads, it takes quite a bit
longer to hit this issue, but it still happens.  The capture was
done using 2 threads.

>>>> Can at least the client be made interruptible?  Mounting with
>>>> -o intr,soft makes no visible difference...
>>
>> please? :)
> 
> It already is interruptible: try 'kill -9' or any other fatal signal.

Oh.  Indeed, I can kill it from another terminal using -9.  That is
great already!

Thank you!

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-31 13:24           ` Michael Tokarev
@ 2012-05-31 13:46             ` Myklebust, Trond
  2012-05-31 13:51               ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: Myklebust, Trond @ 2012-05-31 13:46 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1912 bytes --]

On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> On 31.05.2012 16:59, Myklebust, Trond wrote:
> []
> > That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.
> 
> Oh.
> 
> > Can you either please use wireshark to provide a full text dump (using
> > something like 'tshark -V -O nfs,rpc'), or just send us the binary
> > tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?
> 
> I started tcpdump:
> 
>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> 
> on the client (192.168.88.2).  Next I mounted a directory on the client,
> and started reading (tar'ing) a directory into /dev/null.  It captured a
> few stalls.  Tcpdump shows number of packets it got, the stalls are at
> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
> 
> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> take place on disk at all.  This also makes it obvious that the issue
> does not depend on the speed of disk on the server (since in this case,
> the server disk isn't even in use).

OK. So from the above file it looks as if the traffic is mainly READ
requests.

In 2 places the server stops responding. In both cases, the client seems
to be sending a single TCP frame containing several COMPOUNDS containing
READ requests (which should be legal) just prior to the hang. When the
server doesn't respond, the client pings it with a RENEW, before it ends
up severing the TCP connection and then retransmitting.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-31 13:46             ` Myklebust, Trond
@ 2012-05-31 13:51               ` Michael Tokarev
  2012-07-10 12:52                 ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-05-31 13:51 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel

On 31.05.2012 17:46, Myklebust, Trond wrote:
> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
[]
>> I started tcpdump:
>>
>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>
>> on the client (192.168.88.2).  Next I mounted a directory on the client,
>> and started reading (tar'ing) a directory into /dev/null.  It captured a
>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
>>
>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>> take place on disk at all.  This also makes it obvious that the issue
>> does not depend on the speed of disk on the server (since in this case,
>> the server disk isn't even in use).
> 
> OK. So from the above file it looks as if the traffic is mainly READ
> requests.

The issue here happens only with reads.

> In 2 places the server stops responding. In both cases, the client seems
> to be sending a single TCP frame containing several COMPOUNDS containing
> READ requests (which should be legal) just prior to the hang. When the
> server doesn't respond, the client pings it with a RENEW, before it ends
> up severing the TCP connection and then retransmitting.

And sometimes -- speaking only from the behavour I've seen, not from the
actual frames sent -- server does not respond to the RENEW too, in which
case the client reports "nfs server no responding", and on the next
renew it may actually respond.  This happens too, but much more rare.

During these stalls, ie, when there's no network activity at all,
the server NFSD threads are busy eating all available CPU.

What does it all tell us? :)

Thank you!

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-05-31 13:51               ` Michael Tokarev
@ 2012-07-10 12:52                 ` Michael Tokarev
  2012-07-12 12:53                   ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-07-10 12:52 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: J. Bruce Fields, linux-nfs, Linux-kernel

I tried to debug this again, maybe to reproduce in a virtual machine,
and found out that it is only 32bit server code shows this issue:
after updating the kernel on the server to 64bit (the same version)
I can't reproduce this issue anymore.  Rebooting back to 32bit,
and voila, it is here again.

Something apparenlty isn't right on 32bits... ;)

(And yes, the prob is still present and is very annoying :)

Thanks,

/mjt


On 31.05.2012 17:51, Michael Tokarev wrote:
> On 31.05.2012 17:46, Myklebust, Trond wrote:
>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> []
>>> I started tcpdump:
>>>
>>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>>
>>> on the client (192.168.88.2).  Next I mounted a directory on the client,
>>> and started reading (tar'ing) a directory into /dev/null.  It captured a
>>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
>>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
>>>
>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>>> take place on disk at all.  This also makes it obvious that the issue
>>> does not depend on the speed of disk on the server (since in this case,
>>> the server disk isn't even in use).
>>
>> OK. So from the above file it looks as if the traffic is mainly READ
>> requests.
> 
> The issue here happens only with reads.
> 
>> In 2 places the server stops responding. In both cases, the client seems
>> to be sending a single TCP frame containing several COMPOUNDS containing
>> READ requests (which should be legal) just prior to the hang. When the
>> server doesn't respond, the client pings it with a RENEW, before it ends
>> up severing the TCP connection and then retransmitting.
> 
> And sometimes -- speaking only from the behavour I've seen, not from the
> actual frames sent -- server does not respond to the RENEW too, in which
> case the client reports "nfs server no responding", and on the next
> renew it may actually respond.  This happens too, but much more rare.
> 
> During these stalls, ie, when there's no network activity at all,
> the server NFSD threads are busy eating all available CPU.
> 
> What does it all tell us? :)
> 
> Thank you!
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues
  2012-07-10 12:52                 ` Michael Tokarev
@ 2012-07-12 12:53                   ` J. Bruce Fields
  2012-08-17  1:56                     ` 3.0+ NFS issues (bisected) Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-07-12 12:53 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel

On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> I tried to debug this again, maybe to reproduce in a virtual machine,
> and found out that it is only 32bit server code shows this issue:
> after updating the kernel on the server to 64bit (the same version)
> I can't reproduce this issue anymore.  Rebooting back to 32bit,
> and voila, it is here again.
> 
> Something apparenlty isn't right on 32bits... ;)
> 
> (And yes, the prob is still present and is very annoying :)

OK, that's very useful, thanks.  So probably a bug got introduced in the
32-bit case between 2.6.32 and 3.0.

My personal upstream testing is normally all x86_64 only.  I'll kick off
a 32-bit install and see if I can reproduce this quickly.

Let me know if you're able to narrow this down any more.

--b.

> 
> Thanks,
> 
> /mjt
> 
> 
> On 31.05.2012 17:51, Michael Tokarev wrote:
> > On 31.05.2012 17:46, Myklebust, Trond wrote:
> >> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> > []
> >>> I started tcpdump:
> >>>
> >>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> >>>
> >>> on the client (192.168.88.2).  Next I mounted a directory on the client,
> >>> and started reading (tar'ing) a directory into /dev/null.  It captured a
> >>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
> >>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
> >>>
> >>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> >>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
> >>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> >>> take place on disk at all.  This also makes it obvious that the issue
> >>> does not depend on the speed of disk on the server (since in this case,
> >>> the server disk isn't even in use).
> >>
> >> OK. So from the above file it looks as if the traffic is mainly READ
> >> requests.
> > 
> > The issue here happens only with reads.
> > 
> >> In 2 places the server stops responding. In both cases, the client seems
> >> to be sending a single TCP frame containing several COMPOUNDS containing
> >> READ requests (which should be legal) just prior to the hang. When the
> >> server doesn't respond, the client pings it with a RENEW, before it ends
> >> up severing the TCP connection and then retransmitting.
> > 
> > And sometimes -- speaking only from the behavour I've seen, not from the
> > actual frames sent -- server does not respond to the RENEW too, in which
> > case the client reports "nfs server no responding", and on the next
> > renew it may actually respond.  This happens too, but much more rare.
> > 
> > During these stalls, ie, when there's no network activity at all,
> > the server NFSD threads are busy eating all available CPU.
> > 
> > What does it all tell us? :)
> > 
> > Thank you!
> > 
> > /mjt
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-07-12 12:53                   ` J. Bruce Fields
@ 2012-08-17  1:56                     ` Michael Tokarev
  2012-08-17 14:56                       ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-08-17  1:56 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 12.07.2012 16:53, J. Bruce Fields wrote:
> On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
>> I tried to debug this again, maybe to reproduce in a virtual machine,
>> and found out that it is only 32bit server code shows this issue:
>> after updating the kernel on the server to 64bit (the same version)
>> I can't reproduce this issue anymore.  Rebooting back to 32bit,
>> and voila, it is here again.
>>
>> Something apparenlty isn't right on 32bits... ;)
>>
>> (And yes, the prob is still present and is very annoying :)
> 
> OK, that's very useful, thanks.  So probably a bug got introduced in the
> 32-bit case between 2.6.32 and 3.0.
> 
> My personal upstream testing is normally all x86_64 only.  I'll kick off
> a 32-bit install and see if I can reproduce this quickly.

Actually it has nothing to do with 32 vs 64 bits as I
initially thought.  It happens on 64bits too, but takes
more time (or data to transfer) to trigger.


> Let me know if you're able to narrow this down any more.

I bisected this issue to the following commit:

commit f03d78db65085609938fdb686238867e65003181
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Jul 7 00:27:05 2011 -0700

    net: refine {udp|tcp|sctp}_mem limits

    Current tcp/udp/sctp global memory limits are not taking into account
    hugepages allocations, and allow 50% of ram to be used by buffers of a
    single protocol [ not counting space used by sockets / inodes ...]

    Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
    per protocol, and a minimum of 128 pages.
    Heavy duty machines sysadmins probably need to tweak limits anyway.


Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
the behavour here.

This machine has 4Gb of memory.  On 3.0, with this patch applied
(as it is part of 3.0), tcp_mem is like this:

  21228     28306   42456

with this patch reverted, tcp_mem shows:

  81216     108288  162432

and with these values, it works fine.

So it looks like something else goes wrong there,
which lead to all nfsds fighting with each other
for something and eating 100% of available CPU
instead of servicing clients.

For added fun, when setting tcp_mem to the "good" value
from "bad" value (after booting into kernel with that
patch applied), the problem is _not_ fixed.

Any further hints?

Thanks,

/mjt

>> On 31.05.2012 17:51, Michael Tokarev wrote:
>>> On 31.05.2012 17:46, Myklebust, Trond wrote:
>>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
>>> []
>>>>> I started tcpdump:
>>>>>
>>>>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>>>>
>>>>> on the client (192.168.88.2).  Next I mounted a directory on the client,
>>>>> and started reading (tar'ing) a directory into /dev/null.  It captured a
>>>>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
>>>>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
>>>>>
>>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>>>>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
>>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>>>>> take place on disk at all.  This also makes it obvious that the issue
>>>>> does not depend on the speed of disk on the server (since in this case,
>>>>> the server disk isn't even in use).
>>>>
>>>> OK. So from the above file it looks as if the traffic is mainly READ
>>>> requests.
>>>
>>> The issue here happens only with reads.
>>>
>>>> In 2 places the server stops responding. In both cases, the client seems
>>>> to be sending a single TCP frame containing several COMPOUNDS containing
>>>> READ requests (which should be legal) just prior to the hang. When the
>>>> server doesn't respond, the client pings it with a RENEW, before it ends
>>>> up severing the TCP connection and then retransmitting.
>>>
>>> And sometimes -- speaking only from the behavour I've seen, not from the
>>> actual frames sent -- server does not respond to the RENEW too, in which
>>> case the client reports "nfs server no responding", and on the next
>>> renew it may actually respond.  This happens too, but much more rare.
>>>
>>> During these stalls, ie, when there's no network activity at all,
>>> the server NFSD threads are busy eating all available CPU.
>>>
>>> What does it all tell us? :)
>>>
>>> Thank you!
>>>
>>> /mjt
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17  1:56                     ` 3.0+ NFS issues (bisected) Michael Tokarev
@ 2012-08-17 14:56                       ` J. Bruce Fields
  2012-08-17 16:00                         ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 14:56 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> On 12.07.2012 16:53, J. Bruce Fields wrote:
> > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> >> I tried to debug this again, maybe to reproduce in a virtual machine,
> >> and found out that it is only 32bit server code shows this issue:
> >> after updating the kernel on the server to 64bit (the same version)
> >> I can't reproduce this issue anymore.  Rebooting back to 32bit,
> >> and voila, it is here again.
> >>
> >> Something apparenlty isn't right on 32bits... ;)
> >>
> >> (And yes, the prob is still present and is very annoying :)
> > 
> > OK, that's very useful, thanks.  So probably a bug got introduced in the
> > 32-bit case between 2.6.32 and 3.0.
> > 
> > My personal upstream testing is normally all x86_64 only.  I'll kick off
> > a 32-bit install and see if I can reproduce this quickly.
> 
> Actually it has nothing to do with 32 vs 64 bits as I
> initially thought.  It happens on 64bits too, but takes
> more time (or data to transfer) to trigger.

That makes it sound like some kind of leak: you're hitting this case
eventually either way, but it takes longer in the case where you have
more (low) memory.

I wish I was more familiar with the tcp code.... What number exactly is
being compared against those limits, and how could we watch it from
userspace?

--b.

> 
> 
> > Let me know if you're able to narrow this down any more.
> 
> I bisected this issue to the following commit:
> 
> commit f03d78db65085609938fdb686238867e65003181
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Thu Jul 7 00:27:05 2011 -0700
> 
>     net: refine {udp|tcp|sctp}_mem limits
> 
>     Current tcp/udp/sctp global memory limits are not taking into account
>     hugepages allocations, and allow 50% of ram to be used by buffers of a
>     single protocol [ not counting space used by sockets / inodes ...]
> 
>     Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
>     per protocol, and a minimum of 128 pages.
>     Heavy duty machines sysadmins probably need to tweak limits anyway.
> 
> 
> Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> the behavour here.
> 
> This machine has 4Gb of memory.  On 3.0, with this patch applied
> (as it is part of 3.0), tcp_mem is like this:
> 
>   21228     28306   42456
> 
> with this patch reverted, tcp_mem shows:
> 
>   81216     108288  162432
> 
> and with these values, it works fine.
> 
> So it looks like something else goes wrong there,
> which lead to all nfsds fighting with each other
> for something and eating 100% of available CPU
> instead of servicing clients.
> 
> For added fun, when setting tcp_mem to the "good" value
> from "bad" value (after booting into kernel with that
> patch applied), the problem is _not_ fixed.
> 
> Any further hints?
> 
> Thanks,
> 
> /mjt
> 
> >> On 31.05.2012 17:51, Michael Tokarev wrote:
> >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> >>> []
> >>>>> I started tcpdump:
> >>>>>
> >>>>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> >>>>>
> >>>>> on the client (192.168.88.2).  Next I mounted a directory on the client,
> >>>>> and started reading (tar'ing) a directory into /dev/null.  It captured a
> >>>>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
> >>>>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
> >>>>>
> >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> >>>>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
> >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> >>>>> take place on disk at all.  This also makes it obvious that the issue
> >>>>> does not depend on the speed of disk on the server (since in this case,
> >>>>> the server disk isn't even in use).
> >>>>
> >>>> OK. So from the above file it looks as if the traffic is mainly READ
> >>>> requests.
> >>>
> >>> The issue here happens only with reads.
> >>>
> >>>> In 2 places the server stops responding. In both cases, the client seems
> >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> >>>> READ requests (which should be legal) just prior to the hang. When the
> >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> >>>> up severing the TCP connection and then retransmitting.
> >>>
> >>> And sometimes -- speaking only from the behavour I've seen, not from the
> >>> actual frames sent -- server does not respond to the RENEW too, in which
> >>> case the client reports "nfs server no responding", and on the next
> >>> renew it may actually respond.  This happens too, but much more rare.
> >>>
> >>> During these stalls, ie, when there's no network activity at all,
> >>> the server NFSD threads are busy eating all available CPU.
> >>>
> >>> What does it all tell us? :)
> >>>
> >>> Thank you!
> >>>
> >>> /mjt
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 14:56                       ` J. Bruce Fields
@ 2012-08-17 16:00                         ` J. Bruce Fields
  2012-08-17 17:12                           ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 16:00 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 10:56:16AM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> > On 12.07.2012 16:53, J. Bruce Fields wrote:
> > > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> > >> I tried to debug this again, maybe to reproduce in a virtual machine,
> > >> and found out that it is only 32bit server code shows this issue:
> > >> after updating the kernel on the server to 64bit (the same version)
> > >> I can't reproduce this issue anymore.  Rebooting back to 32bit,
> > >> and voila, it is here again.
> > >>
> > >> Something apparenlty isn't right on 32bits... ;)
> > >>
> > >> (And yes, the prob is still present and is very annoying :)
> > > 
> > > OK, that's very useful, thanks.  So probably a bug got introduced in the
> > > 32-bit case between 2.6.32 and 3.0.
> > > 
> > > My personal upstream testing is normally all x86_64 only.  I'll kick off
> > > a 32-bit install and see if I can reproduce this quickly.
> > 
> > Actually it has nothing to do with 32 vs 64 bits as I
> > initially thought.  It happens on 64bits too, but takes
> > more time (or data to transfer) to trigger.
> 
> That makes it sound like some kind of leak: you're hitting this case
> eventually either way, but it takes longer in the case where you have
> more (low) memory.
> 
> I wish I was more familiar with the tcp code.... What number exactly is
> being compared against those limits, and how could we watch it from
> userspace?

Uh, if I grepped my way through this right: it looks like it's the
"memory" column of the "TCP" row of /proc/net/protocols; might be
interesting to see how that's changing over time.

> 
> --b.
> 
> > 
> > 
> > > Let me know if you're able to narrow this down any more.
> > 
> > I bisected this issue to the following commit:
> > 
> > commit f03d78db65085609938fdb686238867e65003181
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Thu Jul 7 00:27:05 2011 -0700
> > 
> >     net: refine {udp|tcp|sctp}_mem limits
> > 
> >     Current tcp/udp/sctp global memory limits are not taking into account
> >     hugepages allocations, and allow 50% of ram to be used by buffers of a
> >     single protocol [ not counting space used by sockets / inodes ...]
> > 
> >     Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
> >     per protocol, and a minimum of 128 pages.
> >     Heavy duty machines sysadmins probably need to tweak limits anyway.
> > 
> > 
> > Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> > the behavour here.
> > 
> > This machine has 4Gb of memory.  On 3.0, with this patch applied
> > (as it is part of 3.0), tcp_mem is like this:
> > 
> >   21228     28306   42456
> > 
> > with this patch reverted, tcp_mem shows:
> > 
> >   81216     108288  162432
> > 
> > and with these values, it works fine.
> > 
> > So it looks like something else goes wrong there,
> > which lead to all nfsds fighting with each other
> > for something and eating 100% of available CPU
> > instead of servicing clients.
> > 
> > For added fun, when setting tcp_mem to the "good" value
> > from "bad" value (after booting into kernel with that
> > patch applied), the problem is _not_ fixed.
> > 
> > Any further hints?
> > 
> > Thanks,
> > 
> > /mjt
> > 
> > >> On 31.05.2012 17:51, Michael Tokarev wrote:
> > >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> > >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> > >>> []
> > >>>>> I started tcpdump:
> > >>>>>
> > >>>>>  tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> > >>>>>
> > >>>>> on the client (192.168.88.2).  Next I mounted a directory on the client,
> > >>>>> and started reading (tar'ing) a directory into /dev/null.  It captured a
> > >>>>> few stalls.  Tcpdump shows number of packets it got, the stalls are at
> > >>>>> packet counts 58090, 97069 and 97071.  I cancelled the capture after that.
> > >>>>>
> > >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> > >>>>> it is 220Mb uncompressed and 1.3Mb compressed.  The source files are
> > >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> > >>>>> take place on disk at all.  This also makes it obvious that the issue
> > >>>>> does not depend on the speed of disk on the server (since in this case,
> > >>>>> the server disk isn't even in use).
> > >>>>
> > >>>> OK. So from the above file it looks as if the traffic is mainly READ
> > >>>> requests.
> > >>>
> > >>> The issue here happens only with reads.
> > >>>
> > >>>> In 2 places the server stops responding. In both cases, the client seems
> > >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> > >>>> READ requests (which should be legal) just prior to the hang. When the
> > >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> > >>>> up severing the TCP connection and then retransmitting.
> > >>>
> > >>> And sometimes -- speaking only from the behavour I've seen, not from the
> > >>> actual frames sent -- server does not respond to the RENEW too, in which
> > >>> case the client reports "nfs server no responding", and on the next
> > >>> renew it may actually respond.  This happens too, but much more rare.
> > >>>
> > >>> During these stalls, ie, when there's no network activity at all,
> > >>> the server NFSD threads are busy eating all available CPU.
> > >>>
> > >>> What does it all tell us? :)
> > >>>
> > >>> Thank you!
> > >>>
> > >>> /mjt
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at  http://www.tux.org/lkml/
> > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 16:00                         ` J. Bruce Fields
@ 2012-08-17 17:12                           ` Michael Tokarev
  2012-08-17 17:18                             ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-08-17 17:12 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 17.08.2012 20:00, J. Bruce Fields wrote:
[]> Uh, if I grepped my way through this right: it looks like it's the
> "memory" column of the "TCP" row of /proc/net/protocols; might be
> interesting to see how that's changing over time.

This file does not look interesting.  Memory usage does not jump,
there's no high increase either.

But there's something else which is interesting here.

I noticed that in perf top, the top consumer of CPU is svc_recv()
(I mentioned this in the start of this thread).  So I looked how
this routine is called from nfsd.  And here we go.

fs/nfsd/nfssvc.c:

/*
 * This is the NFS server kernel thread
 */
static int
nfsd(void *vrqstp)
{
...
        /*
         * The main request loop
         */
        for (;;) {
                /*
                 * Find a socket with data available and call its
                 * recvfrom routine.
                 */
int i = 0;
                while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
                        ++i;
printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
                if (err == -EINTR)
                        break;
...

(I added the "i" counter and the printk).  And here's the output:

[19626.401136] calling svc_recv: 0 times (err=212)
[19626.405059] calling svc_recv: 1478 times (err=212)
[19626.409512] calling svc_recv: 1106 times (err=212)
[19626.543020] calling svc_recv: 0 times (err=212)
[19626.543059] calling svc_recv: 0 times (err=212)
[19626.548074] calling svc_recv: 0 times (err=212)
[19626.549515] calling svc_recv: 0 times (err=212)
[19626.552320] calling svc_recv: 0 times (err=212)
[19626.553503] calling svc_recv: 0 times (err=212)
[19626.556007] calling svc_recv: 0 times (err=212)
[19626.557152] calling svc_recv: 0 times (err=212)
[19626.560109] calling svc_recv: 0 times (err=212)
[19626.560943] calling svc_recv: 0 times (err=212)
[19626.565315] calling svc_recv: 1067 times (err=212)
[19626.569735] calling svc_recv: 2571 times (err=212)
[19626.574150] calling svc_recv: 3842 times (err=212)
[19626.581914] calling svc_recv: 2891 times (err=212)
[19626.583072] calling svc_recv: 1247 times (err=212)
[19626.616885] calling svc_recv: 0 times (err=212)
[19626.616952] calling svc_recv: 0 times (err=212)
[19626.622889] calling svc_recv: 0 times (err=212)
[19626.624518] calling svc_recv: 0 times (err=212)
[19626.627118] calling svc_recv: 0 times (err=212)
[19626.629735] calling svc_recv: 0 times (err=212)
[19626.631777] calling svc_recv: 0 times (err=212)
[19626.633986] calling svc_recv: 0 times (err=212)
[19626.636746] calling svc_recv: 0 times (err=212)
[19626.637692] calling svc_recv: 0 times (err=212)
[19626.640769] calling svc_recv: 0 times (err=212)
[19626.657852] calling svc_recv: 0 times (err=212)
[19626.661602] calling svc_recv: 0 times (err=212)
[19626.670160] calling svc_recv: 0 times (err=212)
[19626.671917] calling svc_recv: 0 times (err=212)
[19626.684643] calling svc_recv: 0 times (err=212)
[19626.684680] calling svc_recv: 0 times (err=212)
[19626.812820] calling svc_recv: 0 times (err=212)
[19626.814697] calling svc_recv: 0 times (err=212)
[19626.817195] calling svc_recv: 0 times (err=212)
[19626.820324] calling svc_recv: 0 times (err=212)
[19626.822855] calling svc_recv: 0 times (err=212)
[19626.824823] calling svc_recv: 0 times (err=212)
[19626.828016] calling svc_recv: 0 times (err=212)
[19626.829021] calling svc_recv: 0 times (err=212)
[19626.831970] calling svc_recv: 0 times (err=212)

> the stall begin:
[19686.823135] calling svc_recv: 3670352 times (err=212)
[19686.823524] calling svc_recv: 3659205 times (err=212)

> transfer continues
[19686.854734] calling svc_recv: 0 times (err=212)
[19686.860023] calling svc_recv: 0 times (err=212)
[19686.887124] calling svc_recv: 0 times (err=212)
[19686.895532] calling svc_recv: 0 times (err=212)
[19686.903667] calling svc_recv: 0 times (err=212)
[19686.922780] calling svc_recv: 0 times (err=212)

So we're calling svc_recv in a tight loop, eating
all available CPU.  (The above is with just 2 nfsd
threads).

Something is definitely wrong here.  And it happens mure more
often after the mentioned commit (f03d78db65085).

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 17:12                           ` Michael Tokarev
@ 2012-08-17 17:18                             ` J. Bruce Fields
  2012-08-17 17:26                               ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 17:18 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> On 17.08.2012 20:00, J. Bruce Fields wrote:
> []> Uh, if I grepped my way through this right: it looks like it's the
> > "memory" column of the "TCP" row of /proc/net/protocols; might be
> > interesting to see how that's changing over time.
> 
> This file does not look interesting.  Memory usage does not jump,
> there's no high increase either.
> 
> But there's something else which is interesting here.
> 
> I noticed that in perf top, the top consumer of CPU is svc_recv()
> (I mentioned this in the start of this thread).  So I looked how
> this routine is called from nfsd.  And here we go.
> 
> fs/nfsd/nfssvc.c:
> 
> /*
>  * This is the NFS server kernel thread
>  */
> static int
> nfsd(void *vrqstp)
> {
> ...
>         /*
>          * The main request loop
>          */
>         for (;;) {
>                 /*
>                  * Find a socket with data available and call its
>                  * recvfrom routine.
>                  */
> int i = 0;
>                 while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
>                         ++i;
> printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
>                 if (err == -EINTR)
>                         break;
> ...
> 
> (I added the "i" counter and the printk).  And here's the output:
> 
> [19626.401136] calling svc_recv: 0 times (err=212)
> [19626.405059] calling svc_recv: 1478 times (err=212)
> [19626.409512] calling svc_recv: 1106 times (err=212)
> [19626.543020] calling svc_recv: 0 times (err=212)
> [19626.543059] calling svc_recv: 0 times (err=212)
> [19626.548074] calling svc_recv: 0 times (err=212)
> [19626.549515] calling svc_recv: 0 times (err=212)
> [19626.552320] calling svc_recv: 0 times (err=212)
> [19626.553503] calling svc_recv: 0 times (err=212)
> [19626.556007] calling svc_recv: 0 times (err=212)
> [19626.557152] calling svc_recv: 0 times (err=212)
> [19626.560109] calling svc_recv: 0 times (err=212)
> [19626.560943] calling svc_recv: 0 times (err=212)
> [19626.565315] calling svc_recv: 1067 times (err=212)
> [19626.569735] calling svc_recv: 2571 times (err=212)
> [19626.574150] calling svc_recv: 3842 times (err=212)
> [19626.581914] calling svc_recv: 2891 times (err=212)
> [19626.583072] calling svc_recv: 1247 times (err=212)
> [19626.616885] calling svc_recv: 0 times (err=212)
> [19626.616952] calling svc_recv: 0 times (err=212)
> [19626.622889] calling svc_recv: 0 times (err=212)
> [19626.624518] calling svc_recv: 0 times (err=212)
> [19626.627118] calling svc_recv: 0 times (err=212)
> [19626.629735] calling svc_recv: 0 times (err=212)
> [19626.631777] calling svc_recv: 0 times (err=212)
> [19626.633986] calling svc_recv: 0 times (err=212)
> [19626.636746] calling svc_recv: 0 times (err=212)
> [19626.637692] calling svc_recv: 0 times (err=212)
> [19626.640769] calling svc_recv: 0 times (err=212)
> [19626.657852] calling svc_recv: 0 times (err=212)
> [19626.661602] calling svc_recv: 0 times (err=212)
> [19626.670160] calling svc_recv: 0 times (err=212)
> [19626.671917] calling svc_recv: 0 times (err=212)
> [19626.684643] calling svc_recv: 0 times (err=212)
> [19626.684680] calling svc_recv: 0 times (err=212)
> [19626.812820] calling svc_recv: 0 times (err=212)
> [19626.814697] calling svc_recv: 0 times (err=212)
> [19626.817195] calling svc_recv: 0 times (err=212)
> [19626.820324] calling svc_recv: 0 times (err=212)
> [19626.822855] calling svc_recv: 0 times (err=212)
> [19626.824823] calling svc_recv: 0 times (err=212)
> [19626.828016] calling svc_recv: 0 times (err=212)
> [19626.829021] calling svc_recv: 0 times (err=212)
> [19626.831970] calling svc_recv: 0 times (err=212)
> 
> > the stall begin:
> [19686.823135] calling svc_recv: 3670352 times (err=212)
> [19686.823524] calling svc_recv: 3659205 times (err=212)
> 
> > transfer continues
> [19686.854734] calling svc_recv: 0 times (err=212)
> [19686.860023] calling svc_recv: 0 times (err=212)
> [19686.887124] calling svc_recv: 0 times (err=212)
> [19686.895532] calling svc_recv: 0 times (err=212)
> [19686.903667] calling svc_recv: 0 times (err=212)
> [19686.922780] calling svc_recv: 0 times (err=212)
> 
> So we're calling svc_recv in a tight loop, eating
> all available CPU.  (The above is with just 2 nfsd
> threads).
> 
> Something is definitely wrong here.  And it happens mure more
> often after the mentioned commit (f03d78db65085).

Oh, neat.  Hm.  That commit doesn't really sound like the cause, then.
Is that busy-looping reproduceable on kernels before that commit?

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 17:18                             ` J. Bruce Fields
@ 2012-08-17 17:26                               ` Michael Tokarev
  2012-08-17 17:29                                 ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-08-17 17:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 17.08.2012 21:18, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
[]
>> So we're calling svc_recv in a tight loop, eating
>> all available CPU.  (The above is with just 2 nfsd
>> threads).
>>
>> Something is definitely wrong here.  And it happens mure more
>> often after the mentioned commit (f03d78db65085).
> 
> Oh, neat.  Hm.  That commit doesn't really sound like the cause, then.
> Is that busy-looping reproduceable on kernels before that commit?

Note I bisected this issue to this commit.  I haven't seen it
happening before this commit, and reverting it from 3.0 or 3.2
kernel makes the problem to go away.

I guess it is looping there:


net/sunrpc/svc_xprt.c:svc_recv()
...
        len = 0;
...
        if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
...
        } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {  <=== here -- has no wspace due to memory...
...  len = <something>
        }

        /* No data, incomplete (TCP) read, or accept() */
        if (len == 0 || len == -EAGAIN)
                goto out;
...
out:
        rqstp->rq_res.len = 0;
        svc_xprt_release(rqstp);
        return -EAGAIN;
}

I'm trying to verify this theory...

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 17:26                               ` Michael Tokarev
@ 2012-08-17 17:29                                 ` Michael Tokarev
  2012-08-17 19:18                                   ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-08-17 17:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 17.08.2012 21:26, Michael Tokarev wrote:
> On 17.08.2012 21:18, J. Bruce Fields wrote:
>> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> []
>>> So we're calling svc_recv in a tight loop, eating
>>> all available CPU.  (The above is with just 2 nfsd
>>> threads).
>>>
>>> Something is definitely wrong here.  And it happens mure more
>>> often after the mentioned commit (f03d78db65085).
>>
>> Oh, neat.  Hm.  That commit doesn't really sound like the cause, then.
>> Is that busy-looping reproduceable on kernels before that commit?
> 
> Note I bisected this issue to this commit.  I haven't seen it
> happening before this commit, and reverting it from 3.0 or 3.2
> kernel makes the problem to go away.
> 
> I guess it is looping there:
> 
> 
> net/sunrpc/svc_xprt.c:svc_recv()
> ...
>         len = 0;
> ...
>         if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> ...
>         } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {  <=== here -- has no wspace due to memory...
> ...  len = <something>
>         }
> 
>         /* No data, incomplete (TCP) read, or accept() */
>         if (len == 0 || len == -EAGAIN)
>                 goto out;
> ...
> out:
>         rqstp->rq_res.len = 0;
>         svc_xprt_release(rqstp);
>         return -EAGAIN;
> }
> 
> I'm trying to verify this theory...

Yes.  I inserted a printk there, and all these million times while
we're waiting in this EAGAIN loop, this printk is triggering:

....
[21052.533053]  svc_recv: !has_wspace
[21052.533070]  svc_recv: !has_wspace
[21052.533087]  svc_recv: !has_wspace
[21052.533105]  svc_recv: !has_wspace
[21052.533122]  svc_recv: !has_wspace
[21052.533139]  svc_recv: !has_wspace
[21052.533156]  svc_recv: !has_wspace
[21052.533174]  svc_recv: !has_wspace
[21052.533191]  svc_recv: !has_wspace
[21052.533208]  svc_recv: !has_wspace
[21052.533226]  svc_recv: !has_wspace
[21052.533244]  svc_recv: !has_wspace
[21052.533265] calling svc_recv: 1228163 times (err=-4)
[21052.533403] calling svc_recv: 1226616 times (err=-4)
[21052.534520] nfsd: last server has exited, flushing export cache

(I stopped nfsd since it was flooding the log).

I can only guess that before that commit, we always had space,
now we don't anymore, and are looping like crazy.

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 17:29                                 ` Michael Tokarev
@ 2012-08-17 19:18                                   ` J. Bruce Fields
  2012-08-17 20:08                                     ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 19:18 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> On 17.08.2012 21:26, Michael Tokarev wrote:
> > On 17.08.2012 21:18, J. Bruce Fields wrote:
> >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > []
> >>> So we're calling svc_recv in a tight loop, eating
> >>> all available CPU.  (The above is with just 2 nfsd
> >>> threads).
> >>>
> >>> Something is definitely wrong here.  And it happens mure more
> >>> often after the mentioned commit (f03d78db65085).
> >>
> >> Oh, neat.  Hm.  That commit doesn't really sound like the cause, then.
> >> Is that busy-looping reproduceable on kernels before that commit?
> > 
> > Note I bisected this issue to this commit.  I haven't seen it
> > happening before this commit, and reverting it from 3.0 or 3.2
> > kernel makes the problem to go away.
> > 
> > I guess it is looping there:
> > 
> > 
> > net/sunrpc/svc_xprt.c:svc_recv()
> > ...
> >         len = 0;
> > ...
> >         if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > ...
> >         } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {  <=== here -- has no wspace due to memory...
> > ...  len = <something>
> >         }
> > 
> >         /* No data, incomplete (TCP) read, or accept() */
> >         if (len == 0 || len == -EAGAIN)
> >                 goto out;
> > ...
> > out:
> >         rqstp->rq_res.len = 0;
> >         svc_xprt_release(rqstp);
> >         return -EAGAIN;
> > }
> > 
> > I'm trying to verify this theory...
> 
> Yes.  I inserted a printk there, and all these million times while
> we're waiting in this EAGAIN loop, this printk is triggering:
> 
> ....
> [21052.533053]  svc_recv: !has_wspace
> [21052.533070]  svc_recv: !has_wspace
> [21052.533087]  svc_recv: !has_wspace
> [21052.533105]  svc_recv: !has_wspace
> [21052.533122]  svc_recv: !has_wspace
> [21052.533139]  svc_recv: !has_wspace
> [21052.533156]  svc_recv: !has_wspace
> [21052.533174]  svc_recv: !has_wspace
> [21052.533191]  svc_recv: !has_wspace
> [21052.533208]  svc_recv: !has_wspace
> [21052.533226]  svc_recv: !has_wspace
> [21052.533244]  svc_recv: !has_wspace
> [21052.533265] calling svc_recv: 1228163 times (err=-4)
> [21052.533403] calling svc_recv: 1226616 times (err=-4)
> [21052.534520] nfsd: last server has exited, flushing export cache
> 
> (I stopped nfsd since it was flooding the log).
> 
> I can only guess that before that commit, we always had space,
> now we don't anymore, and are looping like crazy.

Thanks!  But, arrgh--that should be enough to go on at this point, but
I'm not seeing it.  If has_wspace is returning false then it's likely
also returning false to the call at the start of svc_xprt_enqueue() (see
svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
requeued and the next svc_recv call should find no socket ready (so
svc_xprt_dequeue() returns NULL), and goes to sleep.

But clearly it's not working that way....

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 19:18                                   ` J. Bruce Fields
@ 2012-08-17 20:08                                     ` J. Bruce Fields
  2012-08-17 22:32                                       ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 20:08 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 03:18:00PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> > On 17.08.2012 21:26, Michael Tokarev wrote:
> > > On 17.08.2012 21:18, J. Bruce Fields wrote:
> > >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > > []
> > >>> So we're calling svc_recv in a tight loop, eating
> > >>> all available CPU.  (The above is with just 2 nfsd
> > >>> threads).
> > >>>
> > >>> Something is definitely wrong here.  And it happens mure more
> > >>> often after the mentioned commit (f03d78db65085).
> > >>
> > >> Oh, neat.  Hm.  That commit doesn't really sound like the cause, then.
> > >> Is that busy-looping reproduceable on kernels before that commit?
> > > 
> > > Note I bisected this issue to this commit.  I haven't seen it
> > > happening before this commit, and reverting it from 3.0 or 3.2
> > > kernel makes the problem to go away.
> > > 
> > > I guess it is looping there:
> > > 
> > > 
> > > net/sunrpc/svc_xprt.c:svc_recv()
> > > ...
> > >         len = 0;
> > > ...
> > >         if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > > ...
> > >         } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {  <=== here -- has no wspace due to memory...
> > > ...  len = <something>
> > >         }
> > > 
> > >         /* No data, incomplete (TCP) read, or accept() */
> > >         if (len == 0 || len == -EAGAIN)
> > >                 goto out;
> > > ...
> > > out:
> > >         rqstp->rq_res.len = 0;
> > >         svc_xprt_release(rqstp);
> > >         return -EAGAIN;
> > > }
> > > 
> > > I'm trying to verify this theory...
> > 
> > Yes.  I inserted a printk there, and all these million times while
> > we're waiting in this EAGAIN loop, this printk is triggering:
> > 
> > ....
> > [21052.533053]  svc_recv: !has_wspace
> > [21052.533070]  svc_recv: !has_wspace
> > [21052.533087]  svc_recv: !has_wspace
> > [21052.533105]  svc_recv: !has_wspace
> > [21052.533122]  svc_recv: !has_wspace
> > [21052.533139]  svc_recv: !has_wspace
> > [21052.533156]  svc_recv: !has_wspace
> > [21052.533174]  svc_recv: !has_wspace
> > [21052.533191]  svc_recv: !has_wspace
> > [21052.533208]  svc_recv: !has_wspace
> > [21052.533226]  svc_recv: !has_wspace
> > [21052.533244]  svc_recv: !has_wspace
> > [21052.533265] calling svc_recv: 1228163 times (err=-4)
> > [21052.533403] calling svc_recv: 1226616 times (err=-4)
> > [21052.534520] nfsd: last server has exited, flushing export cache
> > 
> > (I stopped nfsd since it was flooding the log).
> > 
> > I can only guess that before that commit, we always had space,
> > now we don't anymore, and are looping like crazy.
> 
> Thanks!  But, arrgh--that should be enough to go on at this point, but
> I'm not seeing it.  If has_wspace is returning false then it's likely
> also returning false to the call at the start of svc_xprt_enqueue()

Wait a minute, that assumption's a problem because that calculation
depends in part on xpt_reserved, which is changed here....

In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
lower xpt_reserved value.  That could well explain this.

--b.

> (see
> svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
> requeued and the next svc_recv call should find no socket ready (so
> svc_xprt_dequeue() returns NULL), and goes to sleep.
> 
> But clearly it's not working that way....
> 
> --b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 20:08                                     ` J. Bruce Fields
@ 2012-08-17 22:32                                       ` J. Bruce Fields
  2012-08-18  6:49                                         ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-17 22:32 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> Wait a minute, that assumption's a problem because that calculation
> depends in part on xpt_reserved, which is changed here....
> 
> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> lower xpt_reserved value.  That could well explain this.

So, maybe something like this?

--b.

commit c8136c319ad85d0db870021fc3f9074d37f26d4a
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Fri Aug 17 17:31:53 2012 -0400

    svcrpc: don't add to xpt_reserved till we receive
    
    The rpc server tries to ensure that there will be room to send a reply
    before it receives a request.
    
    It does this by tracking, in xpt_reserved, an upper bound on the total
    size of the replies that is has already committed to for the socket.
    
    Currently it is adding in the estimate for a new reply *before* it
    checks whether there is space available.  If it finds that there is not
    space, it then subtracts the estimate back out.
    
    This may lead the subsequent svc_xprt_enqueue to decide that there is
    space after all.
    
    The results is a svc_recv() that will repeatedly return -EAGAIN, causing
    server threads to loop without doing any actual work.
    
    Reported-by: Michael Tokarev <mjt@tls.msk.ru>
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ec99849a..59ff3a3 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 				rqstp, rqstp->rq_xprt);
 		rqstp->rq_xprt = xprt;
 		svc_xprt_get(xprt);
-		rqstp->rq_reserved = serv->sv_max_mesg;
-		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 		pool->sp_stats.threads_woken++;
 		wake_up(&rqstp->rq_wait);
 	} else {
@@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 	if (xprt) {
 		rqstp->rq_xprt = xprt;
 		svc_xprt_get(xprt);
-		rqstp->rq_reserved = serv->sv_max_mesg;
-		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 
 		/* As there is a shortage of threads and this request
 		 * had to be queued, don't allow the thread to wait so
@@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 			len = xprt->xpt_ops->xpo_recvfrom(rqstp);
 		dprintk("svc: got len=%d\n", len);
 	}
+	if (len > 0) {
+		rqstp->rq_reserved = serv->sv_max_mesg;
+		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+	}
 	svc_xprt_received(xprt);
 
 	/* No data, incomplete (TCP) read, or accept() */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-17 22:32                                       ` J. Bruce Fields
@ 2012-08-18  6:49                                         ` Michael Tokarev
  2012-08-18 11:13                                           ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Tokarev @ 2012-08-18  6:49 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 18.08.2012 02:32, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
>> Wait a minute, that assumption's a problem because that calculation
>> depends in part on xpt_reserved, which is changed here....
>>
>> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
>> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
>> lower xpt_reserved value.  That could well explain this.
> 
> So, maybe something like this?

Well.  What can I say?  With the change below applied (to 3.2 kernel
at least), I don't see any stalls or high CPU usage on the server
anymore.  It survived several multi-gigabyte transfers, for several
hours, without any problem.  So it is a good step forward ;)

But the whole thing seems to be quite a bit fragile.  I tried to follow
the logic in there, and the thing is quite a bit, well, "twisted", and
somewhat difficult to follow.  So I don't know if this is the right
fix or not.  At least it works! :)

And I really wonder why no one else reported this problem before.
Is me the only one in this world who uses linux nfsd? :)

Thank you for all your patience and the proposed fix!

/mjt

> commit c8136c319ad85d0db870021fc3f9074d37f26d4a
> Author: J. Bruce Fields <bfields@redhat.com>
> Date:   Fri Aug 17 17:31:53 2012 -0400
> 
>     svcrpc: don't add to xpt_reserved till we receive
>     
>     The rpc server tries to ensure that there will be room to send a reply
>     before it receives a request.
>     
>     It does this by tracking, in xpt_reserved, an upper bound on the total
>     size of the replies that is has already committed to for the socket.
>     
>     Currently it is adding in the estimate for a new reply *before* it
>     checks whether there is space available.  If it finds that there is not
>     space, it then subtracts the estimate back out.
>     
>     This may lead the subsequent svc_xprt_enqueue to decide that there is
>     space after all.
>     
>     The results is a svc_recv() that will repeatedly return -EAGAIN, causing
>     server threads to loop without doing any actual work.
>     
>     Reported-by: Michael Tokarev <mjt@tls.msk.ru>
>     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ec99849a..59ff3a3 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
>  				rqstp, rqstp->rq_xprt);
>  		rqstp->rq_xprt = xprt;
>  		svc_xprt_get(xprt);
> -		rqstp->rq_reserved = serv->sv_max_mesg;
> -		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>  		pool->sp_stats.threads_woken++;
>  		wake_up(&rqstp->rq_wait);
>  	} else {
> @@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
>  	if (xprt) {
>  		rqstp->rq_xprt = xprt;
>  		svc_xprt_get(xprt);
> -		rqstp->rq_reserved = serv->sv_max_mesg;
> -		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>  
>  		/* As there is a shortage of threads and this request
>  		 * had to be queued, don't allow the thread to wait so
> @@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
>  			len = xprt->xpt_ops->xpo_recvfrom(rqstp);
>  		dprintk("svc: got len=%d\n", len);
>  	}
> +	if (len > 0) {
> +		rqstp->rq_reserved = serv->sv_max_mesg;
> +		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> +	}
>  	svc_xprt_received(xprt);
>  
>  	/* No data, incomplete (TCP) read, or accept() */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-18  6:49                                         ` Michael Tokarev
@ 2012-08-18 11:13                                           ` J. Bruce Fields
  2012-08-18 12:58                                             ` Michael Tokarev
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2012-08-18 11:13 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
> On 18.08.2012 02:32, J. Bruce Fields wrote:
> > On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> >> Wait a minute, that assumption's a problem because that calculation
> >> depends in part on xpt_reserved, which is changed here....
> >>
> >> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> >> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> >> lower xpt_reserved value.  That could well explain this.
> > 
> > So, maybe something like this?
> 
> Well.  What can I say?  With the change below applied (to 3.2 kernel
> at least), I don't see any stalls or high CPU usage on the server
> anymore.  It survived several multi-gigabyte transfers, for several
> hours, without any problem.  So it is a good step forward ;)
> 
> But the whole thing seems to be quite a bit fragile.  I tried to follow
> the logic in there, and the thing is quite a bit, well, "twisted", and
> somewhat difficult to follow.  So I don't know if this is the right
> fix or not.  At least it works! :)

Suggestions welcomed.

> And I really wonder why no one else reported this problem before.
> Is me the only one in this world who uses linux nfsd? :)

This, for example:

	http://marc.info/?l=linux-nfs&m=134131915612287&w=2

may well describe the same problem....  It just needed some debugging
persistence, thanks!

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 3.0+ NFS issues (bisected)
  2012-08-18 11:13                                           ` J. Bruce Fields
@ 2012-08-18 12:58                                             ` Michael Tokarev
  0 siblings, 0 replies; 24+ messages in thread
From: Michael Tokarev @ 2012-08-18 12:58 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Myklebust, Trond, linux-nfs, Linux-kernel, Eric Dumazet

On 18.08.2012 15:13, J. Bruce Fields wrote:
> On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
[]
>> Well.  What can I say?  With the change below applied (to 3.2 kernel
>> at least), I don't see any stalls or high CPU usage on the server
>> anymore.  It survived several multi-gigabyte transfers, for several
>> hours, without any problem.  So it is a good step forward ;)
>>
>> But the whole thing seems to be quite a bit fragile.  I tried to follow
>> the logic in there, and the thing is quite a bit, well, "twisted", and
>> somewhat difficult to follow.  So I don't know if this is the right
>> fix or not.  At least it works! :)
> 
> Suggestions welcomed.

Ok...

Meanwhile, you can add my
Tested-By: Michael Tokarev <mjt@tls.msk.ru>

to the patch.

>> And I really wonder why no one else reported this problem before.
>> Is me the only one in this world who uses linux nfsd? :)
> 
> This, for example:
> 
> 	http://marc.info/?l=linux-nfs&m=134131915612287&w=2
> 
> may well describe the same problem....  It just needed some debugging
> persistence, thanks!

Ah.  I tried to find something when I initially
sent this report, but weren't able to.  Apparently
I'm not alone with this problem indeed!

Thank you for all the work!

/mjt

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-08-18 12:58 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-25  6:53 3.0+ NFS issues Michael Tokarev
2012-05-29 15:24 ` J. Bruce Fields
2012-05-30  7:11   ` Michael Tokarev
2012-05-30 13:25     ` J. Bruce Fields
2012-05-31  6:47       ` Michael Tokarev
2012-05-31 12:59         ` Myklebust, Trond
2012-05-31 13:24           ` Michael Tokarev
2012-05-31 13:46             ` Myklebust, Trond
2012-05-31 13:51               ` Michael Tokarev
2012-07-10 12:52                 ` Michael Tokarev
2012-07-12 12:53                   ` J. Bruce Fields
2012-08-17  1:56                     ` 3.0+ NFS issues (bisected) Michael Tokarev
2012-08-17 14:56                       ` J. Bruce Fields
2012-08-17 16:00                         ` J. Bruce Fields
2012-08-17 17:12                           ` Michael Tokarev
2012-08-17 17:18                             ` J. Bruce Fields
2012-08-17 17:26                               ` Michael Tokarev
2012-08-17 17:29                                 ` Michael Tokarev
2012-08-17 19:18                                   ` J. Bruce Fields
2012-08-17 20:08                                     ` J. Bruce Fields
2012-08-17 22:32                                       ` J. Bruce Fields
2012-08-18  6:49                                         ` Michael Tokarev
2012-08-18 11:13                                           ` J. Bruce Fields
2012-08-18 12:58                                             ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).