linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* high latency NFS
@ 2008-07-24 17:11 Michael Shuey
  2008-07-30 19:21 ` J. Bruce Fields
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Michael Shuey @ 2008-07-24 17:11 UTC (permalink / raw)
  To: linux-kernel

I'm currently toying with Linux's NFS, to see just how fast it can go in a 
high-latency environment.  Right now, I'm simulating a 100ms delay between 
client and server with netem (just 100ms on the outbound packets from the 
client, rather than 50ms each way).  Oddly enough, I'm running into 
performance problems. :-)

According to iozone, my server can sustain about 90/85 MB/s (reads/writes) 
without any latency added.  After a pile of tweaks, and injecting 100ms of 
netem latency, I'm getting 6/40 MB/s (reads/writes).  I'd really like to 
know why writes are now so much faster than reads, and what sort of things 
might boost the read throughput.  Any suggestions?
1
The read throughput seems to be proportional to the latency - adding only 
10ms of delay gives 61 MB/s reads, in limited testing (need to look at it 
further).  While that's to be expected, to some extent, I'm hoping there's 
some form of readahead that can help me out here (assume big sequential 
reads).

iozone is reading/writing a file twice the size of memory on the client with 
a 32k block size.  I've tried raising this as high as 16 MB, but I still 
see around 6 MB/sec reads.

I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing with a stock 
2.6, client and server, is the next order of business.

NFS mount is tcp, version 3.  rsize/wsize are 32k.  Both client and server 
have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, wmem_default, and 
rmem_default tuned - tuning values are 12500000 for defaults (and minimum 
window sizes), 25000000 for the maximums.  Inefficient, yes, but I'm not 
concerned with memory efficiency at the moment.

Both client and server kernels have been modified to provide 
larger-than-normal RPC slot tables.  I allow a max of 1024, but I've found 
that actually enabling more than 490 entries in /proc causes mount to 
complain it can't allocate memory and die.  That was somewhat suprising, 
given I had 122 GB of free memory at the time...

I've also applied a couple patches to allow the NFS readahead to be a 
tunable number of RPC slots.  Currently, I set this to 489 on client and 
server (so it's one less than the max number of RPC slots).  Bandwidth 
delay product math says 380ish slots should be enough to keep a gigabit 
line full, so I suspect something else is preventing me from seeing the 
readahead I expect.

FYI, client and server are connected via gigabit ethernet.  There's a couple 
routers in the way, but they talk at 10gigE and can route wire speed.  
Traffic is IPv4, path MTU size is 9000 bytes.

Is there anything I'm missing?

-- 
Mike Shuey
Purdue University/ITaP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-24 17:11 high latency NFS Michael Shuey
@ 2008-07-30 19:21 ` J. Bruce Fields
  2008-07-30 21:40   ` Shehjar Tikoo
  2008-08-04  8:04   ` Greg Banks
  2008-07-31  0:07 ` Lee Revell
  2008-07-31 18:06 ` Enrico Weigelt
  2 siblings, 2 replies; 22+ messages in thread
From: J. Bruce Fields @ 2008-07-30 19:21 UTC (permalink / raw)
  To: Michael Shuey; +Cc: linux-kernel, linux-nfs, rees, aglo

You might get more responses from the linux-nfs list (cc'd).

--b.

On Thu, Jul 24, 2008 at 01:11:31PM -0400, Michael Shuey wrote:
> I'm currently toying with Linux's NFS, to see just how fast it can go in a 
> high-latency environment.  Right now, I'm simulating a 100ms delay between 
> client and server with netem (just 100ms on the outbound packets from the 
> client, rather than 50ms each way).  Oddly enough, I'm running into 
> performance problems. :-)
> 
> According to iozone, my server can sustain about 90/85 MB/s (reads/writes) 
> without any latency added.  After a pile of tweaks, and injecting 100ms of 
> netem latency, I'm getting 6/40 MB/s (reads/writes).  I'd really like to 
> know why writes are now so much faster than reads, and what sort of things 
> might boost the read throughput.  Any suggestions?
> 1
> The read throughput seems to be proportional to the latency - adding only 
> 10ms of delay gives 61 MB/s reads, in limited testing (need to look at it 
> further).  While that's to be expected, to some extent, I'm hoping there's 
> some form of readahead that can help me out here (assume big sequential 
> reads).
> 
> iozone is reading/writing a file twice the size of memory on the client with 
> a 32k block size.  I've tried raising this as high as 16 MB, but I still 
> see around 6 MB/sec reads.
> 
> I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing with a stock 
> 2.6, client and server, is the next order of business.
> 
> NFS mount is tcp, version 3.  rsize/wsize are 32k.  Both client and server 
> have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, wmem_default, and 
> rmem_default tuned - tuning values are 12500000 for defaults (and minimum 
> window sizes), 25000000 for the maximums.  Inefficient, yes, but I'm not 
> concerned with memory efficiency at the moment.
> 
> Both client and server kernels have been modified to provide 
> larger-than-normal RPC slot tables.  I allow a max of 1024, but I've found 
> that actually enabling more than 490 entries in /proc causes mount to 
> complain it can't allocate memory and die.  That was somewhat suprising, 
> given I had 122 GB of free memory at the time...
> 
> I've also applied a couple patches to allow the NFS readahead to be a 
> tunable number of RPC slots.  Currently, I set this to 489 on client and 
> server (so it's one less than the max number of RPC slots).  Bandwidth 
> delay product math says 380ish slots should be enough to keep a gigabit 
> line full, so I suspect something else is preventing me from seeing the 
> readahead I expect.
> 
> FYI, client and server are connected via gigabit ethernet.  There's a couple 
> routers in the way, but they talk at 10gigE and can route wire speed.  
> Traffic is IPv4, path MTU size is 9000 bytes.
> 
> Is there anything I'm missing?
> 
> -- 
> Mike Shuey
> Purdue University/ITaP
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-30 19:21 ` J. Bruce Fields
@ 2008-07-30 21:40   ` Shehjar Tikoo
  2008-07-31  2:35     ` Michael Shuey
  2008-08-04  8:04   ` Greg Banks
  1 sibling, 1 reply; 22+ messages in thread
From: Shehjar Tikoo @ 2008-07-30 21:40 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Michael Shuey, linux-kernel, linux-nfs, rees, aglo

J. Bruce Fields wrote:
> You might get more responses from the linux-nfs list (cc'd).
> 
> --b.
> 
> On Thu, Jul 24, 2008 at 01:11:31PM -0400, Michael Shuey wrote:
>> I'm currently toying with Linux's NFS, to see just how fast it 
>> can go in a high-latency environment.  Right now, I'm simulating
>>  a 100ms delay between client and server with netem (just 100ms
>> on the outbound packets from the client, rather than 50ms each
>> way). Oddly enough, I'm running into performance problems. :-)
>> 
>> According to iozone, my server can sustain about 90/85 MB/s 
>> (reads/writes) without any latency added.  After a pile of 
>> tweaks, and injecting 100ms of netem latency, I'm getting 6/40 
>> MB/s (reads/writes).  I'd really like to know why writes are now
>>  so much faster than reads, and what sort of things might boost 
>> the read throughput.  Any suggestions?

Is the server sync or async mounted? I've seen such performance
inversion between read and write when the mount mode is async.

What is the number of nfsd threads at the server?

Which file system are you using at the server?

>> 1 The read throughput seems to be proportional to the latency - 
>> adding only 10ms of delay gives 61 MB/s reads, in limited testing
>>  (need to look at it further).  While that's to be expected, to 
>> some extent, I'm hoping there's some form of readahead that can 
>> help me out here (assume big sequential reads).
>> 
>> iozone is reading/writing a file twice the size of memory on the
>>  client with a 32k block size.  I've tried raising this as high
>> as 16 MB, but I still see around 6 MB/sec reads.

In iozone, are you running the read and write test during the same run
of iozone? Iozone runs read tests, after writes so that the file for
the read test exists on the server. You should try running write and
read tests in separate runs to prevent client side caching issues from
influencing raw server read(and read-ahead) performance. You can use
the -w option in iozone to prevent iozone from calling unlink on the
file after the write test has finished, so you can use the same file
in a separate read test run.


>> 
>> I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing 
>> with a stock 2.6, client and server, is the next order of 
>> business.

You can try building the kernel with oprofile support and use it to
measure where the client CPU is spending its time. It is possible that
client-side locking or other algorithm issues are resulting in such
low read throughput. Note, when you start oprofile profiling, use a
CPU_CYCLES count of 5000. I've observed more accurate results with
this sample size for NFS performance.

>> 
>> NFS mount is tcp, version 3.  rsize/wsize are 32k.  Both client 
>> and server have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, 
>> wmem_default, and rmem_default tuned - tuning values are 12500000
>>  for defaults (and minimum window sizes), 25000000 for the 
>> maximums.  Inefficient, yes, but I'm not concerned with memory 
>> efficiency at the moment.
>> 
>> Both client and server kernels have been modified to provide 
>> larger-than-normal RPC slot tables.  I allow a max of 1024, but 
>> I've found that actually enabling more than 490 entries in /proc
>>  causes mount to complain it can't allocate memory and die.  That
>>  was somewhat suprising, given I had 122 GB of free memory at the
>>  time...
>> 
>> I've also applied a couple patches to allow the NFS readahead to
>>  be a tunable number of RPC slots.  Currently, I set this to 489
>>  on client and server (so it's one less than the max number of
>> RPC slots).  Bandwidth delay product math says 380ish slots
>> should be enough to keep a gigabit line full, so I suspect
>> something else is preventing me from seeing the readahead I
>> expect.
>> 
>> FYI, client and server are connected via gigabit ethernet. 
>> There's a couple routers in the way, but they talk at 10gigE and
>>  can route wire speed. Traffic is IPv4, path MTU size is 9000 
>> bytes.
>> 

The following are not completely relevant here but just to get some
more info:
What is the raw TCP throughput that you get between the server and
client machine on this network?

You could run the tests with bare minimum number of network
elements between the server and the client to see whats the best
network performance for NFS you can extract from this server and
client machine.


>> Is there anything I'm missing?
>> 
>> -- Mike Shuey Purdue University/ITaP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-24 17:11 high latency NFS Michael Shuey
  2008-07-30 19:21 ` J. Bruce Fields
@ 2008-07-31  0:07 ` Lee Revell
  2008-07-31 18:06 ` Enrico Weigelt
  2 siblings, 0 replies; 22+ messages in thread
From: Lee Revell @ 2008-07-31  0:07 UTC (permalink / raw)
  To: shuey; +Cc: linux-kernel

On Thu, Jul 24, 2008 at 1:11 PM, Michael Shuey <shuey@purdue.edu> wrote:
> NFS mount is tcp, version 3.  rsize/wsize are 32k.  Both client and server
> have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, wmem_default, and
> rmem_default tuned - tuning values are 12500000 for defaults (and minimum
> window sizes), 25000000 for the maximums.  Inefficient, yes, but I'm not
> concerned with memory efficiency at the moment.

Try using UDP.  I had to move a 10TB+ data warehouse over a 100Mbit
WAN link not long ago and this was the only method that gave me close
to wire speed.

Lee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-30 21:40   ` Shehjar Tikoo
@ 2008-07-31  2:35     ` Michael Shuey
  2008-07-31  3:15       ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Shuey @ 2008-07-31  2:35 UTC (permalink / raw)
  To: Shehjar Tikoo; +Cc: J. Bruce Fields, linux-kernel, linux-nfs, rees, aglo

Thanks for all the tips I've received this evening.  However, I figured out 
the problem late last night. :-)

I was only using the default 8 nfsd threads on the server.  When I raised 
this to 256, the read bandwidth went from about 6 MB/sec to about 95 
MB/sec, at 100ms of netem-induced latency.  Not too shabby.  I can get 
about 993 Mbps on the gigE link between client and server, or 124 MB/sec 
max, so this is about 76% of wire speed.  Network connections pass through 
three switches, at least one of which acts as a router, so I'm feeling 
pretty good about things so far.

FYI, the server is using an ext3 file system, on top of a 10 GB /dev/ram0 
ramdisk (exported async, mounted async).  Oddly enough, /dev/ram0 seems a 
bit slower than tmpfs and a loopback-mounted file - go figure.

To avoid confusing this with cache effects, I'm using iozone on an 8GB file 
from a client with only 4GB of memory.  Like I said, I'm mainly interested 
in large file performance. :-)

-- 
Mike Shuey
Purdue University/ITaP


On Wednesday 30 July 2008, Shehjar Tikoo wrote:
> J. Bruce Fields wrote:
> > You might get more responses from the linux-nfs list (cc'd).
> >
> > --b.
> >
> > On Thu, Jul 24, 2008 at 01:11:31PM -0400, Michael Shuey wrote:
> >> I'm currently toying with Linux's NFS, to see just how fast it
> >> can go in a high-latency environment.  Right now, I'm simulating
> >>  a 100ms delay between client and server with netem (just 100ms
> >> on the outbound packets from the client, rather than 50ms each
> >> way). Oddly enough, I'm running into performance problems. :-)
> >>
> >> According to iozone, my server can sustain about 90/85 MB/s
> >> (reads/writes) without any latency added.  After a pile of
> >> tweaks, and injecting 100ms of netem latency, I'm getting 6/40
> >> MB/s (reads/writes).  I'd really like to know why writes are now
> >>  so much faster than reads, and what sort of things might boost
> >> the read throughput.  Any suggestions?
>
> Is the server sync or async mounted? I've seen such performance
> inversion between read and write when the mount mode is async.
>
> What is the number of nfsd threads at the server?
>
> Which file system are you using at the server?
>
> >> 1 The read throughput seems to be proportional to the latency -
> >> adding only 10ms of delay gives 61 MB/s reads, in limited testing
> >>  (need to look at it further).  While that's to be expected, to
> >> some extent, I'm hoping there's some form of readahead that can
> >> help me out here (assume big sequential reads).
> >>
> >> iozone is reading/writing a file twice the size of memory on the
> >>  client with a 32k block size.  I've tried raising this as high
> >> as 16 MB, but I still see around 6 MB/sec reads.
>
> In iozone, are you running the read and write test during the same run
> of iozone? Iozone runs read tests, after writes so that the file for
> the read test exists on the server. You should try running write and
> read tests in separate runs to prevent client side caching issues from
> influencing raw server read(and read-ahead) performance. You can use
> the -w option in iozone to prevent iozone from calling unlink on the
> file after the write test has finished, so you can use the same file
> in a separate read test run.
>
> >> I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing
> >> with a stock 2.6, client and server, is the next order of
> >> business.
>
> You can try building the kernel with oprofile support and use it to
> measure where the client CPU is spending its time. It is possible that
> client-side locking or other algorithm issues are resulting in such
> low read throughput. Note, when you start oprofile profiling, use a
> CPU_CYCLES count of 5000. I've observed more accurate results with
> this sample size for NFS performance.
>
> >> NFS mount is tcp, version 3.  rsize/wsize are 32k.  Both client
> >> and server have had tcp_rmem, tcp_wmem, wmem_max, rmem_max,
> >> wmem_default, and rmem_default tuned - tuning values are 12500000
> >>  for defaults (and minimum window sizes), 25000000 for the
> >> maximums.  Inefficient, yes, but I'm not concerned with memory
> >> efficiency at the moment.
> >>
> >> Both client and server kernels have been modified to provide
> >> larger-than-normal RPC slot tables.  I allow a max of 1024, but
> >> I've found that actually enabling more than 490 entries in /proc
> >>  causes mount to complain it can't allocate memory and die.  That
> >>  was somewhat suprising, given I had 122 GB of free memory at the
> >>  time...
> >>
> >> I've also applied a couple patches to allow the NFS readahead to
> >>  be a tunable number of RPC slots.  Currently, I set this to 489
> >>  on client and server (so it's one less than the max number of
> >> RPC slots).  Bandwidth delay product math says 380ish slots
> >> should be enough to keep a gigabit line full, so I suspect
> >> something else is preventing me from seeing the readahead I
> >> expect.
> >>
> >> FYI, client and server are connected via gigabit ethernet.
> >> There's a couple routers in the way, but they talk at 10gigE and
> >>  can route wire speed. Traffic is IPv4, path MTU size is 9000
> >> bytes.
>
> The following are not completely relevant here but just to get some
> more info:
> What is the raw TCP throughput that you get between the server and
> client machine on this network?
>
> You could run the tests with bare minimum number of network
> elements between the server and the client to see whats the best
> network performance for NFS you can extract from this server and
> client machine.
>
> >> Is there anything I'm missing?
> >>
> >> -- Mike Shuey Purdue University/ITaP

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-31  2:35     ` Michael Shuey
@ 2008-07-31  3:15       ` J. Bruce Fields
  2008-07-31  7:03         ` Neil Brown
  0 siblings, 1 reply; 22+ messages in thread
From: J. Bruce Fields @ 2008-07-31  3:15 UTC (permalink / raw)
  To: Michael Shuey; +Cc: Shehjar Tikoo, linux-kernel, linux-nfs, rees, aglo

On Wed, Jul 30, 2008 at 10:35:49PM -0400, Michael Shuey wrote:
> Thanks for all the tips I've received this evening.  However, I figured out 
> the problem late last night. :-)
> 
> I was only using the default 8 nfsd threads on the server.  When I raised 
> this to 256, the read bandwidth went from about 6 MB/sec to about 95 
> MB/sec, at 100ms of netem-induced latency.

So this is yet another reminder that someone needs to implement some
kind of automatic tuning of the number of threads.

I guess the first question is what exactly the policy for that should
be?  How do we decide when to add another thread?  How do we decide when
there are too many?

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-31  3:15       ` J. Bruce Fields
@ 2008-07-31  7:03         ` Neil Brown
  2008-08-01  7:23           ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2008-07-31  7:03 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Michael Shuey, Shehjar Tikoo, linux-kernel, linux-nfs, rees, aglo

On Wednesday July 30, bfields@fieldses.org wrote:
> > 
> > I was only using the default 8 nfsd threads on the server.  When I raised 
> > this to 256, the read bandwidth went from about 6 MB/sec to about 95 
> > MB/sec, at 100ms of netem-induced latency.
> 
> So this is yet another reminder that someone needs to implement some
> kind of automatic tuning of the number of threads.
> 
> I guess the first question is what exactly the policy for that should
> be?  How do we decide when to add another thread?  How do we decide when
> there are too many?

Or should the first question be "what are we trying to achieve?"?

Do we want to:
    Automatically choose a number of threads that would match what a
      well informed sysadmin might choose
or
    regularly adjust the number of threads to find an optimal balance
      between prompt request processing (minimal queue length),
      minimal resource usage (idle threads waste memory)
    and not overloading the filesystem (how much concurrency does the
        filesystem/storage subsystem realistically support.

And then we need to think about how this relates to NUMA situations
where we have different numbers of threads on each node.


I think we really want to aim for the first of the above options, but
that the result will end up looking a bit like a very simplistic
attempt at the second.  "simplicitic" is key - we don't want
"complex".

I think that in the NUMA case we probably want to balance each node
independently.

The difficulties - I think - are:
  - make sure we can handle a sudden surge of requests, certainly a
    surge up to levels that we have previously seen.
    I think the means we either don't kill excess threads, or
    only kill them up to a limit: e.g. never fewer than 50% of
    the maximum number of threads
  - make sure we don't create too many threads if something clags up
    and nothing is getting through.  This means we need to monitor the
    number of requests dequeued and not make new threads when that is
    zero.
 

So how about:
  For each node we watch the length of the queue of
  requests-awaiting-threads and the queue of threads
  awaiting requests and maintain these values:
    - max number of threads ever concurrently running
    - number of requests dequeued
    - min length request queue
    - min length of thread queue

  Then every few (5?) seconds we sample these numbers and reset them
     (except the first).
     If 
        the min request queue length is non-zero and 
        the number of requests dequeued is non-zero
     Then
        start a new thread
     If
        the number of threads exceeds half the maximum and
        the min length of the thread queue exceeds 0
     Then
        stop one (idle) thread

You might want to track the max length of the request queue too and
start more threads if the queue is long, to allow a quick ramp-up.

We could try this allow by allowing you to write "auto" to the
'threads' file, so people can experiment.


NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-24 17:11 high latency NFS Michael Shuey
  2008-07-30 19:21 ` J. Bruce Fields
  2008-07-31  0:07 ` Lee Revell
@ 2008-07-31 18:06 ` Enrico Weigelt
  2 siblings, 0 replies; 22+ messages in thread
From: Enrico Weigelt @ 2008-07-31 18:06 UTC (permalink / raw)
  To: linux kernel list

* Michael Shuey <shuey@purdue.edu> wrote:
> I'm currently toying with Linux's NFS, to see just how fast it can go in a 
> high-latency environment.  Right now, I'm simulating a 100ms delay between 
> client and server with netem (just 100ms on the outbound packets from the 
> client, rather than 50ms each way).  Oddly enough, I'm running into 
> performance problems. :-)

NFS is *very bad* for latent links.
Try 9p with an little metadata-caching proxy in the middle ;-P

cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-31  7:03         ` Neil Brown
@ 2008-08-01  7:23           ` Dave Chinner
  2008-08-01 19:15             ` J. Bruce Fields
  2008-08-01 19:23             ` J. Bruce Fields
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2008-08-01  7:23 UTC (permalink / raw)
  To: Neil Brown
  Cc: J. Bruce Fields, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> On Wednesday July 30, bfields@fieldses.org wrote:
> > > 
> > > I was only using the default 8 nfsd threads on the server.  When I raised 
> > > this to 256, the read bandwidth went from about 6 MB/sec to about 95 
> > > MB/sec, at 100ms of netem-induced latency.
> > 
> > So this is yet another reminder that someone needs to implement some
> > kind of automatic tuning of the number of threads.
> > 
> > I guess the first question is what exactly the policy for that should
> > be?  How do we decide when to add another thread?  How do we decide when
> > there are too many?
> 
> Or should the first question be "what are we trying to achieve?"?
> 
> Do we want to:
>     Automatically choose a number of threads that would match what a
>       well informed sysadmin might choose
> or
>     regularly adjust the number of threads to find an optimal balance
>       between prompt request processing (minimal queue length),
>       minimal resource usage (idle threads waste memory)
>     and not overloading the filesystem (how much concurrency does the
>         filesystem/storage subsystem realistically support.
> 
> And then we need to think about how this relates to NUMA situations
> where we have different numbers of threads on each node.
> 
> 
> I think we really want to aim for the first of the above options, but
> that the result will end up looking a bit like a very simplistic
> attempt at the second.  "simplicitic" is key - we don't want
> "complex".

Having implemented the second option on a different NUMA aware
OS and NFS server, I can say that it isn't that complex, nor that
hard to screw up.

	1. spawn a new thread only if all NFSDs are busy and there
	   are still requests queued to be serviced.
	2. rate limit the speed at which you spawn new NFSD threads.
	   About 5/s per node was about right.
	3. define an idle time for each thread before they
	   terminate. That is, is a thread has not been asked to
	   do any work for 30s, exit.
	4. use the NFSD thread pools to allow per-pool independence.

> I think that in the NUMA case we probably want to balance each node
> independently.
> 
> The difficulties - I think - are:
>   - make sure we can handle a sudden surge of requests, certainly a
>     surge up to levels that we have previously seen.
>     I think the means we either don't kill excess threads, or
>     only kill them up to a limit: e.g. never fewer than 50% of
>     the maximum number of threads

You only want to increase the number of threads for sustained
loads or regular peaks of load. You don't want simple transients
to cause massive numbers of threads to spawn so rate limiting
the spawning rate is needed.

>   - make sure we don't create too many threads if something clags up
>     and nothing is getting through.  This means we need to monitor the
>     number of requests dequeued and not make new threads when that is
>     zero.

That second case is easy - only allow a new thread to be spawned when a
request is dequeued. Hence if all the NFSDs are clagged, then we
won't waste resources clagging more of them.

> So how about:
>   For each node we watch the length of the queue of
>   requests-awaiting-threads and the queue of threads
>   awaiting requests and maintain these values:
>     - max number of threads ever concurrently running
>     - number of requests dequeued
>     - min length request queue
>     - min length of thread queue
> 
>   Then every few (5?) seconds we sample these numbers and reset them
>      (except the first).
>      If 
>         the min request queue length is non-zero and 
>         the number of requests dequeued is non-zero
>      Then
>         start a new thread
>      If
>         the number of threads exceeds half the maximum and
>         the min length of the thread queue exceeds 0
>      Then
>         stop one (idle) thread

The period of adjustment is really too low to be useful - a single
extra thread is meaningless if you go from 8 to 9 when you really need
30 or 40 nfsds. Taking minutes to get to the required number is
really too slow. You want to go from 8 to 40 within a few seconds of
that load starting....

> You might want to track the max length of the request queue too and
> start more threads if the queue is long, to allow a quick ramp-up.

Right, but even request queue depth is not a good indicator. You
need to leep track of how many NFSDs are actually doing useful
work. That is, if you've got an NFSD on the CPU that is hitting
the cache and not blocking, you don't need more NFSDs to handle
that load because they can't do any more work than the NFSD
that is currently running is. 

i.e. take the solution that Greg banks used for the CPU scheduler
overload issue (limiting the number of nfsds woken but not yet on
the CPU), and apply that criteria to spawning new threads.  i.e.
we've tried to wake an NFSD, but there are none available so that
means more NFSDs are needed for the given load. If we've already
tried to wake one and it hasn't run yet, then we've got enough
NFSDs....

Also, NFSD scheduling needs to be LIFO so that unused NFSDs
accumulate idle time and so can be culled easily. If you RR the
nfsds, they'll all appear to be doing useful work so it's hard to
tell if you've got any idle at all.

HTH.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-01  7:23           ` Dave Chinner
@ 2008-08-01 19:15             ` J. Bruce Fields
  2008-08-04  0:32               ` Dave Chinner
  2008-08-01 19:23             ` J. Bruce Fields
  1 sibling, 1 reply; 22+ messages in thread
From: J. Bruce Fields @ 2008-08-01 19:15 UTC (permalink / raw)
  To: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > You might want to track the max length of the request queue too and
> > start more threads if the queue is long, to allow a quick ramp-up.
> 
> Right, but even request queue depth is not a good indicator. You
> need to leep track of how many NFSDs are actually doing useful
> work. That is, if you've got an NFSD on the CPU that is hitting
> the cache and not blocking, you don't need more NFSDs to handle
> that load because they can't do any more work than the NFSD
> that is currently running is. 
> 
> i.e. take the solution that Greg banks used for the CPU scheduler
> overload issue (limiting the number of nfsds woken but not yet on
> the CPU),

I don't remember that, or wasn't watching when it happened.... Do you
have a pointer?

> and apply that criteria to spawning new threads.  i.e.
> we've tried to wake an NFSD, but there are none available so that
> means more NFSDs are needed for the given load. If we've already
> tried to wake one and it hasn't run yet, then we've got enough
> NFSDs....

OK, so you do that instead of trying to directly measure 

> Also, NFSD scheduling needs to be LIFO so that unused NFSDs
> accumulate idle time and so can be culled easily. If you RR the
> nfsds, they'll all appear to be doing useful work so it's hard to
> tell if you've got any idle at all.

Those all sound like good ideas, thanks.

(Still waiting for a volunteer for now, alas.)

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-01  7:23           ` Dave Chinner
  2008-08-01 19:15             ` J. Bruce Fields
@ 2008-08-01 19:23             ` J. Bruce Fields
  2008-08-04  0:38               ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: J. Bruce Fields @ 2008-08-01 19:23 UTC (permalink / raw)
  To: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> Having implemented the second option on a different NUMA aware
> OS and NFS server, I can say that it isn't that complex, nor that
> hard to screw up.
> 
> 	1. spawn a new thread only if all NFSDs are busy and there
> 	   are still requests queued to be serviced.
> 	2. rate limit the speed at which you spawn new NFSD threads.
> 	   About 5/s per node was about right.
> 	3. define an idle time for each thread before they
> 	   terminate. That is, is a thread has not been asked to
> 	   do any work for 30s, exit.
> 	4. use the NFSD thread pools to allow per-pool independence.

Actually, I lost you on #4.  You mean that you apply 1-3 independently
on each thread pool?  Or something else?

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-01 19:15             ` J. Bruce Fields
@ 2008-08-04  0:32               ` Dave Chinner
  2008-08-04  1:11                 ` J. Bruce Fields
  2008-08-04  1:29                 ` NeilBrown
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2008-08-04  0:32 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > You might want to track the max length of the request queue too and
> > > start more threads if the queue is long, to allow a quick ramp-up.
> > 
> > Right, but even request queue depth is not a good indicator. You
> > need to leep track of how many NFSDs are actually doing useful
> > work. That is, if you've got an NFSD on the CPU that is hitting
> > the cache and not blocking, you don't need more NFSDs to handle
> > that load because they can't do any more work than the NFSD
> > that is currently running is. 
> > 
> > i.e. take the solution that Greg banks used for the CPU scheduler
> > overload issue (limiting the number of nfsds woken but not yet on
> > the CPU),
> 
> I don't remember that, or wasn't watching when it happened.... Do you
> have a pointer?

Ah, I thought that had been sent to mainline because it was
mentioned in his LCA talk at the start of the year. Slides
65-67 here:

http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-01 19:23             ` J. Bruce Fields
@ 2008-08-04  0:38               ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2008-08-04  0:38 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Fri, Aug 01, 2008 at 03:23:43PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > Having implemented the second option on a different NUMA aware
> > OS and NFS server, I can say that it isn't that complex, nor that
> > hard to screw up.
> > 
> > 	1. spawn a new thread only if all NFSDs are busy and there
> > 	   are still requests queued to be serviced.
> > 	2. rate limit the speed at which you spawn new NFSD threads.
> > 	   About 5/s per node was about right.
> > 	3. define an idle time for each thread before they
> > 	   terminate. That is, is a thread has not been asked to
> > 	   do any work for 30s, exit.
> > 	4. use the NFSD thread pools to allow per-pool independence.
> 
> Actually, I lost you on #4.  You mean that you apply 1-3 independently
> on each thread pool?  Or something else?

The former. i.e when you have a NUMA machine with a pool-per-node or
an SMP machine with a pool-per-cpu configuration, you can configure
the pools the differently according to the hardware config and
interrupt vectoring. This is especially useful if want to prevent
NFSDs from dominating the CPU taking disk interrupts or running user
code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  0:32               ` Dave Chinner
@ 2008-08-04  1:11                 ` J. Bruce Fields
  2008-08-04  2:14                   ` Dave Chinner
  2008-08-04  9:18                   ` Bernd Schubert
  2008-08-04  1:29                 ` NeilBrown
  1 sibling, 2 replies; 22+ messages in thread
From: J. Bruce Fields @ 2008-08-04  1:11 UTC (permalink / raw)
  To: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > You might want to track the max length of the request queue too and
> > > > start more threads if the queue is long, to allow a quick ramp-up.
> > > 
> > > Right, but even request queue depth is not a good indicator. You
> > > need to leep track of how many NFSDs are actually doing useful
> > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > the cache and not blocking, you don't need more NFSDs to handle
> > > that load because they can't do any more work than the NFSD
> > > that is currently running is. 
> > > 
> > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > overload issue (limiting the number of nfsds woken but not yet on
> > > the CPU),
> > 
> > I don't remember that, or wasn't watching when it happened.... Do you
> > have a pointer?
> 
> Ah, I thought that had been sent to mainline because it was
> mentioned in his LCA talk at the start of the year. Slides
> 65-67 here:
> 
> http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

OK, so to summarize: when the rate of incoming rpc's is very high (and,
I guess, when we're serving everything out of cache and don't have IO
wait), all the nfsd threads will stay runable all the time.  That keeps
userspace processes from running (possibly for "minutes").  And that's a
problem even on a server dedicated only to nfs, since it affects portmap
and rpc.mountd.

The solution is given just as "limit the # of nfsd's woken but not yet
on CPU."  It'd be interesting to see more details.

Off hand, this seems like it should be at least partly the scheduler's
job.  E.g. could we tell it to schedule all the nfsd threads as a group?
I suppose the disadvantage to that is that we'd lose information about
how many threads are actually needed, hence lose the chance to reap
unneeded threads?

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  0:32               ` Dave Chinner
  2008-08-04  1:11                 ` J. Bruce Fields
@ 2008-08-04  1:29                 ` NeilBrown
  2008-08-04  6:42                   ` Greg Banks
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2008-08-04  1:29 UTC (permalink / raw)
  To: J. Bruce Fields, Neil Brown, Michael Shuey, Shehjar Tikoo,
	linux-kernel, linux-nfs, rees, aglo
  Cc: Greg Banks

On Mon, August 4, 2008 10:32 am, Dave Chinner wrote:
> On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
>> On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:

>> > i.e. take the solution that Greg banks used for the CPU scheduler
>> > overload issue (limiting the number of nfsds woken but not yet on
>> > the CPU),
>>
>> I don't remember that, or wasn't watching when it happened.... Do you
>> have a pointer?
>
> Ah, I thought that had been sent to mainline because it was
> mentioned in his LCA talk at the start of the year. Slides
> 65-67 here:
>
> http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf

Ahh... I remembered Greg talking about that, went looking, and
couldn't find it.  I couldn't even find any mail about it, yet I'm
sure I saw a patch..

Greg: Do you remember what happened to this?  Did I reject it for some
reason, or did it never get sent?  or ...

NeilBrown


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  1:11                 ` J. Bruce Fields
@ 2008-08-04  2:14                   ` Dave Chinner
  2008-08-04  9:18                   ` Bernd Schubert
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2008-08-04  2:14 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Sun, Aug 03, 2008 at 09:11:58PM -0400, J. Bruce Fields wrote:
> On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> > On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > > You might want to track the max length of the request queue too and
> > > > > start more threads if the queue is long, to allow a quick ramp-up.
> > > > 
> > > > Right, but even request queue depth is not a good indicator. You
> > > > need to leep track of how many NFSDs are actually doing useful
> > > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > > the cache and not blocking, you don't need more NFSDs to handle
> > > > that load because they can't do any more work than the NFSD
> > > > that is currently running is. 
> > > > 
> > > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > > overload issue (limiting the number of nfsds woken but not yet on
> > > > the CPU),
> > > 
> > > I don't remember that, or wasn't watching when it happened.... Do you
> > > have a pointer?
> > 
> > Ah, I thought that had been sent to mainline because it was
> > mentioned in his LCA talk at the start of the year. Slides
> > 65-67 here:
> > 
> > http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf
> 
> OK, so to summarize: when the rate of incoming rpc's is very high (and,
> I guess, when we're serving everything out of cache and don't have IO
> wait), all the nfsd threads will stay runable all the time.  That keeps
> userspace processes from running (possibly for "minutes").  And that's a
> problem even on a server dedicated only to nfs, since it affects portmap
> and rpc.mountd.

In a nutshell.

> The solution is given just as "limit the # of nfsd's woken but not yet
> on CPU."  It'd be interesting to see more details.

Simple counters, IIRC (memory hazy so it might be a bit different).
Basically, when we queue a request we check a wakeup counter. If
the wakeup counter is less than a certain threshold (e.g. 5) we
issue a wakeup to get another NFSD running. When the NFSD first
runs and dequeues a request, it then decrements the wakeup counter,
effectively marking that NFSD as busy doing work. IIRC a small
threshold was necessary to ensure we always had enough NFSDs ready
to run if there was some I/O going on (i.e. a mixture of blocking
and non-blocking RPCs).

i.e. we need to track the wakeup-to-run latency to prevent waking too
many NFSDs and loading the run queue unnecessarily.

> Off hand, this seems like it should be at least partly the scheduler's
> job.

Partly, yes, in that the scheduler overhead shouldn't increase when we
do this. However, from an efficiency point of view, if we are blindly
waking NFSDs when it is not necessary then (IMO) we've got an NFSD
problem....

> E.g. could we tell it to schedule all the nfsd threads as a group?
> I suppose the disadvantage to that is that we'd lose information about
> how many threads are actually needed, hence lose the chance to reap
> unneeded threads?

I don't know enough about how the group scheduling works to be able
to comment in detail. In theory it sounds like it would prevent
the starvation problems, but if it prevents implementation of
dynamic NFSD pools then I don't think it's a good idea....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  1:29                 ` NeilBrown
@ 2008-08-04  6:42                   ` Greg Banks
  2008-08-04 19:07                     ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Greg Banks @ 2008-08-04  6:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

NeilBrown wrote:
> On Mon, August 4, 2008 10:32 am, Dave Chinner wrote:
>   
>>>> i.e. take the solution that Greg banks used for the CPU scheduler
>>>> overload issue (limiting the number of nfsds woken but not yet on
>>>> the CPU),
>>>>         
>
> Ahh... I remembered Greg talking about that, went looking, and
> couldn't find it.  I couldn't even find any mail about it, yet I'm
> sure I saw a patch..
>   
http://marc.info/?l=linux-nfs&m=115501004819230&w=2
> Greg: Do you remember what happened to this?  Did I reject it for some
> reason, or did it never get sent?  or ...
>   
I think we got all caught up arguing about the other patches in the
batch (the last round of the everlasting  "dynamic nfsd management for
Linux" argument) and between us we managed to drop the patch on the ground.

http://thread.gmane.org/gmane.linux.nfs/10372

I think the only part of that patchset that you explicitly rejected was
the one where I tried to kill off the useless "th" line in
/proc/net/rc/nfsd.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-07-30 19:21 ` J. Bruce Fields
  2008-07-30 21:40   ` Shehjar Tikoo
@ 2008-08-04  8:04   ` Greg Banks
  1 sibling, 0 replies; 22+ messages in thread
From: Greg Banks @ 2008-08-04  8:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Michael Shuey, linux-kernel, linux-nfs, rees, aglo

[-- Attachment #1: Type: text/plain, Size: 2722 bytes --]

J. Bruce Fields wrote:
> You might get more responses from the linux-nfs list (cc'd).
>
> --b.
>
> On Thu, Jul 24, 2008 at 01:11:31PM -0400, Michael Shuey wrote:
>   
>>
>> iozone is reading/writing a file twice the size of memory on the client with 
>> a 32k block size.  I've tried raising this as high as 16 MB, but I still 
>> see around 6 MB/sec reads.
>>     
That won't make a skerrick of difference with wsize=32K.
>> I'm using a 2.6.9 derivative (yes, I'm a RHEL4 fan).  Testing with a stock 
>> 2.6, client and server, is the next order of business.
>>
>> NFS mount is tcp, version 3.  rsize/wsize are 32k.
Try wsize=rsize=1M.
>>   Both client and server 
>> have had tcp_rmem, tcp_wmem, wmem_max, rmem_max, wmem_default, and 
>> rmem_default tuned - tuning values are 12500000 for defaults (and minimum 
>> window sizes), 25000000 for the maximums.  Inefficient, yes, but I'm not 
>> concerned with memory efficiency at the moment.
>>     
You're aware that the server screws these up again, at least for
writes?  There was a long sequence of threads on linux-nfs about this
recently, starting with

http://marc.info/?l=linux-nfs&m=121312415114958&w=2

which is Dean Hildebrand posting a patch to make the knfsd behaviour
tunable.  ToT still looks broken.  I've been using the attached patch (I
believe a similar one was posted later in the thread by Olga
Kornievskaia)  for low-latency high-bandwidth 10ge performance work,
where it doesn't help but doesn't hurt either.  It should help for your
high-latency high-bandwidth case.  Keep your tunings though, one of 
them will be affecting the TCP window scale negotiated at connect time.
>> Both client and server kernels have been modified to provide 
>> larger-than-normal RPC slot tables.  I allow a max of 1024, but I've found 
>> that actually enabling more than 490 entries in /proc causes mount to 
>> complain it can't allocate memory and die.  That was somewhat suprising, 
>> given I had 122 GB of free memory at the time...
>>     
That number is used to size a physically contiguous kmalloc()ed array of
slots.  With a large wsize you don't need such large slot table sizes or
large numbers of nfsds to fill the pipe.

And yes, the default number of nfsds is utterly inadequate.
>> I've also applied a couple patches to allow the NFS readahead to be a 
>> tunable number of RPC slots. 
There's a patch in SLES to do that, which I'd very much like to see that
in kernel.org (Neil?).  The default NFS readahead multiplier value is
pessimal and guarantees worst-case alignment of READ rpcs during
streaming reads, so we tune it from 15 to 16.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.


[-- Attachment #2: knfsd-tcp-receive-buffer-scaling --]
[-- Type: text/plain, Size: 970 bytes --]

Index: linux-2.6.16/net/sunrpc/svcsock.c
===================================================================
--- linux-2.6.16.orig/net/sunrpc/svcsock.c	2008-06-16 15:39:01.774672997 +1000
+++ linux-2.6.16/net/sunrpc/svcsock.c	2008-06-16 15:45:06.203421620 +1000
@@ -1157,13 +1159,13 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		 * particular pool, which provides an upper bound
 		 * on the number of threads which will access the socket.
 		 *
-		 * rcvbuf just needs to be able to hold a few requests.
-		 * Normally they will be removed from the queue 
-		 * as soon a a complete request arrives.
+		 * rcvbuf needs the same room as sndbuf, to allow
+		 * workloads comprising mostly WRITE calls to flow
+		 * at a reasonable fraction of line speed.
 		 */
 		svc_sock_setbufsize(svsk->sk_sock,
 				    (serv->sv_nrthreads+3) * serv->sv_bufsz,
-				    3 * serv->sv_bufsz);
+				    (serv->sv_nrthreads+3) * serv->sv_bufsz);
 
 	svc_sock_clear_data_ready(svsk);
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  1:11                 ` J. Bruce Fields
  2008-08-04  2:14                   ` Dave Chinner
@ 2008-08-04  9:18                   ` Bernd Schubert
  2008-08-04  9:25                     ` Greg Banks
  1 sibling, 1 reply; 22+ messages in thread
From: Bernd Schubert @ 2008-08-04  9:18 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, Michael Shuey, Shehjar Tikoo, linux-kernel,
	linux-nfs, rees, aglo

On Monday 04 August 2008 03:11:58 J. Bruce Fields wrote:
> On Mon, Aug 04, 2008 at 10:32:06AM +1000, Dave Chinner wrote:
> > On Fri, Aug 01, 2008 at 03:15:59PM -0400, J. Bruce Fields wrote:
> > > On Fri, Aug 01, 2008 at 05:23:20PM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 31, 2008 at 05:03:05PM +1000, Neil Brown wrote:
> > > > > You might want to track the max length of the request queue too and
> > > > > start more threads if the queue is long, to allow a quick ramp-up.
> > > >
> > > > Right, but even request queue depth is not a good indicator. You
> > > > need to leep track of how many NFSDs are actually doing useful
> > > > work. That is, if you've got an NFSD on the CPU that is hitting
> > > > the cache and not blocking, you don't need more NFSDs to handle
> > > > that load because they can't do any more work than the NFSD
> > > > that is currently running is.
> > > >
> > > > i.e. take the solution that Greg banks used for the CPU scheduler
> > > > overload issue (limiting the number of nfsds woken but not yet on
> > > > the CPU),
> > >
> > > I don't remember that, or wasn't watching when it happened.... Do you
> > > have a pointer?
> >
> > Ah, I thought that had been sent to mainline because it was
> > mentioned in his LCA talk at the start of the year. Slides
> > 65-67 here:
> >
> > http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/41.pdf
>
> OK, so to summarize: when the rate of incoming rpc's is very high (and,
> I guess, when we're serving everything out of cache and don't have IO
> wait), all the nfsd threads will stay runable all the time.  That keeps
> userspace processes from running (possibly for "minutes").  And that's a
> problem even on a server dedicated only to nfs, since it affects portmap
> and rpc.mountd.

Even worse, it affects user space HA software such as heartbeat and everyone 
with reasonable timeouts will see spurious 'failures'. 


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  9:18                   ` Bernd Schubert
@ 2008-08-04  9:25                     ` Greg Banks
  0 siblings, 0 replies; 22+ messages in thread
From: Greg Banks @ 2008-08-04  9:25 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: J. Bruce Fields, Neil Brown, Michael Shuey, Shehjar Tikoo,
	linux-kernel, linux-nfs, rees, aglo

Bernd Schubert wrote:
> On Monday 04 August 2008 03:11:58 J. Bruce Fields wrote:
>   
>> OK, so to summarize: when the rate of incoming rpc's is very high (and,
>> I guess, when we're serving everything out of cache and don't have IO
>> wait), all the nfsd threads will stay runable all the time.  That keeps
>> userspace processes from running (possibly for "minutes").  And that's a
>> problem even on a server dedicated only to nfs, since it affects portmap
>> and rpc.mountd.
>>     
>
> Even worse, it affects user space HA software such as heartbeat and everyone 
> with reasonable timeouts will see spurious 'failures'. 
>   
We're seeing that problem right now, even with the patch.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04  6:42                   ` Greg Banks
@ 2008-08-04 19:07                     ` J. Bruce Fields
  2008-08-05 10:51                       ` Greg Banks
  0 siblings, 1 reply; 22+ messages in thread
From: J. Bruce Fields @ 2008-08-04 19:07 UTC (permalink / raw)
  To: Greg Banks
  Cc: NeilBrown, Michael Shuey, Shehjar Tikoo, linux-kernel, linux-nfs,
	rees, aglo

On Mon, Aug 04, 2008 at 04:42:54PM +1000, Greg Banks wrote:
> NeilBrown wrote:
> > On Mon, August 4, 2008 10:32 am, Dave Chinner wrote:
> >   
> >>>> i.e. take the solution that Greg banks used for the CPU scheduler
> >>>> overload issue (limiting the number of nfsds woken but not yet on
> >>>> the CPU),
> >>>>         
> >
> > Ahh... I remembered Greg talking about that, went looking, and
> > couldn't find it.  I couldn't even find any mail about it, yet I'm
> > sure I saw a patch..
> >   
> http://marc.info/?l=linux-nfs&m=115501004819230&w=2
> > Greg: Do you remember what happened to this?  Did I reject it for some
> > reason, or did it never get sent?  or ...
> >   
> I think we got all caught up arguing about the other patches in the
> batch (the last round of the everlasting  "dynamic nfsd management for
> Linux" argument) and between us we managed to drop the patch on the ground.
> 
> http://thread.gmane.org/gmane.linux.nfs/10372
> 
> I think the only part of that patchset that you explicitly rejected was
> the one where I tried to kill off the useless "th" line in
> /proc/net/rc/nfsd.

Looks like that was me, apologies.  Breaking a documented interface to
userspace just set off an alarm.  But if we really convince ourselves
that it's useless, then OK.

(Though maybe your idea of leaving the line in place with just constant
zeros is good.  Just because the data's useless doesn't mean someone out
there may have a script that does otherwise useful things but that
happens to fail if it can't parse /proc/net/rpc/nfsd.)

Looks like it's been two years now--any chance of rebasing those patches
and resending?

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: high latency NFS
  2008-08-04 19:07                     ` J. Bruce Fields
@ 2008-08-05 10:51                       ` Greg Banks
  0 siblings, 0 replies; 22+ messages in thread
From: Greg Banks @ 2008-08-05 10:51 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: NeilBrown, Michael Shuey, Shehjar Tikoo, linux-kernel, linux-nfs,
	rees, aglo

J. Bruce Fields wrote:
> On Mon, Aug 04, 2008 at 04:42:54PM +1000, Greg Banks wrote:
>   
>> http://thread.gmane.org/gmane.linux.nfs/10372
>>
>>     
>
> Looks like that was me, apologies.  Breaking a documented interface to
> userspace just set off an alarm.  But if we really convince ourselves
> that it's useless, then OK.
>   
I think I explained last time how useless it is.
> (Though maybe your idea of leaving the line in place with just constant
> zeros is good.  Just because the data's useless doesn't mean someone out
> there may have a script that does otherwise useful things but that
> happens to fail if it can't parse /proc/net/rpc/nfsd.)
>   
Ok, I'm happy to do it that way.
> Looks like it's been two years now--any chance of rebasing those patches
> and resending?
>   
Yep.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-08-05 10:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-24 17:11 high latency NFS Michael Shuey
2008-07-30 19:21 ` J. Bruce Fields
2008-07-30 21:40   ` Shehjar Tikoo
2008-07-31  2:35     ` Michael Shuey
2008-07-31  3:15       ` J. Bruce Fields
2008-07-31  7:03         ` Neil Brown
2008-08-01  7:23           ` Dave Chinner
2008-08-01 19:15             ` J. Bruce Fields
2008-08-04  0:32               ` Dave Chinner
2008-08-04  1:11                 ` J. Bruce Fields
2008-08-04  2:14                   ` Dave Chinner
2008-08-04  9:18                   ` Bernd Schubert
2008-08-04  9:25                     ` Greg Banks
2008-08-04  1:29                 ` NeilBrown
2008-08-04  6:42                   ` Greg Banks
2008-08-04 19:07                     ` J. Bruce Fields
2008-08-05 10:51                       ` Greg Banks
2008-08-01 19:23             ` J. Bruce Fields
2008-08-04  0:38               ` Dave Chinner
2008-08-04  8:04   ` Greg Banks
2008-07-31  0:07 ` Lee Revell
2008-07-31 18:06 ` Enrico Weigelt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).