2.6.30-rc deadline scheduler performance regression for iozone over NFS

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.30-rc deadline scheduler performance regression for iozone over NFS
@ 2009-04-23 14:01 Jeff Moyer
  2009-05-08 19:01 ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-04-23 14:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Jens Axboe

Hi,

I've been working on CFQ improvements for interleaved I/Os between
processes, and noticed a regression in performance when using the
deadline I/O scheduler.  The test uses a server configured with a cciss
array and 1Gb/s ethernet.

The iozone command line was:
  iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w

The numbers in the nfsd's row represent the number of nfsd "threads".
These numbers (in MB/s) represent the average of 5 runs.

               v2.6.29

nfsd's  |   1    |  2   |   4   |   8
--------+---------------+-------+------
deadline| 43207 | 67436 | 96289 | 107590

              2.6.30-rc1

nfsd's  |   1   |   2   |   4   |   8
--------+---------------+-------+------
deadline| 43732 | 68059 | 76659 | 83231

    2.6.30-rc3.block-for-linus

nfsd's  |   1   |   2   |   4   |   8
--------+---------------+-------+------
deadline| 46102 | 71151 | 83120 | 82330

Notice the drop for 4 and 8 threads.  It may be worth noting that the
default number of NFSD threads is 8.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-04-23 14:01 2.6.30-rc deadline scheduler performance regression for iozone over NFS Jeff Moyer
@ 2009-05-08 19:01 ` Andrew Morton
  2009-05-11  8:14   ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-08 19:01 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, jens.axboe, Rafael J. Wysocki

On Thu, 23 Apr 2009 10:01:58 -0400
Jeff Moyer <jmoyer@redhat.com> wrote:

> Hi,
> 
> I've been working on CFQ improvements for interleaved I/Os between
> processes, and noticed a regression in performance when using the
> deadline I/O scheduler.  The test uses a server configured with a cciss
> array and 1Gb/s ethernet.
> 
> The iozone command line was:
>   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
> 
> The numbers in the nfsd's row represent the number of nfsd "threads".
> These numbers (in MB/s) represent the average of 5 runs.
> 
>                v2.6.29
> 
> nfsd's  |   1    |  2   |   4   |   8
> --------+---------------+-------+------
> deadline| 43207 | 67436 | 96289 | 107590
> 
>               2.6.30-rc1
> 
> nfsd's  |   1   |   2   |   4   |   8
> --------+---------------+-------+------
> deadline| 43732 | 68059 | 76659 | 83231
> 
>     2.6.30-rc3.block-for-linus
> 
> nfsd's  |   1   |   2   |   4   |   8
> --------+---------------+-------+------
> deadline| 46102 | 71151 | 83120 | 82330
> 
> 
> Notice the drop for 4 and 8 threads.  It may be worth noting that the
> default number of NFSD threads is 8.
> 

I guess we should ask Rafael to add this to the post-2.6.29 regression
list.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-08 19:01 ` Andrew Morton
@ 2009-05-11  8:14   ` Jens Axboe
  2009-05-11 13:53     ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2009-05-11  8:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeff Moyer, linux-kernel, Rafael J. Wysocki

On Fri, May 08 2009, Andrew Morton wrote:
> On Thu, 23 Apr 2009 10:01:58 -0400
> Jeff Moyer <jmoyer@redhat.com> wrote:
> 
> > Hi,
> > 
> > I've been working on CFQ improvements for interleaved I/Os between
> > processes, and noticed a regression in performance when using the
> > deadline I/O scheduler.  The test uses a server configured with a cciss
> > array and 1Gb/s ethernet.
> > 
> > The iozone command line was:
> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
> > 
> > The numbers in the nfsd's row represent the number of nfsd "threads".
> > These numbers (in MB/s) represent the average of 5 runs.
> > 
> >                v2.6.29
> > 
> > nfsd's  |   1    |  2   |   4   |   8
> > --------+---------------+-------+------
> > deadline| 43207 | 67436 | 96289 | 107590
> > 
> >               2.6.30-rc1
> > 
> > nfsd's  |   1   |   2   |   4   |   8
> > --------+---------------+-------+------
> > deadline| 43732 | 68059 | 76659 | 83231
> > 
> >     2.6.30-rc3.block-for-linus
> > 
> > nfsd's  |   1   |   2   |   4   |   8
> > --------+---------------+-------+------
> > deadline| 46102 | 71151 | 83120 | 82330
> > 
> > 
> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
> > default number of NFSD threads is 8.
> > 
> 
> I guess we should ask Rafael to add this to the post-2.6.29 regression
> list.

I agree. It'd be nice to bisect this one down, I'm guessing some mm
change has caused this writeout regression.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-11  8:14   ` Jens Axboe
@ 2009-05-11 13:53     ` Jeff Moyer
  2009-05-11 16:58       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-11 13:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, Rafael J. Wysocki

Jens Axboe <jens.axboe@oracle.com> writes:

> On Fri, May 08 2009, Andrew Morton wrote:
>> On Thu, 23 Apr 2009 10:01:58 -0400
>> Jeff Moyer <jmoyer@redhat.com> wrote:
>> 
>> > Hi,
>> > 
>> > I've been working on CFQ improvements for interleaved I/Os between
>> > processes, and noticed a regression in performance when using the
>> > deadline I/O scheduler.  The test uses a server configured with a cciss
>> > array and 1Gb/s ethernet.
>> > 
>> > The iozone command line was:
>> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>> > 
>> > The numbers in the nfsd's row represent the number of nfsd "threads".
>> > These numbers (in MB/s) represent the average of 5 runs.
>> > 
>> >                v2.6.29
>> > 
>> > nfsd's  |   1    |  2   |   4   |   8
>> > --------+---------------+-------+------
>> > deadline| 43207 | 67436 | 96289 | 107590
>> > 
>> >               2.6.30-rc1
>> > 
>> > nfsd's  |   1   |   2   |   4   |   8
>> > --------+---------------+-------+------
>> > deadline| 43732 | 68059 | 76659 | 83231
>> > 
>> >     2.6.30-rc3.block-for-linus
>> > 
>> > nfsd's  |   1   |   2   |   4   |   8
>> > --------+---------------+-------+------
>> > deadline| 46102 | 71151 | 83120 | 82330
>> > 
>> > 
>> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>> > default number of NFSD threads is 8.
>> > 
>> 
>> I guess we should ask Rafael to add this to the post-2.6.29 regression
>> list.
>
> I agree. It'd be nice to bisect this one down, I'm guessing some mm
> change has caused this writeout regression.

It's not writeout, it's a read test.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-11 13:53     ` Jeff Moyer
@ 2009-05-11 16:58       ` Jens Axboe
  2009-05-13  3:29         ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2009-05-11 16:58 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Andrew Morton, linux-kernel, Rafael J. Wysocki

On Mon, May 11 2009, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Fri, May 08 2009, Andrew Morton wrote:
> >> On Thu, 23 Apr 2009 10:01:58 -0400
> >> Jeff Moyer <jmoyer@redhat.com> wrote:
> >> 
> >> > Hi,
> >> > 
> >> > I've been working on CFQ improvements for interleaved I/Os between
> >> > processes, and noticed a regression in performance when using the
> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
> >> > array and 1Gb/s ethernet.
> >> > 
> >> > The iozone command line was:
> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
> >> > 
> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
> >> > These numbers (in MB/s) represent the average of 5 runs.
> >> > 
> >> >                v2.6.29
> >> > 
> >> > nfsd's  |   1    |  2   |   4   |   8
> >> > --------+---------------+-------+------
> >> > deadline| 43207 | 67436 | 96289 | 107590
> >> > 
> >> >               2.6.30-rc1
> >> > 
> >> > nfsd's  |   1   |   2   |   4   |   8
> >> > --------+---------------+-------+------
> >> > deadline| 43732 | 68059 | 76659 | 83231
> >> > 
> >> >     2.6.30-rc3.block-for-linus
> >> > 
> >> > nfsd's  |   1   |   2   |   4   |   8
> >> > --------+---------------+-------+------
> >> > deadline| 46102 | 71151 | 83120 | 82330
> >> > 
> >> > 
> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
> >> > default number of NFSD threads is 8.
> >> > 
> >> 
> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
> >> list.
> >
> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
> > change has caused this writeout regression.
> 
> It's not writeout, it's a read test.

Doh sorry, I even ran these tests as well a few weeks back. So perhaps
some read-ahead change, I didn't look into it. FWIW, on a single SATA
drive here, it didn't show any difference.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-11 16:58       ` Jens Axboe
@ 2009-05-13  3:29         ` Jeff Moyer
  2009-05-13  3:44           ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-13  3:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia

Jens Axboe <jens.axboe@oracle.com> writes:

> On Mon, May 11 2009, Jeff Moyer wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>> > On Fri, May 08 2009, Andrew Morton wrote:
>> >> On Thu, 23 Apr 2009 10:01:58 -0400
>> >> Jeff Moyer <jmoyer@redhat.com> wrote:
>> >> 
>> >> > Hi,
>> >> > 
>> >> > I've been working on CFQ improvements for interleaved I/Os between
>> >> > processes, and noticed a regression in performance when using the
>> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
>> >> > array and 1Gb/s ethernet.
>> >> > 
>> >> > The iozone command line was:
>> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>> >> > 
>> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>> >> > These numbers (in MB/s) represent the average of 5 runs.
>> >> > 
>> >> >                v2.6.29
>> >> > 
>> >> > nfsd's  |   1    |  2   |   4   |   8
>> >> > --------+---------------+-------+------
>> >> > deadline| 43207 | 67436 | 96289 | 107590
>> >> > 
>> >> >               2.6.30-rc1
>> >> > 
>> >> > nfsd's  |   1   |   2   |   4   |   8
>> >> > --------+---------------+-------+------
>> >> > deadline| 43732 | 68059 | 76659 | 83231
>> >> > 
>> >> >     2.6.30-rc3.block-for-linus
>> >> > 
>> >> > nfsd's  |   1   |   2   |   4   |   8
>> >> > --------+---------------+-------+------
>> >> > deadline| 46102 | 71151 | 83120 | 82330
>> >> > 
>> >> > 
>> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>> >> > default number of NFSD threads is 8.
>> >> > 
>> >> 
>> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
>> >> list.
>> >
>> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
>> > change has caused this writeout regression.
>> 
>> It's not writeout, it's a read test.
>
> Doh sorry, I even ran these tests as well a few weeks back. So perhaps
> some read-ahead change, I didn't look into it. FWIW, on a single SATA
> drive here, it didn't show any difference.

OK, I bisected this to the following commit.  The mount is done using
NFSv3, by the way.

commit 47a14ef1af48c696b214ac168f056ddc79793d0e
Author: Olga Kornievskaia <aglo@citi.umich.edu>
Date:   Tue Oct 21 14:13:47 2008 -0400

    svcrpc: take advantage of tcp autotuning
    
    Allow the NFSv4 server to make use of TCP autotuning behaviour, which
    was previously disabled by setting the sk_userlocks variable.
    
    Set the receive buffers to be big enough to receive the whole RPC
    request, and set this for the listening socket, not the accept socket.
    
    Remove the code that readjusts the receive/send buffer sizes for the
    accepted socket. Previously this code was used to influence the TCP
    window management behaviour, which is no longer needed when autotuning
    is enabled.
    
    This can improve IO bandwidth on networks with high bandwidth-delay
    products, where a large tcp window is required.  It also simplifies
    performance tuning, since getting adequate tcp buffers previously
    required increasing the number of nfsd threads.
    
    Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
    Cc: Jim Rees <rees@umich.edu>
    Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5763e64..7a2a90f 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
 	lock_sock(sock->sk);
 	sock->sk->sk_sndbuf = snd * 2;
 	sock->sk->sk_rcvbuf = rcv * 2;
-	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
 	release_sock(sock->sk);
 #endif
 }
@@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
 		test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
 
-	if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
-		/* sndbuf needs to have room for one request
-		 * per thread, otherwise we can stall even when the
-		 * network isn't a bottleneck.
-		 *
-		 * We count all threads rather than threads in a
-		 * particular pool, which provides an upper bound
-		 * on the number of threads which will access the socket.
-		 *
-		 * rcvbuf just needs to be able to hold a few requests.
-		 * Normally they will be removed from the queue
-		 * as soon a a complete request arrives.
-		 */
-		svc_sock_setbufsize(svsk->sk_sock,
-				    (serv->sv_nrthreads+3) * serv->sv_max_mesg,
-				    3 * serv->sv_max_mesg);
-
 	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 
 	/* Receive data. If we haven't got the record length yet, get
@@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 
 		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
 
-		/* initialise setting must have enough space to
-		 * receive and respond to one request.
-		 * svc_tcp_recvfrom will re-adjust if necessary
-		 */
-		svc_sock_setbufsize(svsk->sk_sock,
-				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
-				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
-
-		set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
 		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 		if (sk->sk_state != TCP_ESTABLISHED)
 			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
@@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
 	/* Initialize the socket */
 	if (sock->type == SOCK_DGRAM)
 		svc_udp_init(svsk, serv);
-	else
+	else {
+		/* initialise setting must have enough space to
+		 * receive and respond to one request.
+		 */
+		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
+					4 * serv->sv_max_mesg);
 		svc_tcp_init(svsk, serv);
+	}
 
 	/*
 	 * We start one listener per sv_serv.  We want AF_INET

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13  3:29         ` Jeff Moyer
@ 2009-05-13  3:44           ` Andrew Morton
  2009-05-13 14:58             ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2009-05-13  3:44 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia,
	J. Bruce Fields, Jim Rees, linux-nfs

(obvious cc's added...)

It's an iozone performance regression.

On Tue, 12 May 2009 23:29:30 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:

> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Mon, May 11 2009, Jeff Moyer wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >> 
> >> > On Fri, May 08 2009, Andrew Morton wrote:
> >> >> On Thu, 23 Apr 2009 10:01:58 -0400
> >> >> Jeff Moyer <jmoyer@redhat.com> wrote:
> >> >> 
> >> >> > Hi,
> >> >> > 
> >> >> > I've been working on CFQ improvements for interleaved I/Os between
> >> >> > processes, and noticed a regression in performance when using the
> >> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
> >> >> > array and 1Gb/s ethernet.
> >> >> > 
> >> >> > The iozone command line was:
> >> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
> >> >> > 
> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
> >> >> > These numbers (in MB/s) represent the average of 5 runs.
> >> >> > 
> >> >> >                v2.6.29
> >> >> > 
> >> >> > nfsd's  |   1    |  2   |   4   |   8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 43207 | 67436 | 96289 | 107590
> >> >> > 
> >> >> >               2.6.30-rc1
> >> >> > 
> >> >> > nfsd's  |   1   |   2   |   4   |   8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 43732 | 68059 | 76659 | 83231
> >> >> > 
> >> >> >     2.6.30-rc3.block-for-linus
> >> >> > 
> >> >> > nfsd's  |   1   |   2   |   4   |   8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 46102 | 71151 | 83120 | 82330
> >> >> > 
> >> >> > 
> >> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
> >> >> > default number of NFSD threads is 8.
> >> >> > 
> >> >> 
> >> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
> >> >> list.
> >> >
> >> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
> >> > change has caused this writeout regression.
> >> 
> >> It's not writeout, it's a read test.
> >
> > Doh sorry, I even ran these tests as well a few weeks back. So perhaps
> > some read-ahead change, I didn't look into it. FWIW, on a single SATA
> > drive here, it didn't show any difference.
> 
> OK, I bisected this to the following commit.  The mount is done using
> NFSv3, by the way.
> 
> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> Author: Olga Kornievskaia <aglo@citi.umich.edu>
> Date:   Tue Oct 21 14:13:47 2008 -0400
> 
>     svcrpc: take advantage of tcp autotuning
>     
>     Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>     was previously disabled by setting the sk_userlocks variable.
>     
>     Set the receive buffers to be big enough to receive the whole RPC
>     request, and set this for the listening socket, not the accept socket.
>     
>     Remove the code that readjusts the receive/send buffer sizes for the
>     accepted socket. Previously this code was used to influence the TCP
>     window management behaviour, which is no longer needed when autotuning
>     is enabled.
>     
>     This can improve IO bandwidth on networks with high bandwidth-delay
>     products, where a large tcp window is required.  It also simplifies
>     performance tuning, since getting adequate tcp buffers previously
>     required increasing the number of nfsd threads.
>     
>     Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
>     Cc: Jim Rees <rees@umich.edu>
>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 5763e64..7a2a90f 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>  	lock_sock(sock->sk);
>  	sock->sk->sk_sndbuf = snd * 2;
>  	sock->sk->sk_rcvbuf = rcv * 2;
> -	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>  	release_sock(sock->sk);
>  #endif
>  }
> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>  		test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>  		test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>  
> -	if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> -		/* sndbuf needs to have room for one request
> -		 * per thread, otherwise we can stall even when the
> -		 * network isn't a bottleneck.
> -		 *
> -		 * We count all threads rather than threads in a
> -		 * particular pool, which provides an upper bound
> -		 * on the number of threads which will access the socket.
> -		 *
> -		 * rcvbuf just needs to be able to hold a few requests.
> -		 * Normally they will be removed from the queue
> -		 * as soon a a complete request arrives.
> -		 */
> -		svc_sock_setbufsize(svsk->sk_sock,
> -				    (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> -				    3 * serv->sv_max_mesg);
> -
>  	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>  
>  	/* Receive data. If we haven't got the record length yet, get
> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>  
>  		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>  
> -		/* initialise setting must have enough space to
> -		 * receive and respond to one request.
> -		 * svc_tcp_recvfrom will re-adjust if necessary
> -		 */
> -		svc_sock_setbufsize(svsk->sk_sock,
> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
> -
> -		set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>  		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>  		if (sk->sk_state != TCP_ESTABLISHED)
>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>  	/* Initialize the socket */
>  	if (sock->type == SOCK_DGRAM)
>  		svc_udp_init(svsk, serv);
> -	else
> +	else {
> +		/* initialise setting must have enough space to
> +		 * receive and respond to one request.
> +		 */
> +		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> +					4 * serv->sv_max_mesg);
>  		svc_tcp_init(svsk, serv);
> +	}
>  
>  	/*
>  	 * We start one listener per sv_serv.  We want AF_INET

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13  3:44           ` Andrew Morton
@ 2009-05-13 14:58             ` Jeff Moyer
  2009-05-13 16:20               ` Olga Kornievskaia
  2009-05-13 19:29               ` Jeff Moyer
  0 siblings, 2 replies; 27+ messages in thread
From: Jeff Moyer @ 2009-05-13 14:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia,
	J. Bruce Fields, Jim Rees, linux-nfs

Andrew Morton <akpm@linux-foundation.org> writes:

> (obvious cc's added...)
>
> It's an iozone performance regression.
>
> On Tue, 12 May 2009 23:29:30 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:
>
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>> > On Mon, May 11 2009, Jeff Moyer wrote:
>> >> Jens Axboe <jens.axboe@oracle.com> writes:
>> >> 
>> >> > On Fri, May 08 2009, Andrew Morton wrote:
>> >> >> On Thu, 23 Apr 2009 10:01:58 -0400
>> >> >> Jeff Moyer <jmoyer@redhat.com> wrote:
>> >> >> 
>> >> >> > Hi,
>> >> >> > 
>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>> >> >> > processes, and noticed a regression in performance when using the
>> >> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
>> >> >> > array and 1Gb/s ethernet.
>> >> >> > 
>> >> >> > The iozone command line was:
>> >> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>> >> >> > 
>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>> >> >> > 
>> >> >> >                v2.6.29
>> >> >> > 
>> >> >> > nfsd's  |   1    |  2   |   4   |   8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>> >> >> > 
>> >> >> >               2.6.30-rc1
>> >> >> > 
>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>> >> >> > 
>> >> >> >     2.6.30-rc3.block-for-linus
>> >> >> > 
>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>> >> >> > 
>> >> >> > 
>> >> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>> >> >> > default number of NFSD threads is 8.

Just following up with numbers:

  2.6.30-rc4

nfsd's  |   8
--------+------
cfq     | 51632   (49791 52436 52308 51488 52141)
deadline| 65558   (41675 42559 74820 87518 81221)

   2.6.30-rc4 reverting the sunrpc "fix"

nfsd's  |   8
--------+------
cfq     |  82513  (81650 82762 83147 82935 82073)
deadline| 107827  (109730 106077 107175 108524 107632)

The numbers in parenthesis are the individual runs.  Notice how
2.6.30-rc4 has some pretty wide variations for deadline.

Cheers,
Jeff

>> >> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
>> >> >> list.
>> >> >
>> >> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
>> >> > change has caused this writeout regression.
>> >> 
>> >> It's not writeout, it's a read test.
>> >
>> > Doh sorry, I even ran these tests as well a few weeks back. So perhaps
>> > some read-ahead change, I didn't look into it. FWIW, on a single SATA
>> > drive here, it didn't show any difference.
>> 
>> OK, I bisected this to the following commit.  The mount is done using
>> NFSv3, by the way.
>> 
>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>> Date:   Tue Oct 21 14:13:47 2008 -0400
>> 
>>     svcrpc: take advantage of tcp autotuning
>>     
>>     Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>>     was previously disabled by setting the sk_userlocks variable.
>>     
>>     Set the receive buffers to be big enough to receive the whole RPC
>>     request, and set this for the listening socket, not the accept socket.
>>     
>>     Remove the code that readjusts the receive/send buffer sizes for the
>>     accepted socket. Previously this code was used to influence the TCP
>>     window management behaviour, which is no longer needed when autotuning
>>     is enabled.
>>     
>>     This can improve IO bandwidth on networks with high bandwidth-delay
>>     products, where a large tcp window is required.  It also simplifies
>>     performance tuning, since getting adequate tcp buffers previously
>>     required increasing the number of nfsd threads.
>>     
>>     Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
>>     Cc: Jim Rees <rees@umich.edu>
>>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
>> 
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 5763e64..7a2a90f 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>>  	lock_sock(sock->sk);
>>  	sock->sk->sk_sndbuf = snd * 2;
>>  	sock->sk->sk_rcvbuf = rcv * 2;
>> -	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>>  	release_sock(sock->sk);
>>  #endif
>>  }
>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>  		test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>>  		test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>  
>> -	if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>> -		/* sndbuf needs to have room for one request
>> -		 * per thread, otherwise we can stall even when the
>> -		 * network isn't a bottleneck.
>> -		 *
>> -		 * We count all threads rather than threads in a
>> -		 * particular pool, which provides an upper bound
>> -		 * on the number of threads which will access the socket.
>> -		 *
>> -		 * rcvbuf just needs to be able to hold a few requests.
>> -		 * Normally they will be removed from the queue
>> -		 * as soon a a complete request arrives.
>> -		 */
>> -		svc_sock_setbufsize(svsk->sk_sock,
>> -				    (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>> -				    3 * serv->sv_max_mesg);
>> -
>>  	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>  
>>  	/* Receive data. If we haven't got the record length yet, get
>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>  
>>  		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>  
>> -		/* initialise setting must have enough space to
>> -		 * receive and respond to one request.
>> -		 * svc_tcp_recvfrom will re-adjust if necessary
>> -		 */
>> -		svc_sock_setbufsize(svsk->sk_sock,
>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>> -
>> -		set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>>  		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>  		if (sk->sk_state != TCP_ESTABLISHED)
>>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>>  	/* Initialize the socket */
>>  	if (sock->type == SOCK_DGRAM)
>>  		svc_udp_init(svsk, serv);
>> -	else
>> +	else {
>> +		/* initialise setting must have enough space to
>> +		 * receive and respond to one request.
>> +		 */
>> +		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>> +					4 * serv->sv_max_mesg);
>>  		svc_tcp_init(svsk, serv);
>> +	}
>>  
>>  	/*
>>  	 * We start one listener per sv_serv.  We want AF_INET
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone  over NFS
  2009-05-13 14:58             ` Jeff Moyer
@ 2009-05-13 16:20               ` Olga Kornievskaia
  2009-05-13 16:32                 ` Andrew Morton
  2009-05-13 19:29               ` Jeff Moyer
  1 sibling, 1 reply; 27+ messages in thread
From: Olga Kornievskaia @ 2009-05-13 16:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki,
	J. Bruce Fields, Jim Rees, linux-nfs

I believe what you are seeing is how well TCP autotuning performs.
What old NFS code was doing is disabling autotuning and instead using
#nfsd thread to scale TCP recv window. You are providing an example of
where setting TCP buffer sizes outperforms TCP autotuning. While this
is a valid example, there is also an alternative example of where old
NFS design hurts performance.

On Wed, May 13, 2009 at 10:58 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>> (obvious cc's added...)
>>
>> It's an iozone performance regression.
>>
>> On Tue, 12 May 2009 23:29:30 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>
>>> > On Mon, May 11 2009, Jeff Moyer wrote:
>>> >> Jens Axboe <jens.axboe@oracle.com> writes:
>>> >>
>>> >> > On Fri, May 08 2009, Andrew Morton wrote:
>>> >> >> On Thu, 23 Apr 2009 10:01:58 -0400
>>> >> >> Jeff Moyer <jmoyer@redhat.com> wrote:
>>> >> >>
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>>> >> >> > processes, and noticed a regression in performance when using the
>>> >> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
>>> >> >> > array and 1Gb/s ethernet.
>>> >> >> >
>>> >> >> > The iozone command line was:
>>> >> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>>> >> >> >
>>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>>> >> >> >
>>> >> >> >                v2.6.29
>>> >> >> >
>>> >> >> > nfsd's  |   1    |  2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>>> >> >> >
>>> >> >> >               2.6.30-rc1
>>> >> >> >
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>>> >> >> >
>>> >> >> >     2.6.30-rc3.block-for-linus
>>> >> >> >
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>>> >> >> >
>>> >> >> >
>>> >> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>>> >> >> > default number of NFSD threads is 8.
>
> Just following up with numbers:
>
>  2.6.30-rc4
>
> nfsd's  |   8
> --------+------
> cfq     | 51632   (49791 52436 52308 51488 52141)
> deadline| 65558   (41675 42559 74820 87518 81221)
>
>   2.6.30-rc4 reverting the sunrpc "fix"
>
> nfsd's  |   8
> --------+------
> cfq     |  82513  (81650 82762 83147 82935 82073)
> deadline| 107827  (109730 106077 107175 108524 107632)
>
> The numbers in parenthesis are the individual runs.  Notice how
> 2.6.30-rc4 has some pretty wide variations for deadline.
>
> Cheers,
> Jeff
>
>>> >> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
>>> >> >> list.
>>> >> >
>>> >> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
>>> >> > change has caused this writeout regression.
>>> >>
>>> >> It's not writeout, it's a read test.
>>> >
>>> > Doh sorry, I even ran these tests as well a few weeks back. So perhaps
>>> > some read-ahead change, I didn't look into it. FWIW, on a single SATA
>>> > drive here, it didn't show any difference.
>>>
>>> OK, I bisected this to the following commit.  The mount is done using
>>> NFSv3, by the way.
>>>
>>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>>> Date:   Tue Oct 21 14:13:47 2008 -0400
>>>
>>>     svcrpc: take advantage of tcp autotuning
>>>
>>>     Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>>>     was previously disabled by setting the sk_userlocks variable.
>>>
>>>     Set the receive buffers to be big enough to receive the whole RPC
>>>     request, and set this for the listening socket, not the accept socket.
>>>
>>>     Remove the code that readjusts the receive/send buffer sizes for the
>>>     accepted socket. Previously this code was used to influence the TCP
>>>     window management behaviour, which is no longer needed when autotuning
>>>     is enabled.
>>>
>>>     This can improve IO bandwidth on networks with high bandwidth-delay
>>>     products, where a large tcp window is required.  It also simplifies
>>>     performance tuning, since getting adequate tcp buffers previously
>>>     required increasing the number of nfsd threads.
>>>
>>>     Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
>>>     Cc: Jim Rees <rees@umich.edu>
>>>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
>>>
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 5763e64..7a2a90f 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>>>      lock_sock(sock->sk);
>>>      sock->sk->sk_sndbuf = snd * 2;
>>>      sock->sk->sk_rcvbuf = rcv * 2;
>>> -    sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>>>      release_sock(sock->sk);
>>>  #endif
>>>  }
>>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>>              test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>>>              test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>>
>>> -    if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>>> -            /* sndbuf needs to have room for one request
>>> -             * per thread, otherwise we can stall even when the
>>> -             * network isn't a bottleneck.
>>> -             *
>>> -             * We count all threads rather than threads in a
>>> -             * particular pool, which provides an upper bound
>>> -             * on the number of threads which will access the socket.
>>> -             *
>>> -             * rcvbuf just needs to be able to hold a few requests.
>>> -             * Normally they will be removed from the queue
>>> -             * as soon a a complete request arrives.
>>> -             */
>>> -            svc_sock_setbufsize(svsk->sk_sock,
>>> -                                (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>>> -                                3 * serv->sv_max_mesg);
>>> -
>>>      clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>
>>>      /* Receive data. If we haven't got the record length yet, get
>>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>>
>>>              tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>
>>> -            /* initialise setting must have enough space to
>>> -             * receive and respond to one request.
>>> -             * svc_tcp_recvfrom will re-adjust if necessary
>>> -             */
>>> -            svc_sock_setbufsize(svsk->sk_sock,
>>> -                                3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>>> -                                3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>>> -
>>> -            set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>>>              set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>              if (sk->sk_state != TCP_ESTABLISHED)
>>>                      set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>>>      /* Initialize the socket */
>>>      if (sock->type == SOCK_DGRAM)
>>>              svc_udp_init(svsk, serv);
>>> -    else
>>> +    else {
>>> +            /* initialise setting must have enough space to
>>> +             * receive and respond to one request.
>>> +             */
>>> +            svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>>> +                                    4 * serv->sv_max_mesg);
>>>              svc_tcp_init(svsk, serv);
>>> +    }
>>>
>>>      /*
>>>       * We start one listener per sv_serv.  We want AF_INET
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 16:20               ` Olga Kornievskaia
@ 2009-05-13 16:32                 ` Andrew Morton
  2009-05-13 18:16                   ` Olga Kornievskaia
  2009-05-13 18:25                   ` Jim Rees
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2009-05-13 16:32 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Jeff Moyer, Jens Axboe, linux-kernel, Rafael J. Wysocki,
	J. Bruce Fields, Jim Rees, linux-nfs

On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote:

> I believe what you are seeing is how well TCP autotuning performs.
> What old NFS code was doing is disabling autotuning and instead using
> #nfsd thread to scale TCP recv window. You are providing an example of
> where setting TCP buffer sizes outperforms TCP autotuning. While this
> is a valid example, there is also an alternative example of where old
> NFS design hurts performance.

<scratches head>

Jeff's computer got slower.  Can we fix that?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone  over NFS
  2009-05-13 16:32                 ` Andrew Morton
@ 2009-05-13 18:16                   ` Olga Kornievskaia
  2009-05-13 19:06                     ` Jeff Moyer
  2009-05-13 18:25                   ` Jim Rees
  1 sibling, 1 reply; 27+ messages in thread
From: Olga Kornievskaia @ 2009-05-13 18:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Moyer, Jens Axboe, linux-kernel, Rafael J. Wysocki,
	J. Bruce Fields, Jim Rees, linux-nfs

On Wed, May 13, 2009 at 12:32 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>
>> I believe what you are seeing is how well TCP autotuning performs.
>> What old NFS code was doing is disabling autotuning and instead using
>> #nfsd thread to scale TCP recv window. You are providing an example of
>> where setting TCP buffer sizes outperforms TCP autotuning. While this
>> is a valid example, there is also an alternative example of where old
>> NFS design hurts performance.
>
> <scratches head>
>
> Jeff's computer got slower.  Can we fix that?

We realize that decrease performance is a problem and understand that
reverting the patch might be the appropriate course of action!

But we are curious why this is happening. Jeff if it's not too much trouble
could you generate tcpdumps for both cases. We are curious what are
the max window sizes in both cases? Also could you give us your tcp and
network sysctl values for the testing environment (both client and server
values) that you can get with "sysctl -a | grep tcp" and also
" | grep net.core".

Poor performance using TCP autotuning can be demonstrated outside
of NFS but using Iperf. It can be shown that iperf will work better if "-w"
flag is used. When this flag is set, Iperf calls setsockopt() call which in
the kernel turns off autotuning.

As for fixing this it would be great if we could get some help from the
TCP kernel folks?

Another thing I should mention is that the proposed NFS patch does
reach into the TCP buffers because we need to make sure the recv buffer
is big enough to receive an RPC. To use autotuning NFS would
have to rely on the system-wide sysctl values. One way to ensure
that an RPC would fit is to then increase system-wide default TCP recv
buffer but then all connection would be using value. We thought that
instead of imposing such requirement we internally set the buffer
size big enough.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 16:32                 ` Andrew Morton
  2009-05-13 18:16                   ` Olga Kornievskaia
@ 2009-05-13 18:25                   ` Jim Rees
  2009-05-13 19:45                     ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Jim Rees @ 2009-05-13 18:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Olga Kornievskaia, Jeff Moyer, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, J. Bruce Fields, linux-nfs

Andrew Morton wrote:

  Jeff's computer got slower.  Can we fix that?

TCP autotuning can reduce performance by up to about 10% in some cases.
Jeff found one of these cases.  While the autotuning penalty never exceeds
10% as far as I know, I can provide examples of other cases where autotuning
improves nfsd performance by more than a factor of 100.

The right thing is to fix autotuning.  If autotuning is considered too
broken to use, it should be turned off everywhere, not just in nfsd, as it
hurts/benefits all TCP clients, not just nfs.

This topic has been discussed before on netdev:
http://www.spinics.net/lists/netdev/msg68650.html
http://www.spinics.net/lists/netdev/msg68155.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone  over NFS
  2009-05-13 18:16                   ` Olga Kornievskaia
@ 2009-05-13 19:06                     ` Jeff Moyer
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Moyer @ 2009-05-13 19:06 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki,
	J. Bruce Fields, Jim Rees, linux-nfs

Olga Kornievskaia <aglo@citi.umich.edu> writes:

> On Wed, May 13, 2009 at 12:32 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>>
>>> I believe what you are seeing is how well TCP autotuning performs.
>>> What old NFS code was doing is disabling autotuning and instead using
>>> #nfsd thread to scale TCP recv window. You are providing an example of
>>> where setting TCP buffer sizes outperforms TCP autotuning. While this
>>> is a valid example, there is also an alternative example of where old
>>> NFS design hurts performance.
>>
>> <scratches head>
>>
>> Jeff's computer got slower.  Can we fix that?
>
> We realize that decrease performance is a problem and understand that
> reverting the patch might be the appropriate course of action!

I wasn't suggesting that we just revert the patch.  I was just looking
for some guidance on diagnosing and hopefully fixing the regression.

> But we are curious why this is happening. Jeff if it's not too much trouble
> could you generate tcpdumps for both cases. We are curious what are
> the max window sizes in both cases? Also could you give us your tcp and
> network sysctl values for the testing environment (both client and server
> values) that you can get with "sysctl -a | grep tcp" and also
> " | grep net.core".

http://people.redhat.com/jmoyer/iozone-regression.tar

I'm happy to continue to help track this down.  If you want to reproduce
this in your own environment, though, you can probably do it with a
ramdisk served up via nfs with the nfs client and server on the same
gig-e network.

> Poor performance using TCP autotuning can be demonstrated outside
> of NFS but using Iperf. It can be shown that iperf will work better if "-w"
> flag is used. When this flag is set, Iperf calls setsockopt() call which in
> the kernel turns off autotuning.
>
> As for fixing this it would be great if we could get some help from the
> TCP kernel folks?

Then we'll need to add netdev to the CC, but probably from a message
that has more background on the problem (we've even trimmed the
offending commit and performance numbers from the email at this point).

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 14:58             ` Jeff Moyer
  2009-05-13 16:20               ` Olga Kornievskaia
@ 2009-05-13 19:29               ` Jeff Moyer
  2009-05-13 23:45                 ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-13 19:29 UTC (permalink / raw)
  To: netdev, Andrew Morton
  Cc: Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia,
	J. Bruce Fields, Jim Rees, linux-nfs

Hi, netdev folks.  The summary here is:

A patch added in the 2.6.30 development cycle caused a performance
regression in my NFS iozone testing.  The patch in question is the
following:

commit 47a14ef1af48c696b214ac168f056ddc79793d0e
Author: Olga Kornievskaia <aglo@citi.umich.edu>
Date:   Tue Oct 21 14:13:47 2008 -0400

    svcrpc: take advantage of tcp autotuning
 
which is also quoted below.  Using 8 nfsd threads, a single client doing
2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
KB/s under 2.6.30-rc4.  I also see more run to run variation under
2.6.30-rc4 using the deadline I/O scheduler on the server.  That
variation disappears (as does the performance regression) when reverting
the above commit.

Olga had the following to say about the regression:

>> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>>
>>> I believe what you are seeing is how well TCP autotuning performs.
>>> What old NFS code was doing is disabling autotuning and instead using
>>> #nfsd thread to scale TCP recv window. You are providing an example of
>>> where setting TCP buffer sizes outperforms TCP autotuning. While this
>>> is a valid example, there is also an alternative example of where old
>>> NFS design hurts performance.

> We realize that decrease performance is a problem and understand that
> reverting the patch might be the appropriate course of action!

> But we are curious why this is happening. Jeff if it's not too much trouble
> could you generate tcpdumps for both cases. We are curious what are
> the max window sizes in both cases? Also could you give us your tcp and
> network sysctl values for the testing environment (both client and server
> values) that you can get with "sysctl -a | grep tcp" and also
> " | grep net.core".

The requested data can be found here:
http://people.redhat.com/jmoyer/iozone-regression.tar

> Poor performance using TCP autotuning can be demonstrated outside
> of NFS but using Iperf. It can be shown that iperf will work better if "-w"
> flag is used. When this flag is set, Iperf calls setsockopt() call which in
> the kernel turns off autotuning.
>
> As for fixing this it would be great if we could get some help from the
> TCP kernel folks?

And so I've CC'd netdev.

Jim Rees had the following to add:

> TCP autotuning can reduce performance by up to about 10% in some cases.
> Jeff found one of these cases.  While the autotuning penalty never exceeds
> 10% as far as I know, I can provide examples of other cases where autotuning
> improves nfsd performance by more than a factor of 100.
> 
> The right thing is to fix autotuning.  If autotuning is considered too
> broken to use, it should be turned off everywhere, not just in nfsd, as it
> hurts/benefits all TCP clients, not just nfs.

> This topic has been discussed before on netdev:
> http://www.spinics.net/lists/netdev/msg68650.html
> http://www.spinics.net/lists/netdev/msg68155.html

I'd like to point out that the penalty is much greater than 10% in my
case.

Here are the data points:

Jeff Moyer <jmoyer@redhat.com> writes:

>>> >> >> Jeff Moyer <jmoyer@redhat.com> wrote:
>>> >> >> 
>>> >> >> > Hi,
>>> >> >> > 
>>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>>> >> >> > processes, and noticed a regression in performance when using the
>>> >> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
>>> >> >> > array and 1Gb/s ethernet.
>>> >> >> > 
>>> >> >> > The iozone command line was:
>>> >> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>>> >> >> > 
>>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>>> >> >> > 
>>> >> >> >                v2.6.29
>>> >> >> > 
>>> >> >> > nfsd's  |   1    |  2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>>> >> >> > 
>>> >> >> >               2.6.30-rc1
>>> >> >> > 
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>>> >> >> > 
>>> >> >> >     2.6.30-rc3.block-for-linus
>>> >> >> > 
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>>> >> >> > 
>>> >> >> > 
>>> >> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>>> >> >> > default number of NFSD threads is 8.
>
> Just following up with numbers:
>
>   2.6.30-rc4
>
> nfsd's  |   8
> --------+------
> cfq     | 51632   (49791 52436 52308 51488 52141)
> deadline| 65558   (41675 42559 74820 87518 81221)
>
>    2.6.30-rc4 reverting the sunrpc "fix"
>
> nfsd's  |   8
> --------+------
> cfq     |  82513  (81650 82762 83147 82935 82073)
> deadline| 107827  (109730 106077 107175 108524 107632)
>
> The numbers in parenthesis are the individual runs.  Notice how
> 2.6.30-rc4 has some pretty wide variations for deadline.
>
>>> OK, I bisected this to the following commit.  The mount is done using
>>> NFSv3, by the way.
>>> 
>>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>>> Date:   Tue Oct 21 14:13:47 2008 -0400
>>> 
>>>     svcrpc: take advantage of tcp autotuning
>>>     
>>>     Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>>>     was previously disabled by setting the sk_userlocks variable.
>>>     
>>>     Set the receive buffers to be big enough to receive the whole RPC
>>>     request, and set this for the listening socket, not the accept socket.
>>>     
>>>     Remove the code that readjusts the receive/send buffer sizes for the
>>>     accepted socket. Previously this code was used to influence the TCP
>>>     window management behaviour, which is no longer needed when autotuning
>>>     is enabled.
>>>     
>>>     This can improve IO bandwidth on networks with high bandwidth-delay
>>>     products, where a large tcp window is required.  It also simplifies
>>>     performance tuning, since getting adequate tcp buffers previously
>>>     required increasing the number of nfsd threads.
>>>     
>>>     Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
>>>     Cc: Jim Rees <rees@umich.edu>
>>>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
>>> 
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 5763e64..7a2a90f 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>>>  	lock_sock(sock->sk);
>>>  	sock->sk->sk_sndbuf = snd * 2;
>>>  	sock->sk->sk_rcvbuf = rcv * 2;
>>> -	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>>>  	release_sock(sock->sk);
>>>  #endif
>>>  }
>>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>>  		test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>>>  		test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>>  
>>> -	if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>>> -		/* sndbuf needs to have room for one request
>>> -		 * per thread, otherwise we can stall even when the
>>> -		 * network isn't a bottleneck.
>>> -		 *
>>> -		 * We count all threads rather than threads in a
>>> -		 * particular pool, which provides an upper bound
>>> -		 * on the number of threads which will access the socket.
>>> -		 *
>>> -		 * rcvbuf just needs to be able to hold a few requests.
>>> -		 * Normally they will be removed from the queue
>>> -		 * as soon a a complete request arrives.
>>> -		 */
>>> -		svc_sock_setbufsize(svsk->sk_sock,
>>> -				    (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>>> -				    3 * serv->sv_max_mesg);
>>> -
>>>  	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>  
>>>  	/* Receive data. If we haven't got the record length yet, get
>>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>>  
>>>  		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>  
>>> -		/* initialise setting must have enough space to
>>> -		 * receive and respond to one request.
>>> -		 * svc_tcp_recvfrom will re-adjust if necessary
>>> -		 */
>>> -		svc_sock_setbufsize(svsk->sk_sock,
>>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>>> -
>>> -		set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>>>  		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>  		if (sk->sk_state != TCP_ESTABLISHED)
>>>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>>>  	/* Initialize the socket */
>>>  	if (sock->type == SOCK_DGRAM)
>>>  		svc_udp_init(svsk, serv);
>>> -	else
>>> +	else {
>>> +		/* initialise setting must have enough space to
>>> +		 * receive and respond to one request.
>>> +		 */
>>> +		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>>> +					4 * serv->sv_max_mesg);
>>>  		svc_tcp_init(svsk, serv);
>>> +	}
>>>  
>>>  	/*
>>>  	 * We start one listener per sv_serv.  We want AF_INET

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 18:25                   ` Jim Rees
@ 2009-05-13 19:45                     ` Trond Myklebust
  0 siblings, 0 replies; 27+ messages in thread
From: Trond Myklebust @ 2009-05-13 19:45 UTC (permalink / raw)
  To: Jim Rees
  Cc: Andrew Morton, Olga Kornievskaia, Jeff Moyer, Jens Axboe,
	linux-kernel, Rafael J. Wysocki, J. Bruce Fields, linux-nfs

On Wed, 2009-05-13 at 14:25 -0400, Jim Rees wrote:
> Andrew Morton wrote:
> 
>   Jeff's computer got slower.  Can we fix that?
> 
> TCP autotuning can reduce performance by up to about 10% in some cases.
> Jeff found one of these cases.  While the autotuning penalty never exceeds
> 10% as far as I know, I can provide examples of other cases where autotuning
> improves nfsd performance by more than a factor of 100.
> 
> The right thing is to fix autotuning.  If autotuning is considered too
> broken to use, it should be turned off everywhere, not just in nfsd, as it
> hurts/benefits all TCP clients, not just nfs.
> 
> This topic has been discussed before on netdev:
> http://www.spinics.net/lists/netdev/msg68650.html
> http://www.spinics.net/lists/netdev/msg68155.html

Yes, but one consequence of this patch is that the socket send buffer
size sk->sk_sndbuf is now initialised to a smaller value than before.

This again means that the test for xprt->xpt_ops->xpo_has_wspace() in
svc_xprt_enqueue() will fail more often, and so you will be able to
process fewer incoming requests in parallel while you are waiting for
the send window size to build up.

Perhaps the right thing to do here is to allow some limited violation of
the xpo_has_wspace() test while the send window is in the process of
building up?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 19:29               ` Jeff Moyer
@ 2009-05-13 23:45                 ` Trond Myklebust
  2009-05-14 13:34                   ` Jeff Moyer
  2009-05-14 17:55                   ` J. Bruce Fields
  0 siblings, 2 replies; 27+ messages in thread
From: Trond Myklebust @ 2009-05-13 23:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> Hi, netdev folks.  The summary here is:
> 
> A patch added in the 2.6.30 development cycle caused a performance
> regression in my NFS iozone testing.  The patch in question is the
> following:
> 
> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> Author: Olga Kornievskaia <aglo@citi.umich.edu>
> Date:   Tue Oct 21 14:13:47 2008 -0400
> 
>     svcrpc: take advantage of tcp autotuning
>  
> which is also quoted below.  Using 8 nfsd threads, a single client doing
> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> KB/s under 2.6.30-rc4.  I also see more run to run variation under
> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> variation disappears (as does the performance regression) when reverting
> the above commit.

It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
function. I can see no reason why we should stop processing new incoming
RPC requests just because the send buffer happens to be 2/3 full. If we
see that we have space for another reply, then we should just go for it.
OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
that the TCP layer knows that we're congested, and that we'd like it to
increase the send window size, please.

Could you therefore please see if the following (untested) patch helps?

Cheers
  Trond
---------------------------------------------------------------------
>From 1545cbda1b1cda2500cb9db3c760a05fc4f6ed4d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Wed, 13 May 2009 19:44:58 -0400
Subject: [PATCH] SUNRPC: Fix the TCP server's send buffer accounting

Currently, the sunrpc server is refusing to allow us to process new RPC
calls if the TCP send buffer is 2/3 full, even if we do actually have
enough free space to guarantee that we can send another request.
The following patch fixes svc_tcp_has_wspace() so that we only stop
processing requests if we know that the socket buffer cannot possibly fit
another reply.

It also fixes the tcp write_space() callback so that we only clear the
SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full.
This should ensure that the send window will grow as per the standard TCP
socket code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 net/sunrpc/svcsock.c |   32 ++++++++++++++++----------------
 1 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index af31988..8962355 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -386,6 +386,15 @@ static void svc_write_space(struct sock *sk)
 	}
 }
 
+static void svc_tcp_write_space(struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+
+	if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock)
+		clear_bit(SOCK_NOSPACE, &sock->flags);
+	svc_write_space(sk);
+}
+
 /*
  * Copy the UDP datagram's destination address to the rqstp structure.
  * The 'destination' address in this case is the address to which the
@@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
 	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
 	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
 	int required;
-	int wspace;
-
-	/*
-	 * Set the SOCK_NOSPACE flag before checking the available
-	 * sock space.
-	 */
-	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
-	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
-	wspace = sk_stream_wspace(svsk->sk_sk);
-
-	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
-		return 0;
-	if (required * 2 > wspace)
-		return 0;
 
-	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
+	if (sk_stream_wspace(svsk->sk_sk) < required)
+		goto out_nospace;
 	return 1;
+out_nospace:
+	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+	return 0;
 }
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
@@ -1036,7 +1036,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		dprintk("setting up TCP socket for reading\n");
 		sk->sk_state_change = svc_tcp_state_change;
 		sk->sk_data_ready = svc_tcp_data_ready;
-		sk->sk_write_space = svc_write_space;
+		sk->sk_write_space = svc_tcp_write_space;
 
 		svsk->sk_reclen = 0;
 		svsk->sk_tcplen = 0;
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 23:45                 ` Trond Myklebust
@ 2009-05-14 13:34                   ` Jeff Moyer
  2009-05-14 14:33                     ` Trond Myklebust
  2009-05-14 17:55                   ` J. Bruce Fields
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-14 13:34 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
>> Hi, netdev folks.  The summary here is:
>> 
>> A patch added in the 2.6.30 development cycle caused a performance
>> regression in my NFS iozone testing.  The patch in question is the
>> following:
>> 
>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>> Date:   Tue Oct 21 14:13:47 2008 -0400
>> 
>>     svcrpc: take advantage of tcp autotuning
>>  
>> which is also quoted below.  Using 8 nfsd threads, a single client doing
>> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
>> KB/s under 2.6.30-rc4.  I also see more run to run variation under
>> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
>> variation disappears (as does the performance regression) when reverting
>> the above commit.
>
> It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> function. I can see no reason why we should stop processing new incoming
> RPC requests just because the send buffer happens to be 2/3 full. If we
> see that we have space for another reply, then we should just go for it.
> OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
> that the TCP layer knows that we're congested, and that we'd like it to
> increase the send window size, please.
>
> Could you therefore please see if the following (untested) patch helps?

I'm seeing slightly better results with the patch:

71548
75987
71557
87432
83538

But that's still not up to the speeds we saw under 2.6.29.  The packet
capture for one run can be found here:
  http://people.redhat.com/jmoyer/trond.pcap.bz2

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 13:34                   ` Jeff Moyer
@ 2009-05-14 14:33                     ` Trond Myklebust
  2009-05-14 14:38                       ` Jeff Moyer
  2009-05-14 15:00                       ` Jeff Moyer
  0 siblings, 2 replies; 27+ messages in thread
From: Trond Myklebust @ 2009-05-14 14:33 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
> 
> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> >> Hi, netdev folks.  The summary here is:
> >> 
> >> A patch added in the 2.6.30 development cycle caused a performance
> >> regression in my NFS iozone testing.  The patch in question is the
> >> following:
> >> 
> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> >> Author: Olga Kornievskaia <aglo@citi.umich.edu>
> >> Date:   Tue Oct 21 14:13:47 2008 -0400
> >> 
> >>     svcrpc: take advantage of tcp autotuning
> >>  
> >> which is also quoted below.  Using 8 nfsd threads, a single client doing
> >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> >> KB/s under 2.6.30-rc4.  I also see more run to run variation under
> >> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> >> variation disappears (as does the performance regression) when reverting
> >> the above commit.
> >
> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > function. I can see no reason why we should stop processing new incoming
> > RPC requests just because the send buffer happens to be 2/3 full. If we
> > see that we have space for another reply, then we should just go for it.
> > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
> > that the TCP layer knows that we're congested, and that we'd like it to
> > increase the send window size, please.
> >
> > Could you therefore please see if the following (untested) patch helps?
> 
> I'm seeing slightly better results with the patch:
> 
> 71548
> 75987
> 71557
> 87432
> 83538
> 
> But that's still not up to the speeds we saw under 2.6.29.  The packet
> capture for one run can be found here:
>   http://people.redhat.com/jmoyer/trond.pcap.bz2
> 
> Cheers,
> Jeff

Yes. Something is very wrong there...

See for instance frame 1195, where the client finishes sending a whole
series of READ requests, and we go into a flurry of ACKs passing
backwards and forwards, but no data. It looks as if the NFS server isn't
processing anything, probably because the window size falls afoul of the
svc_tcp_has_wspace()...

Does something like this help?

Cheers
  Trond
---------------------------------------------------------------------
>From 85e3f5860a9063d193bdb45516b3d3d347b87301 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Thu, 14 May 2009 10:33:07 -0400
Subject: [PATCH] SUNRPC: Always allow the NFS server to process at least one request

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 net/sunrpc/svcsock.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 8962355..4837442 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -972,9 +972,16 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
 {
 	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
 	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
+	int reserved;
 	int required;
 
-	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
+	reserved = atomic_read(&xprt->xpt_reserved);
+	/* Always allow the server to process at least one request, whether
+	 * or not the TCP window is large enough
+	 */
+	if (reserved == 0)
+		return 1;
+	required = (reserved + serv->sv_max_mesg) << 1;
 	if (sk_stream_wspace(svsk->sk_sk) < required)
 		goto out_nospace;
 	return 1;
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 14:33                     ` Trond Myklebust
@ 2009-05-14 14:38                       ` Jeff Moyer
  2009-05-14 15:00                       ` Jeff Moyer
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Moyer @ 2009-05-14 14:38 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote:
>> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
>> 
>> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
>> >> Hi, netdev folks.  The summary here is:
>> >> 
>> >> A patch added in the 2.6.30 development cycle caused a performance
>> >> regression in my NFS iozone testing.  The patch in question is the
>> >> following:
>> >> 
>> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> >> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>> >> Date:   Tue Oct 21 14:13:47 2008 -0400
>> >> 
>> >>     svcrpc: take advantage of tcp autotuning
>> >>  
>> >> which is also quoted below.  Using 8 nfsd threads, a single client doing
>> >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
>> >> KB/s under 2.6.30-rc4.  I also see more run to run variation under
>> >> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
>> >> variation disappears (as does the performance regression) when reverting
>> >> the above commit.
>> >
>> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
>> > function. I can see no reason why we should stop processing new incoming
>> > RPC requests just because the send buffer happens to be 2/3 full. If we
>> > see that we have space for another reply, then we should just go for it.
>> > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
>> > that the TCP layer knows that we're congested, and that we'd like it to
>> > increase the send window size, please.
>> >
>> > Could you therefore please see if the following (untested) patch helps?
>> 
>> I'm seeing slightly better results with the patch:
>> 
>> 71548
>> 75987
>> 71557
>> 87432
>> 83538
>> 
>> But that's still not up to the speeds we saw under 2.6.29.  The packet
>> capture for one run can be found here:
>>   http://people.redhat.com/jmoyer/trond.pcap.bz2
>> 
>> Cheers,
>> Jeff
>
> Yes. Something is very wrong there...
>
> See for instance frame 1195, where the client finishes sending a whole
> series of READ requests, and we go into a flurry of ACKs passing
> backwards and forwards, but no data. It looks as if the NFS server isn't
> processing anything, probably because the window size falls afoul of the
> svc_tcp_has_wspace()...
>
> Does something like this help?

Is this in addition to the previous patch or instead of it?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 14:33                     ` Trond Myklebust
  2009-05-14 14:38                       ` Jeff Moyer
@ 2009-05-14 15:00                       ` Jeff Moyer
  2009-05-17 19:10                         ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-14 15:00 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote:
>> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
>> 
>> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
>> >> Hi, netdev folks.  The summary here is:
>> >> 
>> >> A patch added in the 2.6.30 development cycle caused a performance
>> >> regression in my NFS iozone testing.  The patch in question is the
>> >> following:
>> >> 
>> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> >> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>> >> Date:   Tue Oct 21 14:13:47 2008 -0400
>> >> 
>> >>     svcrpc: take advantage of tcp autotuning
>> >>  
>> >> which is also quoted below.  Using 8 nfsd threads, a single client doing
>> >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
>> >> KB/s under 2.6.30-rc4.  I also see more run to run variation under
>> >> 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
>> >> variation disappears (as does the performance regression) when reverting
>> >> the above commit.
>> >
>> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
>> > function. I can see no reason why we should stop processing new incoming
>> > RPC requests just because the send buffer happens to be 2/3 full. If we
>> > see that we have space for another reply, then we should just go for it.
>> > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
>> > that the TCP layer knows that we're congested, and that we'd like it to
>> > increase the send window size, please.
>> >
>> > Could you therefore please see if the following (untested) patch helps?
>> 
>> I'm seeing slightly better results with the patch:
>> 
>> 71548
>> 75987
>> 71557
>> 87432
>> 83538
>> 
>> But that's still not up to the speeds we saw under 2.6.29.  The packet
>> capture for one run can be found here:
>>   http://people.redhat.com/jmoyer/trond.pcap.bz2
>> 
>> Cheers,
>> Jeff
>
> Yes. Something is very wrong there...
>
> See for instance frame 1195, where the client finishes sending a whole
> series of READ requests, and we go into a flurry of ACKs passing
> backwards and forwards, but no data. It looks as if the NFS server isn't
> processing anything, probably because the window size falls afoul of the
> svc_tcp_has_wspace()...
>
> Does something like this help?

Sorry for the previous, stupid question.  I applied the patch in
addition the last one and here are the results:

70327
71561
68760
69199
65324

A packet capture for this run is available here:
  http://people.redhat.com/jmoyer/trond2.pcap.bz2

Any more ideas?  ;)

-Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-13 23:45                 ` Trond Myklebust
  2009-05-14 13:34                   ` Jeff Moyer
@ 2009-05-14 17:55                   ` J. Bruce Fields
  2009-05-14 18:26                     ` Trond Myklebust
  1 sibling, 1 reply; 27+ messages in thread
From: J. Bruce Fields @ 2009-05-14 17:55 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Moyer, netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs

On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > Hi, netdev folks.  The summary here is:
> > 
> > A patch added in the 2.6.30 development cycle caused a performance
> > regression in my NFS iozone testing.  The patch in question is the
> > following:
> > 
> > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > Date:   Tue Oct 21 14:13:47 2008 -0400
> > 
> >     svcrpc: take advantage of tcp autotuning
> >  
> > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > variation disappears (as does the performance regression) when reverting
> > the above commit.
> 
> It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> function. I can see no reason why we should stop processing new incoming
> RPC requests just because the send buffer happens to be 2/3 full. If we

I agree, the calculation doesn't look right.  But where do you get the
2/3 number from?

...
> @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
>  	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
>  	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
>  	int required;
> -	int wspace;
> -
> -	/*
> -	 * Set the SOCK_NOSPACE flag before checking the available
> -	 * sock space.
> -	 */
> -	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> -	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> -	wspace = sk_stream_wspace(svsk->sk_sk);
> -
> -	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> -		return 0;
> -	if (required * 2 > wspace)
> -		return 0;
>  
> -	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> +	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> +	if (sk_stream_wspace(svsk->sk_sk) < required)

This calculation looks the same before and after--you've just moved the
"*2" into the calcualtion of "required".  Am I missing something?  Maybe
you meant to write:

	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;

without the parentheses?

That looks closer, assuming the calculation is meant to be:

		atomic_read(..) == amount of buffer space we think we
			already need
		serv->sv_max_mesg * 2 == space for worst-case request
			and reply?

--b.

> +		goto out_nospace;
>  	return 1;
> +out_nospace:
> +	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> +	return 0;
>  }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 17:55                   ` J. Bruce Fields
@ 2009-05-14 18:26                     ` Trond Myklebust
  2009-05-15 21:37                       ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2009-05-14 18:26 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Moyer, netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs

On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote:
> On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > > Hi, netdev folks.  The summary here is:
> > > 
> > > A patch added in the 2.6.30 development cycle caused a performance
> > > regression in my NFS iozone testing.  The patch in question is the
> > > following:
> > > 
> > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > > Date:   Tue Oct 21 14:13:47 2008 -0400
> > > 
> > >     svcrpc: take advantage of tcp autotuning
> > >  
> > > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > > variation disappears (as does the performance regression) when reverting
> > > the above commit.
> > 
> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > function. I can see no reason why we should stop processing new incoming
> > RPC requests just because the send buffer happens to be 2/3 full. If we
> 
> I agree, the calculation doesn't look right.  But where do you get the
> 2/3 number from?

That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison.

> ...
> > @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> >  	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
> >  	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
> >  	int required;
> > -	int wspace;
> > -
> > -	/*
> > -	 * Set the SOCK_NOSPACE flag before checking the available
> > -	 * sock space.
> > -	 */
> > -	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > -	required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> > -	wspace = sk_stream_wspace(svsk->sk_sk);
> > -
> > -	if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> > -		return 0;
> > -	if (required * 2 > wspace)
> > -		return 0;
> >  
> > -	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > +	required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> > +	if (sk_stream_wspace(svsk->sk_sk) < required)
> 
> This calculation looks the same before and after--you've just moved the
> "*2" into the calcualtion of "required".  Am I missing something?  Maybe
> you meant to write:
> 
> 	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;
> 
> without the parentheses?

I wasn't trying to change that part of the calculation. I'm just
splitting out the stuff which has to do with TCP congestion (i.e. the
window size), and stuff which has to do with remaining socket buffer
space. I do, however, agree that we should probably drop that *2.

However there is (as usual) 'interesting behaviour' when it comes to
deferred requests. Their buffer space is already accounted for in the
'xpt_reserved' calculation, but they cannot get re-scheduled unless
svc_tcp_has_wspace() thinks it has enough free socket space for yet
another reply. Can you spell 'deadlock', children?

Trond


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 18:26                     ` Trond Myklebust
@ 2009-05-15 21:37                       ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2009-05-15 21:37 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Moyer, netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs

On Thu, May 14, 2009 at 02:26:09PM -0400, Trond Myklebust wrote:
> On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote:
> > On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> > > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > > > Hi, netdev folks.  The summary here is:
> > > > 
> > > > A patch added in the 2.6.30 development cycle caused a performance
> > > > regression in my NFS iozone testing.  The patch in question is the
> > > > following:
> > > > 
> > > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > > > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > > > Date:   Tue Oct 21 14:13:47 2008 -0400
> > > > 
> > > >     svcrpc: take advantage of tcp autotuning
> > > >  
> > > > which is also quoted below.  Using 8 nfsd threads, a single client doing
> > > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > > > KB/s under 2.6.30-rc4.  I also see more run to run variation under
> > > > 2.6.30-rc4 using the deadline I/O scheduler on the server.  That
> > > > variation disappears (as does the performance regression) when reverting
> > > > the above commit.
> > > 
> > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > > function. I can see no reason why we should stop processing new incoming
> > > RPC requests just because the send buffer happens to be 2/3 full. If we
> > 
> > I agree, the calculation doesn't look right.  But where do you get the
> > 2/3 number from?
> 
> That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison.

Oh, I see, so looking at their implementations,

	sk_stream_wspace(sk) < sk_stream_min_wspace(sk)

is equivalent to sk_wmem_queued/2 < sk_->sndbuf - sk_wmem_queued, or
sk_wmem_queued < 2/3 sndbuf, got it.  I didn't understand that the point
of this patch was just to do that calculation around--now I see.--b.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-14 15:00                       ` Jeff Moyer
@ 2009-05-17 19:10                         ` Trond Myklebust
  2009-05-17 19:12                           ` Trond Myklebust
  0 siblings, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2009-05-17 19:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote:
> Sorry for the previous, stupid question.  I applied the patch in
> addition the last one and here are the results:
> 
> 70327
> 71561
> 68760
> 69199
> 65324
> 
> A packet capture for this run is available here:
>   http://people.redhat.com/jmoyer/trond2.pcap.bz2
> 
> Any more ideas?  ;)

Yep. I've got 2 more patches for you. With both of them applied, I'm
seeing decent performance on my own test rig. The first patch is
appended. I'll send the second in another email (to avoid attachments).

Cheers
 Trond
-----------------------------------------------------------------------
>From fcfdaf81eb21a996d83a2b68da2d62bb3697c1db Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Sun, 17 May 2009 12:40:05 -0400
Subject: [PATCH] SUNRPC: Fix svc_tcp_recvfrom()

Ensure that if the TCP receive window is smaller than the message length,
then we just buffer the existing data, in order to allow the client to send
more...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 include/linux/sunrpc/svcsock.h |    1 +
 net/sunrpc/svcsock.c           |  167 +++++++++++++++++++++++++++++-----------
 2 files changed, 124 insertions(+), 44 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 483e103..b0b4546 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -28,6 +28,7 @@ struct svc_sock {
 	/* private TCP part */
 	u32			sk_reclen;	/* length of record */
 	u32			sk_tcplen;	/* current read length */
+	struct page *		sk_pages[RPCSVC_MAXPAGES];	/* received data */
 };
 
 /*
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index c4e7be5..6069489 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr,
 	return len;
 }
 
+static int svc_partial_recvfrom(struct svc_rqst *rqstp,
+				struct kvec *iov, int nr,
+				int buflen, unsigned int base)
+{
+	size_t save_iovlen;
+	void __user *save_iovbase;
+	unsigned int i;
+	int ret;
+
+	if (base == 0)
+		return svc_recvfrom(rqstp, iov, nr, buflen);
+
+	for (i = 0; i < nr; i++) {
+		if (iov[i].iov_len > base)
+			break;
+		base -= iov[i].iov_len;
+	}
+	save_iovlen = iov[i].iov_len;
+	save_iovbase = iov[i].iov_base;
+	iov[i].iov_len -= base;
+	iov[i].iov_base += base;
+	ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen);
+	iov[i].iov_len = save_iovlen;
+	iov[i].iov_base = save_iovbase;
+	return ret;
+}
+
 /*
  * Set socket snd and rcv buffer lengths
  */
@@ -790,6 +817,56 @@ failed:
 	return NULL;
 }
 
+static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
+{
+	unsigned int i, len, npages;
+
+	if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
+		return 0;
+	len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
+	npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < npages; i++) {
+		if (rqstp->rq_pages[i] != NULL)
+			put_page(rqstp->rq_pages[i]);
+		BUG_ON(svsk->sk_pages[i] == NULL);
+		rqstp->rq_pages[i] = svsk->sk_pages[i];
+		svsk->sk_pages[i] = NULL;
+	}
+	rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+	return len;
+}
+
+static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
+{
+	unsigned int i, len, npages;
+
+	if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
+		return;
+	len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
+	npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < npages; i++) {
+		svsk->sk_pages[i] = rqstp->rq_pages[i];
+		rqstp->rq_pages[i] = NULL;
+	}
+}
+
+static void svc_tcp_clear_pages(struct svc_sock *svsk)
+{
+	unsigned int i, len, npages;
+
+	if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
+		goto out;
+	len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
+	npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < npages; i++) {
+		BUG_ON(svsk->sk_pages[i] == NULL);
+		put_page(svsk->sk_pages[i]);
+		svsk->sk_pages[i] = NULL;
+	}
+out:
+	svsk->sk_tcplen = 0;
+}
+
 /*
  * Receive data from a TCP socket.
  */
@@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
 	int		len;
 	struct kvec *vec;
-	int pnum, vlen;
+	unsigned int want, base, vlen;
+	int pnum;
 
 	dprintk("svc: tcp_recv %p data %d conn %d close %d\n",
 		svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags),
@@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	 * possible up to the complete record length.
 	 */
 	if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) {
-		int		want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
 		struct kvec	iov;
 
+		want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
 		iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen;
 		iov.iov_len  = want;
 		if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0)
@@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		if (len < want) {
 			dprintk("svc: short recvfrom while reading record "
 				"length (%d of %d)\n", len, want);
-			svc_xprt_received(&svsk->sk_xprt);
-			return -EAGAIN; /* record header not complete */
+			goto err_noclose;
 		}
 
 		svsk->sk_reclen = ntohl(svsk->sk_reclen);
@@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		}
 	}
 
-	/* Check whether enough data is available */
-	len = svc_recv_available(svsk);
-	if (len < 0)
-		goto error;
-
-	if (len < svsk->sk_reclen) {
-		dprintk("svc: incomplete TCP record (%d of %d)\n",
-			len, svsk->sk_reclen);
-		svc_xprt_received(&svsk->sk_xprt);
-		return -EAGAIN;	/* record not complete */
-	}
-	len = svsk->sk_reclen;
-	set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+	base = svc_tcp_restore_pages(svsk, rqstp);
+	want = svsk->sk_reclen - base;
 
 	vec = rqstp->rq_vec;
 	vec[0] = rqstp->rq_arg.head[0];
 	vlen = PAGE_SIZE;
 	pnum = 1;
-	while (vlen < len) {
+	while (vlen < svsk->sk_reclen) {
 		vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]);
 		vec[pnum].iov_len = PAGE_SIZE;
 		pnum++;
@@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	rqstp->rq_respages = &rqstp->rq_pages[pnum];
 
 	/* Now receive data */
-	len = svc_recvfrom(rqstp, vec, pnum, len);
-	if (len < 0)
-		goto error;
+	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+	len = svc_partial_recvfrom(rqstp, vec, pnum, want, base);
+	if (len != want) {
+		if (len >= 0)
+			svsk->sk_tcplen += len;
+		else if (len != -EAGAIN)
+			goto err_other;
+		svc_tcp_save_pages(svsk, rqstp);
+		dprintk("svc: incomplete TCP record (%d of %d)\n",
+			svsk->sk_tcplen, svsk->sk_reclen);
+		goto err_noclose;
+	}
 
-	dprintk("svc: TCP complete record (%d bytes)\n", len);
-	rqstp->rq_arg.len = len;
+	rqstp->rq_arg.len = svsk->sk_reclen;
 	rqstp->rq_arg.page_base = 0;
-	if (len <= rqstp->rq_arg.head[0].iov_len) {
-		rqstp->rq_arg.head[0].iov_len = len;
+	if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
+		rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
 		rqstp->rq_arg.page_len = 0;
-	} else {
-		rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
-	}
+	} else
+		rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
 
 	rqstp->rq_xprt_ctxt   = NULL;
 	rqstp->rq_prot	      = IPPROTO_TCP;
@@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	/* Reset TCP read info */
 	svsk->sk_reclen = 0;
 	svsk->sk_tcplen = 0;
+	/* If we have more data, signal svc_xprt_enqueue() to try again */
+	if (svc_recv_available(svsk) > sizeof(rpc_fraghdr))
+		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+
 
 	svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
 	svc_xprt_received(&svsk->sk_xprt);
 	if (serv->sv_stats)
 		serv->sv_stats->nettcpcnt++;
 
-	return len;
-
- err_delete:
+	dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len);
+	return rqstp->rq_arg.len;
+error:
+	if (len == -EAGAIN)
+		goto err_got_eagain;
+err_other:
+	printk(KERN_NOTICE "%s: recvfrom returned errno %d\n",
+	       svsk->sk_xprt.xpt_server->sv_name, -len);
+err_delete:
 	set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
 	return -EAGAIN;
-
- error:
-	if (len == -EAGAIN) {
-		dprintk("RPC: TCP recvfrom got EAGAIN\n");
-		svc_xprt_received(&svsk->sk_xprt);
-	} else {
-		printk(KERN_NOTICE "%s: recvfrom returned errno %d\n",
-		       svsk->sk_xprt.xpt_server->sv_name, -len);
-		goto err_delete;
-	}
-
-	return len;
+err_got_eagain:
+	dprintk("RPC: TCP recvfrom got EAGAIN\n");
+err_noclose:
+	svc_xprt_received(&svsk->sk_xprt);
+	return -EAGAIN;	/* record not complete */
 }
 
 /*
@@ -1043,6 +1119,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 
 		svsk->sk_reclen = 0;
 		svsk->sk_tcplen = 0;
+		memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages));
 
 		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
 
@@ -1291,8 +1368,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
 
 	svc_sock_detach(xprt);
 
-	if (!test_bit(XPT_LISTENER, &xprt->xpt_flags))
+	if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
+		svc_tcp_clear_pages(svsk);
 		kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR);
+	}
 }
 
 /*
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-17 19:10                         ` Trond Myklebust
@ 2009-05-17 19:12                           ` Trond Myklebust
  2009-05-18 14:15                             ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Trond Myklebust @ 2009-05-17 19:12 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote:
> On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote:
> > Sorry for the previous, stupid question.  I applied the patch in
> > addition the last one and here are the results:
> > 
> > 70327
> > 71561
> > 68760
> > 69199
> > 65324
> > 
> > A packet capture for this run is available here:
> >   http://people.redhat.com/jmoyer/trond2.pcap.bz2
> > 
> > Any more ideas?  ;)
> 
> Yep. I've got 2 more patches for you. With both of them applied, I'm
> seeing decent performance on my own test rig. The first patch is
> appended. I'll send the second in another email (to avoid attachments).

Here is number 2. It is incremental to all the others...


-----------------------------------------------------------------------
>From 1d11caba8bcfc8fe718bfa9a957715bf3819af09 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Sun, 17 May 2009 13:01:00 -0400
Subject: [PATCH] SUNRPC: Further congestion control tweaks...

Ensure that deferred requests are accounted for correctly by the write
space reservation mechanism. In order to avoid double counting, remove
the
reservation when we defer the request, and save any calculated value, so
that we can restore it when the request is requeued.

Also fix svc_tcp_has_wspace() so that it doesn't reserve twice the
memory
that we expect to require in order to write out the data.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc_xprt.c      |   10 +++++-----
 net/sunrpc/svcsock.c       |   19 +++++++------------
 3 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 2a30775..2c373d8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -341,6 +341,7 @@ struct svc_deferred_req {
 	union svc_addr_u	daddr;	/* where reply must come from */
 	struct cache_deferred_req handle;
 	size_t			xprt_hlen;
+	int			reserved_space;
 	int			argslen;
 	__be32			args[0];
 };
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c200d92..daa1f27 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -299,7 +299,6 @@ static void svc_thread_dequeue(struct svc_pool
*pool, struct svc_rqst *rqstp)
  */
 void svc_xprt_enqueue(struct svc_xprt *xprt)
 {
-	struct svc_serv	*serv = xprt->xpt_server;
 	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
 	int cpu;
@@ -376,8 +375,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 				rqstp, rqstp->rq_xprt);
 		rqstp->rq_xprt = xprt;
 		svc_xprt_get(xprt);
-		rqstp->rq_reserved = serv->sv_max_mesg;
-		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 		rqstp->rq_waking = 1;
 		pool->sp_nwaking++;
 		pool->sp_stats.threads_woken++;
@@ -657,8 +654,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 	if (xprt) {
 		rqstp->rq_xprt = xprt;
 		svc_xprt_get(xprt);
-		rqstp->rq_reserved = serv->sv_max_mesg;
-		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 	} else {
 		/* No data pending. Go to sleep */
 		svc_thread_enqueue(pool, rqstp);
@@ -741,6 +736,8 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
 			rqstp, pool->sp_id, xprt,
 			atomic_read(&xprt->xpt_ref.refcount));
+		rqstp->rq_reserved = serv->sv_max_mesg;
+		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 		rqstp->rq_deferred = svc_deferred_dequeue(xprt);
 		if (rqstp->rq_deferred) {
 			svc_xprt_received(xprt);
@@ -1006,6 +1003,8 @@ static struct cache_deferred_req *svc_defer(struct
cache_req *req)
 	}
 	svc_xprt_get(rqstp->rq_xprt);
 	dr->xprt = rqstp->rq_xprt;
+	dr->reserved_space = rqstp->rq_reserved;
+	svc_reserve(rqstp, 0);
 
 	dr->handle.revisit = svc_revisit;
 	return &dr->handle;
@@ -1018,6 +1017,7 @@ static int svc_deferred_recv(struct svc_rqst
*rqstp)
 {
 	struct svc_deferred_req *dr = rqstp->rq_deferred;
 
+	svc_reserve(rqstp, dr->reserved_space);
 	/* setup iov_base past transport header */
 	rqstp->rq_arg.head[0].iov_base = dr->args + (dr->xprt_hlen>>2);
 	/* The iov_len does not include the transport header bytes */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4837442..eed978e 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -345,6 +345,7 @@ static void svc_sock_setbufsize(struct socket *sock,
unsigned int snd,
 	lock_sock(sock->sk);
 	sock->sk->sk_sndbuf = snd * 2;
 	sock->sk->sk_rcvbuf = rcv * 2;
+	sock->sk->sk_write_space(sock->sk);
 	release_sock(sock->sk);
 #endif
 }
@@ -626,6 +627,7 @@ static void svc_udp_init(struct svc_sock *svsk,
struct svc_serv *serv)
 	 * receive and respond to one request.
 	 * svc_udp_recvfrom will re-adjust if necessary
 	 */
+	svsk->sk_sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
 	svc_sock_setbufsize(svsk->sk_sock,
 			    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
 			    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
@@ -971,21 +973,14 @@ static void svc_tcp_prep_reply_hdr(struct svc_rqst
*rqstp)
 static int svc_tcp_has_wspace(struct svc_xprt *xprt)
 {
 	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
-	struct svc_serv	*serv = svsk->sk_xprt.xpt_server;
-	int reserved;
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
 	int required;
 
-	reserved = atomic_read(&xprt->xpt_reserved);
-	/* Always allow the server to process at least one request, whether
-	 * or not the TCP window is large enough
-	 */
-	if (reserved == 0)
+	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
+		return 1;
+	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
+	if (sk_stream_wspace(svsk->sk_sk) >= required)
 		return 1;
-	required = (reserved + serv->sv_max_mesg) << 1;
-	if (sk_stream_wspace(svsk->sk_sk) < required)
-		goto out_nospace;
-	return 1;
-out_nospace:
 	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 	return 0;
 }
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-17 19:12                           ` Trond Myklebust
@ 2009-05-18 14:15                             ` Jeff Moyer
  2009-05-22 23:45                               ` J. Bruce Fields
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2009-05-18 14:15 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees,
	linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote:
>> On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote:
>> > Sorry for the previous, stupid question.  I applied the patch in
>> > addition the last one and here are the results:
>> > 
>> > 70327
>> > 71561
>> > 68760
>> > 69199
>> > 65324
>> > 
>> > A packet capture for this run is available here:
>> >   http://people.redhat.com/jmoyer/trond2.pcap.bz2
>> > 
>> > Any more ideas?  ;)
>> 
>> Yep. I've got 2 more patches for you. With both of them applied, I'm
>> seeing decent performance on my own test rig. The first patch is
>> appended. I'll send the second in another email (to avoid attachments).
>
> Here is number 2. It is incremental to all the others...

With all 4 patches applied, these are the numbers for 5 runs:

103168
101212
103346
100842
103172

It's looking much better, but we're still off by a few percent.  Thanks
for the quick turnaround on this, Trond!  If you submit these patches,
feel free to add:

Tested-by: Jeff Moyer <jmoyer@redhat.com>

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
  2009-05-18 14:15                             ` Jeff Moyer
@ 2009-05-22 23:45                               ` J. Bruce Fields
  0 siblings, 0 replies; 27+ messages in thread
From: J. Bruce Fields @ 2009-05-22 23:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Trond Myklebust, netdev, Andrew Morton, Jens Axboe, linux-kernel,
	Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs

On Mon, May 18, 2009 at 10:15:22AM -0400, Jeff Moyer wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
> 
> > On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote:
> >> On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote:
> >> > Sorry for the previous, stupid question.  I applied the patch in
> >> > addition the last one and here are the results:
> >> > 
> >> > 70327
> >> > 71561
> >> > 68760
> >> > 69199
> >> > 65324
> >> > 
> >> > A packet capture for this run is available here:
> >> >   http://people.redhat.com/jmoyer/trond2.pcap.bz2
> >> > 
> >> > Any more ideas?  ;)
> >> 
> >> Yep. I've got 2 more patches for you. With both of them applied, I'm
> >> seeing decent performance on my own test rig. The first patch is
> >> appended. I'll send the second in another email (to avoid attachments).
> >
> > Here is number 2. It is incremental to all the others...
> 
> With all 4 patches applied, these are the numbers for 5 runs:
> 
> 103168
> 101212
> 103346
> 100842
> 103172
> 
> It's looking much better, but we're still off by a few percent.  Thanks
> for the quick turnaround on this, Trond!  If you submit these patches,
> feel free to add:

I'd like to take a look and run some tests of my own when I get back
from vacation next week.

Then assuming no problems I'm inclined to queue them up for 2.6.31, and,
in the meantime, revert the autotuning patch temporarily for
2.6.30--under the assumption that autotuning is still the right thing to
do, but that this is too significant a regression to ignore, and Trond's
work is too involved to submit for 2.6.30 this late in the process.

--b.

> 
> Tested-by: Jeff Moyer <jmoyer@redhat.com>
> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-05-22 23:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-23 14:01 2.6.30-rc deadline scheduler performance regression for iozone over NFS Jeff Moyer
2009-05-08 19:01 ` Andrew Morton
2009-05-11  8:14   ` Jens Axboe
2009-05-11 13:53     ` Jeff Moyer
2009-05-11 16:58       ` Jens Axboe
2009-05-13  3:29         ` Jeff Moyer
2009-05-13  3:44           ` Andrew Morton
2009-05-13 14:58             ` Jeff Moyer
2009-05-13 16:20               ` Olga Kornievskaia
2009-05-13 16:32                 ` Andrew Morton
2009-05-13 18:16                   ` Olga Kornievskaia
2009-05-13 19:06                     ` Jeff Moyer
2009-05-13 18:25                   ` Jim Rees
2009-05-13 19:45                     ` Trond Myklebust
2009-05-13 19:29               ` Jeff Moyer
2009-05-13 23:45                 ` Trond Myklebust
2009-05-14 13:34                   ` Jeff Moyer
2009-05-14 14:33                     ` Trond Myklebust
2009-05-14 14:38                       ` Jeff Moyer
2009-05-14 15:00                       ` Jeff Moyer
2009-05-17 19:10                         ` Trond Myklebust
2009-05-17 19:12                           ` Trond Myklebust
2009-05-18 14:15                             ` Jeff Moyer
2009-05-22 23:45                               ` J. Bruce Fields
2009-05-14 17:55                   ` J. Bruce Fields
2009-05-14 18:26                     ` Trond Myklebust
2009-05-15 21:37                       ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).