linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tcp: splice as many packets as possible at once
@ 2009-01-08 17:30 Willy Tarreau
  2009-01-08 19:44 ` Jens Axboe
  2009-01-08 21:50 ` Ben Mansell
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-08 17:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar,
	linux-kernel, netdev

Jens,

here's the other patch I was talking about, for better behaviour of
non-blocking splice(). Ben Mansell also confirms similar improvements
in his tests, where non-blocking splice() initially showed half of
read()/write() performance. Ben, would you mind adding a Tested-By
line ?

Also, please note that this is unrelated to the corruption bug I reported
and does not fix it.

Regards,
Willy

>From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 8 Jan 2009 17:10:13 +0100
Subject: [PATCH] tcp: splice as many packets as possible at once

Currently, in non-blocking mode, tcp_splice_read() returns after
splicing one segment regardless of the len argument. This results
in low performance and very high overhead due to syscall rate when
splicing from interfaces which do not support LRO.

The fix simply consists in not breaking out of the loop after the
first read. That way, we can read up to the size requested by the
caller and still return when there is no data left.

Performance has significantly improved with this fix, with the
number of calls to splice() divided by about 20, and CPU usage
dropped from 100% to 75%.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 35bcddf..80261b4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 		lock_sock(sk);
 
 		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
-		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current))
 			break;
 	}
-- 
1.6.0.3


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
@ 2009-01-08 19:44 ` Jens Axboe
  2009-01-08 22:03   ` Willy Tarreau
  2009-01-08 21:50 ` Ben Mansell
  1 sibling, 1 reply; 190+ messages in thread
From: Jens Axboe @ 2009-01-08 19:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar,
	linux-kernel, netdev

On Thu, Jan 08 2009, Willy Tarreau wrote:
> Jens,
> 
> here's the other patch I was talking about, for better behaviour of
> non-blocking splice(). Ben Mansell also confirms similar improvements
> in his tests, where non-blocking splice() initially showed half of
> read()/write() performance. Ben, would you mind adding a Tested-By
> line ?

Looks very good to me, I don't see any valid reason to have that !timeo
break. So feel free to add my acked-by to that patch and send it to
Dave.

> 
> Also, please note that this is unrelated to the corruption bug I reported
> and does not fix it.
> 
> Regards,
> Willy
> 
> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 8 Jan 2009 17:10:13 +0100
> Subject: [PATCH] tcp: splice as many packets as possible at once
> 
> Currently, in non-blocking mode, tcp_splice_read() returns after
> splicing one segment regardless of the len argument. This results
> in low performance and very high overhead due to syscall rate when
> splicing from interfaces which do not support LRO.
> 
> The fix simply consists in not breaking out of the loop after the
> first read. That way, we can read up to the size requested by the
> caller and still return when there is no data left.
> 
> Performance has significantly improved with this fix, with the
> number of calls to splice() divided by about 20, and CPU usage
> dropped from 100% to 75%.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..80261b4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>  		lock_sock(sk);
>  
>  		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
> -		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
> +		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
>  		    signal_pending(current))
>  			break;
>  	}
> -- 
> 1.6.0.3
> 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
  2009-01-08 19:44 ` Jens Axboe
@ 2009-01-08 21:50 ` Ben Mansell
  2009-01-08 21:55   ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Ben Mansell @ 2009-01-08 21:50 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, Jarek Poplawski, Ingo Molnar, linux-kernel, netdev,
	Jens Axboe

> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 8 Jan 2009 17:10:13 +0100
> Subject: [PATCH] tcp: splice as many packets as possible at once
> 
> Currently, in non-blocking mode, tcp_splice_read() returns after
> splicing one segment regardless of the len argument. This results
> in low performance and very high overhead due to syscall rate when
> splicing from interfaces which do not support LRO.
> 
> The fix simply consists in not breaking out of the loop after the
> first read. That way, we can read up to the size requested by the
> caller and still return when there is no data left.
> 
> Performance has significantly improved with this fix, with the
> number of calls to splice() divided by about 20, and CPU usage
> dropped from 100% to 75%.
> 

I get similar results with my testing here. Benchmarking an application 
with this patch shows that more than one packet is being splice()d in at 
once, as a result I see a doubling in throughput.

Tested-by: Ben Mansell <ben@zeus.com>

> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..80261b4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>  		lock_sock(sk);
>  
>  		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
> -		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
> +		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
>  		    signal_pending(current))
>  			break;
>  	}


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 21:50 ` Ben Mansell
@ 2009-01-08 21:55   ` David Miller
  2009-01-08 22:20     ` Willy Tarreau
  2009-01-09  6:47     ` Eric Dumazet
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-08 21:55 UTC (permalink / raw)
  To: ben; +Cc: w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Ben Mansell <ben@zeus.com>
Date: Thu, 08 Jan 2009 21:50:44 +0000

> > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > Subject: [PATCH] tcp: splice as many packets as possible at once
> > Currently, in non-blocking mode, tcp_splice_read() returns after
> > splicing one segment regardless of the len argument. This results
> > in low performance and very high overhead due to syscall rate when
> > splicing from interfaces which do not support LRO.
> > The fix simply consists in not breaking out of the loop after the
> > first read. That way, we can read up to the size requested by the
> > caller and still return when there is no data left.
> > Performance has significantly improved with this fix, with the
> > number of calls to splice() divided by about 20, and CPU usage
> > dropped from 100% to 75%.
> > 
> 
> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
> 
> Tested-by: Ben Mansell <ben@zeus.com>

I'm not applying this until someone explains to me why
we should remove this test from the splice receive but
keep it in the tcp_recvmsg() code where it has been
essentially forever.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 19:44 ` Jens Axboe
@ 2009-01-08 22:03   ` Willy Tarreau
  0 siblings, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-08 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar,
	linux-kernel, netdev

On Thu, Jan 08, 2009 at 08:44:24PM +0100, Jens Axboe wrote:
> On Thu, Jan 08 2009, Willy Tarreau wrote:
> > Jens,
> > 
> > here's the other patch I was talking about, for better behaviour of
> > non-blocking splice(). Ben Mansell also confirms similar improvements
> > in his tests, where non-blocking splice() initially showed half of
> > read()/write() performance. Ben, would you mind adding a Tested-By
> > line ?
> 
> Looks very good to me, I don't see any valid reason to have that !timeo
> break. So feel free to add my acked-by to that patch and send it to
> Dave.

Done, thanks.

Also, I tried to follow the code path from the userspace splice() call and
the kernel code. While splicing from socket to a pipe seems obvious till
tcp_splice_read(), I have a hard time finding the correct way from the pipe
to the socket.

It *seems* to me that we proceed like this, I'd like someone to confirm :

sys_splice()
  do_splice()
    do_splice_from()
      generic_splice_sendpage()      [ I assumed this from socket_file_ops ]
        splice_from_pipe(,,pipe_to_sendpage)
          __splice_from_pipe(,,pipe_to_sendpage)
            pipe_to_sendpage()
              tcp_sendpage()         [ I assumed this from inet_stream_ops ]
                do_tcp_sendpages()   (if NETIF_F_SG) or sock_no_sendpage()

I hope that I guessed it right because it does not appear obvious to me. It
will help me try to understand better the corruption problem.

Now for the read part, am I wrong tcp_splice_read() will return -EAGAIN
if the pipe is full ? It seems to me that this will be brought up by
splice_to_pipe(), via __tcp_splice_read, tcp_read_sock, tcp_splice_data_recv
and skb_splice_bits.

If so, this can cause serious trouble because if there is data available in
a socket, poll() will return POLLIN. Then we call splice(sock, pipe) but the
pipe is full, so we get -EAGAIN, which in turn makes the application think
that we need to poll again, and since no data has moved, we'll immediately
call splice() again. In my early experiments, I'm pretty sure I already came
across such a situation. Shouldn't we return -EBUSY or -ENOSPC in this case ?
I can work on patches if there is a consensus around those issues, and this
will also help me understand better the splice internals. I just don't want
to work in the wrong direction for nothing.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 21:55   ` David Miller
@ 2009-01-08 22:20     ` Willy Tarreau
  2009-01-13 23:08       ` David Miller
  2009-01-09  6:47     ` Eric Dumazet
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-08 22:20 UTC (permalink / raw)
  To: David Miller; +Cc: ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> From: Ben Mansell <ben@zeus.com>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
> 
> > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > > Subject: [PATCH] tcp: splice as many packets as possible at once
> > > Currently, in non-blocking mode, tcp_splice_read() returns after
> > > splicing one segment regardless of the len argument. This results
> > > in low performance and very high overhead due to syscall rate when
> > > splicing from interfaces which do not support LRO.
> > > The fix simply consists in not breaking out of the loop after the
> > > first read. That way, we can read up to the size requested by the
> > > caller and still return when there is no data left.
> > > Performance has significantly improved with this fix, with the
> > > number of calls to splice() divided by about 20, and CPU usage
> > > dropped from 100% to 75%.
> > > 
> > 
> > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
> > 
> > Tested-by: Ben Mansell <ben@zeus.com>
> 
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

In my opinion, the code structure is different between both functions. In
tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
all data which have been processed since the entry in the function. If we
removed the test here, we could not break out of the loop once we have
copied something. In tcp_splice_read(), the test is still present in the
(!ret) code path, where ret is the last number of bytes processed, so the
test is still performed regardless of what has been previously transferred.

So in summary, in tcp_splice_read without this test, we get back to the
top of the loop, and if __tcp_splice_read() returns 0, then we break out
of the loop.

I don't know if my explanation is clear or not, it's easier to follow the
loops in front of the code :-/

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 21:55   ` David Miller
  2009-01-08 22:20     ` Willy Tarreau
@ 2009-01-09  6:47     ` Eric Dumazet
  2009-01-09  7:04       ` Willy Tarreau
                         ` (2 more replies)
  1 sibling, 3 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09  6:47 UTC (permalink / raw)
  To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

David Miller a écrit :
> From: Ben Mansell <ben@zeus.com>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
> 
>>> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
>>> From: Willy Tarreau <w@1wt.eu>
>>> Date: Thu, 8 Jan 2009 17:10:13 +0100
>>> Subject: [PATCH] tcp: splice as many packets as possible at once
>>> Currently, in non-blocking mode, tcp_splice_read() returns after
>>> splicing one segment regardless of the len argument. This results
>>> in low performance and very high overhead due to syscall rate when
>>> splicing from interfaces which do not support LRO.
>>> The fix simply consists in not breaking out of the loop after the
>>> first read. That way, we can read up to the size requested by the
>>> caller and still return when there is no data left.
>>> Performance has significantly improved with this fix, with the
>>> number of calls to splice() divided by about 20, and CPU usage
>>> dropped from 100% to 75%.
>>>
>> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
>>
>> Tested-by: Ben Mansell <ben@zeus.com>
> 
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

I found this patch usefull in my testings, but had a feeling something
was not complete. If the goal is to reduce number of splice() calls,
we also should reduce number of wakeups. If splice() is used in non
blocking mode, nothing we can do here of course, since the application
will use a poll()/select()/epoll() event before calling splice(). A
good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

I tested this on current tree and it is not working : we still have
one wakeup for each frame (ethernet link is a 100 Mb/s one)

bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5)                            = 0
accept(6, 0, NULL)                      = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1024
splice(0x3, 0, 0x5, 0, 0x400, 0x5)      = 1024
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460


About tcp_recvmsg(), we might also remove the "!timeo" test as well,
more testings are needed. But remind that if an application provides
a large buffer to tcp_recvmsg() call, removing the test will reduce
the number of syscalls but might use more DCACHE. It could reduce
performance on old cpus. With splice() call, we expect to not
copy memory and trash DCACHE, and pipe buffers being limited to 16,
we cope with a limited working set. 




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  6:47     ` Eric Dumazet
@ 2009-01-09  7:04       ` Willy Tarreau
  2009-01-09  7:28         ` Eric Dumazet
  2009-01-09 15:42       ` Eric Dumazet
  2009-01-13 23:26       ` David Miller
  2 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09  7:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
> 
> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
> 
> I tested this on current tree and it is not working : we still have
> one wakeup for each frame (ethernet link is a 100 Mb/s one)

Well, it simply means that data are not coming in fast enough compared to
the tiny amount of work you have to perform to forward them, there's nothing
wrong with that. It is important in my opinion not to wait for *enough* data
to come in, otherwise it might become impossible to forward small chunks.
I mean, if there are only 300 bytes left to forward, we must not wait
indefinitely for more data to come, we must forward those 300 bytes.

In your case below, it simply means that the performance improvement brought
by splice will be really minor because you'll just avoid 2 memory copies,
which are ridiculously cheap at 100 Mbps. If you would change your program
to use recv/send, you would observe the exact same pattern, because as soon
as poll() wakes you up, you still only have one frame in the system buffers.
On a small machine I have here (geode 500 MHz), I easily have multiple
frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
on something like 10-100 sockets, and by the time I process the first ones,
the later have time to queue up more data.

> bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> listen(6, 5)                            = 0
> accept(6, 0, NULL)                      = 7
> setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1024
> splice(0x3, 0, 0x5, 0, 0x400, 0x5)      = 1024
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3)    = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460

How much CPU is used for that ? Probably something like 1-2% ? If you could
retry with 100-1000 concurrent sessions, you would observe more traffic on
each socket, which is where the gain of splice becomes really important.

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed.

No right now we can't (we must move it somewhere else at least). Because
once at least one byte has been received (copied != 0), no other check
will break out of the loop (or at least I have not found it).

> But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set. 

That's an interesting point indeed !

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  7:04       ` Willy Tarreau
@ 2009-01-09  7:28         ` Eric Dumazet
  2009-01-09  7:42           ` Willy Tarreau
  2009-01-13 23:27           ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09  7:28 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
>>> I'm not applying this until someone explains to me why
>>> we should remove this test from the splice receive but
>>> keep it in the tcp_recvmsg() code where it has been
>>> essentially forever.
>> I found this patch usefull in my testings, but had a feeling something
>> was not complete. If the goal is to reduce number of splice() calls,
>> we also should reduce number of wakeups. If splice() is used in non
>> blocking mode, nothing we can do here of course, since the application
>> will use a poll()/select()/epoll() event before calling splice(). A
>> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
>>
>> I tested this on current tree and it is not working : we still have
>> one wakeup for each frame (ethernet link is a 100 Mb/s one)
> 
> Well, it simply means that data are not coming in fast enough compared to
> the tiny amount of work you have to perform to forward them, there's nothing
> wrong with that. It is important in my opinion not to wait for *enough* data
> to come in, otherwise it might become impossible to forward small chunks.
> I mean, if there are only 300 bytes left to forward, we must not wait
> indefinitely for more data to come, we must forward those 300 bytes.
> 
> In your case below, it simply means that the performance improvement brought
> by splice will be really minor because you'll just avoid 2 memory copies,
> which are ridiculously cheap at 100 Mbps. If you would change your program
> to use recv/send, you would observe the exact same pattern, because as soon
> as poll() wakes you up, you still only have one frame in the system buffers.
> On a small machine I have here (geode 500 MHz), I easily have multiple
> frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
> on something like 10-100 sockets, and by the time I process the first ones,
> the later have time to queue up more data.

My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)

But if it doesnt work on a single flow, it wont work on many :)

I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
could be available for other tasks.

If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
would be good if kernel was smart enough and could reduce number of wakeups.

(Next blocking point is the fixed limit of 16 pages per pipe, but thats another story)

> 
>> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
>> more testings are needed.
> 
> No right now we can't (we must move it somewhere else at least). Because
> once at least one byte has been received (copied != 0), no other check
> will break out of the loop (or at least I have not found it).
> 

Of course we cant remove the test totally, but change the logic so that several skb
might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

Lets focus on functional changes, not on implementation details :)



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  7:28         ` Eric Dumazet
@ 2009-01-09  7:42           ` Willy Tarreau
  2009-01-13 23:27           ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09  7:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 08:28:09AM +0100, Eric Dumazet wrote:
> My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)
> 
> But if it doesnt work on a single flow, it wont work on many :)

Yes it will, precisely because during the time you spend processing flow #1,
you're still receiving data for flow #2. I really invite you to try. That's
what I've been observing for years of userland coding of proxies.

> I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
> in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
> could be available for other tasks.

I also observe 23000 bytes on average on gigabit, which is very good
(only about 5000 calls per second). And the CPU usage is lower than
with recv/send, and I'd like to be able to run some profiling because
I observed very different performance patterns depending on the network
cards used. Generally, almost all the time is spent in softirqs.

It's easy to make poll wait a little bit more : call it later and do
your work before calling it. Also, epoll_wait() lets you ask it to
return just a few amount of FDs. This really improves data gathering.
I generally observe best performance between 30-200 FDs per call, even
with 10000 concurrent connections. During the time I process the first
200 FDs, data is accumulating in the other's buffers.

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
> would be good if kernel was smart enough and could reduce number of wakeups.

Yes, I agree about that. But my comment was about not making this
behaviour mandatory for splice(). Letting the application choose is
the way to go, of course.

> (Next blocking point is the fixed limit of 16 pages per pipe, but
> thats another story)

Yes but that's not always easy to guess how many data you can feed
into the pipe. It seems that depending on how the segments are
gathered, you can store between 16 segments and 64 kB. I have
observed some cases in blocking mode where I could not push more
than a few kbytes with a small MSS, indicating to me that all those
segments were each on a distinct page. I don't know precisely how
that's handled internally.

> >> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> >> more testings are needed.
> > 
> > No right now we can't (we must move it somewhere else at least). Because
> > once at least one byte has been received (copied != 0), no other check
> > will break out of the loop (or at least I have not found it).
> > 
> 
> Of course we cant remove the test totally, but change the logic so that several skb
> might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

OK, I initially understood that you suggested we could simply remove
it like I did for splice.

> Lets focus on functional changes, not on implementation details :)

Agreed :-)

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  6:47     ` Eric Dumazet
  2009-01-09  7:04       ` Willy Tarreau
@ 2009-01-09 15:42       ` Eric Dumazet
  2009-01-09 17:57         ` Eric Dumazet
                           ` (2 more replies)
  2009-01-13 23:26       ` David Miller
  2 siblings, 3 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 15:42 UTC (permalink / raw)
  To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Eric Dumazet a écrit :
> David Miller a écrit :
>> I'm not applying this until someone explains to me why
>> we should remove this test from the splice receive but
>> keep it in the tcp_recvmsg() code where it has been
>> essentially forever.

Reading again tcp_recvmsg(), I found it already is able to eat several skb
even in nonblocking mode.

setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0
ioctl(5, FIONBIO, [1])                  = 0
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536


David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is
no issue with it. We really want to break from this loop if !timeo .

Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

I wonder if the right fix should be done in tcp_read_sock() : this is the
one who should eat several skbs IMHO, if we want optimal ACK generation.

We break out of its loop at line 1246

if (!desc->count) /* this test is always true */
	break;

(__tcp_splice_read() set count to 0, right before calling tcp_read_sock())

So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 15:42       ` Eric Dumazet
@ 2009-01-09 17:57         ` Eric Dumazet
  2009-01-09 18:54         ` Willy Tarreau
  2009-01-13 23:31         ` David Miller
  2 siblings, 0 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 17:57 UTC (permalink / raw)
  To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David Miller a écrit :
>>> I'm not applying this until someone explains to me why
>>> we should remove this test from the splice receive but
>>> keep it in the tcp_recvmsg() code where it has been
>>> essentially forever.
> 
> Reading again tcp_recvmsg(), I found it already is able to eat several skb
> even in nonblocking mode.
> 
> setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0
> ioctl(5, FIONBIO, [1])                  = 0
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> 
> 
> David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is
> no issue with it. We really want to break from this loop if !timeo .
> 
> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
> 
> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
> 
> We break out of its loop at line 1246
> 
> if (!desc->count) /* this test is always true */
> 	break;
> 
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
> 
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

I tested following patch and got expected result :

- less ACK, and correct rcvbuf adjustments.
- I can fill the Gb link with one flow only, using less than
  10% of the cpu, instead of 40% without patch.

Setting desc->count to 1 seems to be the current practice.
(Example in drivers/scsi/iscsi_tcp.c : iscsi_sw_tcp_data_ready())

******************************************************************
* Note : this patch is wrong, because splice() can now           *
* return  more bytes than asked for (if SPLICE_F_NONBLOCK asked) *
******************************************************************

[PATCH] tcp: splice as many packets as possible at once

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called once
per syscall.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..15bd67b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -533,6 +533,9 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
 		.arg.data = tss,
 	};
 
+	if (tss->flags & SPLICE_F_NONBLOCK)
+		rd_desc.count = 1; /* we want as many segments as possible */
+
 	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
 }
 


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 15:42       ` Eric Dumazet
  2009-01-09 17:57         ` Eric Dumazet
@ 2009-01-09 18:54         ` Willy Tarreau
  2009-01-09 20:51           ` Eric Dumazet
  2009-01-13 23:31         ` David Miller
  2 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 18:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Hi Eric,

On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
(...)
> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
> 
> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
> 
> We break out of its loop at line 1246
> 
> if (!desc->count) /* this test is always true */
> 	break;
> 
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
> 
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

That's a very interesting discovery that you made here. I have made
mesurements with this line commented out just to get an idea. The
hardest part was to find a CPU-bound machine. Finally I slowed my
laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
call that 300). That way, I cannot saturate the PCI-based tg3 and
I can observe the effects of various changes on the data rate.

- original tcp_splice_read(), with "!timeo"    : 24.1 MB/s
- modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
- original with line #1246 commented out       : 34.5 MB/s (+43%)

So you're right, avoiding calling tcp_read_sock() all the time
gives a nice performance boost.

Also, I found that tcp_splice_read() behaves like this when breaking
out of the loop :

    lock_sock();
    while () {
         ...
         __tcp_splice_read();
         ...
         release_sock();
         lock_sock();
         if (break condition)
              break;
    }
    release_sock();

Which means that when breaking out of the loop on (!timeo)
with ret > 0, we do release_sock/lock_sock/release_sock.

So I tried a minor modification, consisting in moving the
test before release_sock(), and leaving !timeo there with
line #1246 commented out. That's a noticeable winner, as
the data rate went up to 35.7 MB/s (+48%).

Also, in your second mail, you're saying that your change
might return more data than requested by the user. I can't
find why, could you please explain to me, as I'm still quite
ignorant in this area ?

Thanks,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 18:54         ` Willy Tarreau
@ 2009-01-09 20:51           ` Eric Dumazet
  2009-01-09 21:24             ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 20:51 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> Hi Eric,
> 
> On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
> (...)
>> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
>> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
>>
>> I wonder if the right fix should be done in tcp_read_sock() : this is the
>> one who should eat several skbs IMHO, if we want optimal ACK generation.
>>
>> We break out of its loop at line 1246
>>
>> if (!desc->count) /* this test is always true */
>> 	break;
>>
>> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>>
>> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.
> 
> That's a very interesting discovery that you made here. I have made
> mesurements with this line commented out just to get an idea. The
> hardest part was to find a CPU-bound machine. Finally I slowed my
> laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
> call that 300). That way, I cannot saturate the PCI-based tg3 and
> I can observe the effects of various changes on the data rate.
> 
> - original tcp_splice_read(), with "!timeo"    : 24.1 MB/s
> - modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
> - original with line #1246 commented out       : 34.5 MB/s (+43%)
> 
> So you're right, avoiding calling tcp_read_sock() all the time
> gives a nice performance boost.
> 
> Also, I found that tcp_splice_read() behaves like this when breaking
> out of the loop :
> 
>     lock_sock();
>     while () {
>          ...
>          __tcp_splice_read();
>          ...
>          release_sock();
>          lock_sock();
>          if (break condition)
>               break;
>     }
>     release_sock();
> 
> Which means that when breaking out of the loop on (!timeo)
> with ret > 0, we do release_sock/lock_sock/release_sock.
> 
> So I tried a minor modification, consisting in moving the
> test before release_sock(), and leaving !timeo there with
> line #1246 commented out. That's a noticeable winner, as
> the data rate went up to 35.7 MB/s (+48%).
> 
> Also, in your second mail, you're saying that your change
> might return more data than requested by the user. I can't
> find why, could you please explain to me, as I'm still quite
> ignorant in this area ?

Well, I just tested various user programs and indeed got this
strange result :

Here I call splice() with len=1000 (0x3e8), and you can see
it gives a result of 1460 at the second call.

I suspect a bug in splice code, that my patch just exposed.

pipe([3, 4])                            = 0
open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0644) = 5
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5)                            = 0
accept(6, 0, NULL)                      = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [1000], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1000                OK
splice(0x3, 0, 0x5, 0, 0x3e8, 0x5)      = 1000
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1460                Oh well... 1460 > 1000 !!!
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3)      = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5)      = 1460




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 20:51           ` Eric Dumazet
@ 2009-01-09 21:24             ` Willy Tarreau
  2009-01-09 22:02               ` Eric Dumazet
  2009-01-09 22:07               ` Willy Tarreau
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 21:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
(...)
> > Also, in your second mail, you're saying that your change
> > might return more data than requested by the user. I can't
> > find why, could you please explain to me, as I'm still quite
> > ignorant in this area ?
> 
> Well, I just tested various user programs and indeed got this
> strange result :
> 
> Here I call splice() with len=1000 (0x3e8), and you can see
> it gives a result of 1460 at the second call.

huh, not nice indeed!

While looking at the code to see how this could be possible, I
came across this minor thing (unrelated IMHO) :

	if (__skb_splice_bits(skb, &offset, &tlen, &spd))
		goto done;
>>>>>>	else if (!tlen)    <<<<<<
		goto done;

	/*
	 * now see if we have a frag_list to map
	 */
	if (skb_shinfo(skb)->frag_list) {
		struct sk_buff *list = skb_shinfo(skb)->frag_list;

		for (; list && tlen; list = list->next) {
			if (__skb_splice_bits(list, &offset, &tlen, &spd))
				break;
		}
	}

    done:

Above on the enlighted line, we'd better remove the else and leave a plain
"if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
tlen, we still enter the if and evaluate the for condition for nothing. But
let's leave that for later.

> I suspect a bug in splice code, that my patch just exposed.

I've checked in skb_splice_bits() and below and can't see how we can move
more than the requested len.

However, with your change, I don't clearly see how we break out of
the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
and read remaining data ? I suspect that we should at least exit when
((struct tcp_splice_state *)desc->arg.data)->len = 0.

At least that's something easy to add just before or after !desc->count
for a test.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 21:24             ` Willy Tarreau
@ 2009-01-09 22:02               ` Eric Dumazet
  2009-01-09 22:09                 ` Willy Tarreau
  2009-01-09 22:07               ` Willy Tarreau
  1 sibling, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 22:02 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> (...)
>>> Also, in your second mail, you're saying that your change
>>> might return more data than requested by the user. I can't
>>> find why, could you please explain to me, as I'm still quite
>>> ignorant in this area ?
>> Well, I just tested various user programs and indeed got this
>> strange result :
>>
>> Here I call splice() with len=1000 (0x3e8), and you can see
>> it gives a result of 1460 at the second call.
> 
> huh, not nice indeed!
> 
> While looking at the code to see how this could be possible, I
> came across this minor thing (unrelated IMHO) :
> 
> 	if (__skb_splice_bits(skb, &offset, &tlen, &spd))
> 		goto done;
>>>>>>> 	else if (!tlen)    <<<<<<
> 		goto done;
> 
> 	/*
> 	 * now see if we have a frag_list to map
> 	 */
> 	if (skb_shinfo(skb)->frag_list) {
> 		struct sk_buff *list = skb_shinfo(skb)->frag_list;
> 
> 		for (; list && tlen; list = list->next) {
> 			if (__skb_splice_bits(list, &offset, &tlen, &spd))
> 				break;
> 		}
> 	}
> 
>     done:
> 
> Above on the enlighted line, we'd better remove the else and leave a plain
> "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
> tlen, we still enter the if and evaluate the for condition for nothing. But
> let's leave that for later.
> 
>> I suspect a bug in splice code, that my patch just exposed.
> 
> I've checked in skb_splice_bits() and below and can't see how we can move
> more than the requested len.
> 
> However, with your change, I don't clearly see how we break out of
> the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
> and read remaining data ? I suspect that we should at least exit when
> ((struct tcp_splice_state *)desc->arg.data)->len = 0.
> 
> At least that's something easy to add just before or after !desc->count
> for a test.
> 

I believe the bug is in tcp_splice_data_recv()

I am going to test a new patch, but here is the thing I found:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..fbbddf4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 {
        struct tcp_splice_state *tss = rd_desc->arg.data;

-       return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+       return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
 }

 static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 21:24             ` Willy Tarreau
  2009-01-09 22:02               ` Eric Dumazet
@ 2009-01-09 22:07               ` Willy Tarreau
  2009-01-09 22:12                 ` Eric Dumazet
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> (...)
> > > Also, in your second mail, you're saying that your change
> > > might return more data than requested by the user. I can't
> > > find why, could you please explain to me, as I'm still quite
> > > ignorant in this area ?
> > 
> > Well, I just tested various user programs and indeed got this
> > strange result :
> > 
> > Here I call splice() with len=1000 (0x3e8), and you can see
> > it gives a result of 1460 at the second call.

OK finally I could reproduce it and found why we have this. It's
expected in fact.

The problem when we loop in tcp_read_sock() is that tss->len is
not decremented by the amount of bytes read, this one is done
only in tcp_splice_read() which is outer.

The solution I found was to do just like other callers, which means
use desc->count to keep the remaining number of bytes we want to
read. In fact, tcp_read_sock() is designed to use that one as a stop
condition, which explains why you first had to hide it.

Now with the attached patch as a replacement for my previous one,
both issues are solved :
  - I splice 1000 bytes if I ask to do so
  - I splice as much as possible if available (typically 23 kB).

My observed performances are still at the top of earlier results
and IMHO that way of counting bytes makes sense for an actor called
from tcp_read_sock().

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 35bcddf..51ff3aa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 				unsigned int offset, size_t len)
 {
 	struct tcp_splice_state *tss = rd_desc->arg.data;
+	int ret;
 
-	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
+	if (ret > 0)
+		rd_desc->count -= ret;
+	return ret;
 }
 
 static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
 	/* Store TCP splice context information in read_descriptor_t. */
 	read_descriptor_t rd_desc = {
 		.arg.data = tss,
+		.count = tss->len,
 	};
 
 	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);

Regards,
Willy


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:02               ` Eric Dumazet
@ 2009-01-09 22:09                 ` Willy Tarreau
  0 siblings, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 22:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 11:02:06PM +0100, Eric Dumazet wrote:
> Willy Tarreau a écrit :
> > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> > (...)
> >>> Also, in your second mail, you're saying that your change
> >>> might return more data than requested by the user. I can't
> >>> find why, could you please explain to me, as I'm still quite
> >>> ignorant in this area ?
> >> Well, I just tested various user programs and indeed got this
> >> strange result :
> >>
> >> Here I call splice() with len=1000 (0x3e8), and you can see
> >> it gives a result of 1460 at the second call.
> > 
> > huh, not nice indeed!
> > 
> > While looking at the code to see how this could be possible, I
> > came across this minor thing (unrelated IMHO) :
> > 
> > 	if (__skb_splice_bits(skb, &offset, &tlen, &spd))
> > 		goto done;
> >>>>>>> 	else if (!tlen)    <<<<<<
> > 		goto done;
> > 
> > 	/*
> > 	 * now see if we have a frag_list to map
> > 	 */
> > 	if (skb_shinfo(skb)->frag_list) {
> > 		struct sk_buff *list = skb_shinfo(skb)->frag_list;
> > 
> > 		for (; list && tlen; list = list->next) {
> > 			if (__skb_splice_bits(list, &offset, &tlen, &spd))
> > 				break;
> > 		}
> > 	}
> > 
> >     done:
> > 
> > Above on the enlighted line, we'd better remove the else and leave a plain
> > "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
> > tlen, we still enter the if and evaluate the for condition for nothing. But
> > let's leave that for later.
> > 
> >> I suspect a bug in splice code, that my patch just exposed.
> > 
> > I've checked in skb_splice_bits() and below and can't see how we can move
> > more than the requested len.
> > 
> > However, with your change, I don't clearly see how we break out of
> > the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
> > and read remaining data ? I suspect that we should at least exit when
> > ((struct tcp_splice_state *)desc->arg.data)->len = 0.
> > 
> > At least that's something easy to add just before or after !desc->count
> > for a test.
> > 
> 
> I believe the bug is in tcp_splice_data_recv()

yes, see my other mail.

> I am going to test a new patch, but here is the thing I found:
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index bd6ff90..fbbddf4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
>  {
>         struct tcp_splice_state *tss = rd_desc->arg.data;
> 
> -       return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> +       return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
>  }

it will not work, len is always 1000 in your case, for every call,
as I had in my logs :

kernel :

tcp_splice_data_recv: skb=ed3d3480 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3480 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3540 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3540 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3600 offset=0 len=1176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3600 offset=1000 len=176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages

program :

accept(3, 0, NULL)                      = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7])                            = 0
select(6, [5], [], NULL, NULL)          = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1000
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3)      = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1460
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3)      = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1460
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3)      = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 176
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0xb0, 0x3)       = 176
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = -1 EAGAIN (Resource temporarily unavailable)
close(5)                                = 0
close(4)                                = 0
exit_group(0)                           = ?


Now with the fix :

accept(3, 0, NULL)                      = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7])                            = 0
select(6, [5], [], NULL, NULL)          = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1000
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3)      = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1000
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3)      = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1000
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3)      = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 1000
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3)      = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = 96
select(6, [5], [4], NULL, NULL)         = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x60, 0x3)       = 96
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3)      = -1 EAGAIN (Resource temporarily unavailable)
close(5)                                = 0
close(4)                                = 0
exit_group(0)                           = ?


Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:07               ` Willy Tarreau
@ 2009-01-09 22:12                 ` Eric Dumazet
  2009-01-09 22:17                   ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 22:12 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
>> (...)
>>>> Also, in your second mail, you're saying that your change
>>>> might return more data than requested by the user. I can't
>>>> find why, could you please explain to me, as I'm still quite
>>>> ignorant in this area ?
>>> Well, I just tested various user programs and indeed got this
>>> strange result :
>>>
>>> Here I call splice() with len=1000 (0x3e8), and you can see
>>> it gives a result of 1460 at the second call.
> 
> OK finally I could reproduce it and found why we have this. It's
> expected in fact.
> 
> The problem when we loop in tcp_read_sock() is that tss->len is
> not decremented by the amount of bytes read, this one is done
> only in tcp_splice_read() which is outer.
> 
> The solution I found was to do just like other callers, which means
> use desc->count to keep the remaining number of bytes we want to
> read. In fact, tcp_read_sock() is designed to use that one as a stop
> condition, which explains why you first had to hide it.
> 
> Now with the attached patch as a replacement for my previous one,
> both issues are solved :
>   - I splice 1000 bytes if I ask to do so
>   - I splice as much as possible if available (typically 23 kB).
> 
> My observed performances are still at the top of earlier results
> and IMHO that way of counting bytes makes sense for an actor called
> from tcp_read_sock().
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..51ff3aa 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
>  				unsigned int offset, size_t len)
>  {
>  	struct tcp_splice_state *tss = rd_desc->arg.data;
> +	int ret;
>  
> -	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> +	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> +	if (ret > 0)
> +		rd_desc->count -= ret;
> +	return ret;
>  }
>  
>  static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
>  	/* Store TCP splice context information in read_descriptor_t. */
>  	read_descriptor_t rd_desc = {
>  		.arg.data = tss,
> +		.count = tss->len,
>  	};
>  
>  	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
> 

OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 23808df..96b49e1 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -100,13 +100,11 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk, int flag)
 
 	/*
 	 * Use rd_desc to pass 'conn' to iscsi_tcp_recv.
-	 * We set count to 1 because we want the network layer to
-	 * hand us all the skbs that are available. iscsi_tcp_recv
-	 * handled pdus that cross buffers or pdus that still need data.
+	 * iscsi_tcp_recv handled pdus that cross buffers or pdus that
+	 * still need data.
 	 */
 	rd_desc.arg.data = conn;
-	rd_desc.count = 1;
-	tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv);
+	tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv, 65536);
 
 	read_unlock(&sk->sk_callback_lock);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 218235d..b1facd1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -490,7 +490,7 @@ extern void tcp_get_info(struct sock *, struct tcp_info *);
 typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *,
 				unsigned int, size_t);
 extern int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
-			 sk_read_actor_t recv_actor);
+			 sk_read_actor_t recv_actor, size_t tlen);
 
 extern void tcp_initialize_rcv_mss(struct sock *sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..fbbddf4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 {
 	struct tcp_splice_state *tss = rd_desc->arg.data;
 
-	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+	return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
 }
 
 static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -533,7 +533,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
 		.arg.data = tss,
 	};
 
-	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv, tss->len);
 }
 
 /**
@@ -611,11 +611,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 		tss.len -= ret;
 		spliced += ret;
 
+		if (!timeo)
+			break;
 		release_sock(sk);
 		lock_sock(sk);
 
 		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
-		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current))
 			break;
 	}
@@ -1193,7 +1195,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
  *	  (although both would be easy to implement).
  */
 int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
-		  sk_read_actor_t recv_actor)
+		  sk_read_actor_t recv_actor, size_t tlen)
 {
 	struct sk_buff *skb;
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -1209,6 +1211,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 			size_t len;
 
 			len = skb->len - offset;
+			if (len > tlen)
+				len = tlen;
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
 				u32 urg_offset = tp->urg_seq - seq;
@@ -1226,6 +1230,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				seq += used;
 				copied += used;
 				offset += used;
+				tlen -= used;
 			}
 			/*
 			 * If recv_actor drops the lock (e.g. TCP splice
@@ -1243,7 +1248,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 			break;
 		}
 		sk_eat_skb(sk, skb, 0);
-		if (!desc->count)
+		if (!tlen)
 			break;
 	}
 	tp->copied_seq = seq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 5cbb404..75f8e83 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1109,8 +1109,7 @@ static void xs_tcp_data_ready(struct sock *sk, int bytes)
 	/* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
 	rd_desc.arg.data = xprt;
 	do {
-		rd_desc.count = 65536;
-		read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
+		read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv, 65536);
 	} while (read > 0);
 out:
 	read_unlock(&sk->sk_callback_lock);



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:12                 ` Eric Dumazet
@ 2009-01-09 22:17                   ` Willy Tarreau
  2009-01-09 22:42                     ` Evgeniy Polyakov
  2009-01-09 22:45                     ` Eric Dumazet
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 22:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
> Willy Tarreau a écrit :
> > On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> >> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> >> (...)
> >>>> Also, in your second mail, you're saying that your change
> >>>> might return more data than requested by the user. I can't
> >>>> find why, could you please explain to me, as I'm still quite
> >>>> ignorant in this area ?
> >>> Well, I just tested various user programs and indeed got this
> >>> strange result :
> >>>
> >>> Here I call splice() with len=1000 (0x3e8), and you can see
> >>> it gives a result of 1460 at the second call.
> > 
> > OK finally I could reproduce it and found why we have this. It's
> > expected in fact.
> > 
> > The problem when we loop in tcp_read_sock() is that tss->len is
> > not decremented by the amount of bytes read, this one is done
> > only in tcp_splice_read() which is outer.
> > 
> > The solution I found was to do just like other callers, which means
> > use desc->count to keep the remaining number of bytes we want to
> > read. In fact, tcp_read_sock() is designed to use that one as a stop
> > condition, which explains why you first had to hide it.
> > 
> > Now with the attached patch as a replacement for my previous one,
> > both issues are solved :
> >   - I splice 1000 bytes if I ask to do so
> >   - I splice as much as possible if available (typically 23 kB).
> > 
> > My observed performances are still at the top of earlier results
> > and IMHO that way of counting bytes makes sense for an actor called
> > from tcp_read_sock().
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 35bcddf..51ff3aa 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> >  				unsigned int offset, size_t len)
> >  {
> >  	struct tcp_splice_state *tss = rd_desc->arg.data;
> > +	int ret;
> >  
> > -	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> > +	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> > +	if (ret > 0)
> > +		rd_desc->count -= ret;
> > +	return ret;
> >  }
> >  
> >  static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> > @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> >  	/* Store TCP splice context information in read_descriptor_t. */
> >  	read_descriptor_t rd_desc = {
> >  		.arg.data = tss,
> > +		.count = tss->len,
> >  	};
> >  
> >  	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
> > 
> 
> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)

I've seen the other callers, but they all use desc->count for their own
purpose. That's how I understood what it was used for :-)

I think it's better not to change the API here and use tcp_read_sock()
how it's supposed to be used. Also, the less parameters to the function,
the better.

However I'm OK for the !timeo before release_sock/lock_sock. I just
don't know if we can put the rest of the if above or not. I don't
know what changes we're supposed to collect by doing release_sock/
lock_sock before the if().

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:17                   ` Willy Tarreau
@ 2009-01-09 22:42                     ` Evgeniy Polyakov
  2009-01-09 22:50                       ` Willy Tarreau
  2009-01-10  7:40                       ` Eric Dumazet
  2009-01-09 22:45                     ` Eric Dumazet
  1 sibling, 2 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-09 22:42 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote:
> However I'm OK for the !timeo before release_sock/lock_sock. I just
> don't know if we can put the rest of the if above or not. I don't
> know what changes we're supposed to collect by doing release_sock/
> lock_sock before the if().

Not to interrupt the discussion, but for the clarification, that
release_sock/lock_sock is used to process the backlog accumulated while
socket was locked. And while dropping additional pair before the final
release is ok, but moving this itself should be thought of twice.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:17                   ` Willy Tarreau
  2009-01-09 22:42                     ` Evgeniy Polyakov
@ 2009-01-09 22:45                     ` Eric Dumazet
  2009-01-09 22:53                       ` Willy Tarreau
  1 sibling, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 22:45 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
>> Willy Tarreau a écrit :
>>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
>>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
>>>> (...)
>>>>>> Also, in your second mail, you're saying that your change
>>>>>> might return more data than requested by the user. I can't
>>>>>> find why, could you please explain to me, as I'm still quite
>>>>>> ignorant in this area ?
>>>>> Well, I just tested various user programs and indeed got this
>>>>> strange result :
>>>>>
>>>>> Here I call splice() with len=1000 (0x3e8), and you can see
>>>>> it gives a result of 1460 at the second call.
>>> OK finally I could reproduce it and found why we have this. It's
>>> expected in fact.
>>>
>>> The problem when we loop in tcp_read_sock() is that tss->len is
>>> not decremented by the amount of bytes read, this one is done
>>> only in tcp_splice_read() which is outer.
>>>
>>> The solution I found was to do just like other callers, which means
>>> use desc->count to keep the remaining number of bytes we want to
>>> read. In fact, tcp_read_sock() is designed to use that one as a stop
>>> condition, which explains why you first had to hide it.
>>>
>>> Now with the attached patch as a replacement for my previous one,
>>> both issues are solved :
>>>   - I splice 1000 bytes if I ask to do so
>>>   - I splice as much as possible if available (typically 23 kB).
>>>
>>> My observed performances are still at the top of earlier results
>>> and IMHO that way of counting bytes makes sense for an actor called
>>> from tcp_read_sock().
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 35bcddf..51ff3aa 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
>>>  				unsigned int offset, size_t len)
>>>  {
>>>  	struct tcp_splice_state *tss = rd_desc->arg.data;
>>> +	int ret;
>>>  
>>> -	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
>>> +	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
>>> +	if (ret > 0)
>>> +		rd_desc->count -= ret;
>>> +	return ret;
>>>  }
>>>  
>>>  static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
>>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
>>>  	/* Store TCP splice context information in read_descriptor_t. */
>>>  	read_descriptor_t rd_desc = {
>>>  		.arg.data = tss,
>>> +		.count = tss->len,
>>>  	};
>>>  
>>>  	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
>>>
>> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)
> 
> I've seen the other callers, but they all use desc->count for their own
> purpose. That's how I understood what it was used for :-)

Ah yes, I reread your patch and you are right.

> 
> I think it's better not to change the API here and use tcp_read_sock()
> how it's supposed to be used. Also, the less parameters to the function,
> the better.
> 
> However I'm OK for the !timeo before release_sock/lock_sock. I just
> don't know if we can put the rest of the if above or not. I don't
> know what changes we're supposed to collect by doing release_sock/
> lock_sock before the if().

Only the (!timeo) can be above. Other conditions must be checked after
the release/lock.


Thank you



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:42                     ` Evgeniy Polyakov
@ 2009-01-09 22:50                       ` Willy Tarreau
  2009-01-09 23:01                         ` Evgeniy Polyakov
  2009-01-10  7:40                       ` Eric Dumazet
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 22:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Sat, Jan 10, 2009 at 01:42:58AM +0300, Evgeniy Polyakov wrote:
> On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote:
> > However I'm OK for the !timeo before release_sock/lock_sock. I just
> > don't know if we can put the rest of the if above or not. I don't
> > know what changes we're supposed to collect by doing release_sock/
> > lock_sock before the if().
> 
> Not to interrupt the discussion, but for the clarification, that
> release_sock/lock_sock is used to process the backlog accumulated while
> socket was locked. And while dropping additional pair before the final
> release is ok, but moving this itself should be thought of twice.

Nice, thanks Evgeniy. So it makes sense to move only the !timeo test
above since it's not dependant on the socket, and leave the rest of
the test where it currently is. That's what Eric has proposed in his
latest patch.

Well, I'm now trying to educate myself on the send part. It's still
not very clear to me and I'd like to understand a little bit better
why we have this corruption problem and why there is a difference
between sending segments from memory and sending them from another
socket where they were already waiting.

I think I'll put printks everywhere and see what I can observe.
Knowing about the GSO/SG workaround already helps me enable/disable
the bug.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:45                     ` Eric Dumazet
@ 2009-01-09 22:53                       ` Willy Tarreau
  2009-01-09 23:34                         ` Eric Dumazet
  0 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 22:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote:
> Willy Tarreau a écrit :
> > On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
> >> Willy Tarreau a écrit :
> >>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> >>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> >>>> (...)
> >>>>>> Also, in your second mail, you're saying that your change
> >>>>>> might return more data than requested by the user. I can't
> >>>>>> find why, could you please explain to me, as I'm still quite
> >>>>>> ignorant in this area ?
> >>>>> Well, I just tested various user programs and indeed got this
> >>>>> strange result :
> >>>>>
> >>>>> Here I call splice() with len=1000 (0x3e8), and you can see
> >>>>> it gives a result of 1460 at the second call.
> >>> OK finally I could reproduce it and found why we have this. It's
> >>> expected in fact.
> >>>
> >>> The problem when we loop in tcp_read_sock() is that tss->len is
> >>> not decremented by the amount of bytes read, this one is done
> >>> only in tcp_splice_read() which is outer.
> >>>
> >>> The solution I found was to do just like other callers, which means
> >>> use desc->count to keep the remaining number of bytes we want to
> >>> read. In fact, tcp_read_sock() is designed to use that one as a stop
> >>> condition, which explains why you first had to hide it.
> >>>
> >>> Now with the attached patch as a replacement for my previous one,
> >>> both issues are solved :
> >>>   - I splice 1000 bytes if I ask to do so
> >>>   - I splice as much as possible if available (typically 23 kB).
> >>>
> >>> My observed performances are still at the top of earlier results
> >>> and IMHO that way of counting bytes makes sense for an actor called
> >>> from tcp_read_sock().
> >>>
> >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >>> index 35bcddf..51ff3aa 100644
> >>> --- a/net/ipv4/tcp.c
> >>> +++ b/net/ipv4/tcp.c
> >>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> >>>  				unsigned int offset, size_t len)
> >>>  {
> >>>  	struct tcp_splice_state *tss = rd_desc->arg.data;
> >>> +	int ret;
> >>>  
> >>> -	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> >>> +	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> >>> +	if (ret > 0)
> >>> +		rd_desc->count -= ret;
> >>> +	return ret;
> >>>  }
> >>>  
> >>>  static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> >>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> >>>  	/* Store TCP splice context information in read_descriptor_t. */
> >>>  	read_descriptor_t rd_desc = {
> >>>  		.arg.data = tss,
> >>> +		.count = tss->len,
> >>>  	};
> >>>  
> >>>  	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
> >>>
> >> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)
> > 
> > I've seen the other callers, but they all use desc->count for their own
> > purpose. That's how I understood what it was used for :-)
> 
> Ah yes, I reread your patch and you are right.
> 
> > 
> > I think it's better not to change the API here and use tcp_read_sock()
> > how it's supposed to be used. Also, the less parameters to the function,
> > the better.
> > 
> > However I'm OK for the !timeo before release_sock/lock_sock. I just
> > don't know if we can put the rest of the if above or not. I don't
> > know what changes we're supposed to collect by doing release_sock/
> > lock_sock before the if().
> 
> Only the (!timeo) can be above. Other conditions must be checked after
> the release/lock.

Yes that's what Evgeniy explained too. I smelled something like this
but did not know.

Care to redo the whole patch, since you already have all code parts
at hand as well as some fragments of commit messages ? You can even
add my Tested-by if you want. Finally it was nice that Dave asked
for this explanation because it drove our nose to the fishy parts ;-)

Thanks,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:50                       ` Willy Tarreau
@ 2009-01-09 23:01                         ` Evgeniy Polyakov
  2009-01-09 23:06                           ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-09 23:01 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau (w@1wt.eu) wrote:
> Well, I'm now trying to educate myself on the send part. It's still
> not very clear to me and I'd like to understand a little bit better
> why we have this corruption problem and why there is a difference
> between sending segments from memory and sending them from another
> socket where they were already waiting.

printks are the best choice, since you will get exactly what you are
looking for instead of deciphering what developer or code told you.

> I think I'll put printks everywhere and see what I can observe.
> Knowing about the GSO/SG workaround already helps me enable/disable
> the bug.

I wish I could also be capable to disable the bugs... :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 23:01                         ` Evgeniy Polyakov
@ 2009-01-09 23:06                           ` Willy Tarreau
  0 siblings, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-09 23:06 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Sat, Jan 10, 2009 at 02:01:24AM +0300, Evgeniy Polyakov wrote:
> On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau (w@1wt.eu) wrote:
> > Well, I'm now trying to educate myself on the send part. It's still
> > not very clear to me and I'd like to understand a little bit better
> > why we have this corruption problem and why there is a difference
> > between sending segments from memory and sending them from another
> > socket where they were already waiting.
> 
> printks are the best choice, since you will get exactly what you are
> looking for instead of deciphering what developer or code told you.

I know, but as I told in an earlier mail, it was not even clear to
me what function were called. I have made guesses but that's quite
hard when you don't know the entry point. I think I'm not too far
from having discovered the call chain. Understanding it will be
another story, of course ;-)

> > I think I'll put printks everywhere and see what I can observe.
> > Knowing about the GSO/SG workaround already helps me enable/disable
> > the bug.
> 
> I wish I could also be capable to disable the bugs... :)

Me too :-)
Here we have an opportunity to turn it on/off, let's take it !

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:53                       ` Willy Tarreau
@ 2009-01-09 23:34                         ` Eric Dumazet
  2009-01-13  5:45                           ` David Miller
  2009-01-14  0:05                           ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-09 23:34 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote:
>> Only the (!timeo) can be above. Other conditions must be checked after
>> the release/lock.
> 
> Yes that's what Evgeniy explained too. I smelled something like this
> but did not know.
> 
> Care to redo the whole patch, since you already have all code parts
> at hand as well as some fragments of commit messages ? You can even
> add my Tested-by if you want. Finally it was nice that Dave asked
> for this explanation because it drove our nose to the fishy parts ;-)

Sure, here it is :


David, do you think we still must call __tcp_splice_read() only once
in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

With following patch, a splice() call is limited to 16 frames, typically
16*1460 = 23360 bytes. Removing the test as Willy did in its patch
could return the exact length requested by user (limited to 16 pages),
giving nice blocks if feeding a file on disk...

Thank you

From: Willy Tarreau <w@1wt.eu>

[PATCH] tcp: splice as many packets as possible at once

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..1233835 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 				unsigned int offset, size_t len)
 {
 	struct tcp_splice_state *tss = rd_desc->arg.data;
+	int ret;
 
-	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
+	if (ret > 0)
+		rd_desc->count -= ret;
+	return ret;
 }
 
 static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
 	/* Store TCP splice context information in read_descriptor_t. */
 	read_descriptor_t rd_desc = {
 		.arg.data = tss,
+		.count	  = tss->len,
 	};
 
 	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
@@ -611,11 +616,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 		tss.len -= ret;
 		spliced += ret;
 
+		if (!timeo)
+			break;
 		release_sock(sk);
 		lock_sock(sk);
 
 		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
-		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current))
 			break;
 	}


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 22:42                     ` Evgeniy Polyakov
  2009-01-09 22:50                       ` Willy Tarreau
@ 2009-01-10  7:40                       ` Eric Dumazet
  2009-01-11 12:58                         ` Evgeniy Polyakov
  1 sibling, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-10  7:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

Evgeniy Polyakov a écrit :
> On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote:
>> However I'm OK for the !timeo before release_sock/lock_sock. I just
>> don't know if we can put the rest of the if above or not. I don't
>> know what changes we're supposed to collect by doing release_sock/
>> lock_sock before the if().
> 
> Not to interrupt the discussion, but for the clarification, that
> release_sock/lock_sock is used to process the backlog accumulated while
> socket was locked. And while dropping additional pair before the final
> release is ok, but moving this itself should be thought of twice.
> 

Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()

So :

1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
to process backlog. Its already done in skb_splice_bits()

2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
then we should perform the following tests inside this loop ?

if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
   signal_pending(current)) break;

And removie them from tcp_splice_read() ?




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-10  7:40                       ` Eric Dumazet
@ 2009-01-11 12:58                         ` Evgeniy Polyakov
  2009-01-11 13:14                           ` Eric Dumazet
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-11 12:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

Hi Eric.

On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Not to interrupt the discussion, but for the clarification, that
> > release_sock/lock_sock is used to process the backlog accumulated while
> > socket was locked. And while dropping additional pair before the final
> > release is ok, but moving this itself should be thought of twice.
> > 
> 
> Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()
> 
> So :
> 
> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> to process backlog. Its already done in skb_splice_bits()

Yes, in the tcp_splice_read() they are added to remove a deadlock.

> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
> then we should perform the following tests inside this loop ?
> 
> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
>    signal_pending(current)) break;
> 
> And removie them from tcp_splice_read() ?

It could be done, but for what reason? To detect disconnected socket early?
Does it worth the changes?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-11 12:58                         ` Evgeniy Polyakov
@ 2009-01-11 13:14                           ` Eric Dumazet
  2009-01-11 13:35                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-11 13:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

Evgeniy Polyakov a écrit :
> Hi Eric.
> 
> On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>> Not to interrupt the discussion, but for the clarification, that
>>> release_sock/lock_sock is used to process the backlog accumulated while
>>> socket was locked. And while dropping additional pair before the final
>>> release is ok, but moving this itself should be thought of twice.
>>>
>> Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()
>>
>> So :
>>
>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
>> to process backlog. Its already done in skb_splice_bits()
> 
> Yes, in the tcp_splice_read() they are added to remove a deadlock.

Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?

> 
>> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
>> then we should perform the following tests inside this loop ?
>>
>> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
>>    signal_pending(current)) break;
>>
>> And removie them from tcp_splice_read() ?
> 
> It could be done, but for what reason? To detect disconnected socket early?
> Does it worth the changes?
> 

I was thinking about the case your thread is doing a splice() from tcp socket to
 a pipe, while another thread is doing the splice from this pipe to something else.

Once patched, tcp_read_sock() could loop a long time...



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-11 13:14                           ` Eric Dumazet
@ 2009-01-11 13:35                             ` Evgeniy Polyakov
  2009-01-11 16:00                               ` Eric Dumazet
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-11 13:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> >> to process backlog. Its already done in skb_splice_bits()
> > 
> > Yes, in the tcp_splice_read() they are added to remove a deadlock.
> 
> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?

Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
and tcp_splice_read() needs it to process the backlog.

I think that even with non-blocking splice that release_sock/lock_sock
is needed, since we are able to do a parallel job: to receive new data
(scheduled by early release_sock backlog processing) in bh and to
process already received data via splice codepath.
Maybe in non-blocking splice mode this is not an issue though, but for
the blocking mode this allows to grab more skbs at once in skb_splice_bits.

> > 
> >> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
> >> then we should perform the following tests inside this loop ?
> >>
> >> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
> >>    signal_pending(current)) break;
> >>
> >> And removie them from tcp_splice_read() ?
> > 
> > It could be done, but for what reason? To detect disconnected socket early?
> > Does it worth the changes?
> 
> I was thinking about the case your thread is doing a splice() from tcp socket to
>  a pipe, while another thread is doing the splice from this pipe to something else.
> 
> Once patched, tcp_read_sock() could loop a long time...
 
Well, it maybe a good idea... Can not say anything against it :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-11 13:35                             ` Evgeniy Polyakov
@ 2009-01-11 16:00                               ` Eric Dumazet
  2009-01-11 16:05                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Eric Dumazet @ 2009-01-11 16:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

Evgeniy Polyakov a écrit :
> On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
>>>> to process backlog. Its already done in skb_splice_bits()
>>> Yes, in the tcp_splice_read() they are added to remove a deadlock.
>> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?
> 
> Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
> and tcp_splice_read() needs it to process the backlog.

While we drop lock in skb_splice_bits() to prevent the deadlock, we
also process backlog at this stage. No need to process backlog
again in the higher level function.

> 
> I think that even with non-blocking splice that release_sock/lock_sock
> is needed, since we are able to do a parallel job: to receive new data
> (scheduled by early release_sock backlog processing) in bh and to
> process already received data via splice codepath.
> Maybe in non-blocking splice mode this is not an issue though, but for
> the blocking mode this allows to grab more skbs at once in skb_splice_bits.

skb_splice_bits() operates on one skb, you lost me :)




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-11 16:00                               ` Eric Dumazet
@ 2009-01-11 16:05                                 ` Evgeniy Polyakov
  2009-01-14  0:07                                   ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-11 16:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel,
	netdev, jens.axboe

On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> >>>> to process backlog. Its already done in skb_splice_bits()
> >>> Yes, in the tcp_splice_read() they are added to remove a deadlock.
> >> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?
> > 
> > Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
> > and tcp_splice_read() needs it to process the backlog.
> 
> While we drop lock in skb_splice_bits() to prevent the deadlock, we
> also process backlog at this stage. No need to process backlog
> again in the higher level function.

Yes, but having it earlier allows to receive new skb while processing
already received.

> > I think that even with non-blocking splice that release_sock/lock_sock
> > is needed, since we are able to do a parallel job: to receive new data
> > (scheduled by early release_sock backlog processing) in bh and to
> > process already received data via splice codepath.
> > Maybe in non-blocking splice mode this is not an issue though, but for
> > the blocking mode this allows to grab more skbs at once in skb_splice_bits.
> 
> skb_splice_bits() operates on one skb, you lost me :)

Exactly, and to have it we earlier release a socket so that it could be
acked and while we copy it or doing anything else, the next one would
received.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 23:34                         ` Eric Dumazet
@ 2009-01-13  5:45                           ` David Miller
  2009-01-14  0:05                           ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-13  5:45 UTC (permalink / raw)
  To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

Eric, I'll get to this thread as soon as I can, perhaps tomorrow.  I
want to get all of the build fallout and bug fixes for 2.6.29-rcX
sorted before everyone heads off to LCA in the next week or so :-)

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-08 22:20     ` Willy Tarreau
@ 2009-01-13 23:08       ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-13 23:08 UTC (permalink / raw)
  To: w; +Cc: ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 8 Jan 2009 23:20:39 +0100

> On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
> 
> In my opinion, the code structure is different between both functions. In
> tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
> all data which have been processed since the entry in the function. If we
> removed the test here, we could not break out of the loop once we have
> copied something. In tcp_splice_read(), the test is still present in the
> (!ret) code path, where ret is the last number of bytes processed, so the
> test is still performed regardless of what has been previously transferred.
> 
> So in summary, in tcp_splice_read without this test, we get back to the
> top of the loop, and if __tcp_splice_read() returns 0, then we break out
> of the loop.

Ok I see what you're saying, the !timeo check is only
necessary when the receive queue has been exhausted.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  6:47     ` Eric Dumazet
  2009-01-09  7:04       ` Willy Tarreau
  2009-01-09 15:42       ` Eric Dumazet
@ 2009-01-13 23:26       ` David Miller
  2 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-13 23:26 UTC (permalink / raw)
  To: dada1; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 07:47:16 +0100

> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

Spice read does not handle SO_RCVLOWAT like tcp_recvmsg() does.

We should probably add a:

	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

and check 'target' against 'spliced' in the main loop of
tcp_splice_read().

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed. But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set. 

I sometimes have a suspicion we can remove this test too, but it's
not really that clear.

If an application is doing non-blocking reads and they care about
latency, they shouldn't be providing huge buffers.  This much I
agree with, but...

If you look at where this check is placed in the recvmsg() case, it is
done after we have verified that there is no socket backlog.

		if (copied >= target && !sk->sk_backlog.tail)
			break;

		if (copied) {
			if (sk->sk_err ||
			    sk->sk_state == TCP_CLOSE ||
			    (sk->sk_shutdown & RCV_SHUTDOWN) ||
			    !timeo ||
			    signal_pending(current))
				break;
		} else {

So either:

1) We haven't met the target.  And note that target is one unless
   the user makes an explicit receive low-water setting.

2) Or there is no backlog.

When we get to the 'if (copied)' check.

You can view this "!timeo" check as meaning "non-blocking".  When
we get to it, we are guarenteed that we haven't met the target
and we have no backlog.  So it is absolutely appropriate to break
out of recvmsg() processing here if non-blocking.

There is a lot of logic and feature handling in tcp_splice_read() and
that's why the semantics of "!timeo" cases are not being handled
properly here.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09  7:28         ` Eric Dumazet
  2009-01-09  7:42           ` Willy Tarreau
@ 2009-01-13 23:27           ` David Miller
  2009-01-13 23:35             ` Eric Dumazet
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-13 23:27 UTC (permalink / raw)
  To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 08:28:09 +0100

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
> [32768], 4), it would be good if kernel was smart enough and could
> reduce number of wakeups.

Right, and as I pointed out in previous replies the problem is
that splice() receive in TCP doesn't check the low water mark
at all.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 15:42       ` Eric Dumazet
  2009-01-09 17:57         ` Eric Dumazet
  2009-01-09 18:54         ` Willy Tarreau
@ 2009-01-13 23:31         ` David Miller
  2 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-13 23:31 UTC (permalink / raw)
  To: dada1; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 09 Jan 2009 16:42:44 +0100

> David, if you referred to code at line 1374 of net/ipv4/tcp.c, I
> believe there is no issue with it. We really want to break from this
> loop if !timeo .

Correct, I agree, and I gave some detailed analysis of this in
another response :-)

> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

"Like", sure, but not the same since splice() lacks the low-water
and backlog checks.

> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
> 
> We break out of its loop at line 1246
> 
> if (!desc->count) /* this test is always true */
> 	break;
> 
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
> 
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

Yes, that's very odd.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-13 23:27           ` David Miller
@ 2009-01-13 23:35             ` Eric Dumazet
  0 siblings, 0 replies; 190+ messages in thread
From: Eric Dumazet @ 2009-01-13 23:35 UTC (permalink / raw)
  To: David Miller; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 09 Jan 2009 08:28:09 +0100
> 
>> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
>> [32768], 4), it would be good if kernel was smart enough and could
>> reduce number of wakeups.
> 
> Right, and as I pointed out in previous replies the problem is
> that splice() receive in TCP doesn't check the low water mark
> at all.
> 
> 

Yes I understand, but if splice() is running, wakeup occured, and
no need to check if the wakeup was good or not... just proceed and
consume some skb, since we already were awaken.

Then, an application might setup a high SO_RCVLOWAT, but want to splice()
only few bytes, so RCVLOWAT is only a hint for the thing that perform the
wakeup, not for the consumer.



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-09 23:34                         ` Eric Dumazet
  2009-01-13  5:45                           ` David Miller
@ 2009-01-14  0:05                           ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-14  0:05 UTC (permalink / raw)
  To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

You seem to be working that out in another thread :-)

> [PATCH] tcp: splice as many packets as possible at once
> 
> As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
> optimal. It processes at most one segment per call.
> This results in low performance and very high overhead due to syscall rate
> when splicing from interfaces which do not support LRO.
> 
> Willy provided a patch inside tcp_splice_read(), but a better fix
> is to let tcp_read_sock() process as many segments as possible, so
> that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
> often.
> 
> With this change, splice() behaves like tcp_recvmsg(), being able
> to consume many skbs in one system call. With typical 1460 bytes
> of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
> 16*1460 = 23360 bytes.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

I've applied this, thanks!

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-11 16:05                                 ` Evgeniy Polyakov
@ 2009-01-14  0:07                                   ` David Miller
  2009-01-14  0:13                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-14  0:07 UTC (permalink / raw)
  To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Sun, 11 Jan 2009 19:05:48 +0300

> On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > While we drop lock in skb_splice_bits() to prevent the deadlock, we
> > also process backlog at this stage. No need to process backlog
> > again in the higher level function.
> 
> Yes, but having it earlier allows to receive new skb while processing
> already received.
> 
> > > I think that even with non-blocking splice that release_sock/lock_sock
> > > is needed, since we are able to do a parallel job: to receive new data
> > > (scheduled by early release_sock backlog processing) in bh and to
> > > process already received data via splice codepath.
> > > Maybe in non-blocking splice mode this is not an issue though, but for
> > > the blocking mode this allows to grab more skbs at once in skb_splice_bits.
> > 
> > skb_splice_bits() operates on one skb, you lost me :)
> 
> Exactly, and to have it we earlier release a socket so that it could be
> acked and while we copy it or doing anything else, the next one would
> received.

I think the socket release in skb_splice_bits() (although necessary)
just muddies the waters, and whether the extra one done in
tcp_splice_read() helps at all is open to debate.

That skb_clone() done by skb_splice_bits() pisses me off too,
we really ought to fix that.  And we also have that data corruption
bug to cure too.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:07                                   ` David Miller
@ 2009-01-14  0:13                                     ` Evgeniy Polyakov
  2009-01-14  0:16                                       ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  0:13 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 13, 2009 at 04:07:21PM -0800, David Miller (davem@davemloft.net) wrote:
> > Exactly, and to have it we earlier release a socket so that it could be
> > acked and while we copy it or doing anything else, the next one would
> > received.
> 
> I think the socket release in skb_splice_bits() (although necessary)
> just muddies the waters, and whether the extra one done in
> tcp_splice_read() helps at all is open to debate.

Well, yes, probably simple performance test with and without will
clarify the things.

> That skb_clone() done by skb_splice_bits() pisses me off too,
> we really ought to fix that.  And we also have that data corruption
> bug to cure too.

Clone is needed since tcp expects to own the skb and frees it
unconditionally via __kfree_skb().

What is the best solution for the data corruption bug? To copy the data
all the time or implement own allocator to be used in alloc_skb and
friends to allocate the head? I think it can be done transparently for
the drivers. I can volunteer for this :)
The first one is more appropriate for the current bugfix-only stage,
but this will result in the whole release being too slow.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:13                                     ` Evgeniy Polyakov
@ 2009-01-14  0:16                                       ` David Miller
  2009-01-14  0:22                                         ` Evgeniy Polyakov
  2009-01-14  0:51                                         ` Herbert Xu
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-14  0:16 UTC (permalink / raw)
  To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 14 Jan 2009 03:13:45 +0300

> What is the best solution for the data corruption bug? To copy the data
> all the time or implement own allocator to be used in alloc_skb and
> friends to allocate the head? I think it can be done transparently for
> the drivers. I can volunteer for this :)
> The first one is more appropriate for the current bugfix-only stage,
> but this will result in the whole release being too slow.

I am looking into this right now.

I wish there were some way we could make this code grab and release a
reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
accomplish it's goals.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:16                                       ` David Miller
@ 2009-01-14  0:22                                         ` Evgeniy Polyakov
  2009-01-14  0:37                                           ` David Miller
  2009-01-14  0:51                                         ` Herbert Xu
  1 sibling, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  0:22 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller (davem@davemloft.net) wrote:
> I wish there were some way we could make this code grab and release a
> reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> accomplish it's goals.

Ugh... Clone without cloninig, but by increasing the dataref. Getting
that splice only needs that skb to track the head of the original, this
may really work.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:22                                         ` Evgeniy Polyakov
@ 2009-01-14  0:37                                           ` David Miller
  2009-01-14  3:51                                             ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-14  0:37 UTC (permalink / raw)
  To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 14 Jan 2009 03:22:52 +0300

> On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller (davem@davemloft.net) wrote:
> > I wish there were some way we could make this code grab and release a
> > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> > accomplish it's goals.
> 
> Ugh... Clone without cloninig, but by increasing the dataref. Getting
> that splice only needs that skb to track the head of the original, this
> may really work.

Here is something I scrambled together, it is largely based upon
Jarek's patch:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..05126da 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,12 +70,17 @@
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+static void skb_release_data(struct sk_buff *skb);
+
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
 	struct sk_buff *skb = (struct sk_buff *) buf->private;
 
-	kfree_skb(skb);
+	if (skb)
+		skb_release_data(skb);
+	else
+		put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
@@ -83,7 +88,10 @@ static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 {
 	struct sk_buff *skb = (struct sk_buff *) buf->private;
 
-	skb_get(skb);
+	if (skb)
+		atomic_inc(&skb_shinfo(skb)->dataref);
+	else
+		get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1336,7 +1344,10 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
 	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
 
-	kfree_skb(skb);
+	if (skb)
+		skb_release_data(skb);
+	else
+		put_page(spd->pages[i]);
 }
 
 /*
@@ -1344,7 +1355,7 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
@@ -1352,8 +1363,15 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
+	spd->partial[spd->nr_pages].private =
+		(unsigned long) (linear ? skb : NULL);
 	spd->nr_pages++;
+
+	if (linear)
+		atomic_inc(&skb_shinfo(skb)->dataref);
+	else
+		get_page(page);
+
 	return 0;
 }
 
@@ -1369,7 +1387,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
 static inline int __splice_segment(struct page *page, unsigned int poff,
 				   unsigned int plen, unsigned int *off,
 				   unsigned int *len, struct sk_buff *skb,
-				   struct splice_pipe_desc *spd)
+				   struct splice_pipe_desc *spd, int linear)
 {
 	if (!*len)
 		return 1;
@@ -1392,7 +1410,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb))
+		if (spd_fill_page(spd, page, flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1437,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
-			     offset, len, skb, spd))
+			     offset, len, skb, spd, 1))
 		return 1;
 
 	/*
@@ -1429,7 +1447,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(f->page, f->page_offset, f->size,
-				     offset, len, skb, spd))
+				     offset, len, skb, spd, 0))
 			return 1;
 	}
 

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:16                                       ` David Miller
  2009-01-14  0:22                                         ` Evgeniy Polyakov
@ 2009-01-14  0:51                                         ` Herbert Xu
  2009-01-14  1:24                                           ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-14  0:51 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

David Miller <davem@davemloft.net> wrote:
>
> I wish there were some way we could make this code grab and release a
> reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> accomplish it's goals.

We can probably do that for spliced data that end up going to
the networking stack again.  However, as splice is generic the
data may be headed to a destination other than the network stack.

In that case to make dataref work we'd need some mechanism
that allows non-network entities to get/drop this ref count.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:51                                         ` Herbert Xu
@ 2009-01-14  1:24                                           ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-14  1:24 UTC (permalink / raw)
  To: herbert
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Jan 2009 11:51:10 +1100

> We can probably do that for spliced data that end up going to
> the networking stack again.  However, as splice is generic the
> data may be headed to a destination other than the network stack.
> 
> In that case to make dataref work we'd need some mechanism
> that allows non-network entities to get/drop this ref count.

I see.

I'll think about this some more.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  0:37                                           ` David Miller
@ 2009-01-14  3:51                                             ` Herbert Xu
  2009-01-14  4:25                                               ` David Miller
  2009-01-14  7:27                                               ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-14  3:51 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

David Miller <davem@davemloft.net> wrote:
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..05126da 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -70,12 +70,17 @@
> static struct kmem_cache *skbuff_head_cache __read_mostly;
> static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> 
> +static void skb_release_data(struct sk_buff *skb);
> +
> static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
>                                  struct pipe_buffer *buf)
> {
>        struct sk_buff *skb = (struct sk_buff *) buf->private;
> 
> -       kfree_skb(skb);
> +       if (skb)
> +               skb_release_data(skb);
> +       else
> +               put_page(buf->page);
> }

Unfortunately this won't work, not even for network destinations.

The reason is that this gets called as soon as the destination's
splice hook returns, for networking that means when sendpage returns.

So by that time we'll still be left with just a page reference
on a page where the slab memory may already have been freed.

To make this work we need to get the destination's splice hooks
to acquire this reference.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  3:51                                             ` Herbert Xu
@ 2009-01-14  4:25                                               ` David Miller
  2009-01-14  7:27                                               ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-14  4:25 UTC (permalink / raw)
  To: herbert
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Jan 2009 14:51:24 +1100

> Unfortunately this won't work, not even for network destinations.
> 
> The reason is that this gets called as soon as the destination's
> splice hook returns, for networking that means when sendpage returns.
> 
> So by that time we'll still be left with just a page reference
> on a page where the slab memory may already have been freed.
> 
> To make this work we need to get the destination's splice hooks
> to acquire this reference.

Yes I realized this after your earlier posting today.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  3:51                                             ` Herbert Xu
  2009-01-14  4:25                                               ` David Miller
@ 2009-01-14  7:27                                               ` David Miller
  2009-01-14  8:26                                                 ` Herbert Xu
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-14  7:27 UTC (permalink / raw)
  To: herbert
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Jan 2009 14:51:24 +1100

> Unfortunately this won't work, not even for network destinations.
> 
> The reason is that this gets called as soon as the destination's
> splice hook returns, for networking that means when sendpage returns.
> 
> So by that time we'll still be left with just a page reference
> on a page where the slab memory may already have been freed.
> 
> To make this work we need to get the destination's splice hooks
> to acquire this reference.

So while trying to figure out a sane way to fix this, I found
another bug:

	/*
	 * map the linear part
	 */
	if (__splice_segment(virt_to_page(skb->data),
			     (unsigned long) skb->data & (PAGE_SIZE - 1),
			     skb_headlen(skb),
			     offset, len, skb, spd))
		return 1;

This will explode if the SLAB cache for skb->head is using compound
(ie. order > 0) pages.

For example, if this is an order-1 page being used for the skb->head
data (which would be true on most systems for jumbo MTU frames being
received into a linear SKB), the offset will be wrong and depending
upon skb_headlen() we could reference past the end of that
non-compound page we will end up grabbing a reference to.

And then we'll end up with a compound page in an skb_shinfo() frag
array, which is illegal.

Well, at least, I can list several drivers that will barf when
trying to TX that (Acenic, atlx1, cassini, jme, sungem), since
they use pci_map_page(... virt_to_page(skb->data)) or similar.

The core KMAP'ing support for SKBs will also not be able to grok
such a beastly SKB.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  7:27                                               ` David Miller
@ 2009-01-14  8:26                                                 ` Herbert Xu
  2009-01-14  8:53                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-14  8:26 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
> 
> So while trying to figure out a sane way to fix this, I found
> another bug:
> 
> 	/*
> 	 * map the linear part
> 	 */
> 	if (__splice_segment(virt_to_page(skb->data),
> 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> 			     skb_headlen(skb),
> 			     offset, len, skb, spd))
> 		return 1;
> 
> This will explode if the SLAB cache for skb->head is using compound
> (ie. order > 0) pages.
> 
> For example, if this is an order-1 page being used for the skb->head
> data (which would be true on most systems for jumbo MTU frames being
> received into a linear SKB), the offset will be wrong and depending
> upon skb_headlen() we could reference past the end of that
> non-compound page we will end up grabbing a reference to.

I'm actually not worried so much about these packets since these
drivers should be converted to skb frags as otherwise they'll
probably stop working after a while due to memory fragmentation.

But yeah for correctness we definitely should address this in
skb_splice_bits.

I still think Jarek's approach (the copying one) is probably the
easiest for now until we can find a better way.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  8:26                                                 ` Herbert Xu
@ 2009-01-14  8:53                                                   ` Jarek Poplawski
  2009-01-14  9:29                                                     ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14  8:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Jan 14, 2009 at 07:26:30PM +1100, Herbert Xu wrote:
> On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
> > 
> > So while trying to figure out a sane way to fix this, I found
> > another bug:
> > 
> > 	/*
> > 	 * map the linear part
> > 	 */
> > 	if (__splice_segment(virt_to_page(skb->data),
> > 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> > 			     skb_headlen(skb),
> > 			     offset, len, skb, spd))
> > 		return 1;
> > 
> > This will explode if the SLAB cache for skb->head is using compound
> > (ie. order > 0) pages.
> > 
> > For example, if this is an order-1 page being used for the skb->head
> > data (which would be true on most systems for jumbo MTU frames being
> > received into a linear SKB), the offset will be wrong and depending
> > upon skb_headlen() we could reference past the end of that
> > non-compound page we will end up grabbing a reference to.
> 
> I'm actually not worried so much about these packets since these
> drivers should be converted to skb frags as otherwise they'll
> probably stop working after a while due to memory fragmentation.
> 
> But yeah for correctness we definitely should address this in
> skb_splice_bits.
> 
> I still think Jarek's approach (the copying one) is probably the
> easiest for now until we can find a better way.
> 

Actually, I still think my second approach (the PageSlab) is probably
(if tested) the easiest for now, because it should fix the reported
(Willy's) problem, without any change or copy overhead for splice to
file (which could be still wrong, but not obviously wrong). Then we
could search for the only right way (which is most probably around
Herbert's new skb page allocator. IMHO "my" "copying approach" is too
risky e.g. for stable etc. because of unknown memory requirements,
especially for some larger size page configs/systems.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  8:53                                                   ` Jarek Poplawski
@ 2009-01-14  9:29                                                     ` David Miller
  2009-01-14  9:42                                                       ` Jarek Poplawski
                                                                         ` (3 more replies)
  0 siblings, 4 replies; 190+ messages in thread
From: David Miller @ 2009-01-14  9:29 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 14 Jan 2009 08:53:08 +0000

> Actually, I still think my second approach (the PageSlab) is probably
> (if tested) the easiest for now, because it should fix the reported
> (Willy's) problem, without any change or copy overhead for splice to
> file (which could be still wrong, but not obviously wrong).

It's a simple fix, but as Herbert stated it leaves other ->sendpage()
implementations exposed to data corruption when the from side of the
pipe buffer is a socket.

That, to me, is almost worse than a bad fix.

It's definitely worse than a slower but full fix, which the copy
patch is.

Therefore what I'll likely do is push Jarek's copy based cure,
and meanwhile we can brainstorm some more on how to fix this
properly in the long term.

So, I've put together a full commit message and Jarek's patch
below.  One thing I notice is that the silly skb_clone() done
by SKB splicing is no longer necessary.

We could get rid of that to offset (some) of the cost we are
adding with this bug fix.

Comments?

net: Fix data corruption when splicing from sockets.

From: Jarek Poplawski <jarkao2@gmail.com>

The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.

The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.

But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.

The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.

The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused.  Yet, that is
all that the socket send side has at this point.

This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.

The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.

This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, len, offset);
+		if (!page)
+			return 1;
+	}
+
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
 	spd->nr_pages++;
+	get_page(page);
+
 	return 0;
 }
 
@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
 static inline int __splice_segment(struct page *page, unsigned int poff,
 				   unsigned int plen, unsigned int *off,
 				   unsigned int *len, struct sk_buff *skb,
-				   struct splice_pipe_desc *spd)
+				   struct splice_pipe_desc *spd, int linear)
 {
 	if (!*len)
 		return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb))
+		if (spd_fill_page(spd, page, flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
-			     offset, len, skb, spd))
+			     offset, len, skb, spd, 1))
 		return 1;
 
 	/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(f->page, f->page_offset, f->size,
-				     offset, len, skb, spd))
+				     offset, len, skb, spd, 0))
 			return 1;
 	}
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:29                                                     ` David Miller
@ 2009-01-14  9:42                                                       ` Jarek Poplawski
  2009-01-14 10:06                                                         ` David Miller
  2009-01-14  9:54                                                       ` Jarek Poplawski
                                                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14  9:42 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 14 Jan 2009 08:53:08 +0000
> 
> > Actually, I still think my second approach (the PageSlab) is probably
> > (if tested) the easiest for now, because it should fix the reported
> > (Willy's) problem, without any change or copy overhead for splice to
> > file (which could be still wrong, but not obviously wrong).
> 
> It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> implementations exposed to data corruption when the from side of the
> pipe buffer is a socket.

I don't think Herbert meant other ->sendpage() implementations, but I
could miss something.

> That, to me, is almost worse than a bad fix.
> 
> It's definitely worse than a slower but full fix, which the copy
> patch is.

Sorry, I can't see how this patch could make sendpage worse.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:29                                                     ` David Miller
  2009-01-14  9:42                                                       ` Jarek Poplawski
@ 2009-01-14  9:54                                                       ` Jarek Poplawski
  2009-01-14 10:01                                                         ` Willy Tarreau
                                                                           ` (2 more replies)
  2009-01-14 11:28                                                       ` Herbert Xu
  2009-01-15 23:03                                                       ` Willy Tarreau
  3 siblings, 3 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14  9:54 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> Therefore what I'll likely do is push Jarek's copy based cure,
> and meanwhile we can brainstorm some more on how to fix this
> properly in the long term.
> 
> So, I've put together a full commit message and Jarek's patch
> below.  One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
> 
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
> 
> Comments?

Yes, this should lessen the overhead a bit.

> 
> net: Fix data corruption when splicing from sockets.
> 
> From: Jarek Poplawski <jarkao2@gmail.com>
> 
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
> 
> The idea is to pass the virt_to_page() reference via the pipe
> buffer, and refcount the buffer using a SKB reference.
> 
> But if we are splicing from a socket to a socket (via sendpage)
> this doesn't work.
> 
> The from side processing will grab the page (and SKB) references.
> The sendpage() calls will grab page references only, return, and
> then the from side processing completes and drops the SKB ref.
> 
> The page based reference to skb->data is not enough to keep the
> kmalloc() buffer backing it from being reused.  Yet, that is
> all that the socket send side has at this point.
> 
> This leads to data corruption if the skb->data buffer is reused
> by SLAB before the send side socket actually gets the TX packet
> out to the device.
> 
> The fix employed here is to simply allocate a page and copy the
> skb->data bytes into that page.
> 
> This will hurt performance, but there is no clear way to fix this
> properly without a copy at the present time, and it is important
> to get rid of the data corruption.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

You are very brave! I'd prefer to wait for at least minimal testing
by Willy...

Thanks,
Jarek P.

BTW, an skb parameter could be removed from spd_fill_page() to make it
even faster...

...
>  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
>  				unsigned int len, unsigned int offset,
> -				struct sk_buff *skb)
> +				struct sk_buff *skb, int linear)
>  {
>  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
>  		return 1;
>  
> +	if (linear) {
> +		page = linear_to_page(page, len, offset);
> +		if (!page)
> +			return 1;
> +	}
> +
>  	spd->pages[spd->nr_pages] = page;
>  	spd->partial[spd->nr_pages].len = len;
>  	spd->partial[spd->nr_pages].offset = offset;
> -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
>  	spd->nr_pages++;
> +	get_page(page);
> +
>  	return 0;
>  }
>  
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
>  static inline int __splice_segment(struct page *page, unsigned int poff,
>  				   unsigned int plen, unsigned int *off,
>  				   unsigned int *len, struct sk_buff *skb,
> -				   struct splice_pipe_desc *spd)
> +				   struct splice_pipe_desc *spd, int linear)
>  {
>  	if (!*len)
>  		return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
>  		/* the linear region may spread across several pages  */
>  		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>  
> -		if (spd_fill_page(spd, page, flen, poff, skb))
> +		if (spd_fill_page(spd, page, flen, poff, skb, linear))
>  			return 1;
>  
>  		__segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
>  	if (__splice_segment(virt_to_page(skb->data),
>  			     (unsigned long) skb->data & (PAGE_SIZE - 1),
>  			     skb_headlen(skb),
> -			     offset, len, skb, spd))
> +			     offset, len, skb, spd, 1))
>  		return 1;
>  
>  	/*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
>  		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>  
>  		if (__splice_segment(f->page, f->page_offset, f->size,
> -				     offset, len, skb, spd))
> +				     offset, len, skb, spd, 0))
>  			return 1;
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:54                                                       ` Jarek Poplawski
@ 2009-01-14 10:01                                                         ` Willy Tarreau
  2009-01-14 12:06                                                         ` Jarek Poplawski
  2009-01-14 12:15                                                         ` Jarek Poplawski
  2 siblings, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-14 10:01 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

Will test, probably this evening.

Thanks guys,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:42                                                       ` Jarek Poplawski
@ 2009-01-14 10:06                                                         ` David Miller
  2009-01-14 10:47                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-14 10:06 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 14 Jan 2009 09:42:16 +0000

> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Wed, 14 Jan 2009 08:53:08 +0000
> > 
> > > Actually, I still think my second approach (the PageSlab) is probably
> > > (if tested) the easiest for now, because it should fix the reported
> > > (Willy's) problem, without any change or copy overhead for splice to
> > > file (which could be still wrong, but not obviously wrong).
> > 
> > It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> > implementations exposed to data corruption when the from side of the
> > pipe buffer is a socket.
> 
> I don't think Herbert meant other ->sendpage() implementations, but I
> could miss something.

I think he did :-)

Or, more generally, he could have been referring to splice pipe
outputs.  All of these things grab references to pages and
expect that to keep the underlying data from being reallocated.

That doesn't work for this skb->data case.

> > That, to me, is almost worse than a bad fix.
> > 
> > It's definitely worse than a slower but full fix, which the copy
> > patch is.
> 
> Sorry, I can't see how this patch could make sendpage worse.

Because that patch only fixes TCP's ->sendpage() implementation.

There are others out there which could end up experiencing similar
data corruption.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14 10:06                                                         ` David Miller
@ 2009-01-14 10:47                                                           ` Jarek Poplawski
  2009-01-14 11:29                                                             ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14 10:47 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

On Wed, Jan 14, 2009 at 02:06:37AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 14 Jan 2009 09:42:16 +0000
> 
> > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> > > From: Jarek Poplawski <jarkao2@gmail.com>
> > > Date: Wed, 14 Jan 2009 08:53:08 +0000
> > > 
> > > > Actually, I still think my second approach (the PageSlab) is probably
> > > > (if tested) the easiest for now, because it should fix the reported
> > > > (Willy's) problem, without any change or copy overhead for splice to
> > > > file (which could be still wrong, but not obviously wrong).
> > > 
> > > It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> > > implementations exposed to data corruption when the from side of the
> > > pipe buffer is a socket.
> > 
> > I don't think Herbert meant other ->sendpage() implementations, but I
> > could miss something.
> 
> I think he did :-)

I hope Herbert will make it clear.

> 
> Or, more generally, he could have been referring to splice pipe
> outputs.  All of these things grab references to pages and
> expect that to keep the underlying data from being reallocated.
> 
> That doesn't work for this skb->data case.
> 
> > > That, to me, is almost worse than a bad fix.
> > > 
> > > It's definitely worse than a slower but full fix, which the copy
> > > patch is.
> > 
> > Sorry, I can't see how this patch could make sendpage worse.
> 
> Because that patch only fixes TCP's ->sendpage() implementation.

Yes, it fixes (I hope) the only reported implementation.

> 
> There are others out there which could end up experiencing similar
> data corruption.

Since this fix is very simple (and IMHO safe) it could be probably
used elsewhere too, but I doubt we should care at the moment.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:29                                                     ` David Miller
  2009-01-14  9:42                                                       ` Jarek Poplawski
  2009-01-14  9:54                                                       ` Jarek Poplawski
@ 2009-01-14 11:28                                                       ` Herbert Xu
  2009-01-15 23:03                                                       ` Willy Tarreau
  3 siblings, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-14 11:28 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> 
> It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> implementations exposed to data corruption when the from side of the
> pipe buffer is a socket.
> 
> That, to me, is almost worse than a bad fix.

Yep, so far nobody has verified the disk path at all.  So for all
we know, if there is a delay on the way to disk, the exact same
thing can occur.

Besides, the PageSlab thing is going to copy for network to network
anyway.

> So, I've put together a full commit message and Jarek's patch
> below.  One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
> 
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
> 
> Comments?

Yes that's probably a good idea.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14 10:47                                                           ` Jarek Poplawski
@ 2009-01-14 11:29                                                             ` Herbert Xu
  2009-01-14 11:40                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-14 11:29 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
>
> > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > could miss something.
> > 
> > I think he did :-)
> 
> I hope Herbert will make it clear.

Yes I did mean the other splice paths, and in particular, the path
to disk.  I hope that's clear enough :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14 11:29                                                             ` Herbert Xu
@ 2009-01-14 11:40                                                               ` Jarek Poplawski
  2009-01-14 11:45                                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14 11:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote:
> On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
> >
> > > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > > could miss something.
> > > 
> > > I think he did :-)
> > 
> > I hope Herbert will make it clear.
> 
> Yes I did mean the other splice paths, and in particular, the path
> to disk.  I hope that's clear enough :)

So, I think I got it right: otherwise it would be enough to make this
copy to a new page only before ->sendpage() calls e.g. in
generic_splice_sendpage().

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14 11:40                                                               ` Jarek Poplawski
@ 2009-01-14 11:45                                                                 ` Jarek Poplawski
  0 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14 11:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Jan 14, 2009 at 11:40:59AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote:
> > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
> > >
> > > > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > > > could miss something.
> > > > 
> > > > I think he did :-)
> > > 
> > > I hope Herbert will make it clear.
> > 
> > Yes I did mean the other splice paths, and in particular, the path
> > to disk.  I hope that's clear enough :)
> 
> So, I think I got it right: otherwise it would be enough to make this
> copy to a new page only before ->sendpage() calls e.g. in
> generic_splice_sendpage().

Hmm... Actually in pipe_to_sendpage().

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:54                                                       ` Jarek Poplawski
  2009-01-14 10:01                                                         ` Willy Tarreau
@ 2009-01-14 12:06                                                         ` Jarek Poplawski
  2009-01-14 12:15                                                         ` Jarek Poplawski
  2 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14 12:06 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe, Changli Gao

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> > net: Fix data corruption when splicing from sockets.
> > 
> > From: Jarek Poplawski <jarkao2@gmail.com>
...
> > 
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

On the other hand, I can't waste your trust like this (plus I'm not
suicidal), so a little update:

Based on a review by Changli Gao <xiaosuo@gmail.com>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

> 
> Thanks,
> Jarek P.
> 
> BTW, an skb parameter could be removed from spd_fill_page() to make it
> even faster...
> 
> ...
> >  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> >  				unsigned int len, unsigned int offset,
> > -				struct sk_buff *skb)
> > +				struct sk_buff *skb, int linear)
> >  {
> >  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> >  		return 1;
> >  
> > +	if (linear) {
> > +		page = linear_to_page(page, len, offset);
> > +		if (!page)
> > +			return 1;
> > +	}
> > +
> >  	spd->pages[spd->nr_pages] = page;
> >  	spd->partial[spd->nr_pages].len = len;
> >  	spd->partial[spd->nr_pages].offset = offset;
> > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> >  	spd->nr_pages++;
> > +	get_page(page);
> > +
> >  	return 0;
> >  }
> >  
> > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> >  static inline int __splice_segment(struct page *page, unsigned int poff,
> >  				   unsigned int plen, unsigned int *off,
> >  				   unsigned int *len, struct sk_buff *skb,
> > -				   struct splice_pipe_desc *spd)
> > +				   struct splice_pipe_desc *spd, int linear)
> >  {
> >  	if (!*len)
> >  		return 1;
> > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> >  		/* the linear region may spread across several pages  */
> >  		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
> >  
> > -		if (spd_fill_page(spd, page, flen, poff, skb))
> > +		if (spd_fill_page(spd, page, flen, poff, skb, linear))
> >  			return 1;
> >  
> >  		__segment_seek(&page, &poff, &plen, flen);
> > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> >  	if (__splice_segment(virt_to_page(skb->data),
> >  			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> >  			     skb_headlen(skb),
> > -			     offset, len, skb, spd))
> > +			     offset, len, skb, spd, 1))
> >  		return 1;
> >  
> >  	/*
> > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> >  		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
> >  
> >  		if (__splice_segment(f->page, f->page_offset, f->size,
> > -				     offset, len, skb, spd))
> > +				     offset, len, skb, spd, 0))
> >  			return 1;
> >  	}
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:54                                                       ` Jarek Poplawski
  2009-01-14 10:01                                                         ` Willy Tarreau
  2009-01-14 12:06                                                         ` Jarek Poplawski
@ 2009-01-14 12:15                                                         ` Jarek Poplawski
  2 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-14 12:15 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev,
	jens.axboe, Changli Gao

Sorry: take 2:

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> > net: Fix data corruption when splicing from sockets.
> > 
> > From: Jarek Poplawski <jarkao2@gmail.com>
...
> > 
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

On the other hand, I can't waste your trust like this (plus I'm not
suicidal), so a little update:

Based on a review by Changli Gao <xiaosuo@gmail.com>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Fixed-by: Jens Axboe <jens.axboe@oracle.com>

> 
> Thanks,
> Jarek P.
> 
> BTW, an skb parameter could be removed from spd_fill_page() to make it
> even faster...
> 
> ...
> >  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> >  				unsigned int len, unsigned int offset,
> > -				struct sk_buff *skb)
> > +				struct sk_buff *skb, int linear)
> >  {
> >  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> >  		return 1;
> >  
> > +	if (linear) {
> > +		page = linear_to_page(page, len, offset);
> > +		if (!page)
> > +			return 1;
> > +	}
> > +
> >  	spd->pages[spd->nr_pages] = page;
> >  	spd->partial[spd->nr_pages].len = len;
> >  	spd->partial[spd->nr_pages].offset = offset;
> > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> >  	spd->nr_pages++;
> > +	get_page(page);
> > +
> >  	return 0;
> >  }
> >  
> > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> >  static inline int __splice_segment(struct page *page, unsigned int poff,
> >  				   unsigned int plen, unsigned int *off,
> >  				   unsigned int *len, struct sk_buff *skb,
> > -				   struct splice_pipe_desc *spd)
> > +				   struct splice_pipe_desc *spd, int linear)
> >  {
> >  	if (!*len)
> >  		return 1;
> > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> >  		/* the linear region may spread across several pages  */
> >  		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
> >  
> > -		if (spd_fill_page(spd, page, flen, poff, skb))
> > +		if (spd_fill_page(spd, page, flen, poff, skb, linear))
> >  			return 1;
> >  
> >  		__segment_seek(&page, &poff, &plen, flen);
> > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> >  	if (__splice_segment(virt_to_page(skb->data),
> >  			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> >  			     skb_headlen(skb),
> > -			     offset, len, skb, spd))
> > +			     offset, len, skb, spd, 1))
> >  		return 1;
> >  
> >  	/*
> > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> >  		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
> >  
> >  		if (__splice_segment(f->page, f->page_offset, f->size,
> > -				     offset, len, skb, spd))
> > +				     offset, len, skb, spd, 0))
> >  			return 1;
> >  	}
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-14  9:29                                                     ` David Miller
                                                                         ` (2 preceding siblings ...)
  2009-01-14 11:28                                                       ` Herbert Xu
@ 2009-01-15 23:03                                                       ` Willy Tarreau
  2009-01-15 23:19                                                         ` David Miller
  2009-01-15 23:19                                                         ` Herbert Xu
  3 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-15 23:03 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> Therefore what I'll likely do is push Jarek's copy based cure,
> and meanwhile we can brainstorm some more on how to fix this
> properly in the long term.
> 
> So, I've put together a full commit message and Jarek's patch
> below.  One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
> 
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
> 
> Comments?

David,

please don't merge it as-is. I've just tried it and got an OOM
in __alloc_pages_internal after a few seconds of data transfer.

I'm leaving the patch below for comments, maybe someone will spot
something ? Don't we need at least one kfree() somewhere to match
alloc_pages() ?

Regards,
Willy

--

> 
> net: Fix data corruption when splicing from sockets.
> 
> From: Jarek Poplawski <jarkao2@gmail.com>
> 
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
> 
> The idea is to pass the virt_to_page() reference via the pipe
> buffer, and refcount the buffer using a SKB reference.
> 
> But if we are splicing from a socket to a socket (via sendpage)
> this doesn't work.
> 
> The from side processing will grab the page (and SKB) references.
> The sendpage() calls will grab page references only, return, and
> then the from side processing completes and drops the SKB ref.
> 
> The page based reference to skb->data is not enough to keep the
> kmalloc() buffer backing it from being reused.  Yet, that is
> all that the socket send side has at this point.
> 
> This leads to data corruption if the skb->data buffer is reused
> by SLAB before the send side socket actually gets the TX packet
> out to the device.
> 
> The fix employed here is to simply allocate a page and copy the
> skb->data bytes into that page.
> 
> This will hurt performance, but there is no clear way to fix this
> properly without a copy at the present time, and it is important
> to get rid of the data corruption.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..6e43d52 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
>  static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
>  				  struct pipe_buffer *buf)
>  {
> -	struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> -	kfree_skb(skb);
> +	put_page(buf->page);
>  }
>  
>  static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
>  				struct pipe_buffer *buf)
>  {
> -	struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> -	skb_get(skb);
> +	get_page(buf->page);
>  }
>  
>  static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
> @@ -1334,9 +1330,19 @@ fault:
>   */
>  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
>  {
> -	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
> +	put_page(spd->pages[i]);
> +}
>  
> -	kfree_skb(skb);
> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> +					  unsigned int offset)
> +{
> +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> +	if (!p)
> +		return NULL;
> +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> +
> +	return p;
>  }
>  
>  /*
> @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
>   */
>  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
>  				unsigned int len, unsigned int offset,
> -				struct sk_buff *skb)
> +				struct sk_buff *skb, int linear)
>  {
>  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
>  		return 1;
>  
> +	if (linear) {
> +		page = linear_to_page(page, len, offset);
> +		if (!page)
> +			return 1;
> +	}
> +
>  	spd->pages[spd->nr_pages] = page;
>  	spd->partial[spd->nr_pages].len = len;
>  	spd->partial[spd->nr_pages].offset = offset;
> -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
>  	spd->nr_pages++;
> +	get_page(page);
> +
>  	return 0;
>  }
>  
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
>  static inline int __splice_segment(struct page *page, unsigned int poff,
>  				   unsigned int plen, unsigned int *off,
>  				   unsigned int *len, struct sk_buff *skb,
> -				   struct splice_pipe_desc *spd)
> +				   struct splice_pipe_desc *spd, int linear)
>  {
>  	if (!*len)
>  		return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
>  		/* the linear region may spread across several pages  */
>  		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>  
> -		if (spd_fill_page(spd, page, flen, poff, skb))
> +		if (spd_fill_page(spd, page, flen, poff, skb, linear))
>  			return 1;
>  
>  		__segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
>  	if (__splice_segment(virt_to_page(skb->data),
>  			     (unsigned long) skb->data & (PAGE_SIZE - 1),
>  			     skb_headlen(skb),
> -			     offset, len, skb, spd))
> +			     offset, len, skb, spd, 1))
>  		return 1;
>  
>  	/*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
>  		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>  
>  		if (__splice_segment(f->page, f->page_offset, f->size,
> -				     offset, len, skb, spd))
> +				     offset, len, skb, spd, 0))
>  			return 1;
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:03                                                       ` Willy Tarreau
@ 2009-01-15 23:19                                                         ` David Miller
  2009-01-15 23:19                                                         ` Herbert Xu
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-15 23:19 UTC (permalink / raw)
  To: w
  Cc: jarkao2, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 16 Jan 2009 00:03:31 +0100

> please don't merge it as-is. I've just tried it and got an OOM
> in __alloc_pages_internal after a few seconds of data transfer.
> 
> I'm leaving the patch below for comments, maybe someone will spot
> something ? Don't we need at least one kfree() somewhere to match
> alloc_pages() ?

What is needed is a put_page().

But that should be happening as a result of the splice callback.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:03                                                       ` Willy Tarreau
  2009-01-15 23:19                                                         ` David Miller
@ 2009-01-15 23:19                                                         ` Herbert Xu
  2009-01-15 23:26                                                           ` David Miller
  2009-01-15 23:32                                                           ` [PATCH] " Willy Tarreau
  1 sibling, 2 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-15 23:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> 
> I'm leaving the patch below for comments, maybe someone will spot
> something ? Don't we need at least one kfree() somewhere to match
> alloc_pages() ?

Indeed.

> >  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> >  				unsigned int len, unsigned int offset,
> > -				struct sk_buff *skb)
> > +				struct sk_buff *skb, int linear)
> >  {
> >  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> >  		return 1;
> >  
> > +	if (linear) {
> > +		page = linear_to_page(page, len, offset);
> > +		if (!page)
> > +			return 1;
> > +	}
> > +
> >  	spd->pages[spd->nr_pages] = page;
> >  	spd->partial[spd->nr_pages].len = len;
> >  	spd->partial[spd->nr_pages].offset = offset;
> > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> >  	spd->nr_pages++;
> > +	get_page(page);

This get_page needs to be moved into an else clause of the previous
if block.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:19                                                         ` Herbert Xu
@ 2009-01-15 23:26                                                           ` David Miller
  2009-01-15 23:32                                                             ` Herbert Xu
  2009-01-20  8:37                                                             ` Jarek Poplawski
  2009-01-15 23:32                                                           ` [PATCH] " Willy Tarreau
  1 sibling, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-15 23:26 UTC (permalink / raw)
  To: herbert
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 16 Jan 2009 10:19:35 +1100

> On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> > > +	if (linear) {
> > > +		page = linear_to_page(page, len, offset);
> > > +		if (!page)
> > > +			return 1;
> > > +	}
> > > +
> > >  	spd->pages[spd->nr_pages] = page;
> > >  	spd->partial[spd->nr_pages].len = len;
> > >  	spd->partial[spd->nr_pages].offset = offset;
> > > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > >  	spd->nr_pages++;
> > > +	get_page(page);
> 
> This get_page needs to be moved into an else clause of the previous
> if block.

Yep, good catch Herbert.

New patch, this has the SKB clone removal as well:

net: Fix data corruption when splicing from sockets.

From: Jarek Poplawski <jarkao2@gmail.com>

The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.

The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.

But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.

The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.

The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused.  Yet, that is
all that the socket send side has at this point.

This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.

The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.

This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.

With fixes from Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65eac77..56272ac 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, len, offset);
+		if (!page)
+			return 1;
+	} else
+		get_page(page);
+
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
 	spd->nr_pages++;
+
 	return 0;
 }
 
@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
 static inline int __splice_segment(struct page *page, unsigned int poff,
 				   unsigned int plen, unsigned int *off,
 				   unsigned int *len, struct sk_buff *skb,
-				   struct splice_pipe_desc *spd)
+				   struct splice_pipe_desc *spd, int linear)
 {
 	if (!*len)
 		return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb))
+		if (spd_fill_page(spd, page, flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
-			     offset, len, skb, spd))
+			     offset, len, skb, spd, 1))
 		return 1;
 
 	/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(f->page, f->page_offset, f->size,
-				     offset, len, skb, spd))
+				     offset, len, skb, spd, 0))
 			return 1;
 	}
 
@@ -1442,7 +1455,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1455,16 +1468,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1488,15 +1491,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
+		struct sock *sk = skb->sk;
 		int ret;
-		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:26                                                           ` David Miller
@ 2009-01-15 23:32                                                             ` Herbert Xu
  2009-01-15 23:34                                                               ` David Miller
  2009-01-20  8:37                                                             ` Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-15 23:32 UTC (permalink / raw)
  To: David Miller
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
>
> New patch, this has the SKB clone removal as well:

Thanks Dave!

Something else just came to mind though.

> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> +					  unsigned int offset)
> +{
> +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> +	if (!p)
> +		return NULL;
> +	memcpy(page_address(p) + offset, page_address(page) + offset, len);

This won't work very well if skb->head is longer than a page.

We'll need to divide it up into individual pages.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:19                                                         ` Herbert Xu
  2009-01-15 23:26                                                           ` David Miller
@ 2009-01-15 23:32                                                           ` Willy Tarreau
  2009-01-15 23:35                                                             ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-15 23:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Jan 16, 2009 at 10:19:35AM +1100, Herbert Xu wrote:
> On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> > 
> > I'm leaving the patch below for comments, maybe someone will spot
> > something ? Don't we need at least one kfree() somewhere to match
> > alloc_pages() ?
> 
> Indeed.
> 
> > >  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > >  				unsigned int len, unsigned int offset,
> > > -				struct sk_buff *skb)
> > > +				struct sk_buff *skb, int linear)
> > >  {
> > >  	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > >  		return 1;
> > >  
> > > +	if (linear) {
> > > +		page = linear_to_page(page, len, offset);
> > > +		if (!page)
> > > +			return 1;
> > > +	}
> > > +
> > >  	spd->pages[spd->nr_pages] = page;
> > >  	spd->partial[spd->nr_pages].len = len;
> > >  	spd->partial[spd->nr_pages].offset = offset;
> > > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > >  	spd->nr_pages++;
> > > +	get_page(page);
> 
> This get_page needs to be moved into an else clause of the previous
> if block.

Good catch Herbert, it's working fine now. The performance I get
(30.6 MB/s) is between the original splice version (24.1) and the
recently optimized one (35.7). That's not that bad considering
we're copying the data.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:32                                                             ` Herbert Xu
@ 2009-01-15 23:34                                                               ` David Miller
  2009-01-15 23:42                                                                 ` Willy Tarreau
  2009-01-19  6:16                                                                 ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-15 23:34 UTC (permalink / raw)
  To: herbert
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 16 Jan 2009 10:32:05 +1100

> On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > +					  unsigned int offset)
> > +{
> > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > +
> > +	if (!p)
> > +		return NULL;
> > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> 
> This won't work very well if skb->head is longer than a page.
> 
> We'll need to divide it up into individual pages.

Oh yes the same bug I pointed out the other day.

But Willy can test this patch as-is, since he is not using
jumbo frames in linear SKBs.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:32                                                           ` [PATCH] " Willy Tarreau
@ 2009-01-15 23:35                                                             ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-15 23:35 UTC (permalink / raw)
  To: w
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 16 Jan 2009 00:32:20 +0100

> Good catch Herbert, it's working fine now. The performance I get
> (30.6 MB/s) is between the original splice version (24.1) and the
> recently optimized one (35.7). That's not that bad considering
> we're copying the data.

Willy, see how much my skb clone removal in the patch I just
posted helps, if at all.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:34                                                               ` David Miller
@ 2009-01-15 23:42                                                                 ` Willy Tarreau
  2009-01-15 23:44                                                                   ` Willy Tarreau
  2009-01-19  6:16                                                                 ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-15 23:42 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
> 
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > +					  unsigned int offset)
> > > +{
> > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > +	if (!p)
> > > +		return NULL;
> > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > 
> > This won't work very well if skb->head is longer than a page.
> > 
> > We'll need to divide it up into individual pages.
> 
> Oh yes the same bug I pointed out the other day.
> 
> But Willy can test this patch as-is,

Hey, nice work Dave. +3% performance from your previous patch
(31.6 MB/s). It's going fine and stable here.

> since he is not using jumbo frames in linear SKBs.

If you're interested, this week-end I can do some tests on my
myri10ge NICs which support LRO. I frequently observe 23 kB
packets there, and they also support jumbo frames. Those should
cover the case above.

I'm afraid that's all for me for this evening, I have to get some
sleep before going to work. If you want to cook up more patches,
I'll be able to do a bit of testing in 5 hours now.

Cheers!
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:42                                                                 ` Willy Tarreau
@ 2009-01-15 23:44                                                                   ` Willy Tarreau
  2009-01-15 23:54                                                                     ` David Miller
  2009-01-16  6:51                                                                     ` Jarek Poplawski
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-15 23:44 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Fri, 16 Jan 2009 10:32:05 +1100
> > 
> > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > +					  unsigned int offset)
> > > > +{
> > > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > +
> > > > +	if (!p)
> > > > +		return NULL;
> > > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > > 
> > > This won't work very well if skb->head is longer than a page.
> > > 
> > > We'll need to divide it up into individual pages.
> > 
> > Oh yes the same bug I pointed out the other day.
> > 
> > But Willy can test this patch as-is,
> 
> Hey, nice work Dave. +3% performance from your previous patch
> (31.6 MB/s). It's going fine and stable here.

And BTW feel free to add my Tested-by if you want in case you merge
this fix.

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:44                                                                   ` Willy Tarreau
@ 2009-01-15 23:54                                                                     ` David Miller
  2009-01-19  0:42                                                                       ` Willy Tarreau
  2009-01-16  6:51                                                                     ` Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-15 23:54 UTC (permalink / raw)
  To: w
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 16 Jan 2009 00:44:08 +0100

> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.

Done, thanks Willy.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:44                                                                   ` Willy Tarreau
  2009-01-15 23:54                                                                     ` David Miller
@ 2009-01-16  6:51                                                                     ` Jarek Poplawski
  2009-01-19  6:08                                                                       ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-16  6:51 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Jan 16, 2009 at 12:44:08AM +0100, Willy Tarreau wrote:
> On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Date: Fri, 16 Jan 2009 10:32:05 +1100
> > > 
> > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > > +					  unsigned int offset)
> > > > > +{
> > > > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > > +
> > > > > +	if (!p)
> > > > > +		return NULL;
> > > > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > > > 
> > > > This won't work very well if skb->head is longer than a page.
> > > > 
> > > > We'll need to divide it up into individual pages.
> > > 
> > > Oh yes the same bug I pointed out the other day.
> > > 
> > > But Willy can test this patch as-is,
> > 
> > Hey, nice work Dave. +3% performance from your previous patch
> > (31.6 MB/s). It's going fine and stable here.
> 
> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.
> 
> Willy
> 

Herbert, good catch!

David, if it's not too late I think more credits are needed,
especially for Willy. He did "a bit" more than testing.

Alas, I can't see this problem with skb->head longer than page. There
is even some comment on this in __splice_segment(), but I can miss
something.

I'm more concerned with memory usage if these skbs are not acked for
some reason. Isn't there some DOS issue possible?

Thanks everybody,
Jarek P.
--------->
Based on a review by Changli Gao <xiaosuo@gmail.com>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Fixed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:54                                                                     ` David Miller
@ 2009-01-19  0:42                                                                       ` Willy Tarreau
  2009-01-19  3:08                                                                         ` Herbert Xu
                                                                                           ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-19  0:42 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

Hi guys,

On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 16 Jan 2009 00:44:08 +0100
> 
> > And BTW feel free to add my Tested-by if you want in case you merge
> > this fix.
> 
> Done, thanks Willy.

Just for the record, I've now re-integrated those changes in a test kernel
that I booted on my 10gig machines. I have updated my user-space code in
haproxy to run a new series of tests. Eventhough there is a memcpy(), the
results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :

  - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
    (3.2 Gbps at 100% CPU without splice)

  - 9.2 Gbps at 50% CPU using MTU=1500 with LRO

  - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
    splice)

  - 10 Gbps at 15% CPU using MTU=9000 with LRO

These last ones are really impressive. While I had already observed such
performance on the Myri10GE with Tux, it's the first time I can reach that
level with so little CPU usage in haproxy !

So I think that the memcpy() workaround might be a non-issue for some time.
I agree it's not beautiful but it works pretty well for now.

The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
-EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
last patch based on Jarek's workaround for the corruption issue.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  0:42                                                                       ` Willy Tarreau
@ 2009-01-19  3:08                                                                         ` Herbert Xu
  2009-01-19  3:27                                                                           ` David Miller
  2009-01-19  3:28                                                                         ` David Miller
  2009-01-20 12:01                                                                         ` Ben Mansell
  2 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-19  3:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
>
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)

One thing to note is that Myricom's driver probably uses page
frags which means that you're not actually triggering the copy.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  3:08                                                                         ` Herbert Xu
@ 2009-01-19  3:27                                                                           ` David Miller
  2009-01-19  6:14                                                                             ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-19  3:27 UTC (permalink / raw)
  To: herbert
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 19 Jan 2009 14:08:44 +1100

> On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> >
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> 
> One thing to note is that Myricom's driver probably uses page
> frags which means that you're not actually triggering the copy.

Right.

And this is also the only reason why jumbo MTU worked :-)

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  0:42                                                                       ` Willy Tarreau
  2009-01-19  3:08                                                                         ` Herbert Xu
@ 2009-01-19  3:28                                                                         ` David Miller
  2009-01-19  6:11                                                                           ` Willy Tarreau
  2009-01-24 21:23                                                                           ` Willy Tarreau
  2009-01-20 12:01                                                                         ` Ben Mansell
  2 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-19  3:28 UTC (permalink / raw)
  To: w
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 19 Jan 2009 01:42:06 +0100

> Hi guys,
> 
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > 
> > > And BTW feel free to add my Tested-by if you want in case you merge
> > > this fix.
> > 
> > Done, thanks Willy.
> 
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)
> 
>   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> 
>   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
>     splice)
> 
>   - 10 Gbps at 15% CPU using MTU=9000 with LRO

Thanks for the numbers.

We can almost certainly do a lot better, so if you have the time and
can get some oprofile dumps for these various cases that would be
useful to us.

Thanks again.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-16  6:51                                                                     ` Jarek Poplawski
@ 2009-01-19  6:08                                                                       ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-19  6:08 UTC (permalink / raw)
  To: jarkao2
  Cc: w, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 16 Jan 2009 06:51:08 +0000

> David, if it's not too late I think more credits are needed,
> especially for Willy. He did "a bit" more than testing.
 ...
> Foreseen-by: Changli Gao <xiaosuo@gmail.com>
> Diagnosed-by: Willy Tarreau <w@1wt.eu>
> Reported-by: Willy Tarreau <w@1wt.eu>
> Fixed-by: Jens Axboe <jens.axboe@oracle.com>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

I will be sure to add these before committing, thanks Jarek.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  3:28                                                                         ` David Miller
@ 2009-01-19  6:11                                                                           ` Willy Tarreau
  2009-01-24 21:23                                                                           ` Willy Tarreau
  1 sibling, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-19  6:11 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
> 
> > Hi guys,
> > 
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > > 
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > > 
> > > Done, thanks Willy.
> > 
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> > 
> >   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> > 
> >   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> >     splice)
> > 
> >   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> Thanks for the numbers.
> 
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

No problem, of course. It's just a matter of time, but if we can push
the numbers further, let's try.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  3:27                                                                           ` David Miller
@ 2009-01-19  6:14                                                                             ` Willy Tarreau
  2009-01-19  6:19                                                                               ` David Miller
  2009-01-19  8:40                                                                               ` Jarek Poplawski
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-19  6:14 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 19 Jan 2009 14:08:44 +1100
> 
> > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > >
> > > Just for the record, I've now re-integrated those changes in a test kernel
> > > that I booted on my 10gig machines. I have updated my user-space code in
> > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > > 
> > >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > >     (3.2 Gbps at 100% CPU without splice)
> > 
> > One thing to note is that Myricom's driver probably uses page
> > frags which means that you're not actually triggering the copy.

So does this mean that the corruption problem should still there for
such a driver ? I'm asking before testing, because at these speeds,
validity tests are not that easy ;-)

> Right.
> 
> And this is also the only reason why jumbo MTU worked :-)

What should we expect from other drivers with jumbo frames ? Hangs,
corruption, errors, packet loss ?

Thanks,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:34                                                               ` David Miller
  2009-01-15 23:42                                                                 ` Willy Tarreau
@ 2009-01-19  6:16                                                                 ` David Miller
  2009-01-19 10:20                                                                   ` Herbert Xu
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-19  6:16 UTC (permalink / raw)
  To: herbert
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: David Miller <davem@davemloft.net>
Date: Thu, 15 Jan 2009 15:34:49 -0800 (PST)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
> 
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > +					  unsigned int offset)
> > > +{
> > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > +	if (!p)
> > > +		return NULL;
> > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > 
> > This won't work very well if skb->head is longer than a page.
> > 
> > We'll need to divide it up into individual pages.
> 
> Oh yes the same bug I pointed out the other day.
> 
> But Willy can test this patch as-is, since he is not using
> jumbo frames in linear SKBs.

Actually, Herbert, it turns out this case should be OK.

Look a level or two higher, at __splice_segment(), it even has a
comment :-)

--------------------
	do {
		unsigned int flen = min(*len, plen);

		/* the linear region may spread across several pages  */
		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

		if (spd_fill_page(spd, page, flen, poff, skb, linear))
			return 1;

		__segment_seek(&page, &poff, &plen, flen);
		*len -= flen;

	} while (*len && plen);
--------------------

That code should work and do what we need.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  6:14                                                                             ` Willy Tarreau
@ 2009-01-19  6:19                                                                               ` David Miller
  2009-01-19  6:45                                                                                 ` Willy Tarreau
  2009-01-19 10:19                                                                                 ` Herbert Xu
  2009-01-19  8:40                                                                               ` Jarek Poplawski
  1 sibling, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-19  6:19 UTC (permalink / raw)
  To: w
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 19 Jan 2009 07:14:20 +0100

> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > 
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
> 
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

It ought not to, but it seems that is the case where you
saw the original corruptions, so hmmm...

Actually, I see, the myri10ge driver does put up to
64 bytes of the initial packet into the linear area.
If the IPV4 + TCP headers are less than this, you will
hit the corruption case even with the myri10ge driver.

So everything checks out.

> > And this is also the only reason why jumbo MTU worked :-)
> 
> What should we expect from other drivers with jumbo frames ? Hangs,
> corruption, errors, packet loss ?

Upon recent review I think jumbo frames in such drivers should
actually be fine.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  6:19                                                                               ` David Miller
@ 2009-01-19  6:45                                                                                 ` Willy Tarreau
  2009-01-19 10:19                                                                                 ` Herbert Xu
  1 sibling, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-19  6:45 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 07:14:20 +0100
> 
> > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > > 
> > > > One thing to note is that Myricom's driver probably uses page
> > > > frags which means that you're not actually triggering the copy.
> > 
> > So does this mean that the corruption problem should still there for
> > such a driver ? I'm asking before testing, because at these speeds,
> > validity tests are not that easy ;-)
> 
> It ought not to, but it seems that is the case where you
> saw the original corruptions, so hmmm...
> 
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.
> 
> So everything checks out.

OK, so I will modify my tools in order to perform a few checks.

Thanks!
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  6:14                                                                             ` Willy Tarreau
  2009-01-19  6:19                                                                               ` David Miller
@ 2009-01-19  8:40                                                                               ` Jarek Poplawski
  1 sibling, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-19  8:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Mon, Jan 19, 2009 at 07:14:20AM +0100, Willy Tarreau wrote:
> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > 
> > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > > >
> > > > Just for the record, I've now re-integrated those changes in a test kernel
> > > > that I booted on my 10gig machines. I have updated my user-space code in
> > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > > > 
> > > >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > > >     (3.2 Gbps at 100% CPU without splice)
> > > 
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
> 
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

I guess, David meant the performance: there is not much change because
only a small part could be copied. The most harmed should be jumbo
frames in linear only skbs. But no corruption is expected.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  6:19                                                                               ` David Miller
  2009-01-19  6:45                                                                                 ` Willy Tarreau
@ 2009-01-19 10:19                                                                                 ` Herbert Xu
  2009-01-19 20:59                                                                                   ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-19 10:19 UTC (permalink / raw)
  To: David Miller
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> 
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.

I thought splice only mapped the payload areas, no?

So we should probably test with a non-paged driver to be totally
sure.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  6:16                                                                 ` David Miller
@ 2009-01-19 10:20                                                                   ` Herbert Xu
  0 siblings, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-19 10:20 UTC (permalink / raw)
  To: David Miller
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Sun, Jan 18, 2009 at 10:16:03PM -0800, David Miller wrote:
> 
> Actually, Herbert, it turns out this case should be OK.
> 
> Look a level or two higher, at __splice_segment(), it even has a
> comment :-)

Aha, that should take care of it.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19 10:19                                                                                 ` Herbert Xu
@ 2009-01-19 20:59                                                                                   ` David Miller
  2009-01-19 21:24                                                                                     ` Herbert Xu
  2009-01-25 21:03                                                                                     ` Willy Tarreau
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-19 20:59 UTC (permalink / raw)
  To: herbert
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 19 Jan 2009 21:19:24 +1100

> On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> > 
> > Actually, I see, the myri10ge driver does put up to
> > 64 bytes of the initial packet into the linear area.
> > If the IPV4 + TCP headers are less than this, you will
> > hit the corruption case even with the myri10ge driver.
> 
> I thought splice only mapped the payload areas, no?

And the difference between 64 and IPV4+TCP header len becomes the
payload, don't you see? :-)

myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
the linear area, unconditionally.  So a small number of payload bytes
can in fact end up there.

Otherwise Willy could never have triggered this bug.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19 20:59                                                                                   ` David Miller
@ 2009-01-19 21:24                                                                                     ` Herbert Xu
  2009-01-25 21:03                                                                                     ` Willy Tarreau
  1 sibling, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-19 21:24 UTC (permalink / raw)
  To: David Miller
  Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
.
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
> 
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally.  So a small number of payload bytes
> can in fact end up there.

Ah, I thought they were doing a precise division.  Yes pulling
a constant length will trigger it if it's big enough.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-15 23:26                                                           ` David Miller
  2009-01-15 23:32                                                             ` Herbert Xu
@ 2009-01-20  8:37                                                             ` Jarek Poplawski
  2009-01-20  9:33                                                               ` [PATCH v2] " Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-20  8:37 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, w, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
...
> net: Fix data corruption when splicing from sockets.
> 
> From: Jarek Poplawski <jarkao2@gmail.com>
> 
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
...
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 65eac77..56272ac 100644
...

Here is a tiny upgrade to save some memory by reusing a page for more
chunks if possible, which I think could be considered, after the
testing of the main patch is finished. (There could be also added an
additional freeing of this cached page before socket destruction,
maybe in tcp_splice_read(), if somebody finds good place.)

Thanks,
Jarek P.

---

 include/net/sock.h  |    4 ++++
 net/core/skbuff.c   |   32 ++++++++++++++++++++++++++------
 net/core/sock.c     |    2 ++
 net/ipv4/tcp_ipv4.c |    8 ++++++++
 4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..4ded741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,6 +190,8 @@ struct sock_common {
   *	@sk_user_data: RPC layer private data
   *	@sk_sndmsg_page: cached page for sendmsg
   *	@sk_sndmsg_off: cached offset for sendmsg
+  *	@sk_splice_page: cached page for splice
+  *	@sk_splice_off: cached offset for splice
   *	@sk_send_head: front of stuff to transmit
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
@@ -279,6 +281,8 @@ struct sock {
 	struct page		*sk_sndmsg_page;
 	struct sk_buff		*sk_send_head;
 	__u32			sk_sndmsg_off;
+	struct page		*sk_splice_page;
+	__u32			sk_splice_off;
 	int			sk_write_pending;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56272ac..74adc79 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 }
 
 static inline struct page *linear_to_page(struct page *page, unsigned int len,
-					  unsigned int offset)
+					  unsigned int *offset,
+					  struct sk_buff *skb)
 {
-	struct page *p = alloc_pages(GFP_KERNEL, 0);
+	struct sock *sk = skb->sk;
+	struct page *p = sk->sk_splice_page;
+	unsigned int off;
 
-	if (!p)
-		return NULL;
-	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+	if (!p) {
+new_page:
+		p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0);
+		if (!p)
+			return NULL;
+
+		off = sk->sk_splice_off = 0;
+		/* hold this page until it's full or unneeded */
+		get_page(p);
+	} else {
+		off = sk->sk_splice_off;
+		if (off + len > PAGE_SIZE) {
+			put_page(p);
+			goto new_page;
+		}
+	}
+
+	memcpy(page_address(p) + off, page_address(page) + *offset, len);
+	sk->sk_splice_off += len;
+	*offset = off;
 
 	return p;
 }
@@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 		return 1;
 
 	if (linear) {
-		page = linear_to_page(page, len, offset);
+		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;
 	} else
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..6b258a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_sndmsg_page	=	NULL;
 	sk->sk_sndmsg_off	=	0;
+	sk->sk_splice_page	=	NULL;
+	sk->sk_splice_off	=	0;
 
 	sk->sk_peercred.pid 	=	0;
 	sk->sk_peercred.uid	=	-1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 19d7b42..cf3d367 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	/*
+	 * If splice cached page exists, toss it.
+	 */
+	if (sk->sk_splice_page) {
+		__free_page(sk->sk_splice_page);
+		sk->sk_splice_page = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20  8:37                                                             ` Jarek Poplawski
@ 2009-01-20  9:33                                                               ` Jarek Poplawski
  2009-01-20 10:00                                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-20  9:33 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, w, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 20, 2009 at 08:37:26AM +0000, Jarek Poplawski wrote:
...
> Here is a tiny upgrade to save some memory by reusing a page for more
> chunks if possible, which I think could be considered, after the
> testing of the main patch is finished. (There could be also added an
> additional freeing of this cached page before socket destruction,
> maybe in tcp_splice_read(), if somebody finds good place.)

OOPS! I did it again... Here is better refcounting.

Jarek P.

--- (take 2)

 include/net/sock.h  |    4 ++++
 net/core/skbuff.c   |   32 ++++++++++++++++++++++++++------
 net/core/sock.c     |    2 ++
 net/ipv4/tcp_ipv4.c |    8 ++++++++
 4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..4ded741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,6 +190,8 @@ struct sock_common {
   *	@sk_user_data: RPC layer private data
   *	@sk_sndmsg_page: cached page for sendmsg
   *	@sk_sndmsg_off: cached offset for sendmsg
+  *	@sk_splice_page: cached page for splice
+  *	@sk_splice_off: cached offset for splice
   *	@sk_send_head: front of stuff to transmit
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
@@ -279,6 +281,8 @@ struct sock {
 	struct page		*sk_sndmsg_page;
 	struct sk_buff		*sk_send_head;
 	__u32			sk_sndmsg_off;
+	struct page		*sk_splice_page;
+	__u32			sk_splice_off;
 	int			sk_write_pending;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56272ac..02a1a6c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 }
 
 static inline struct page *linear_to_page(struct page *page, unsigned int len,
-					  unsigned int offset)
+					  unsigned int *offset,
+					  struct sk_buff *skb)
 {
-	struct page *p = alloc_pages(GFP_KERNEL, 0);
+	struct sock *sk = skb->sk;
+	struct page *p = sk->sk_splice_page;
+	unsigned int off;
 
-	if (!p)
-		return NULL;
-	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+	if (!p) {
+new_page:
+		p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0);
+		if (!p)
+			return NULL;
+
+		off = sk->sk_splice_off = 0;
+		/* we hold one ref to this page until it's full or unneeded */
+	} else {
+		off = sk->sk_splice_off;
+		if (off + len > PAGE_SIZE) {
+			put_page(p);
+			goto new_page;
+		}
+	}
+
+	memcpy(page_address(p) + off, page_address(page) + *offset, len);
+	sk->sk_splice_off += len;
+	*offset = off;
+	get_page(p);
 
 	return p;
 }
@@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 		return 1;
 
 	if (linear) {
-		page = linear_to_page(page, len, offset);
+		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;
 	} else
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..6b258a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_sndmsg_page	=	NULL;
 	sk->sk_sndmsg_off	=	0;
+	sk->sk_splice_page	=	NULL;
+	sk->sk_splice_off	=	0;
 
 	sk->sk_peercred.pid 	=	0;
 	sk->sk_peercred.uid	=	-1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 19d7b42..cf3d367 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	/*
+	 * If splice cached page exists, toss it.
+	 */
+	if (sk->sk_splice_page) {
+		__free_page(sk->sk_splice_page);
+		sk->sk_splice_page = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20  9:33                                                               ` [PATCH v2] " Jarek Poplawski
@ 2009-01-20 10:00                                                                 ` Evgeniy Polyakov
  2009-01-20 10:20                                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-20 10:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

Hi Jarek.

On Tue, Jan 20, 2009 at 09:33:52AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > Here is a tiny upgrade to save some memory by reusing a page for more
> > chunks if possible, which I think could be considered, after the
> > testing of the main patch is finished. (There could be also added an
> > additional freeing of this cached page before socket destruction,
> > maybe in tcp_splice_read(), if somebody finds good place.)
> 
> OOPS! I did it again... Here is better refcounting.
> 
> Jarek P.
> 
> --- (take 2)
> 
>  include/net/sock.h  |    4 ++++
>  net/core/skbuff.c   |   32 ++++++++++++++++++++++++++------
>  net/core/sock.c     |    2 ++
>  net/ipv4/tcp_ipv4.c |    8 ++++++++
>  4 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5a3a151..4ded741 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -190,6 +190,8 @@ struct sock_common {
>    *	@sk_user_data: RPC layer private data
>    *	@sk_sndmsg_page: cached page for sendmsg
>    *	@sk_sndmsg_off: cached offset for sendmsg
> +  *	@sk_splice_page: cached page for splice
> +  *	@sk_splice_off: cached offset for splice

Ugh, increase every socket by 16 bytes... Does TCP one still fit the
page?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 10:00                                                                 ` Evgeniy Polyakov
@ 2009-01-20 10:20                                                                   ` Jarek Poplawski
  2009-01-20 10:31                                                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-20 10:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Jan 20, 2009 at 01:00:43PM +0300, Evgeniy Polyakov wrote:
> Hi Jarek.

Hi Evgeniy.
...
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5a3a151..4ded741 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -190,6 +190,8 @@ struct sock_common {
> >    *	@sk_user_data: RPC layer private data
> >    *	@sk_sndmsg_page: cached page for sendmsg
> >    *	@sk_sndmsg_off: cached offset for sendmsg
> > +  *	@sk_splice_page: cached page for splice
> > +  *	@sk_splice_off: cached offset for splice
> 
> Ugh, increase every socket by 16 bytes... Does TCP one still fit the
> page?

Good question! Alas I can't check this soon, but if it's really like
this, of course this needs some better idea and rework. (BTW, I'd like
to prevent here as much as possible some strange activities like 1
byte (payload) packets getting full pages without any accounting.)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 10:20                                                                   ` Jarek Poplawski
@ 2009-01-20 10:31                                                                     ` Evgeniy Polyakov
  2009-01-20 11:01                                                                       ` Jarek Poplawski
  2009-01-26  8:20                                                                       ` [PATCH v2] " Jarek Poplawski
  0 siblings, 2 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-20 10:31 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> Good question! Alas I can't check this soon, but if it's really like
> this, of course this needs some better idea and rework. (BTW, I'd like
> to prevent here as much as possible some strange activities like 1
> byte (payload) packets getting full pages without any accounting.)

I believe approach to meet all our goals is to have own network memory
allocator, so that each skb could have its payload in the fragments, we
would not suffer from the heavy fragmentation and power-of-two overhead
for the larger MTUs, have a reserve for the OOM condition and generally
do not depend on the main system behaviour.

I will resurrect to some point my network allocator to check how things
go in the modern environment, if no one will beat this idea first :)

1. Network (tree) allocator
http://www.ioremap.net/projects/nta

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 10:31                                                                     ` Evgeniy Polyakov
@ 2009-01-20 11:01                                                                       ` Jarek Poplawski
  2009-01-20 17:16                                                                         ` David Miller
  2009-01-26  8:20                                                                       ` [PATCH v2] " Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-20 11:01 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > Good question! Alas I can't check this soon, but if it's really like
> > this, of course this needs some better idea and rework. (BTW, I'd like
> > to prevent here as much as possible some strange activities like 1
> > byte (payload) packets getting full pages without any accounting.)
> 
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.

100% right! But I guess we need this current fix for -stable, and I'm
a bit worried about safety.

> 
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)

I can't see too much beating of ideas around this problem now... I Wish
you luck!

> 
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta
> 

Great, I'll try to learn a bit btw.,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  0:42                                                                       ` Willy Tarreau
  2009-01-19  3:08                                                                         ` Herbert Xu
  2009-01-19  3:28                                                                         ` David Miller
@ 2009-01-20 12:01                                                                         ` Ben Mansell
  2009-01-20 12:11                                                                           ` Evgeniy Polyakov
  2 siblings, 1 reply; 190+ messages in thread
From: Ben Mansell @ 2009-01-20 12:01 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, jarkao2, zbr, dada1, mingo, linux-kernel,
	netdev, jens.axboe

Willy Tarreau wrote:
> Hi guys,
> 
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
>> From: Willy Tarreau <w@1wt.eu>
>> Date: Fri, 16 Jan 2009 00:44:08 +0100
>>
>>> And BTW feel free to add my Tested-by if you want in case you merge
>>> this fix.
>> Done, thanks Willy.
> 
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)
> 
>   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> 
>   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
>     splice)
> 
>   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> These last ones are really impressive. While I had already observed such
> performance on the Myri10GE with Tux, it's the first time I can reach that
> level with so little CPU usage in haproxy !
> 
> So I think that the memcpy() workaround might be a non-issue for some time.
> I agree it's not beautiful but it works pretty well for now.
> 
> The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
> -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
> last patch based on Jarek's workaround for the corruption issue.

I've also tested on the same three patches (against 2.6.27.2 here), and 
the patches appear to work just fine. I'm running a similar proxy 
benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
2xforcedeth). splice is working OK now, although I get identical results 
when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
98% system).

I may be hitting a h/w limitation which prevents any higher throughput, 
but I'm a little surprised that splice() didn't use less CPU time. 
Anyway, the splice code is working which is the important part!


Ben


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-20 12:01                                                                         ` Ben Mansell
@ 2009-01-20 12:11                                                                           ` Evgeniy Polyakov
  2009-01-20 13:43                                                                             ` Ben Mansell
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-20 12:11 UTC (permalink / raw)
  To: Ben Mansell
  Cc: Willy Tarreau, David Miller, herbert, jarkao2, dada1, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote:
> I've also tested on the same three patches (against 2.6.27.2 here), and 
> the patches appear to work just fine. I'm running a similar proxy 
> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
> 2xforcedeth). splice is working OK now, although I get identical results 
> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
> 98% system).

With small MTU or when driver does not support fragmented allocation
(iirc at least forcedeth does not) skb will contain all the data in the
linear part and thus will be copied in the kernel. read()/write() does
effectively the same, but in userspace.
This should only affect splice usage which involves socket->pipe data
transfer.

> I may be hitting a h/w limitation which prevents any higher throughput, 
> but I'm a little surprised that splice() didn't use less CPU time. 
> Anyway, the splice code is working which is the important part!

Does splice without patches (but with performance improvement for
non-blocking splice) has the same performance? It does not copy data,
but you may hit the data corruption? If performance is the same, this
maybe indeed HW limitation.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-20 12:11                                                                           ` Evgeniy Polyakov
@ 2009-01-20 13:43                                                                             ` Ben Mansell
  2009-01-20 14:06                                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Ben Mansell @ 2009-01-20 13:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Willy Tarreau, David Miller, herbert, jarkao2, dada1, mingo,
	linux-kernel, netdev, jens.axboe

Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote:
>> I've also tested on the same three patches (against 2.6.27.2 here), and 
>> the patches appear to work just fine. I'm running a similar proxy 
>> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
>> 2xforcedeth). splice is working OK now, although I get identical results 
>> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
>> 98% system).
> 
> With small MTU or when driver does not support fragmented allocation
> (iirc at least forcedeth does not) skb will contain all the data in the
> linear part and thus will be copied in the kernel. read()/write() does
> effectively the same, but in userspace.
> This should only affect splice usage which involves socket->pipe data
> transfer.

I'll try with some larger MTUs and see if that helps - it should also 
give an improvement if I'm hitting a limit on the number of 
packets/second that the cards can process, regardless of splice...

>> I may be hitting a h/w limitation which prevents any higher throughput, 
>> but I'm a little surprised that splice() didn't use less CPU time. 
>> Anyway, the splice code is working which is the important part!
> 
> Does splice without patches (but with performance improvement for
> non-blocking splice) has the same performance? It does not copy data,
> but you may hit the data corruption? If performance is the same, this
> maybe indeed HW limitation.

With an unpatched kernel, the splice performance was worse (due to the 
one packet per-splice issues). With the small patch to fix that, I was 
getting around 2 Gbps performance, although oddly enough, I could only 
get 2 Gbps with read()/write() then as well...

I'll try and do some tests on a machine that hopefully doesn't have the 
bottlenecks (and one that uses different NICs)


Ben

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-20 13:43                                                                             ` Ben Mansell
@ 2009-01-20 14:06                                                                               ` Jarek Poplawski
  0 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-20 14:06 UTC (permalink / raw)
  To: Ben Mansell
  Cc: Evgeniy Polyakov, Willy Tarreau, David Miller, herbert, dada1,
	mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 20, 2009 at 01:43:52PM +0000, Ben Mansell wrote:
...
> With an unpatched kernel, the splice performance was worse (due to the  
> one packet per-splice issues). With the small patch to fix that, I was  
> getting around 2 Gbps performance, although oddly enough, I could only  
> get 2 Gbps with read()/write() then as well...
>
> I'll try and do some tests on a machine that hopefully doesn't have the  
> bottlenecks (and one that uses different NICs)

I guess you should especially check if SG and checksums are on, and it
could depend on a chip within those NICs as well.

Jarek P. 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 11:01                                                                       ` Jarek Poplawski
@ 2009-01-20 17:16                                                                         ` David Miller
  2009-01-21  9:54                                                                           ` Jarek Poplawski
  2009-01-22  9:04                                                                           ` [PATCH v3] " Jarek Poplawski
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-20 17:16 UTC (permalink / raw)
  To: jarkao2
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 20 Jan 2009 11:01:44 +0000

> On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > Good question! Alas I can't check this soon, but if it's really like
> > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > to prevent here as much as possible some strange activities like 1
> > > byte (payload) packets getting full pages without any accounting.)
> > 
> > I believe approach to meet all our goals is to have own network memory
> > allocator, so that each skb could have its payload in the fragments, we
> > would not suffer from the heavy fragmentation and power-of-two overhead
> > for the larger MTUs, have a reserve for the OOM condition and generally
> > do not depend on the main system behaviour.
> 
> 100% right! But I guess we need this current fix for -stable, and I'm
> a bit worried about safety.

Jarek, we already have a page and offset you can use.

It's called sk_sndmsg_page but that is just the (current) name.
Nothing prevents you from reusing it for your purposes here.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 17:16                                                                         ` David Miller
@ 2009-01-21  9:54                                                                           ` Jarek Poplawski
  2009-01-22  9:04                                                                           ` [PATCH v3] " Jarek Poplawski
  1 sibling, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-21  9:54 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
...
> Jarek, we already have a page and offset you can use.
> 
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

I'm trying to get some know-how about this field.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-20 17:16                                                                         ` David Miller
  2009-01-21  9:54                                                                           ` Jarek Poplawski
@ 2009-01-22  9:04                                                                           ` Jarek Poplawski
  2009-01-26  5:22                                                                             ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-22  9:04 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 20 Jan 2009 11:01:44 +0000
> 
> > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > > Good question! Alas I can't check this soon, but if it's really like
> > > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > > to prevent here as much as possible some strange activities like 1
> > > > byte (payload) packets getting full pages without any accounting.)
> > > 
> > > I believe approach to meet all our goals is to have own network memory
> > > allocator, so that each skb could have its payload in the fragments, we
> > > would not suffer from the heavy fragmentation and power-of-two overhead
> > > for the larger MTUs, have a reserve for the OOM condition and generally
> > > do not depend on the main system behaviour.
> > 
> > 100% right! But I guess we need this current fix for -stable, and I'm
> > a bit worried about safety.
> 
> Jarek, we already have a page and offset you can use.
> 
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
I used here tcp_sndmsg() way, but I think I'll go back to this question
soon.

Thanks,
Jarek P.

------------> take 3

net: Optimize memory usage when splicing from sockets.

The recent fix of data corruption when splicing from sockets uses
memory very inefficiently allocating a new page to copy each chunk of
linear part of skb. This patch uses the same page until it's full
(almost) by caching the page in sk_sndmsg_page field.

With changes from David S. Miller <davem@davemloft.net>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Tested-by: needed...
---

 net/core/skbuff.c |   45 +++++++++++++++++++++++++++++++++++----------
 1 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2e5f2ca..2e64c1b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1333,14 +1333,39 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 	put_page(spd->pages[i]);
 }
 
-static inline struct page *linear_to_page(struct page *page, unsigned int len,
-					  unsigned int offset)
-{
-	struct page *p = alloc_pages(GFP_KERNEL, 0);
+static inline struct page *linear_to_page(struct page *page, unsigned int *len,
+					  unsigned int *offset,
+					  struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	struct page *p = sk->sk_sndmsg_page;
+	unsigned int off;
+
+	if (!p) {
+new_page:
+		p = sk->sk_sndmsg_page = alloc_pages(sk->sk_allocation, 0);
+		if (!p)
+			return NULL;
 
-	if (!p)
-		return NULL;
-	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+		off = sk->sk_sndmsg_off = 0;
+		/* hold one ref to this page until it's full */
+	} else {
+		unsigned int mlen;
+
+		off = sk->sk_sndmsg_off;
+		mlen = PAGE_SIZE - off;
+		if (mlen < 64 && mlen < *len) {
+			put_page(p);
+			goto new_page;
+		}
+
+		*len = min_t(unsigned int, *len, mlen);
+	}
+
+	memcpy(page_address(p) + off, page_address(page) + *offset, *len);
+	sk->sk_sndmsg_off += *len;
+	*offset = off;
+	get_page(p);
 
 	return p;
 }
@@ -1349,21 +1374,21 @@ static inline struct page *linear_to_page(struct page *page, unsigned int len,
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
-				unsigned int len, unsigned int offset,
+				unsigned int *len, unsigned int offset,
 				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
 	if (linear) {
-		page = linear_to_page(page, len, offset);
+		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;
 	} else
 		get_page(page);
 
 	spd->pages[spd->nr_pages] = page;
-	spd->partial[spd->nr_pages].len = len;
+	spd->partial[spd->nr_pages].len = *len;
 	spd->partial[spd->nr_pages].offset = offset;
 	spd->nr_pages++;
 
@@ -1405,7 +1430,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb, linear))
+		if (spd_fill_page(spd, page, &flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19  3:28                                                                         ` David Miller
  2009-01-19  6:11                                                                           ` Willy Tarreau
@ 2009-01-24 21:23                                                                           ` Willy Tarreau
  1 sibling, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-24 21:23 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

Hi David,

On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
> 
> > Hi guys,
> > 
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > > 
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > > 
> > > Done, thanks Willy.
> > 
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> > 
> >   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> > 
> >   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> >     splice)
> > 
> >   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> Thanks for the numbers.
> 
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

Well, I tried to get oprofile to work on those machines, but I'm surely
missing something, as "opreport" always complains :
  opreport error: No sample file found: try running opcontrol --dump
  or specify a session containing sample files

I've straced opcontrol --dump, opcontrol --stop, and I see nothing
creating any file in the samples directory. I thought it would be
opjitconv which would do it, but it's hard to debug, and I haven't
used oprofile for a 2/3 years now. I followed exactly the instructions
in the kernel doc, as well as a howto found on the net, but I remain
out of luck. I just see a "complete_dump" file created with only two
bytes. It's not easy to debug on those machines are they're diskless
and PXE-booted from squashfs root images.

However, upon Ingo's suggestion I tried his perfcounters while
running a test at 5 Gbps with haproxy running alone on one core,
and IRQs on the other one. No LRO was used and MTU was 1500.

For instance, kerneltop tells how many CPU cycles are spent in each
function :

# kerneltop -e 0 -d 1 -c 1000000 -C 1

             events         RIP          kernel function
  ______     ______   ________________   _______________

            1864.00 - 00000000f87be000 : init_module    [myri10ge]
            1078.00 - 00000000784a6580 : tcp_read_sock
             901.00 - 00000000784a7840 : tcp_sendpage
             857.00 - 0000000078470be0 : __skb_splice_bits
             617.00 - 00000000784b2260 : tcp_transmit_skb
             604.00 - 000000007849fdf0 : __ip_local_out
             581.00 - 0000000078470460 : __copy_skb_header
             569.00 - 000000007850cac0 : _spin_lock     [myri10ge]
             472.00 - 0000000078185150 : __slab_free
             443.00 - 000000007850cc10 : _spin_lock_bh  [sky2]
             434.00 - 00000000781852e0 : __slab_alloc
             408.00 - 0000000078488620 : __qdisc_run
             355.00 - 0000000078185b20 : kmem_cache_free
             352.00 - 0000000078472950 : __alloc_skb
             348.00 - 00000000784705f0 : __skb_clone
             337.00 - 0000000078185870 : kmem_cache_alloc       [myri10ge]
             302.00 - 0000000078472150 : skb_release_data
             297.00 - 000000007847bcf0 : dev_queue_xmit
             285.00 - 00000000784a08f0 : ip_queue_xmit

You should ignore the lines init_module, _spin_lock, etc, in fact all
lines indicating a module, as there's something wrong there, they
always report the name of the last module loaded, and the name changes
when the module is unloaded.

I also tried dumping the number of cache misses per function :

------------------------------------------------------------------------------
 KernelTop:    1146 irqs/sec  [NMI, 1000 cache-misses],  (all, cpu: 1)
------------------------------------------------------------------------------

             events         RIP          kernel function
  ______     ______   ________________   _______________

            7512.00 - 00000000784a6580 : tcp_read_sock
            2716.00 - 0000000078470be0 : __skb_splice_bits
            2516.00 - 00000000784a08f0 : ip_queue_xmit
             986.00 - 00000000784a7840 : tcp_sendpage
             587.00 - 00000000781a40c0 : sys_splice
             462.00 - 00000000781856b0 : kfree  [myri10ge]
             451.00 - 000000007849fdf0 : __ip_local_out
             242.00 - 0000000078185b20 : kmem_cache_free
             205.00 - 00000000784b1b90 : __tcp_select_window
             153.00 - 000000007850cac0 : _spin_lock     [myri10ge]
             142.00 - 000000007849f6a0 : ip_fragment
             119.00 - 0000000078188950 : rw_verify_area
             117.00 - 00000000784a99e0 : tcp_rcv_space_adjust
             107.00 - 000000007850cc10 : _spin_lock_bh  [sky2]
             104.00 - 00000000781852e0 : __slab_alloc

There are other options to combine events but I don't understand
the output when I enable it.

I think that when properly used, these tools can report useful
information.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-19 20:59                                                                                   ` David Miller
  2009-01-19 21:24                                                                                     ` Herbert Xu
@ 2009-01-25 21:03                                                                                     ` Willy Tarreau
  2009-01-26  7:59                                                                                       ` Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-25 21:03 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

Hi David,

On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 19 Jan 2009 21:19:24 +1100
> 
> > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> > > 
> > > Actually, I see, the myri10ge driver does put up to
> > > 64 bytes of the initial packet into the linear area.
> > > If the IPV4 + TCP headers are less than this, you will
> > > hit the corruption case even with the myri10ge driver.
> > 
> > I thought splice only mapped the payload areas, no?
> 
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
> 
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally.  So a small number of payload bytes
> can in fact end up there.
> 
> Otherwise Willy could never have triggered this bug.

Just FWIW, I've updated my tools in order to perform content checks more
easily. I cannot reproduce the issue at all with the myri10ge NICs, neither
with large frames nor with tiny ones (8 bytes).

However, I have noticed that the load is now sensible to the number of
concurrent sessions. I'm using 2.6.29-rc2 with the perfcounters patches,
and I'm not sure whether the difference in behaviour came with the data
corruption fixes or with the new kernel (which has some profiling options
turned on). Basically, below 800-1000 concurrent sessions, I have no
problem reaching 10 Gbps with LRO and MTU=1500, with about 60% CPU. Above
this number of session, the CPU suddenly jumps to 100% and the data rate
drops to about 6.7 Gbps.

I spent a long time trying to figure what it was, but I think that I
have found. Kerneltop reports different figures above and below the
limit.

1) below the limit :

            1429.00 - 00000000784a7840 : tcp_sendpage
             561.00 - 00000000784a6580 : tcp_read_sock
             485.00 - 00000000f87e13c0 : myri10ge_xmit  [myri10ge]
             433.00 - 00000000781a40c0 : sys_splice
             411.00 - 00000000784a6eb0 : tcp_poll
             344.00 - 000000007847bcf0 : dev_queue_xmit
             342.00 - 0000000078470be0 : __skb_splice_bits
             319.00 - 0000000078472950 : __alloc_skb
             310.00 - 0000000078185870 : kmem_cache_alloc
             285.00 - 00000000784b2260 : tcp_transmit_skb
             285.00 - 000000007850cac0 : _spin_lock
             250.00 - 00000000781afda0 : sys_epoll_ctl
             238.00 - 000000007810334c : system_call
             232.00 - 000000007850ac20 : schedule
             230.00 - 000000007850cc10 : _spin_lock_bh
             222.00 - 00000000784705f0 : __skb_clone
             220.00 - 000000007850cbc0 : _spin_lock_irqsave
             213.00 - 00000000784a08f0 : ip_queue_xmit
             211.00 - 0000000078185ea0 : __kmalloc_track_caller

2) above the limit :

            1778.00 - 00000000784a7840 : tcp_sendpage
            1281.00 - 0000000078472950 : __alloc_skb
             639.00 - 00000000784a6780 : sk_stream_alloc_skb
             507.00 - 0000000078185ea0 : __kmalloc_track_caller
             484.00 - 0000000078185870 : kmem_cache_alloc
             476.00 - 00000000784a6580 : tcp_read_sock
             451.00 - 00000000784a08f0 : ip_queue_xmit
             421.00 - 00000000f87e13c0 : myri10ge_xmit  [myri10ge]
             374.00 - 00000000781852e0 : __slab_alloc
             361.00 - 00000000781a40c0 : sys_splice
             273.00 - 0000000078470be0 : __skb_splice_bits
             231.00 - 000000007850cac0 : _spin_lock
             206.00 - 0000000078168b30 : get_pageblock_flags_group
             165.00 - 00000000784a0260 : ip_finish_output
             165.00 - 00000000784b2260 : tcp_transmit_skb
             161.00 - 0000000078470460 : __copy_skb_header
             153.00 - 000000007816d6d0 : put_page
             144.00 - 000000007850cbc0 : _spin_lock_irqsave
             137.00 - 0000000078189be0 : fget_light

The memory allocation clearly is the culprit here. I'll try Jarek's
patch which reduces memory allocation to see if that changes something,
as I'm sure we can do fairly better, given how it behaves with limited
sessions.

Regards,
Willy

PS: this thread is long, if some of the people in CC want to get off
    the thread, please complain.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-22  9:04                                                                           ` [PATCH v3] " Jarek Poplawski
@ 2009-01-26  5:22                                                                             ` David Miller
  2009-01-27  7:11                                                                               ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-26  5:22 UTC (permalink / raw)
  To: jarkao2
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 22 Jan 2009 09:04:42 +0000

> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> I used here tcp_sndmsg() way, but I think I'll go back to this question
> soon.

Indeed, it is something to look into, as well as locking.

I'll try to find some time for this, thanks Jarek.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-25 21:03                                                                                     ` Willy Tarreau
@ 2009-01-26  7:59                                                                                       ` Jarek Poplawski
  2009-01-26  8:12                                                                                         ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-26  7:59 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
...
> The memory allocation clearly is the culprit here. I'll try Jarek's
> patch which reduces memory allocation to see if that changes something,
> as I'm sure we can do fairly better, given how it behaves with limited
> sessions.

I think you are right, but I wonder if it's not better to wait with
more profiling until this splicing is really redone.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] tcp: splice as many packets as possible at once
  2009-01-26  7:59                                                                                       ` Jarek Poplawski
@ 2009-01-26  8:12                                                                                         ` Willy Tarreau
  0 siblings, 0 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-01-26  8:12 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Mon, Jan 26, 2009 at 07:59:33AM +0000, Jarek Poplawski wrote:
> On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
> ...
> > The memory allocation clearly is the culprit here. I'll try Jarek's
> > patch which reduces memory allocation to see if that changes something,
> > as I'm sure we can do fairly better, given how it behaves with limited
> > sessions.
> 
> I think you are right, but I wonder if it's not better to wait with
> more profiling until this splicing is really redone.

Agreed. In fact I have run a few tests even with your patch and I could
see no obvious figure starting to appear. I'll go back to 2.6.27-stable
+ the fixes because I don't really know what I'm testing under 2.6.29-rc2.
Once I'm able to get reproducible reference numbers, I'll test again
with the latest kernel.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-20 10:31                                                                     ` Evgeniy Polyakov
  2009-01-20 11:01                                                                       ` Jarek Poplawski
@ 2009-01-26  8:20                                                                       ` Jarek Poplawski
  2009-01-26 21:21                                                                         ` Evgeniy Polyakov
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-26  8:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On 20-01-2009 11:31, Evgeniy Polyakov wrote:
...
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.
> 
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)
> 
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta

I looked at this a bit, but alas I didn't find much for this Herbert's
idea of payload in fragments/pages. Maybe some kind of API RFC is
needed before this resurrection?

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-26  8:20                                                                       ` [PATCH v2] " Jarek Poplawski
@ 2009-01-26 21:21                                                                         ` Evgeniy Polyakov
  2009-01-27  6:10                                                                           ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-01-26 21:21 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

Hi Jarek.

On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > 1. Network (tree) allocator
> > http://www.ioremap.net/projects/nta
> 
> I looked at this a bit, but alas I didn't find much for this Herbert's
> idea of payload in fragments/pages. Maybe some kind of API RFC is
> needed before this resurrection?

Basic idea is to steal some (probably a lot) pages from the slab
allocator and put network buffers there without strict need for
power-of-two alignment and possible wraps when we add skb_shared_info at
the end, so that old e1000 driver required order-4 allocations for the
jumbo frames. We can do that in alloc_skb() and friends and put returned
buffers into skb's fraglist and updated reference counters for those
pages; and with additional copy of the network headers into skb->head.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-26 21:21                                                                         ` Evgeniy Polyakov
@ 2009-01-27  6:10                                                                           ` David Miller
  2009-01-27  7:40                                                                             ` Jarek Poplawski
  2009-01-27 18:42                                                                             ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: David Miller @ 2009-01-27  6:10 UTC (permalink / raw)
  To: zbr
  Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Tue, 27 Jan 2009 00:21:30 +0300

> Hi Jarek.
> 
> On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > 1. Network (tree) allocator
> > > http://www.ioremap.net/projects/nta
> > 
> > I looked at this a bit, but alas I didn't find much for this Herbert's
> > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > needed before this resurrection?
> 
> Basic idea is to steal some (probably a lot) pages from the slab
> allocator and put network buffers there without strict need for
> power-of-two alignment and possible wraps when we add skb_shared_info at
> the end, so that old e1000 driver required order-4 allocations for the
> jumbo frames. We can do that in alloc_skb() and friends and put returned
> buffers into skb's fraglist and updated reference counters for those
> pages; and with additional copy of the network headers into skb->head.

We are going back and forth saying the same thing, I think :-)
(BTW, I think NTA is cool and we might do something like that
eventually)

The basic thing we have to do is make the drivers receive into
pages, and then slide the network headers (only) into the linear
SKB data area.

Even for drivers like NIU and myri10ge that do this, they only
use heuristics or some fixed minimum to decide how much to
move to the linear area.

Result?  Some data payload bits end up there because it overshoots.

Since we have pskb_may_pull() calls everywhere necessary, which
means not in eth_type_trans(), we could just make these drivers
(and future drivers converted to operate in this way) only
put the ethernet headers there initially.

Then the rest of the stack will take care of moving the network
and transport payloads there, as necessary.

I bet it won't even hurt latency or routing/firewall performance.

I did test this with the NIU driver at one point, and it did not
change TCP latency nor throughput at all even at 10g speeds.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-26  5:22                                                                             ` David Miller
@ 2009-01-27  7:11                                                                               ` Herbert Xu
  2009-01-27  7:54                                                                                 ` Jarek Poplawski
  2009-02-01  8:41                                                                                 ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: Herbert Xu @ 2009-01-27  7:11 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

David Miller <davem@davemloft.net> wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Thu, 22 Jan 2009 09:04:42 +0000
> 
>> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
>> I used here tcp_sndmsg() way, but I think I'll go back to this question
>> soon.
> 
> Indeed, it is something to look into, as well as locking.
> 
> I'll try to find some time for this, thanks Jarek.

After a quick look it seems to be OK to me.  The code in the patch
is called from tcp_splice_read, which holds the socket lock.  So as
long as the patch uses the usual TCP convention it should work.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-27  6:10                                                                           ` David Miller
@ 2009-01-27  7:40                                                                             ` Jarek Poplawski
  2009-01-30 21:42                                                                               ` David Miller
  2009-01-27 18:42                                                                             ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27  7:40 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Mon, Jan 26, 2009 at 10:10:56PM -0800, David Miller wrote:
> From: Evgeniy Polyakov <zbr@ioremap.net>
> Date: Tue, 27 Jan 2009 00:21:30 +0300
> 
> > Hi Jarek.
> > 
> > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > > 1. Network (tree) allocator
> > > > http://www.ioremap.net/projects/nta
> > > 
> > > I looked at this a bit, but alas I didn't find much for this Herbert's
> > > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > > needed before this resurrection?
> > 
> > Basic idea is to steal some (probably a lot) pages from the slab
> > allocator and put network buffers there without strict need for
> > power-of-two alignment and possible wraps when we add skb_shared_info at
> > the end, so that old e1000 driver required order-4 allocations for the
> > jumbo frames. We can do that in alloc_skb() and friends and put returned
> > buffers into skb's fraglist and updated reference counters for those
> > pages; and with additional copy of the network headers into skb->head.

I think the main problem is to respect put_page() more, and maybe you
mean to add this to your allocator too, but using slab pages for this
looks a bit complex to me, but I can miss something.

> We are going back and forth saying the same thing, I think :-)
> (BTW, I think NTA is cool and we might do something like that
> eventually)
> 
> The basic thing we have to do is make the drivers receive into
> pages, and then slide the network headers (only) into the linear
> SKB data area.

As a matter of fact, I wonder if these headers should be always
separated. Their "chunk" could be refcounted as well, I guess.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27  7:11                                                                               ` Herbert Xu
@ 2009-01-27  7:54                                                                                 ` Jarek Poplawski
  2009-01-27 10:09                                                                                   ` Herbert Xu
  2009-02-01  8:41                                                                                 ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27  7:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 06:11:30PM +1100, Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Thu, 22 Jan 2009 09:04:42 +0000
> > 
> >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> >> I used here tcp_sndmsg() way, but I think I'll go back to this question
> >> soon.
> > 
> > Indeed, it is something to look into, as well as locking.
> > 
> > I'll try to find some time for this, thanks Jarek.
> 
> After a quick look it seems to be OK to me.  The code in the patch
> is called from tcp_splice_read, which holds the socket lock.  So as
> long as the patch uses the usual TCP convention it should work.

Yes, but ip_append_data() (and skb_append_datato_frags() for
NETIF_F_UFO only, so currently not a problem), uses this differently,
and these pages in sk->sk_sndmsg_page could leak or be used after
kfree. (I didn't track locking in these other places).

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27  7:54                                                                                 ` Jarek Poplawski
@ 2009-01-27 10:09                                                                                   ` Herbert Xu
  2009-01-27 10:35                                                                                     ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-27 10:09 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> 
> Yes, but ip_append_data() (and skb_append_datato_frags() for
> NETIF_F_UFO only, so currently not a problem), uses this differently,
> and these pages in sk->sk_sndmsg_page could leak or be used after
> kfree. (I didn't track locking in these other places).

It'll be freed when the socket is freed so that should be fine.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 10:09                                                                                   ` Herbert Xu
@ 2009-01-27 10:35                                                                                     ` Jarek Poplawski
  2009-01-27 10:57                                                                                       ` Jarek Poplawski
  2009-01-27 11:48                                                                                       ` Herbert Xu
  0 siblings, 2 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27 10:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> > 
> > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > and these pages in sk->sk_sndmsg_page could leak or be used after
> > kfree. (I didn't track locking in these other places).
> 
> It'll be freed when the socket is freed so that should be fine.
> 

I don't think so: these places can overwrite sk->sk_sndmsg_page left
after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
pointer without put_page() (they only reference copied chunks and
expect auto freeing). On the other hand, if tcp_sendmsg() reads after
them it could use a pointer after the page is freed, I guess.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 10:35                                                                                     ` Jarek Poplawski
@ 2009-01-27 10:57                                                                                       ` Jarek Poplawski
  2009-01-27 11:48                                                                                       ` Herbert Xu
  1 sibling, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27 10:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> > > 
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> > 
> > It'll be freed when the socket is freed so that should be fine.
> > 
> 
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

tcp_v4_destroy_sock() looks like vulnerable too.

BTW, skb_append_datato_frags() currently doesn't need to use this
sk->sk_sndmsg_page at all - it doesn't use caching between calls.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 10:35                                                                                     ` Jarek Poplawski
  2009-01-27 10:57                                                                                       ` Jarek Poplawski
@ 2009-01-27 11:48                                                                                       ` Herbert Xu
  2009-01-27 12:16                                                                                         ` Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-27 11:48 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
>
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> > 
> > It'll be freed when the socket is freed so that should be fine.
> 
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

I wasn't referring to the first part of your sentence.  That can't
happen because they're only used for UDP sockets, this is a TCP
socket.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 11:48                                                                                       ` Herbert Xu
@ 2009-01-27 12:16                                                                                         ` Jarek Poplawski
  2009-01-27 12:31                                                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27 12:16 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> >
> > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > kfree. (I didn't track locking in these other places).
> > > 
> > > It'll be freed when the socket is freed so that should be fine.
> > 
> > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > pointer without put_page() (they only reference copied chunks and
> > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > them it could use a pointer after the page is freed, I guess.
> 
> I wasn't referring to the first part of your sentence.  That can't
> happen because they're only used for UDP sockets, this is a TCP
> socket.

Do you mean this part from ip_append_data() isn't used for TCP?:

1007
1008                         if (page && (left = PAGE_SIZE - off) > 0) {
1009                                 if (copy >= left)
1010                                         copy = left;
1011                                 if (page != frag->page) {
1012                                         if (i == MAX_SKB_FRAGS) {
1013                                                 err = -EMSGSIZE;
1014                                                 goto error;
1015                                         }
1016                                         get_page(page);
1017                                         skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
1018                                         frag = &skb_shinfo(skb)->frags[i];
1019                                 }
1020                         } else if (i < MAX_SKB_FRAGS) {
1021                                 if (copy > PAGE_SIZE)
1022                                         copy = PAGE_SIZE;
1023                                 page = alloc_pages(sk->sk_allocation, 0);
1024                                 if (page == NULL)  {
1025                                         err = -ENOMEM;
1026                                         goto error;
1027                                 }
1028                                 sk->sk_sndmsg_page = page;
1029                                 sk->sk_sndmsg_off = 0;
1030
1031                                 skb_fill_page_desc(skb, i, page, 0, 0);
1032                                 frag = &skb_shinfo(skb)->frags[i];
1033                         } else {

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 12:16                                                                                         ` Jarek Poplawski
@ 2009-01-27 12:31                                                                                           ` Jarek Poplawski
  2009-01-27 17:06                                                                                             ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-27 12:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > >
> > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > kfree. (I didn't track locking in these other places).
> > > > 
> > > > It'll be freed when the socket is freed so that should be fine.
> > > 
> > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > pointer without put_page() (they only reference copied chunks and
> > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > them it could use a pointer after the page is freed, I guess.
> > 
> > I wasn't referring to the first part of your sentence.  That can't
> > happen because they're only used for UDP sockets, this is a TCP
> > socket.
> 
> Do you mean this part from ip_append_data() isn't used for TCP?:

Actually, the beginning part of ip_append_data() should be enough too.
So I guess I missed your point...

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 12:31                                                                                           ` Jarek Poplawski
@ 2009-01-27 17:06                                                                                             ` David Miller
  2009-01-28  8:10                                                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-27 17:06 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 27 Jan 2009 12:31:11 +0000

> On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > >
> > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > kfree. (I didn't track locking in these other places).
> > > > > 
> > > > > It'll be freed when the socket is freed so that should be fine.
> > > > 
> > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > pointer without put_page() (they only reference copied chunks and
> > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > them it could use a pointer after the page is freed, I guess.
> > > 
> > > I wasn't referring to the first part of your sentence.  That can't
> > > happen because they're only used for UDP sockets, this is a TCP
> > > socket.
> > 
> > Do you mean this part from ip_append_data() isn't used for TCP?:
> 
> Actually, the beginning part of ip_append_data() should be enough too.
> So I guess I missed your point...

TCP doesn't use ip_append_data(), period.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-27  6:10                                                                           ` David Miller
  2009-01-27  7:40                                                                             ` Jarek Poplawski
@ 2009-01-27 18:42                                                                             ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-27 18:42 UTC (permalink / raw)
  To: zbr
  Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: David Miller <davem@davemloft.net>
Date: Mon, 26 Jan 2009 22:10:56 -0800 (PST)

> Even for drivers like NIU and myri10ge that do this, they only
> use heuristics or some fixed minimum to decide how much to
> move to the linear area.
> 
> Result?  Some data payload bits end up there because it overshoots.
 ...
> I did test this with the NIU driver at one point, and it did not
> change TCP latency nor throughput at all even at 10g speeds.

As a followup, it turns out that NIU right now does this properly.

It only pulls a maximum of ETH_HLEN into the linear area before giving
the SKB to netif_receive_skb().

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27 17:06                                                                                             ` David Miller
@ 2009-01-28  8:10                                                                                               ` Jarek Poplawski
  0 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-01-28  8:10 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Jan 27, 2009 at 09:06:51AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 12:31:11 +0000
> 
> > On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > > >
> > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > > kfree. (I didn't track locking in these other places).
> > > > > > 
> > > > > > It'll be freed when the socket is freed so that should be fine.
> > > > > 
> > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > > pointer without put_page() (they only reference copied chunks and
> > > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > > them it could use a pointer after the page is freed, I guess.
> > > > 
> > > > I wasn't referring to the first part of your sentence.  That can't
> > > > happen because they're only used for UDP sockets, this is a TCP
> > > > socket.
> > > 
> > > Do you mean this part from ip_append_data() isn't used for TCP?:
> > 
> > Actually, the beginning part of ip_append_data() should be enough too.
> > So I guess I missed your point...
> 
> TCP doesn't use ip_append_data(), period.

Hmm... I see: TCP does use ip_send_reply(), so ip_append_data() too,
but with a special socket.

Thanks for the explanations,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-27  7:40                                                                             ` Jarek Poplawski
@ 2009-01-30 21:42                                                                               ` David Miller
  2009-01-30 21:59                                                                                 ` Willy Tarreau
                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: David Miller @ 2009-01-30 21:42 UTC (permalink / raw)
  To: jarkao2
  Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 27 Jan 2009 07:40:48 +0000

> I think the main problem is to respect put_page() more, and maybe you
> mean to add this to your allocator too, but using slab pages for this
> looks a bit complex to me, but I can miss something.

Hmmm, Jarek's comments here made me realize that we might be
able to do some hack with cooperation with SLAB.

Basically the idea is that if the page count of a SLAB page
is greater than one, SLAB will not use that page for new
allocations.

It's cheesy and the SLAB developers will likely barf at the
idea, but it would certainly work.

Back to real life, I think long term the thing to do is to just do the
cached page allocator thing we'll be doing after Jarek's socket page
patch is integrated, and for best performance the driver has to
receive it's data into pages, only explicitly pulling the ethernet
header into the linear area, like NIU does.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 21:42                                                                               ` David Miller
@ 2009-01-30 21:59                                                                                 ` Willy Tarreau
  2009-01-30 22:03                                                                                   ` David Miller
  2009-01-30 22:16                                                                                 ` Herbert Xu
  2009-02-03 11:38                                                                                 ` Nick Piggin
  2 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-30 21:59 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 07:40:48 +0000
> 
> > I think the main problem is to respect put_page() more, and maybe you
> > mean to add this to your allocator too, but using slab pages for this
> > looks a bit complex to me, but I can miss something.
> 
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
> 
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.

I thought it was the standard behaviour. That may explain why I
did not understand much of previous discussion then :-/

> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

Maybe that would be enough as a definitive fix for a stable
release, so that we can go on with deeper changes in newer
versions ?

> Back to real life, I think long term the thing to do is to just do the
> cached page allocator thing we'll be doing after Jarek's socket page
> patch is integrated, and for best performance the driver has to
> receive it's data into pages, only explicitly pulling the ethernet
> header into the linear area, like NIU does.

Are there NICs out there able to do that themselves or does the
driver need to rely on complex hacks in order to achieve this ?

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 21:59                                                                                 ` Willy Tarreau
@ 2009-01-30 22:03                                                                                   ` David Miller
  2009-01-30 22:13                                                                                     ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-01-30 22:03 UTC (permalink / raw)
  To: w
  Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 30 Jan 2009 22:59:20 +0100

> On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > It's cheesy and the SLAB developers will likely barf at the
> > idea, but it would certainly work.
> 
> Maybe that would be enough as a definitive fix for a stable
> release, so that we can go on with deeper changes in newer
> versions ?

Such a check could have performance ramifications, I wouldn't
risk it and already I intend to push Jarek's page allocator
splice fix back to -stable eventually.

> > Back to real life, I think long term the thing to do is to just do the
> > cached page allocator thing we'll be doing after Jarek's socket page
> > patch is integrated, and for best performance the driver has to
> > receive it's data into pages, only explicitly pulling the ethernet
> > header into the linear area, like NIU does.
> 
> Are there NICs out there able to do that themselves or does the
> driver need to rely on complex hacks in order to achieve this ?

Any NIC, even the dumbest ones, can be made to receive into pages.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 22:03                                                                                   ` David Miller
@ 2009-01-30 22:13                                                                                     ` Willy Tarreau
  2009-01-30 22:15                                                                                       ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-01-30 22:13 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 30 Jan 2009 22:59:20 +0100
> 
> > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > > It's cheesy and the SLAB developers will likely barf at the
> > > idea, but it would certainly work.
> > 
> > Maybe that would be enough as a definitive fix for a stable
> > release, so that we can go on with deeper changes in newer
> > versions ?
> 
> Such a check could have performance ramifications, I wouldn't
> risk it and already I intend to push Jarek's page allocator
> splice fix back to -stable eventually.

OK.

> > > Back to real life, I think long term the thing to do is to just do the
> > > cached page allocator thing we'll be doing after Jarek's socket page
> > > patch is integrated, and for best performance the driver has to
> > > receive it's data into pages, only explicitly pulling the ethernet
> > > header into the linear area, like NIU does.
> > 
> > Are there NICs out there able to do that themselves or does the
> > driver need to rely on complex hacks in order to achieve this ?
> 
> Any NIC, even the dumbest ones, can be made to receive into pages.

OK I thought that it was not always easy to split between headers
and payload. I know that myri10ge can be configured to receive into
either skbs or pages, but I was not sure about the real implications
behind that.

Thanks
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 22:13                                                                                     ` Willy Tarreau
@ 2009-01-30 22:15                                                                                       ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-01-30 22:15 UTC (permalink / raw)
  To: w
  Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 30 Jan 2009 23:13:46 +0100

> On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> > Any NIC, even the dumbest ones, can be made to receive into pages.
> 
> OK I thought that it was not always easy to split between headers
> and payload. I know that myri10ge can be configured to receive into
> either skbs or pages, but I was not sure about the real implications
> behind that.

For a dumb NIC you wouldn't split, you'd receive directly, the
entire packet, into part of a page.

Then right befire you give it to the stack, you pull the ethernet
header from the page into the linear area.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 21:42                                                                               ` David Miller
  2009-01-30 21:59                                                                                 ` Willy Tarreau
@ 2009-01-30 22:16                                                                                 ` Herbert Xu
  2009-02-02  8:08                                                                                   ` Jarek Poplawski
  2009-02-03 11:38                                                                                 ` Nick Piggin
  2 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-01-30 22:16 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> 
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
> 
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.
> 
> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

I'm not going anywhere near that discussion :)

> Back to real life, I think long term the thing to do is to just do the
> cached page allocator thing we'll be doing after Jarek's socket page
> patch is integrated, and for best performance the driver has to
> receive it's data into pages, only explicitly pulling the ethernet
> header into the linear area, like NIU does.

Yes that sounds like the way to go.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v3] tcp: splice as many packets as possible at once
  2009-01-27  7:11                                                                               ` Herbert Xu
  2009-01-27  7:54                                                                                 ` Jarek Poplawski
@ 2009-02-01  8:41                                                                                 ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-02-01  8:41 UTC (permalink / raw)
  To: herbert
  Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 27 Jan 2009 18:11:30 +1100

> David Miller <davem@davemloft.net> wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Thu, 22 Jan 2009 09:04:42 +0000
> > 
> >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> >> I used here tcp_sndmsg() way, but I think I'll go back to this question
> >> soon.
> > 
> > Indeed, it is something to look into, as well as locking.
> > 
> > I'll try to find some time for this, thanks Jarek.
> 
> After a quick look it seems to be OK to me.  The code in the patch
> is called from tcp_splice_read, which holds the socket lock.  So as
> long as the patch uses the usual TCP convention it should work.

I've tossed Jarek's patch into net-next-2.6, thanks everyone.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 22:16                                                                                 ` Herbert Xu
@ 2009-02-02  8:08                                                                                   ` Jarek Poplawski
  2009-02-02  8:18                                                                                     ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-02  8:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
...
> > Back to real life, I think long term the thing to do is to just do the
> > cached page allocator thing we'll be doing after Jarek's socket page
> > patch is integrated, and for best performance the driver has to
> > receive it's data into pages, only explicitly pulling the ethernet
> > header into the linear area, like NIU does.
> 
> Yes that sounds like the way to go.

Looks like a lot of changes in drivers, plus: would it work with jumbo
frames? I wonder why the linear area can't be allocated as paged, and
freed with put_page() instead of kfree(skb->head) in skb_release_data().

Actually, at least for some time, there could be used both of these
methods (on paged alloc failure) with some ifs in skb_release_data()
and spd_fill_page() (to check if linear_to_page() is needed).

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-02  8:08                                                                                   ` Jarek Poplawski
@ 2009-02-02  8:18                                                                                     ` David Miller
  2009-02-02  8:43                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-02  8:18 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 2 Feb 2009 08:08:55 +0000

> On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> ...
> > > Back to real life, I think long term the thing to do is to just do the
> > > cached page allocator thing we'll be doing after Jarek's socket page
> > > patch is integrated, and for best performance the driver has to
> > > receive it's data into pages, only explicitly pulling the ethernet
> > > header into the linear area, like NIU does.
> > 
> > Yes that sounds like the way to go.
> 
> Looks like a lot of changes in drivers, plus: would it work with jumbo
> frames? I wonder why the linear area can't be allocated as paged, and
> freed with put_page() instead of kfree(skb->head) in skb_release_data().

Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-02  8:18                                                                                     ` David Miller
@ 2009-02-02  8:43                                                                                       ` Jarek Poplawski
  2009-02-03  7:50                                                                                         ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-02  8:43 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 2 Feb 2009 08:08:55 +0000
> 
> > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > ...
> > > > Back to real life, I think long term the thing to do is to just do the
> > > > cached page allocator thing we'll be doing after Jarek's socket page
> > > > patch is integrated, and for best performance the driver has to
> > > > receive it's data into pages, only explicitly pulling the ethernet
> > > > header into the linear area, like NIU does.
> > > 
> > > Yes that sounds like the way to go.
> > 
> > Looks like a lot of changes in drivers, plus: would it work with jumbo
> > frames? I wonder why the linear area can't be allocated as paged, and
> > freed with put_page() instead of kfree(skb->head) in skb_release_data().
> 
> Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.

I mean allocating chunks of cached pages similarly to sk_sndmsg_page
way. I guess the similar problem is to be worked out in any case. But
it seems doing it on the linear area requires less changes in other
places.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-02  8:43                                                                                       ` Jarek Poplawski
@ 2009-02-03  7:50                                                                                         ` David Miller
  2009-02-03  9:41                                                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-03  7:50 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 2 Feb 2009 08:43:58 +0000

> On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
> 
> I mean allocating chunks of cached pages similarly to sk_sndmsg_page
> way. I guess the similar problem is to be worked out in any case. But
> it seems doing it on the linear area requires less changes in other
> places.

This is a very interesting idea, but it has some drawbacks:

1) Just like any other allocator we'll need to find a way to
   handle > PAGE_SIZE allocations, and thus add handling for
   compound pages etc.
 
   And exactly the drivers that want such huge SKB data areas
   on receive should be converted to use scatter gather page
   vectors in order to avoid multi-order pages and thus strains
   on the page allocator.

2) Space wastage and poor packing can be an issue.

   Even with SLAB/SLUB we get poor packing, look at Evegeniy's
   graphs that he made when writing his NTA patches.

Now, when choosing a way to move forward, I'm willing to accept a
little bit of the issues in #2 for the sake of avoiding the
issues in #1 above.

Jarek, note that we can just keep your current splice() copy hacks in
there.  And as a result we can have an easier to handle migration
path.  We just do the page RX allocation conversions in the drivers
where performance really matters, for hardware a lot of people have.

That's a lot smoother and has less issues that converting the system
wide SKB allocator upside down.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03  7:50                                                                                         ` David Miller
@ 2009-02-03  9:41                                                                                           ` Jarek Poplawski
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
                                                                                                               ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-03  9:41 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 2 Feb 2009 08:43:58 +0000
> 
> > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
> > 
> > I mean allocating chunks of cached pages similarly to sk_sndmsg_page
> > way. I guess the similar problem is to be worked out in any case. But
> > it seems doing it on the linear area requires less changes in other
> > places.
> 
> This is a very interesting idea, but it has some drawbacks:
> 
> 1) Just like any other allocator we'll need to find a way to
>    handle > PAGE_SIZE allocations, and thus add handling for
>    compound pages etc.
>  
>    And exactly the drivers that want such huge SKB data areas
>    on receive should be converted to use scatter gather page
>    vectors in order to avoid multi-order pages and thus strains
>    on the page allocator.

I guess compound pages are handled by put_page() enough, but I don't
think they should be main argument here, and I agree: scatter gather
should be used where possible.

> 
> 2) Space wastage and poor packing can be an issue.
> 
>    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
>    graphs that he made when writing his NTA patches.

I'm a bit lost here: could you "remind" the way page space would be
used/saved in your paged variant e.g. for ~1500B skbs?

> 
> Now, when choosing a way to move forward, I'm willing to accept a
> little bit of the issues in #2 for the sake of avoiding the
> issues in #1 above.
> 
> Jarek, note that we can just keep your current splice() copy hacks in
> there.  And as a result we can have an easier to handle migration
> path.  We just do the page RX allocation conversions in the drivers
> where performance really matters, for hardware a lot of people have.
> 
> That's a lot smoother and has less issues that converting the system
> wide SKB allocator upside down.
> 

Yes, this looks reasonable. On the other hand, I think it would be
nice to get some opinions of slab folks (incl. Evgeniy) on the expected
efficiency of such a solution. (It seems releasing with put_page() will
always have some cost with delayed reusing and/or waste of space.)

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03  9:41                                                                                           ` Jarek Poplawski
@ 2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
  2009-02-03 11:24                                                                                               ` Herbert Xu
                                                                                                                 ` (2 more replies)
  2009-02-04  7:56                                                                                             ` Jarek Poplawski
  2009-02-06  7:52                                                                                             ` David Miller
  2 siblings, 3 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 11:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > 1) Just like any other allocator we'll need to find a way to
> >    handle > PAGE_SIZE allocations, and thus add handling for
> >    compound pages etc.
> >  
> >    And exactly the drivers that want such huge SKB data areas
> >    on receive should be converted to use scatter gather page
> >    vectors in order to avoid multi-order pages and thus strains
> >    on the page allocator.
> 
> I guess compound pages are handled by put_page() enough, but I don't
> think they should be main argument here, and I agree: scatter gather
> should be used where possible.

Problem is to allocate them, since with the time memory will be
quite fragmented, which will not allow to find a big enough page.

NTA tried to solve this by not allowing to free the data allocated on
the different CPU, contrary to what SLAB does. Modulo cache coherency
improvements, it allows to combine freed chunks back into the pages and
combine them in turn to get bigger contiguous areas suitable for the
drivers which were not converted to use the scatter gather approach.
I even believe that for some hardware it is the only way to deal
with the jumbo frames.

> > 2) Space wastage and poor packing can be an issue.
> > 
> >    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
> >    graphs that he made when writing his NTA patches.
> 
> I'm a bit lost here: could you "remind" the way page space would be
> used/saved in your paged variant e.g. for ~1500B skbs?

At least in NTA I used cache line alignment for smaller chunks, while
SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes
per packet (modulo size of the shared info structure).

> Yes, this looks reasonable. On the other hand, I think it would be
> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> efficiency of such a solution. (It seems releasing with put_page() will
> always have some cost with delayed reusing and/or waste of space.)

Well, my opinion is rather biased here :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
@ 2009-02-03 11:24                                                                                               ` Herbert Xu
  2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
  2009-02-04  0:46                                                                                                 ` David Miller
  2009-02-03 12:36                                                                                               ` Jarek Poplawski
  2009-02-04  0:46                                                                                               ` David Miller
  2 siblings, 2 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 11:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote:
>
> I even believe that for some hardware it is the only way to deal
> with the jumbo frames.

Not necessarily.  Even if the hardware can only DMA into contiguous
memory, we can always allocate a sufficient number of contiguous
buffers initially, and then always copy them into fragmented skbs
at receive time.  This way the contiguous buffers are never
depleted.

Granted copying sucks, but this is really because the underlying
hardware is badly designed.  Also copying is way better than
not receiving at all due to memory fragmentation.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-01-30 21:42                                                                               ` David Miller
  2009-01-30 21:59                                                                                 ` Willy Tarreau
  2009-01-30 22:16                                                                                 ` Herbert Xu
@ 2009-02-03 11:38                                                                                 ` Nick Piggin
  2 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-03 11:38 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Saturday 31 January 2009 08:42:27 David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 07:40:48 +0000
>
> > I think the main problem is to respect put_page() more, and maybe you
> > mean to add this to your allocator too, but using slab pages for this
> > looks a bit complex to me, but I can miss something.
>
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
>
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.

Wouldn't your caller need to know what objects are already
allocated in that page too?


> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

It is nasty, yes. Using the page allocator directly seeems
like a better approach.

And btw. be careful of using page->_count for anything, due
to speculative page references... basically it is useful only
to test zero or non-zero refcount. If designing a new scheme
for the network layer, it would be nicer to begin by using
say _mapcount or private or some other field in there for a
refcount (and I have a patch to avoid the atomic
put_page_testzero in page freeing for a caller that does their
own refcounting, so don't fear that extra overhead too much :)).



^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:24                                                                                               ` Herbert Xu
@ 2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
  2009-02-03 11:53                                                                                                   ` Herbert Xu
  2009-02-03 13:05                                                                                                   ` david
  2009-02-04  0:46                                                                                                 ` David Miller
  1 sibling, 2 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 11:49 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > I even believe that for some hardware it is the only way to deal
> > with the jumbo frames.
> 
> Not necessarily.  Even if the hardware can only DMA into contiguous
> memory, we can always allocate a sufficient number of contiguous
> buffers initially, and then always copy them into fragmented skbs
> at receive time.  This way the contiguous buffers are never
> depleted.

How many such preallocated frames is enough? Does it enough to have all
sockets recv buffer sizes divided by the MTU size? Or just some of them,
or... That will work but there are way too many corner cases.

> Granted copying sucks, but this is really because the underlying
> hardware is badly designed.  Also copying is way better than
> not receiving at all due to memory fragmentation.

Maybe just do not allow jumbo frames when memory is fragmented enough
and fallback to the smaller MTU in this case? With LRO/GRO stuff there
should be not that much of the overhead compared to multiple-page
copies.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
@ 2009-02-03 11:53                                                                                                   ` Herbert Xu
  2009-02-03 12:07                                                                                                     ` Evgeniy Polyakov
  2009-02-03 13:05                                                                                                   ` david
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 11:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 02:49:44PM +0300, Evgeniy Polyakov wrote:
> How many such preallocated frames is enough? Does it enough to have all
> sockets recv buffer sizes divided by the MTU size? Or just some of them,
> or... That will work but there are way too many corner cases.

Easy, the driver is already allocating them right now so we don't
have to change a thing :)

All we have to do is change the refill mechanism to always allocate
a replacement skb in the rx path, and if that fails, allocate a
fragmented skb instead and copy the received data into it so that
the contiguous skb can be reused.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:53                                                                                                   ` Herbert Xu
@ 2009-02-03 12:07                                                                                                     ` Evgeniy Polyakov
  2009-02-03 12:12                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 12:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 10:53:13PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > How many such preallocated frames is enough? Does it enough to have all
> > sockets recv buffer sizes divided by the MTU size? Or just some of them,
> > or... That will work but there are way too many corner cases.
> 
> Easy, the driver is already allocating them right now so we don't
> have to change a thing :)

How many? A hundred or so descriptors (or even several thousands) -
this really does not scale for the somewhat loaded IO servers, that's
why we frequently get questions why dmesg is filler with order-3 and
higher allocation failure dumps.

> All we have to do is change the refill mechanism to always allocate
> a replacement skb in the rx path, and if that fails, allocate a
> fragmented skb instead and copy the received data into it so that
> the contiguous skb can be reused.

Having a 'reserve' skb per socket is a good idea, but what if numbr of
sockets is way too big?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:07                                                                                                     ` Evgeniy Polyakov
@ 2009-02-03 12:12                                                                                                       ` Herbert Xu
  2009-02-03 12:18                                                                                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 12:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 03:07:15PM +0300, Evgeniy Polyakov wrote:
>
> How many? A hundred or so descriptors (or even several thousands) -
> this really does not scale for the somewhat loaded IO servers, that's
> why we frequently get questions why dmesg is filler with order-3 and
> higher allocation failure dumps.

I think you've misunderstood my suggested scheme.

I'm suggesting that we keep the driver initialisation path as is,
so however many skb's the driver is allocating at open() time
remains unchanged.  Usually this would be the number of entries
on the ring buffer.  We can't do any better than that since if
the hardware can't do SG then you'll just have to find this many
contiguous buffers.

The only change we need to make is at receive time.  Instead of
always pushing the received skb into the stack, we should try to
allocate a linear replacement skb, and if that fails, allocate
a fragmented skb and copy the data into it.  That way we can
always push a linear skb back into the ring buffer.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 13:05                                                                                                   ` david
@ 2009-02-03 12:12                                                                                                     ` Evgeniy Polyakov
  2009-02-03 12:18                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 12:12 UTC (permalink / raw)
  To: david
  Cc: Herbert Xu, Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 05:05:14AM -0800, david@lang.hm (david@lang.hm) wrote:
> >Maybe just do not allow jumbo frames when memory is fragmented enough
> >and fallback to the smaller MTU in this case? With LRO/GRO stuff there
> >should be not that much of the overhead compared to multiple-page
> >copies.
> 
> 
> 1. define 'fragmented enough'

When allocator can not provide requested amount of data.

> 2. the packet size was already negotiated on your existing connections, 
> how are you going to change all those on the fly?

I.e. MTU can not be changed on-flight? Magic world.

> 3. what do you do when a remote system sends you a large packet? drop it 
> on the floor?

We already do just that when jumbo frame can not be allocated :)

> having some pool of large buffers to receive into (and copy out of those 
> buffers as quickly as possible) would cause a performance hit when things 
> get bad, but isn't that better than dropping packets?

It is a solution, but I think it will behave noticebly worse than
with decresed MTU.

> as for the number of buffers to use. make a reasonable guess. if you only 
> have a small number of packets around, use the buffers directly, as you 
> use more of them start copying, as useage climbs attempt to allocate more. 
> if you can't allocate more (and you have all of your existing ones in use) 
> you will have to drop the packet, but at that point are you really in any 
> worse shape than if you didn't have some mechanism to copy out of the 
> large buffers?

That's the main point: how to deal with broken hardware? I think (but
have no strong numbers though) that having 6 packets with 1500 MTU
combined into GRO/LRO frame will be processed way faster than copying 9k
MTU into 3 pages and process single skb.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:12                                                                                                     ` Evgeniy Polyakov
@ 2009-02-03 12:18                                                                                                       ` Herbert Xu
  2009-02-03 12:30                                                                                                         ` Evgeniy Polyakov
  2009-02-03 12:33                                                                                                         ` Nick Piggin
  0 siblings, 2 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 12:18 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote:
> 
> It is a solution, but I think it will behave noticebly worse than
> with decresed MTU.

Not necessarily.  Remember GSO/GRO in essence are just hacks to
get around the fact that we can't increase the MTU to where we
want it to be.  MTU reduces the cost over the entire path while
GRO/GSO only do so for the sender and the receiver.

In other words when given the choice between a larger MTU with
copying or GRO, the larger MTU will probably win anyway as it's
optimising the entire path rather than just the receiver.
 
> That's the main point: how to deal with broken hardware? I think (but
> have no strong numbers though) that having 6 packets with 1500 MTU
> combined into GRO/LRO frame will be processed way faster than copying 9k
> MTU into 3 pages and process single skb.

Please note that with my scheme, you'd only start copying if you
can't allocate a linear skb.  So if memory fragmentation doesn't
happen then there is no copying at all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:12                                                                                                       ` Herbert Xu
@ 2009-02-03 12:18                                                                                                         ` Evgeniy Polyakov
  2009-02-03 12:25                                                                                                           ` Willy Tarreau
  2009-02-03 12:27                                                                                                           ` Herbert Xu
  0 siblings, 2 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 12:18 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> The only change we need to make is at receive time.  Instead of
> always pushing the received skb into the stack, we should try to
> allocate a linear replacement skb, and if that fails, allocate
> a fragmented skb and copy the data into it.  That way we can
> always push a linear skb back into the ring buffer.

Yes, that's was the part about 'reserve' buffer for the sockets you cut
:)

I agree that this will work and will be better than nothing, but copying
9kb into 3 pages is rather CPU hungry operation, and I think (but have
no numbers though) that system will behave faster if MTU is reduced to
the standard one.
Another solution is to have a proper allocator which will be able to
defragment the data, if talking about the alternatives to the drop.

So:
1. copy the whole jumbo skb into fragmented one
2. reduce the MTU
3. rely on the allocator

For the 'good' hardware and drivers nothing from the above is really needed.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:18                                                                                                         ` Evgeniy Polyakov
@ 2009-02-03 12:25                                                                                                           ` Willy Tarreau
  2009-02-03 12:28                                                                                                             ` Herbert Xu
  2009-02-04  0:47                                                                                                             ` David Miller
  2009-02-03 12:27                                                                                                           ` Herbert Xu
  1 sibling, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-02-03 12:25 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Herbert Xu, Jarek Poplawski, David Miller, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > The only change we need to make is at receive time.  Instead of
> > always pushing the received skb into the stack, we should try to
> > allocate a linear replacement skb, and if that fails, allocate
> > a fragmented skb and copy the data into it.  That way we can
> > always push a linear skb back into the ring buffer.
> 
> Yes, that's was the part about 'reserve' buffer for the sockets you cut
> :)
> 
> I agree that this will work and will be better than nothing, but copying
> 9kb into 3 pages is rather CPU hungry operation, and I think (but have
> no numbers though) that system will behave faster if MTU is reduced to
> the standard one.

Well, FWIW, I've always observed better performance with 4k MTU (4080 to
be precise) than with 9K, and I think that the overhead of allocating 3
contiguous pages is a major reason for this.

> Another solution is to have a proper allocator which will be able to
> defragment the data, if talking about the alternatives to the drop.
>
> So:
> 1. copy the whole jumbo skb into fragmented one
> 2. reduce the MTU

you'll not reduce MTU of established connections though. And trying to
advertise MSS changes in the middle of a TCP connection is an awful
hack which I think will not work everywhere.

> 3. rely on the allocator
> 
> For the 'good' hardware and drivers nothing from the above is really needed.
> 
> -- 
> 	Evgeniy Polyakov

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:18                                                                                                         ` Evgeniy Polyakov
  2009-02-03 12:25                                                                                                           ` Willy Tarreau
@ 2009-02-03 12:27                                                                                                           ` Herbert Xu
  1 sibling, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 12:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote:
>
> I agree that this will work and will be better than nothing, but copying
> 9kb into 3 pages is rather CPU hungry operation, and I think (but have
> no numbers though) that system will behave faster if MTU is reduced to
> the standard one.

Reducing the MTU can create all sorts of problems so it should be
avoided if at all possible.  These days, path MTU discovery is
haphazard at best.  In fact MTU problems are the main reason why
jumbo frames simply don't get deployed.

> Another solution is to have a proper allocator which will be able to
> defragment the data, if talking about the alternatives to the drop.

Sure, if we can create an allocator that can guarantee contiguous
allocations all the time then by all means go for it.  But until
we get there, doing what I suggested is way better than stopping
the receiving process altogether.

> So:
> 1. copy the whole jumbo skb into fragmented one
> 2. reduce the MTU
> 3. rely on the allocator

Yes, improving the allocator would obviously inrease the performance,
however, there is nothing against employing both methods.  I'd
always avoid reducing the MTU at run-time though.

> For the 'good' hardware and drivers nothing from the above is really needed.

Right, that's why there is a point beyond which improving the
allocator is no longer worthwhile.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:25                                                                                                           ` Willy Tarreau
@ 2009-02-03 12:28                                                                                                             ` Herbert Xu
  2009-02-04  0:47                                                                                                             ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 12:28 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Evgeniy Polyakov, Jarek Poplawski, David Miller, dada1, ben,
	mingo, linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 01:25:35PM +0100, Willy Tarreau wrote:
>
> > So:
> > 1. copy the whole jumbo skb into fragmented one
> > 2. reduce the MTU
> 
> you'll not reduce MTU of established connections though. And trying to
> advertise MSS changes in the middle of a TCP connection is an awful
> hack which I think will not work everywhere.

Not to mention that Ethernet isn't just IP, so the protocol that
is being used might not have a concept of path MTU discovery at
all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:18                                                                                                       ` Herbert Xu
@ 2009-02-03 12:30                                                                                                         ` Evgeniy Polyakov
  2009-02-03 12:33                                                                                                           ` Herbert Xu
  2009-02-03 12:33                                                                                                         ` Nick Piggin
  1 sibling, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 12:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 11:18:08PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > It is a solution, but I think it will behave noticebly worse than
> > with decresed MTU.
> 
> Not necessarily.  Remember GSO/GRO in essence are just hacks to
> get around the fact that we can't increase the MTU to where we
> want it to be.  MTU reduces the cost over the entire path while
> GRO/GSO only do so for the sender and the receiver.
> 
> In other words when given the choice between a larger MTU with
> copying or GRO, the larger MTU will probably win anyway as it's
> optimising the entire path rather than just the receiver.

Well, we both do not have the data and very likely will not change the
opinions :)
But we can proceed the discussion in case something interesting will
appear. For example I can hack up e1000e driver to do a dumb copy of 9k
each time it has received a jumbo frame and compare it with usual 1.5k
MTU performance. But getting that modern CPUs are loafing with noticebly
big IO chunks, this may only show that CPU was increased with the copy.
But still may work.

> > That's the main point: how to deal with broken hardware? I think (but
> > have no strong numbers though) that having 6 packets with 1500 MTU
> > combined into GRO/LRO frame will be processed way faster than copying 9k
> > MTU into 3 pages and process single skb.
> 
> Please note that with my scheme, you'd only start copying if you
> can't allocate a linear skb.  So if memory fragmentation doesn't
> happen then there is no copying at all.

Yes, absolutely.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:18                                                                                                       ` Herbert Xu
  2009-02-03 12:30                                                                                                         ` Evgeniy Polyakov
@ 2009-02-03 12:33                                                                                                         ` Nick Piggin
  1 sibling, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-03 12:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Evgeniy Polyakov, david, Jarek Poplawski, David Miller, w, dada1,
	ben, mingo, linux-kernel, netdev, jens.axboe

On Tuesday 03 February 2009 23:18:08 Herbert Xu wrote:
> On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote:
> > It is a solution, but I think it will behave noticebly worse than
> > with decresed MTU.
>
> Not necessarily.  Remember GSO/GRO in essence are just hacks to
> get around the fact that we can't increase the MTU to where we
> want it to be.  MTU reduces the cost over the entire path while
> GRO/GSO only do so for the sender and the receiver.
>
> In other words when given the choice between a larger MTU with
> copying or GRO, the larger MTU will probably win anyway as it's
> optimising the entire path rather than just the receiver.
>
> > That's the main point: how to deal with broken hardware? I think (but
> > have no strong numbers though) that having 6 packets with 1500 MTU
> > combined into GRO/LRO frame will be processed way faster than copying 9k
> > MTU into 3 pages and process single skb.
>
> Please note that with my scheme, you'd only start copying if you
> can't allocate a linear skb.  So if memory fragmentation doesn't
> happen then there is no copying at all.

This sounds like a really nice idea (to the layman)!


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:30                                                                                                         ` Evgeniy Polyakov
@ 2009-02-03 12:33                                                                                                           ` Herbert Xu
  0 siblings, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-03 12:33 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 03:30:15PM +0300, Evgeniy Polyakov wrote:
>
> But we can proceed the discussion in case something interesting will
> appear. For example I can hack up e1000e driver to do a dumb copy of 9k
> each time it has received a jumbo frame and compare it with usual 1.5k
> MTU performance. But getting that modern CPUs are loafing with noticebly
> big IO chunks, this may only show that CPU was increased with the copy.
> But still may work.

Comparing performance is pointless because the only time you need
to do the copy is when the allocator has failed.  So there is *no*
alternative to copying, regardless of how slow it is.

You can always improve the allocator whether we do this copying
fallback or not.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
  2009-02-03 11:24                                                                                               ` Herbert Xu
@ 2009-02-03 12:36                                                                                               ` Jarek Poplawski
  2009-02-03 13:06                                                                                                 ` Evgeniy Polyakov
  2009-02-04  0:46                                                                                               ` David Miller
  2 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-03 12:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > 1) Just like any other allocator we'll need to find a way to
> > >    handle > PAGE_SIZE allocations, and thus add handling for
> > >    compound pages etc.
> > >  
> > >    And exactly the drivers that want such huge SKB data areas
> > >    on receive should be converted to use scatter gather page
> > >    vectors in order to avoid multi-order pages and thus strains
> > >    on the page allocator.
> > 
> > I guess compound pages are handled by put_page() enough, but I don't
> > think they should be main argument here, and I agree: scatter gather
> > should be used where possible.
> 
> Problem is to allocate them, since with the time memory will be
> quite fragmented, which will not allow to find a big enough page.

Yes, it's a problem, but I don't think the main one. Since we're
currently concerned with zero-copy for splice I think we could
concentrate on most common cases, and treat jumbo frames with best
effort only: if there are free compound pages - fine, otherwise we
fallback to slab and copy in splice.

> 
> NTA tried to solve this by not allowing to free the data allocated on
> the different CPU, contrary to what SLAB does. Modulo cache coherency
> improvements, it allows to combine freed chunks back into the pages and
> combine them in turn to get bigger contiguous areas suitable for the
> drivers which were not converted to use the scatter gather approach.
> I even believe that for some hardware it is the only way to deal
> with the jumbo frames.
> 
> > > 2) Space wastage and poor packing can be an issue.
> > > 
> > >    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
> > >    graphs that he made when writing his NTA patches.
> > 
> > I'm a bit lost here: could you "remind" the way page space would be
> > used/saved in your paged variant e.g. for ~1500B skbs?
> 
> At least in NTA I used cache line alignment for smaller chunks, while
> SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes
> per packet (modulo size of the shared info structure).
> 
> > Yes, this looks reasonable. On the other hand, I think it would be
> > nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> > efficiency of such a solution. (It seems releasing with put_page() will
> > always have some cost with delayed reusing and/or waste of space.)
> 
> Well, my opinion is rather biased here :)

I understand NTA could be better than slabs in above-mentioned cases,
but I'm not sure you explaind enough your point on solving this
zero-copy problem vs. NTA?

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
  2009-02-03 11:53                                                                                                   ` Herbert Xu
@ 2009-02-03 13:05                                                                                                   ` david
  2009-02-03 12:12                                                                                                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 190+ messages in thread
From: david @ 2009-02-03 13:05 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Herbert Xu, Jarek Poplawski, David Miller, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Tue, 3 Feb 2009, Evgeniy Polyakov wrote:

> On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
>>> I even believe that for some hardware it is the only way to deal
>>> with the jumbo frames.
>>
>> Not necessarily.  Even if the hardware can only DMA into contiguous
>> memory, we can always allocate a sufficient number of contiguous
>> buffers initially, and then always copy them into fragmented skbs
>> at receive time.  This way the contiguous buffers are never
>> depleted.
>
> How many such preallocated frames is enough? Does it enough to have all
> sockets recv buffer sizes divided by the MTU size? Or just some of them,
> or... That will work but there are way too many corner cases.
>
>> Granted copying sucks, but this is really because the underlying
>> hardware is badly designed.  Also copying is way better than
>> not receiving at all due to memory fragmentation.
>
> Maybe just do not allow jumbo frames when memory is fragmented enough
> and fallback to the smaller MTU in this case? With LRO/GRO stuff there
> should be not that much of the overhead compared to multiple-page
> copies.


1. define 'fragmented enough'

2. the packet size was already negotiated on your existing connections, 
how are you going to change all those on the fly?

3. what do you do when a remote system sends you a large packet? drop it 
on the floor?

having some pool of large buffers to receive into (and copy out of those 
buffers as quickly as possible) would cause a performance hit when things 
get bad, but isn't that better than dropping packets?

as for the number of buffers to use. make a reasonable guess. if you only 
have a small number of packets around, use the buffers directly, as you 
use more of them start copying, as useage climbs attempt to allocate more. 
if you can't allocate more (and you have all of your existing ones in use) 
you will have to drop the packet, but at that point are you really in any 
worse shape than if you didn't have some mechanism to copy out of the 
large buffers?

David Lang

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:36                                                                                               ` Jarek Poplawski
@ 2009-02-03 13:06                                                                                                 ` Evgeniy Polyakov
  2009-02-03 13:25                                                                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 13:06 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> I understand NTA could be better than slabs in above-mentioned cases,
> but I'm not sure you explaind enough your point on solving this
> zero-copy problem vs. NTA?

NTA steals pages from the SLAB so we can maintain any reference counter
logic in them, so linear part of the skb may be not really freed/reused
until reference counter hits zero.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 13:06                                                                                                 ` Evgeniy Polyakov
@ 2009-02-03 13:25                                                                                                   ` Jarek Poplawski
  2009-02-03 14:20                                                                                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-03 13:25 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Feb 03, 2009 at 04:06:06PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > I understand NTA could be better than slabs in above-mentioned cases,
> > but I'm not sure you explaind enough your point on solving this
> > zero-copy problem vs. NTA?
> 
> NTA steals pages from the SLAB so we can maintain any reference counter
> logic in them, so linear part of the skb may be not really freed/reused
> until reference counter hits zero.

Now it's clear. So this looks like one of the options considered by
David. Then I wonder about details... It seems some kind of scheduled
browsing for refcounts is needed or is there something better?

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 13:25                                                                                                   ` Jarek Poplawski
@ 2009-02-03 14:20                                                                                                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-03 14:20 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Tue, Feb 03, 2009 at 01:25:37PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> Now it's clear. So this looks like one of the options considered by
> David. Then I wonder about details... It seems some kind of scheduled
> browsing for refcounts is needed or is there something better?

It depends on the implementation, for example each kfree() may check the
reference counter and return page to the allocator when it is really
free. Since page may contiain multiple objects its reference counter may
hit zero someday in the future, or never reach it if data was not freed.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
  2009-02-03 11:24                                                                                               ` Herbert Xu
  2009-02-03 12:36                                                                                               ` Jarek Poplawski
@ 2009-02-04  0:46                                                                                               ` David Miller
  2009-02-04  8:08                                                                                                 ` Evgeniy Polyakov
  2 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-04  0:46 UTC (permalink / raw)
  To: zbr
  Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Tue, 3 Feb 2009 14:10:12 +0300

> NTA tried to solve this by not allowing to free the data allocated on
> the different CPU, contrary to what SLAB does. Modulo cache coherency
> improvements,

This could kill performance on NUMA systems if we are not careful.

If we ever consider NTA seriously, these issues would need to
be performance tested.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 11:24                                                                                               ` Herbert Xu
  2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
@ 2009-02-04  0:46                                                                                                 ` David Miller
  2009-02-04  9:41                                                                                                   ` Benny Amorsen
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-04  0:46 UTC (permalink / raw)
  To: herbert
  Cc: zbr, jarkao2, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 3 Feb 2009 22:24:31 +1100

> Not necessarily.  Even if the hardware can only DMA into contiguous
> memory, we can always allocate a sufficient number of contiguous
> buffers initially, and then always copy them into fragmented skbs
> at receive time.  This way the contiguous buffers are never
> depleted.
> 
> Granted copying sucks, but this is really because the underlying
> hardware is badly designed.  Also copying is way better than
> not receiving at all due to memory fragmentation.

This scheme sounds very reasonable.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03 12:25                                                                                                           ` Willy Tarreau
  2009-02-03 12:28                                                                                                             ` Herbert Xu
@ 2009-02-04  0:47                                                                                                             ` David Miller
  2009-02-04  6:19                                                                                                               ` Willy Tarreau
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-04  0:47 UTC (permalink / raw)
  To: w
  Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Tue, 3 Feb 2009 13:25:35 +0100

> Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> be precise) than with 9K, and I think that the overhead of allocating 3
> contiguous pages is a major reason for this.

With what hardware?  If it's with myri10ge, that driver uses page
frags so would not be using 3 contiguous pages even for jumbo frames.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  0:47                                                                                                             ` David Miller
@ 2009-02-04  6:19                                                                                                               ` Willy Tarreau
  2009-02-04  8:12                                                                                                                 ` Evgeniy Polyakov
  2009-02-04  9:12                                                                                                                 ` David Miller
  0 siblings, 2 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-02-04  6:19 UTC (permalink / raw)
  To: David Miller
  Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Tue, 3 Feb 2009 13:25:35 +0100
> 
> > Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> > be precise) than with 9K, and I think that the overhead of allocating 3
> > contiguous pages is a major reason for this.
> 
> With what hardware?  If it's with myri10ge, that driver uses page
> frags so would not be using 3 contiguous pages even for jumbo frames.

Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
remember the exact optimal value, I think it was slightly lower).

For the myri10ge, could this be caused by the cache footprint then ?
I can also retry with various values between 4 and 9k, including
values close to 8k. Maybe the fact that 4k is better than 9 is
because we get better filling of all pages ?

I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
allocation failures which were polluting the logs, so it's been running
with that setting for years now.

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03  9:41                                                                                           ` Jarek Poplawski
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
@ 2009-02-04  7:56                                                                                             ` Jarek Poplawski
  2009-02-06  7:52                                                                                             ` David Miller
  2 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-04  7:56 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski wrote:
> On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Mon, 2 Feb 2009 08:43:58 +0000
> > 
> > > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> > > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
> > > 
> > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page
> > > way. I guess the similar problem is to be worked out in any case. But
> > > it seems doing it on the linear area requires less changes in other
> > > places.
> > 
> > This is a very interesting idea, but it has some drawbacks:
> > 
> > 1) Just like any other allocator we'll need to find a way to
> >    handle > PAGE_SIZE allocations, and thus add handling for
> >    compound pages etc.
> >  
> >    And exactly the drivers that want such huge SKB data areas
> >    on receive should be converted to use scatter gather page
> >    vectors in order to avoid multi-order pages and thus strains
> >    on the page allocator.
> 
> I guess compound pages are handled by put_page() enough, but I don't
> think they should be main argument here, and I agree: scatter gather
> should be used where possible.
> 
> > 
> > 2) Space wastage and poor packing can be an issue.
> > 
> >    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
> >    graphs that he made when writing his NTA patches.
> 
> I'm a bit lost here: could you "remind" the way page space would be
> used/saved in your paged variant e.g. for ~1500B skbs?

Here is some proof of concept to make sure I wasn't misunderstood.
alloc_paged() is used only for "normal" size skbs (2x ~1500B per page;
I think Herbert mentioned something like this at the beginning; it
also avoids allocs other than GFP_ATOMIC and GFP_KERNEL for
simplicity.)

I guess it could be replaced with any other mechanizm allocting to a
fragment or Evgeniy's allocator when it's ready.
 
Alas it's not tested, but if it works, I think it should show how
much gain is expected here for most common traffic.

Jarek P.
---

diff -Nurp a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h	2009-02-02 20:23:46.000000000 +0100
+++ b/include/linux/netdevice.h	2009-02-02 21:52:46.000000000 +0100
@@ -1154,6 +1154,9 @@ struct softnet_data
 	struct sk_buff		*completion_queue;
 
 	struct napi_struct	backlog;
+
+	struct page		*alloc_skb_page[2];
+	unsigned int		alloc_skb_off[2];
 };
 
 DECLARE_PER_CPU(struct softnet_data,softnet_data);
diff -Nurp a/include/linux/skbuff.h b/include/linux/skbuff.h
--- a/include/linux/skbuff.h	2009-02-02 20:23:46.000000000 +0100
+++ b/include/linux/skbuff.h	2009-02-02 22:12:04.000000000 +0100
@@ -144,7 +144,8 @@ struct skb_shared_info {
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
-	unsigned short  gso_type;
+	__u8		gso_type;
+	__u8		alloc_paged;
 	__be32          ip6_frag_id;
 #ifdef CONFIG_HAS_DMA
 	unsigned int	num_dma_maps;
diff -Nurp a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c	2009-02-02 19:37:33.000000000 +0100
+++ b/net/core/dev.c	2009-02-02 23:15:55.000000000 +0100
@@ -5243,6 +5243,9 @@ static int __init net_dev_init(void)
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
+
+		queue->alloc_skb_page[0] = NULL;
+		queue->alloc_skb_page[1] = NULL;
 	}
 
 	dev_boot_phase = 0;
diff -Nurp a/net/core/skbuff.c b/net/core/skbuff.c
--- a/net/core/skbuff.c	2009-02-02 20:23:46.000000000 +0100
+++ b/net/core/skbuff.c	2009-02-02 23:57:10.000000000 +0100
@@ -151,6 +151,55 @@ void skb_truesize_bug(struct sk_buff *sk
 }
 EXPORT_SYMBOL(skb_truesize_bug);
 
+static inline void *alloc_paged(unsigned int size, gfp_t gfp_mask)
+{
+	struct softnet_data *sd;
+	unsigned long flags;
+	unsigned int off;
+	struct page *p;
+	void *ret;
+	int i;
+
+	if (size < 1400 || size > 2000)
+		return NULL;
+
+	if (gfp_mask == GFP_ATOMIC)
+		i = 0;
+	else if (gfp_mask == GFP_KERNEL)
+		i = 1;
+	else
+		return NULL;
+
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	p = sd->alloc_skb_page[i];
+
+	if (p) {
+		off = sd->alloc_skb_off[i];
+		if (off + size > PAGE_SIZE) {
+			put_page(p);
+			goto new_page;
+		}
+	} else {
+new_page:
+		p = sd->alloc_skb_page[i] = alloc_pages(gfp_mask, 0);
+		if (!p) {
+			ret = NULL;
+			goto out;
+		}
+
+		off = 0;
+		/* hold one ref to this page until it's full */
+	}
+
+	get_page(p);
+	ret = page_address(p) + off;
+	sd->alloc_skb_off[i] = off + size;
+out:
+	local_irq_restore(flags);
+	return ret;
+}
+
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
@@ -178,7 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
-	u8 *data;
+	u8 *data, paged = 0;
 
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
@@ -188,8 +237,13 @@ struct sk_buff *__alloc_skb(unsigned int
 		goto out;
 
 	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+	data = alloc_paged(size + sizeof(struct skb_shared_info), gfp_mask);
+	if (data)
+		paged = 1;
+	else
+		data = kmalloc_node_track_caller(size +
+						 sizeof(struct skb_shared_info),
+						 gfp_mask, node);
 	if (!data)
 		goto nodata;
 
@@ -214,6 +268,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	shinfo->gso_type = 0;
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
+	shinfo->alloc_paged = paged;
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -341,7 +396,10 @@ static void skb_release_data(struct sk_b
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);
 
-		kfree(skb->head);
+		if (skb_shinfo(skb)->alloc_paged)
+			put_page(virt_to_page(skb->head));
+		else
+			kfree(skb->head);
 	}
 }
 
@@ -1380,7 +1438,7 @@ static inline int spd_fill_page(struct s
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
-	if (linear) {
+	if (linear && !skb_shinfo(skb)->alloc_paged) {
 		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  0:46                                                                                               ` David Miller
@ 2009-02-04  8:08                                                                                                 ` Evgeniy Polyakov
  2009-02-04  9:23                                                                                                   ` Nick Piggin
  0 siblings, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-04  8:08 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote:
> > NTA tried to solve this by not allowing to free the data allocated on
> > the different CPU, contrary to what SLAB does. Modulo cache coherency
> > improvements,
> 
> This could kill performance on NUMA systems if we are not careful.
> 
> If we ever consider NTA seriously, these issues would need to
> be performance tested.

Quite contrary I think. Memory is allocated and freed on the same CPU,
which means on the same memory domain, closest to the CPU in question.

I did not test NUMA though, but NTA performance on the usual CPU (it is
2.5 years old already :) was noticebly good.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  6:19                                                                                                               ` Willy Tarreau
@ 2009-02-04  8:12                                                                                                                 ` Evgeniy Polyakov
  2009-02-04  8:54                                                                                                                   ` Willy Tarreau
  2009-02-04  9:12                                                                                                                 ` David Miller
  1 sibling, 1 reply; 190+ messages in thread
From: Evgeniy Polyakov @ 2009-02-04  8:12 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, jarkao2, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote:
> Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> remember the exact optimal value, I think it was slightly lower).

Very likely it is related to the allocator - the same allocation
overhead to get a page, but 2.5 times bigger frame.

> For the myri10ge, could this be caused by the cache footprint then ?
> I can also retry with various values between 4 and 9k, including
> values close to 8k. Maybe the fact that 4k is better than 9 is
> because we get better filling of all pages ?
> 
> I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
> BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
> allocation failures which were polluting the logs, so it's been running
> with that setting for years now.

Recent e1000 (e1000e) uses fragments, so it does not suffer from the
high-order allocation failures.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  8:12                                                                                                                 ` Evgeniy Polyakov
@ 2009-02-04  8:54                                                                                                                   ` Willy Tarreau
  2009-02-04  8:59                                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-02-04  8:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, herbert, jarkao2, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Wed, Feb 04, 2009 at 11:12:01AM +0300, Evgeniy Polyakov wrote:
> On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote:
> > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> > remember the exact optimal value, I think it was slightly lower).
> 
> Very likely it is related to the allocator - the same allocation
> overhead to get a page, but 2.5 times bigger frame.
> 
> > For the myri10ge, could this be caused by the cache footprint then ?
> > I can also retry with various values between 4 and 9k, including
> > values close to 8k. Maybe the fact that 4k is better than 9 is
> > because we get better filling of all pages ?
> > 
> > I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
> > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
> > allocation failures which were polluting the logs, so it's been running
> > with that setting for years now.
> 
> Recent e1000 (e1000e) uses fragments, so it does not suffer from the
> high-order allocation failures.

My server is running 2.4 :-), but I observed the same issues with older
2.6 as well. I can certainly imagine that things have changed a lot since,
but the initial point remains : jumbo frames are expensive to deal with,
and with recent NICs and drivers, we might get close performance for
little additional cost. After all, initial justification for jumbo frames
was the devastating interrupt rate and all NICs coalesce interrupts now.

So if we can optimize all the infrastructure for extremely fast
processing of standard frames (1500) and still support jumbo frames
in a suboptimal mode, I think it could be a very good trade-off.

Regards,
willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  8:54                                                                                                                   ` Willy Tarreau
@ 2009-02-04  8:59                                                                                                                     ` Herbert Xu
  2009-02-04  9:01                                                                                                                       ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-02-04  8:59 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Evgeniy Polyakov, David Miller, jarkao2, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
>
> My server is running 2.4 :-), but I observed the same issues with older
> 2.6 as well. I can certainly imagine that things have changed a lot since,
> but the initial point remains : jumbo frames are expensive to deal with,
> and with recent NICs and drivers, we might get close performance for
> little additional cost. After all, initial justification for jumbo frames
> was the devastating interrupt rate and all NICs coalesce interrupts now.

This is total crap! Jumbo frames are way better than any of the
hacks (such as GSO) that people have come up with to get around it.
The only reason we are not using it as much is because of this
nasty thing called the Internet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  8:59                                                                                                                     ` Herbert Xu
@ 2009-02-04  9:01                                                                                                                       ` David Miller
  2009-02-04  9:12                                                                                                                         ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-04  9:01 UTC (permalink / raw)
  To: herbert
  Cc: w, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 4 Feb 2009 19:59:07 +1100

> On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> >
> > My server is running 2.4 :-), but I observed the same issues with older
> > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > but the initial point remains : jumbo frames are expensive to deal with,
> > and with recent NICs and drivers, we might get close performance for
> > little additional cost. After all, initial justification for jumbo frames
> > was the devastating interrupt rate and all NICs coalesce interrupts now.
> 
> This is total crap! Jumbo frames are way better than any of the
> hacks (such as GSO) that people have come up with to get around it.
> The only reason we are not using it as much is because of this
> nasty thing called the Internet.

Completely agreed.

If Jumbo frames are slower, it is NOT some fundamental issue.  It is
rather due to some misdesign of the hardware or it's driver.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  9:01                                                                                                                       ` David Miller
@ 2009-02-04  9:12                                                                                                                         ` Willy Tarreau
  2009-02-04  9:15                                                                                                                           ` David Miller
                                                                                                                                             ` (2 more replies)
  0 siblings, 3 replies; 190+ messages in thread
From: Willy Tarreau @ 2009-02-04  9:12 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 4 Feb 2009 19:59:07 +1100
> 
> > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> > >
> > > My server is running 2.4 :-), but I observed the same issues with older
> > > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > > but the initial point remains : jumbo frames are expensive to deal with,
> > > and with recent NICs and drivers, we might get close performance for
> > > little additional cost. After all, initial justification for jumbo frames
> > > was the devastating interrupt rate and all NICs coalesce interrupts now.
> > 
> > This is total crap! Jumbo frames are way better than any of the
> > hacks (such as GSO) that people have come up with to get around it.
> > The only reason we are not using it as much is because of this
> > nasty thing called the Internet.
> 
> Completely agreed.
> 
> If Jumbo frames are slower, it is NOT some fundamental issue.  It is
> rather due to some misdesign of the hardware or it's driver.

Agreed we can't use them *because* of the internet, but this
limitation has forced hardware designers to find valid alternatives.
For instance, having the ability to reach 10 Gbps with 1500 bytes
frames on myri10ge with a low CPU usage is a real achievement. This
is "only" 800 kpps after all.

And the arbitrary choice of 9k for jumbo frames was total crap too.
It's clear that no hardware designer was involved in the process.
They have to stuff 16kB of RAM on a NIC to use only 9. And we need
to allocate 3 pages for slightly more than 2. 7.5 kB would have been
better in this regard.

I still find it nice to lower CPU usage with frames larger than 1500,
but given the fact that this is rarely used (even in datacenters), I
think our efforts should concentrate on where the real users are, ie
<1500.

Regards,
Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  6:19                                                                                                               ` Willy Tarreau
  2009-02-04  8:12                                                                                                                 ` Evgeniy Polyakov
@ 2009-02-04  9:12                                                                                                                 ` David Miller
  1 sibling, 0 replies; 190+ messages in thread
From: David Miller @ 2009-02-04  9:12 UTC (permalink / raw)
  To: w
  Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Wed, 4 Feb 2009 07:19:47 +0100

> On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote:
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Tue, 3 Feb 2009 13:25:35 +0100
> > 
> > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> > > be precise) than with 9K, and I think that the overhead of allocating 3
> > > contiguous pages is a major reason for this.
> > 
> > With what hardware?  If it's with myri10ge, that driver uses page
> > frags so would not be using 3 contiguous pages even for jumbo frames.
> 
> Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> remember the exact optimal value, I think it was slightly lower).
> 
> For the myri10ge, could this be caused by the cache footprint then ?
> I can also retry with various values between 4 and 9k, including
> values close to 8k. Maybe the fact that 4k is better than 9 is
> because we get better filling of all pages ?

Looking quickly, myri10ge's buffer manager is incredibly simplistic so
it wastes a lot of memory and gives terrible cache behavior.

When using JUMBO MTU it just gives whole pages to the chip.

So it looks like, assuming 4096 byte PAGE_SIZE and 9000 byte
jumbo MTU, the chip will allocate for a full size frame:

	FULL PAGE
	FULL PAGE
	FULL PAGE

and only ~1K of that last full page will be utilized.

The headers will therefore always land on the same cache lines,
and PAGE_SIZE-~1K will be wasted.

Whereas for < PAGE_SIZE mtu selections, it will give MTU sized
blocks to the chip for packet data allocation.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  9:12                                                                                                                         ` Willy Tarreau
@ 2009-02-04  9:15                                                                                                                           ` David Miller
  2009-02-04 19:19                                                                                                                           ` Roland Dreier
  2009-02-05  8:32                                                                                                                           ` Bill Fink
  2 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-02-04  9:15 UTC (permalink / raw)
  To: w
  Cc: herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

From: Willy Tarreau <w@1wt.eu>
Date: Wed, 4 Feb 2009 10:12:17 +0100

> And the arbitrary choice of 9k for jumbo frames was total crap too.
> It's clear that no hardware designer was involved in the process.

Willy, do some reasearch, this also completely wrong.

Alteon effectively created jumbo MTUs in their Acenic chips since that
was the first chip to ever do it.  Those were hardware engineers only
making those design decisions.

I think this is will I will stop taking part in this part of the
discussion.  Every posting is full of misinformation and I've got
better things to do than to refute them every 5 minutes.  :-/

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  8:08                                                                                                 ` Evgeniy Polyakov
@ 2009-02-04  9:23                                                                                                   ` Nick Piggin
  0 siblings, 0 replies; 190+ messages in thread
From: Nick Piggin @ 2009-02-04  9:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, jarkao2, herbert, w, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Wednesday 04 February 2009 19:08:51 Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) 
wrote:
> > > NTA tried to solve this by not allowing to free the data allocated on
> > > the different CPU, contrary to what SLAB does. Modulo cache coherency
> > > improvements,
> >
> > This could kill performance on NUMA systems if we are not careful.
> >
> > If we ever consider NTA seriously, these issues would need to
> > be performance tested.
>
> Quite contrary I think. Memory is allocated and freed on the same CPU,
> which means on the same memory domain, closest to the CPU in question.
>
> I did not test NUMA though, but NTA performance on the usual CPU (it is
> 2.5 years old already :) was noticebly good.

I had a quick look at NTA... I didn't understand much of it yet, but
the remote freeing scheme is kind of like what I did for slqb. The
freeing CPU queues objects back to the CPU that allocated them, which
eventually checks the queue and frees them itself.

I don't know how much cache coherency gains you get from this -- in
most slab allocations, I think the object tends to be cache on on the
CPU that frees it. I'm doing it mainly to try avoid locking... I guess
that makes for cache coherency benefit itself.

If NTA does significantly better than slab allocator, I would be quite
interested. It might be something that we can learn from and use in
the general slab allocator (or maybe something more network specific
that NTA does).


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  0:46                                                                                                 ` David Miller
@ 2009-02-04  9:41                                                                                                   ` Benny Amorsen
  2009-02-04 12:01                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Benny Amorsen @ 2009-02-04  9:41 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, jarkao2, w, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

David Miller <davem@davemloft.net> writes:

> From: Herbert Xu <herbert@gondor.apana.org.au>

>> Granted copying sucks, but this is really because the underlying
>> hardware is badly designed.  Also copying is way better than
>> not receiving at all due to memory fragmentation.
>
> This scheme sounds very reasonable.

Would it be possible to add a counter somewhere for this, or is that
too expensive?


/Benny


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  9:41                                                                                                   ` Benny Amorsen
@ 2009-02-04 12:01                                                                                                     ` Herbert Xu
  0 siblings, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-04 12:01 UTC (permalink / raw)
  To: Benny Amorsen
  Cc: davem, zbr, jarkao2, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

Benny Amorsen <benny+usenet@amorsen.dk> wrote:
>
> Would it be possible to add a counter somewhere for this, or is that
> too expensive?

Yes a counter would be useful and is reasonable.  But you can
probably deduce it by just looking at slabinfo.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  9:12                                                                                                                         ` Willy Tarreau
  2009-02-04  9:15                                                                                                                           ` David Miller
@ 2009-02-04 19:19                                                                                                                           ` Roland Dreier
  2009-02-04 19:28                                                                                                                             ` Willy Tarreau
  2009-02-05  8:32                                                                                                                           ` Bill Fink
  2 siblings, 1 reply; 190+ messages in thread
From: Roland Dreier @ 2009-02-04 19:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

 > And the arbitrary choice of 9k for jumbo frames was total crap too.
 > It's clear that no hardware designer was involved in the process.
 > They have to stuff 16kB of RAM on a NIC to use only 9. And we need
 > to allocate 3 pages for slightly more than 2. 7.5 kB would have been
 > better in this regard.

9K was not totally arbitrary.  The CRC used for checksumming ethernet
packets has a probability of undetected errors that goes up about
11-thousand something bytes.  So the real limit is ~11000 bytes, and I
believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR
and transport headers without fragmentation.

 - R.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04 19:19                                                                                                                           ` Roland Dreier
@ 2009-02-04 19:28                                                                                                                             ` Willy Tarreau
  2009-02-04 19:48                                                                                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-02-04 19:28 UTC (permalink / raw)
  To: Roland Dreier
  Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Wed, Feb 04, 2009 at 11:19:06AM -0800, Roland Dreier wrote:
>  > And the arbitrary choice of 9k for jumbo frames was total crap too.
>  > It's clear that no hardware designer was involved in the process.
>  > They have to stuff 16kB of RAM on a NIC to use only 9. And we need
>  > to allocate 3 pages for slightly more than 2. 7.5 kB would have been
>  > better in this regard.
> 
> 9K was not totally arbitrary.  The CRC used for checksumming ethernet
> packets has a probability of undetected errors that goes up about
> 11-thousand something bytes.  So the real limit is ~11000 bytes, and I
> believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR
> and transport headers without fragmentation.

Yes I know that initial motivation. But IMHO it was a purely functional
motivation without real considerations of the implications. When you
read Alteon's initial proposal, there is even a biased analysis (they
compare the fill ratio obtained with one 8k frame with that of 6 1.5k
frames). Their own argument does not stand with their final proposal!
I think that there might already have been people pushing for 8k and
not 9 by then, but in order to get wide acceptance in datacenters,
they had to please the NFS admins.

Now it's useless to speculate on history and I agree with Davem that
we're wasting our time with this discussion, let's go back to the
keyboard ;-)

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04 19:28                                                                                                                             ` Willy Tarreau
@ 2009-02-04 19:48                                                                                                                               ` Jarek Poplawski
  0 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-04 19:48 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Roland Dreier, David Miller, herbert, zbr, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Wed, Feb 04, 2009 at 08:28:20PM +0100, Willy Tarreau wrote:
...
> Now it's useless to speculate on history and I agree with Davem that
> we're wasting our time with this discussion, let's go back to the
> keyboard ;-)

Davem knows everything, so he is ...wrong. This is a very useful
discussion. Go back to the keyboards, please :-)

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-04  9:12                                                                                                                         ` Willy Tarreau
  2009-02-04  9:15                                                                                                                           ` David Miller
  2009-02-04 19:19                                                                                                                           ` Roland Dreier
@ 2009-02-05  8:32                                                                                                                           ` Bill Fink
  2 siblings, 0 replies; 190+ messages in thread
From: Bill Fink @ 2009-02-05  8:32 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo,
	linux-kernel, netdev, jens.axboe

On Wed, 4 Feb 2009, Willy Tarreau wrote:

> On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Wed, 4 Feb 2009 19:59:07 +1100
> > 
> > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> > > >
> > > > My server is running 2.4 :-), but I observed the same issues with older
> > > > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > > > but the initial point remains : jumbo frames are expensive to deal with,
> > > > and with recent NICs and drivers, we might get close performance for
> > > > little additional cost. After all, initial justification for jumbo frames
> > > > was the devastating interrupt rate and all NICs coalesce interrupts now.
> > > 
> > > This is total crap! Jumbo frames are way better than any of the
> > > hacks (such as GSO) that people have come up with to get around it.
> > > The only reason we are not using it as much is because of this
> > > nasty thing called the Internet.
> > 
> > Completely agreed.
> > 
> > If Jumbo frames are slower, it is NOT some fundamental issue.  It is
> > rather due to some misdesign of the hardware or it's driver.
> 
> Agreed we can't use them *because* of the internet, but this
> limitation has forced hardware designers to find valid alternatives.
> For instance, having the ability to reach 10 Gbps with 1500 bytes
> frames on myri10ge with a low CPU usage is a real achievement. This
> is "only" 800 kpps after all.
> 
> And the arbitrary choice of 9k for jumbo frames was total crap too.
> It's clear that no hardware designer was involved in the process.
> They have to stuff 16kB of RAM on a NIC to use only 9. And we need
> to allocate 3 pages for slightly more than 2. 7.5 kB would have been
> better in this regard.
> 
> I still find it nice to lower CPU usage with frames larger than 1500,
> but given the fact that this is rarely used (even in datacenters), I
> think our efforts should concentrate on where the real users are, ie
> <1500.

Those in the HPC realm use 9000 byte jumbo frames because it makes
a major performance difference, especially across large RTT paths,
and the Internet2 backbone fully supports 9000 byte jumbo frames
(with some wishing we could support much larger frame sizes).

Local environment:

9000 byte jumbo frames:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.01 sec = 9905.9707 Mbps 100 %TX 76 %RX 0 retrans 0.15 msRTT

4080 byte MTU:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
 9171.6875 MB /  10.02 sec = 7680.7663 Mbps 100 %TX 99 %RX 0 retrans 0.19 msRTT

The performance impact is even more pronounced on a large RTT path
such as the following netem emulated 80 ms RTT path:

9000 byte jumbo frames:

[root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15
25904.2500 MB /  30.16 sec = 7205.8755 Mbps 96 %TX 55 %RX 0 retrans 82.73 msRTT

4080 byte MTU:

[root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15
 8650.0129 MB /  30.25 sec = 2398.8862 Mbps 33 %TX 19 %RX 2371 retrans 81.98 msRTT

And if there's any loss in the path, the performance difference is also
dramatic, such as here across a real MAN environment with about a 1 ms RTT:

9000 byte jumbo frames:

[root@chance9 ~]# nuttcp -w20m 192.168.88.8
 7711.8750 MB /  10.05 sec = 6436.2406 Mbps 82 %TX 96 %RX 261 retrans 0.92 msRTT

4080 byte MTU:

[root@chance9 ~]# nuttcp -w20m 192.168.88.8
 4551.0625 MB /  10.08 sec = 3786.2108 Mbps 50 %TX 95 %RX 42 retrans 0.95 msRTT

All testing was with myri10ge on the transmitter side (2.6.20.7 kernel).

So my experience has definitely been that 9000 byte jumbo frames are a
major performance win for high throughput applications.

						-Bill

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-03  9:41                                                                                           ` Jarek Poplawski
  2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
  2009-02-04  7:56                                                                                             ` Jarek Poplawski
@ 2009-02-06  7:52                                                                                             ` David Miller
  2009-02-06  8:09                                                                                               ` Herbert Xu
                                                                                                                 ` (2 more replies)
  2 siblings, 3 replies; 190+ messages in thread
From: David Miller @ 2009-02-06  7:52 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 3 Feb 2009 09:41:08 +0000

> Yes, this looks reasonable. On the other hand, I think it would be
> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> efficiency of such a solution. (It seems releasing with put_page() will
> always have some cost with delayed reusing and/or waste of space.)

I think we can't avoid using carved up pages for skb->data in the end.
The whole kernel wants to speak in pages and be able to grab and
release them in one way and one way only (get_page() and put_page()).

What do you think is more likely?  Us teaching the whole entire kernel
how to hold onto SKB linear data buffers, or the networking fixing
itself to operate on pages for it's header metadata? :-)

What we'll end up with is likely a hybrid scheme.  High speed devices
will receive into pages.  And also the skb->data area will be page
backed and held using get_page()/put_page() references.

It is not even worth optimizing for skb->data holding the entire
packet, that's not the case that matters.

These skb->data areas will thus be 128 bytes plus the skb_shinfo
structure blob.  They also will be recycled often, rather than held
onto for long periods of time.

In fact we can optimize that even further in many ways, for example by
dropping the skb->data backed memory once the skb is queued to the
socket receive buffer.  That will make skb->data buffer lifetimes
miniscule even under heavy receive load.

In that kind of situation, doing even the most stupidest page slicing
algorithm, similar to what we do now with sk->sk_sndmsg_page, is
more than adequate and things like NTA (purely to solve this problem)
is overengineering.


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  7:52                                                                                             ` David Miller
@ 2009-02-06  8:09                                                                                               ` Herbert Xu
  2009-02-06  9:10                                                                                               ` Jarek Poplawski
  2009-02-06 18:59                                                                                               ` Jarek Poplawski
  2 siblings, 0 replies; 190+ messages in thread
From: Herbert Xu @ 2009-02-06  8:09 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote:
>
> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.

Indeed, while I was doing the tun accounting stuff and reviewing
the rx accounting users, it was apparent that we don't need to
carry most of this stuff in our receive queues.

This is almost the opposite of the skbless data idea for TSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  7:52                                                                                             ` David Miller
  2009-02-06  8:09                                                                                               ` Herbert Xu
@ 2009-02-06  9:10                                                                                               ` Jarek Poplawski
  2009-02-06  9:17                                                                                                 ` David Miller
  2009-02-06  9:23                                                                                                 ` Herbert Xu
  2009-02-06 18:59                                                                                               ` Jarek Poplawski
  2 siblings, 2 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06  9:10 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 3 Feb 2009 09:41:08 +0000
> 
> > Yes, this looks reasonable. On the other hand, I think it would be
> > nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> > efficiency of such a solution. (It seems releasing with put_page() will
> > always have some cost with delayed reusing and/or waste of space.)
> 
> I think we can't avoid using carved up pages for skb->data in the end.
> The whole kernel wants to speak in pages and be able to grab and
> release them in one way and one way only (get_page() and put_page()).
> 
> What do you think is more likely?  Us teaching the whole entire kernel
> how to hold onto SKB linear data buffers, or the networking fixing
> itself to operate on pages for it's header metadata? :-)

This idea looks very reasonable, except I wander why nobody else
didn't need this kind of mm interface. Another question is it seems
many mechanisms like fast searching, defragmentation etc. could be
reused.

> What we'll end up with is likely a hybrid scheme.  High speed devices
> will receive into pages.  And also the skb->data area will be page
> backed and held using get_page()/put_page() references.
> 
> It is not even worth optimizing for skb->data holding the entire
> packet, that's not the case that matters.
> 
> These skb->data areas will thus be 128 bytes plus the skb_shinfo
> structure blob.  They also will be recycled often, rather than held
> onto for long periods of time.

Looks fine, except: you mentioned dumb NICs, which would need this
page space on receive, anyway. BTW, don't they need this on transmit
again?

> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.
> 
> In that kind of situation, doing even the most stupidest page slicing
> algorithm, similar to what we do now with sk->sk_sndmsg_page, is
> more than adequate and things like NTA (purely to solve this problem)
> is overengineering.

Hmm... I don't get it. It seems these slabs do a lot of advanced work,
and still some people like Evgeniy or Nick thought it's not enough,
and even found it worth of their time to rework this.

There is also a question of memory accounting: do you think admins
don't care if we give away say 25% additionally?

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:10                                                                                               ` Jarek Poplawski
@ 2009-02-06  9:17                                                                                                 ` David Miller
  2009-02-06  9:42                                                                                                   ` Jarek Poplawski
  2009-02-06  9:23                                                                                                 ` Herbert Xu
  1 sibling, 1 reply; 190+ messages in thread
From: David Miller @ 2009-02-06  9:17 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 6 Feb 2009 09:10:34 +0000

> Hmm... I don't get it. It seems these slabs do a lot of advanced work,
> and still some people like Evgeniy or Nick thought it's not enough,
> and even found it worth of their time to rework this.

Note that, at least to some extent, the memory allocators are
duplicating some of the locality and NUMA logic that's already present
in the page allocator itself.

Except that they are handling the fact that objects are moving around
instead of pages.

Also keep in mind that we might also want to encourage drivers to make
use of the SKB recycling mechanisms we have.  So this will decrease
lifetimes, and thus the wastage and locality issues immensely.

We truly want something different from what the general purpose
allocator provides.  Namely, a reference countable buffer.

And all I'm saying is that since the page allocator provides that
facility, and using pages solves all of the splice() et al.  problems,
building something extremely simple on top of the page allocator seems
to be a good way to go.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:10                                                                                               ` Jarek Poplawski
  2009-02-06  9:17                                                                                                 ` David Miller
@ 2009-02-06  9:23                                                                                                 ` Herbert Xu
  2009-02-06  9:51                                                                                                   ` Jarek Poplawski
  1 sibling, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-02-06  9:23 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote:
> 
> Looks fine, except: you mentioned dumb NICs, which would need this
> page space on receive, anyway. BTW, don't they need this on transmit
> again?

A lot more NICs support SG on tx than rx.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:17                                                                                                 ` David Miller
@ 2009-02-06  9:42                                                                                                   ` Jarek Poplawski
  2009-02-06  9:49                                                                                                     ` David Miller
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06  9:42 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

On Fri, Feb 06, 2009 at 01:17:22AM -0800, David Miller wrote:
...
> And all I'm saying is that since the page allocator provides that
> facility, and using pages solves all of the splice() et al.  problems,
> building something extremely simple on top of the page allocator seems
> to be a good way to go.

This all is absolutely right if we can afford it: more simple - more
memory wasted. I hope you're right, but on the other hand, people
don't use slob by default.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:42                                                                                                   ` Jarek Poplawski
@ 2009-02-06  9:49                                                                                                     ` David Miller
  0 siblings, 0 replies; 190+ messages in thread
From: David Miller @ 2009-02-06  9:49 UTC (permalink / raw)
  To: jarkao2
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 6 Feb 2009 09:42:53 +0000

> I hope you're right, but on the other hand, people don't use slob by
> default.

The use is different, so it's a bad comparison, really.

SLAB has to satisfy all kinds of object lifetimes, all kinds
of sizes and use cases, and perform generally well in all
situations with zero knowledge about usage patterns.

Networking is strictly the opposite of that set of requirements.  We
know all of these parameters (or their ranges) and can on top of that
control them if necessary.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:23                                                                                                 ` Herbert Xu
@ 2009-02-06  9:51                                                                                                   ` Jarek Poplawski
  2009-02-06 10:28                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06  9:51 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Feb 06, 2009 at 08:23:26PM +1100, Herbert Xu wrote:
> On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote:
> > 
> > Looks fine, except: you mentioned dumb NICs, which would need this
> > page space on receive, anyway. BTW, don't they need this on transmit
> > again?
> 
> A lot more NICs support SG on tx than rx.

OK, but since there is not so much difference, and we need to waste
it in some cases anyway, plus handle it later some special way, I'm
a bit in doubt.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  9:51                                                                                                   ` Jarek Poplawski
@ 2009-02-06 10:28                                                                                                     ` Herbert Xu
  2009-02-06 10:58                                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Herbert Xu @ 2009-02-06 10:28 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
>
> OK, but since there is not so much difference, and we need to waste
> it in some cases anyway, plus handle it later some special way, I'm
> a bit in doubt.

Well the thing is cards that don't support SG on tx probably
don't support jumbo frames either.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06 10:28                                                                                                     ` Herbert Xu
@ 2009-02-06 10:58                                                                                                       ` Jarek Poplawski
  2009-02-06 11:10                                                                                                         ` Willy Tarreau
  0 siblings, 1 reply; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06 10:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev,
	jens.axboe

On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> >
> > OK, but since there is not so much difference, and we need to waste
> > it in some cases anyway, plus handle it later some special way, I'm
> > a bit in doubt.
> 
> Well the thing is cards that don't support SG on tx probably
> don't support jumbo frames either.

?? I mean this 128 byte chunk would be hard to reuse after copying
to skb->data, and if reused, we could miss this for some NICs on TX,
so the whole packed would need a copy.

BTW, David mentioned something simple like sk_sndmsg_page would be
enough, but I guess not for these non-SG NICs. We have to allocate
bigger chunks for them, so more fragmentation to handle.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06 10:58                                                                                                       ` Jarek Poplawski
@ 2009-02-06 11:10                                                                                                         ` Willy Tarreau
  2009-02-06 11:47                                                                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 190+ messages in thread
From: Willy Tarreau @ 2009-02-06 11:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Herbert Xu, David Miller, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote:
> On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> > >
> > > OK, but since there is not so much difference, and we need to waste
> > > it in some cases anyway, plus handle it later some special way, I'm
> > > a bit in doubt.
> > 
> > Well the thing is cards that don't support SG on tx probably
> > don't support jumbo frames either.
> 
> ?? I mean this 128 byte chunk would be hard to reuse after copying
> to skb->data, and if reused, we could miss this for some NICs on TX,
> so the whole packed would need a copy.

couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit
map to indicate which slot is used and which one is free ? This would
just be a matter of calling ffz() to find one spare place in a page.
Also, that bitmap might serve as a refcount, because if it drops to
zero, it means all slots are unused. And -1 means all slots are used.

This would reduce wastage if wee need to allocate 128 bytes often.

Willy


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06 11:10                                                                                                         ` Willy Tarreau
@ 2009-02-06 11:47                                                                                                           ` Jarek Poplawski
  0 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06 11:47 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Herbert Xu, David Miller, zbr, dada1, ben, mingo, linux-kernel,
	netdev, jens.axboe

On Fri, Feb 06, 2009 at 12:10:15PM +0100, Willy Tarreau wrote:
> On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote:
> > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> > > >
> > > > OK, but since there is not so much difference, and we need to waste
> > > > it in some cases anyway, plus handle it later some special way, I'm
> > > > a bit in doubt.
> > > 
> > > Well the thing is cards that don't support SG on tx probably
> > > don't support jumbo frames either.
> > 
> > ?? I mean this 128 byte chunk would be hard to reuse after copying
> > to skb->data, and if reused, we could miss this for some NICs on TX,
> > so the whole packed would need a copy.
> 
> couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit
> map to indicate which slot is used and which one is free ? This would
> just be a matter of calling ffz() to find one spare place in a page.
> Also, that bitmap might serve as a refcount, because if it drops to
> zero, it means all slots are unused. And -1 means all slots are used.
> 
> This would reduce wastage if wee need to allocate 128 bytes often.

Something like this would be useful for SG NICs for the paged
skb->data area. But I'm concerned with non-SG ones: if I got it right,
for 1500 byte packets we need to allocate such a chunk, copy 128+
bytes to skb->data, and we have 128+ unused. If there are later such
short packets, we don't need this space: why copy?

But even if we find the way to use this, or don't reserve such space
while receiving from SG NIC, than we will need to copy both chunks
to another page for TX on non-SG NICs.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH v2] tcp: splice as many packets as possible at once
  2009-02-06  7:52                                                                                             ` David Miller
  2009-02-06  8:09                                                                                               ` Herbert Xu
  2009-02-06  9:10                                                                                               ` Jarek Poplawski
@ 2009-02-06 18:59                                                                                               ` Jarek Poplawski
  2 siblings, 0 replies; 190+ messages in thread
From: Jarek Poplawski @ 2009-02-06 18:59 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe

David Miller wrote, On 02/06/2009 08:52 AM:

> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 3 Feb 2009 09:41:08 +0000
> 
>> Yes, this looks reasonable. On the other hand, I think it would be
>> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
>> efficiency of such a solution. (It seems releasing with put_page() will
>> always have some cost with delayed reusing and/or waste of space.)
> 
> I think we can't avoid using carved up pages for skb->data in the end.
> The whole kernel wants to speak in pages and be able to grab and
> release them in one way and one way only (get_page() and put_page()).
> 
> What do you think is more likely?  Us teaching the whole entire kernel
> how to hold onto SKB linear data buffers, or the networking fixing
> itself to operate on pages for it's header metadata? :-)
> 
> What we'll end up with is likely a hybrid scheme.  High speed devices
> will receive into pages.  And also the skb->data area will be page
> backed and held using get_page()/put_page() references.

So, after a full awakening I think I got your point at last! I thought
all the time we're trying to do something more general, and you're
seemingly focused on SG capable NICs, with myri10ge or niu as model
to follow. I'm OK with this. Very nice idea and much less work! (It's
only enough to CC all the maintainers !)

> It is not even worth optimizing for skb->data holding the entire
> packet, that's not the case that matters.
> 
> These skb->data areas will thus be 128 bytes plus the skb_shinfo
> structure blob.  They also will be recycled often, rather than held
> onto for long periods of time.
> 
> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.
> 
> In that kind of situation, doing even the most stupidest page slicing
> algorithm, similar to what we do now with sk->sk_sndmsg_page, is
> more than adequate and things like NTA (purely to solve this problem)
> is overengineering.

This is 100% right, except if we try to do something with non-SG and/or
jumbos - IMHO this requires some overengineering.

Jarek P.

^ permalink raw reply	[flat|nested] 190+ messages in thread

end of thread, other threads:[~2009-02-06 18:59 UTC | newest]

Thread overview: 190+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau
2009-01-08 19:44 ` Jens Axboe
2009-01-08 22:03   ` Willy Tarreau
2009-01-08 21:50 ` Ben Mansell
2009-01-08 21:55   ` David Miller
2009-01-08 22:20     ` Willy Tarreau
2009-01-13 23:08       ` David Miller
2009-01-09  6:47     ` Eric Dumazet
2009-01-09  7:04       ` Willy Tarreau
2009-01-09  7:28         ` Eric Dumazet
2009-01-09  7:42           ` Willy Tarreau
2009-01-13 23:27           ` David Miller
2009-01-13 23:35             ` Eric Dumazet
2009-01-09 15:42       ` Eric Dumazet
2009-01-09 17:57         ` Eric Dumazet
2009-01-09 18:54         ` Willy Tarreau
2009-01-09 20:51           ` Eric Dumazet
2009-01-09 21:24             ` Willy Tarreau
2009-01-09 22:02               ` Eric Dumazet
2009-01-09 22:09                 ` Willy Tarreau
2009-01-09 22:07               ` Willy Tarreau
2009-01-09 22:12                 ` Eric Dumazet
2009-01-09 22:17                   ` Willy Tarreau
2009-01-09 22:42                     ` Evgeniy Polyakov
2009-01-09 22:50                       ` Willy Tarreau
2009-01-09 23:01                         ` Evgeniy Polyakov
2009-01-09 23:06                           ` Willy Tarreau
2009-01-10  7:40                       ` Eric Dumazet
2009-01-11 12:58                         ` Evgeniy Polyakov
2009-01-11 13:14                           ` Eric Dumazet
2009-01-11 13:35                             ` Evgeniy Polyakov
2009-01-11 16:00                               ` Eric Dumazet
2009-01-11 16:05                                 ` Evgeniy Polyakov
2009-01-14  0:07                                   ` David Miller
2009-01-14  0:13                                     ` Evgeniy Polyakov
2009-01-14  0:16                                       ` David Miller
2009-01-14  0:22                                         ` Evgeniy Polyakov
2009-01-14  0:37                                           ` David Miller
2009-01-14  3:51                                             ` Herbert Xu
2009-01-14  4:25                                               ` David Miller
2009-01-14  7:27                                               ` David Miller
2009-01-14  8:26                                                 ` Herbert Xu
2009-01-14  8:53                                                   ` Jarek Poplawski
2009-01-14  9:29                                                     ` David Miller
2009-01-14  9:42                                                       ` Jarek Poplawski
2009-01-14 10:06                                                         ` David Miller
2009-01-14 10:47                                                           ` Jarek Poplawski
2009-01-14 11:29                                                             ` Herbert Xu
2009-01-14 11:40                                                               ` Jarek Poplawski
2009-01-14 11:45                                                                 ` Jarek Poplawski
2009-01-14  9:54                                                       ` Jarek Poplawski
2009-01-14 10:01                                                         ` Willy Tarreau
2009-01-14 12:06                                                         ` Jarek Poplawski
2009-01-14 12:15                                                         ` Jarek Poplawski
2009-01-14 11:28                                                       ` Herbert Xu
2009-01-15 23:03                                                       ` Willy Tarreau
2009-01-15 23:19                                                         ` David Miller
2009-01-15 23:19                                                         ` Herbert Xu
2009-01-15 23:26                                                           ` David Miller
2009-01-15 23:32                                                             ` Herbert Xu
2009-01-15 23:34                                                               ` David Miller
2009-01-15 23:42                                                                 ` Willy Tarreau
2009-01-15 23:44                                                                   ` Willy Tarreau
2009-01-15 23:54                                                                     ` David Miller
2009-01-19  0:42                                                                       ` Willy Tarreau
2009-01-19  3:08                                                                         ` Herbert Xu
2009-01-19  3:27                                                                           ` David Miller
2009-01-19  6:14                                                                             ` Willy Tarreau
2009-01-19  6:19                                                                               ` David Miller
2009-01-19  6:45                                                                                 ` Willy Tarreau
2009-01-19 10:19                                                                                 ` Herbert Xu
2009-01-19 20:59                                                                                   ` David Miller
2009-01-19 21:24                                                                                     ` Herbert Xu
2009-01-25 21:03                                                                                     ` Willy Tarreau
2009-01-26  7:59                                                                                       ` Jarek Poplawski
2009-01-26  8:12                                                                                         ` Willy Tarreau
2009-01-19  8:40                                                                               ` Jarek Poplawski
2009-01-19  3:28                                                                         ` David Miller
2009-01-19  6:11                                                                           ` Willy Tarreau
2009-01-24 21:23                                                                           ` Willy Tarreau
2009-01-20 12:01                                                                         ` Ben Mansell
2009-01-20 12:11                                                                           ` Evgeniy Polyakov
2009-01-20 13:43                                                                             ` Ben Mansell
2009-01-20 14:06                                                                               ` Jarek Poplawski
2009-01-16  6:51                                                                     ` Jarek Poplawski
2009-01-19  6:08                                                                       ` David Miller
2009-01-19  6:16                                                                 ` David Miller
2009-01-19 10:20                                                                   ` Herbert Xu
2009-01-20  8:37                                                             ` Jarek Poplawski
2009-01-20  9:33                                                               ` [PATCH v2] " Jarek Poplawski
2009-01-20 10:00                                                                 ` Evgeniy Polyakov
2009-01-20 10:20                                                                   ` Jarek Poplawski
2009-01-20 10:31                                                                     ` Evgeniy Polyakov
2009-01-20 11:01                                                                       ` Jarek Poplawski
2009-01-20 17:16                                                                         ` David Miller
2009-01-21  9:54                                                                           ` Jarek Poplawski
2009-01-22  9:04                                                                           ` [PATCH v3] " Jarek Poplawski
2009-01-26  5:22                                                                             ` David Miller
2009-01-27  7:11                                                                               ` Herbert Xu
2009-01-27  7:54                                                                                 ` Jarek Poplawski
2009-01-27 10:09                                                                                   ` Herbert Xu
2009-01-27 10:35                                                                                     ` Jarek Poplawski
2009-01-27 10:57                                                                                       ` Jarek Poplawski
2009-01-27 11:48                                                                                       ` Herbert Xu
2009-01-27 12:16                                                                                         ` Jarek Poplawski
2009-01-27 12:31                                                                                           ` Jarek Poplawski
2009-01-27 17:06                                                                                             ` David Miller
2009-01-28  8:10                                                                                               ` Jarek Poplawski
2009-02-01  8:41                                                                                 ` David Miller
2009-01-26  8:20                                                                       ` [PATCH v2] " Jarek Poplawski
2009-01-26 21:21                                                                         ` Evgeniy Polyakov
2009-01-27  6:10                                                                           ` David Miller
2009-01-27  7:40                                                                             ` Jarek Poplawski
2009-01-30 21:42                                                                               ` David Miller
2009-01-30 21:59                                                                                 ` Willy Tarreau
2009-01-30 22:03                                                                                   ` David Miller
2009-01-30 22:13                                                                                     ` Willy Tarreau
2009-01-30 22:15                                                                                       ` David Miller
2009-01-30 22:16                                                                                 ` Herbert Xu
2009-02-02  8:08                                                                                   ` Jarek Poplawski
2009-02-02  8:18                                                                                     ` David Miller
2009-02-02  8:43                                                                                       ` Jarek Poplawski
2009-02-03  7:50                                                                                         ` David Miller
2009-02-03  9:41                                                                                           ` Jarek Poplawski
2009-02-03 11:10                                                                                             ` Evgeniy Polyakov
2009-02-03 11:24                                                                                               ` Herbert Xu
2009-02-03 11:49                                                                                                 ` Evgeniy Polyakov
2009-02-03 11:53                                                                                                   ` Herbert Xu
2009-02-03 12:07                                                                                                     ` Evgeniy Polyakov
2009-02-03 12:12                                                                                                       ` Herbert Xu
2009-02-03 12:18                                                                                                         ` Evgeniy Polyakov
2009-02-03 12:25                                                                                                           ` Willy Tarreau
2009-02-03 12:28                                                                                                             ` Herbert Xu
2009-02-04  0:47                                                                                                             ` David Miller
2009-02-04  6:19                                                                                                               ` Willy Tarreau
2009-02-04  8:12                                                                                                                 ` Evgeniy Polyakov
2009-02-04  8:54                                                                                                                   ` Willy Tarreau
2009-02-04  8:59                                                                                                                     ` Herbert Xu
2009-02-04  9:01                                                                                                                       ` David Miller
2009-02-04  9:12                                                                                                                         ` Willy Tarreau
2009-02-04  9:15                                                                                                                           ` David Miller
2009-02-04 19:19                                                                                                                           ` Roland Dreier
2009-02-04 19:28                                                                                                                             ` Willy Tarreau
2009-02-04 19:48                                                                                                                               ` Jarek Poplawski
2009-02-05  8:32                                                                                                                           ` Bill Fink
2009-02-04  9:12                                                                                                                 ` David Miller
2009-02-03 12:27                                                                                                           ` Herbert Xu
2009-02-03 13:05                                                                                                   ` david
2009-02-03 12:12                                                                                                     ` Evgeniy Polyakov
2009-02-03 12:18                                                                                                       ` Herbert Xu
2009-02-03 12:30                                                                                                         ` Evgeniy Polyakov
2009-02-03 12:33                                                                                                           ` Herbert Xu
2009-02-03 12:33                                                                                                         ` Nick Piggin
2009-02-04  0:46                                                                                                 ` David Miller
2009-02-04  9:41                                                                                                   ` Benny Amorsen
2009-02-04 12:01                                                                                                     ` Herbert Xu
2009-02-03 12:36                                                                                               ` Jarek Poplawski
2009-02-03 13:06                                                                                                 ` Evgeniy Polyakov
2009-02-03 13:25                                                                                                   ` Jarek Poplawski
2009-02-03 14:20                                                                                                     ` Evgeniy Polyakov
2009-02-04  0:46                                                                                               ` David Miller
2009-02-04  8:08                                                                                                 ` Evgeniy Polyakov
2009-02-04  9:23                                                                                                   ` Nick Piggin
2009-02-04  7:56                                                                                             ` Jarek Poplawski
2009-02-06  7:52                                                                                             ` David Miller
2009-02-06  8:09                                                                                               ` Herbert Xu
2009-02-06  9:10                                                                                               ` Jarek Poplawski
2009-02-06  9:17                                                                                                 ` David Miller
2009-02-06  9:42                                                                                                   ` Jarek Poplawski
2009-02-06  9:49                                                                                                     ` David Miller
2009-02-06  9:23                                                                                                 ` Herbert Xu
2009-02-06  9:51                                                                                                   ` Jarek Poplawski
2009-02-06 10:28                                                                                                     ` Herbert Xu
2009-02-06 10:58                                                                                                       ` Jarek Poplawski
2009-02-06 11:10                                                                                                         ` Willy Tarreau
2009-02-06 11:47                                                                                                           ` Jarek Poplawski
2009-02-06 18:59                                                                                               ` Jarek Poplawski
2009-02-03 11:38                                                                                 ` Nick Piggin
2009-01-27 18:42                                                                             ` David Miller
2009-01-15 23:32                                                           ` [PATCH] " Willy Tarreau
2009-01-15 23:35                                                             ` David Miller
2009-01-14  0:51                                         ` Herbert Xu
2009-01-14  1:24                                           ` David Miller
2009-01-09 22:45                     ` Eric Dumazet
2009-01-09 22:53                       ` Willy Tarreau
2009-01-09 23:34                         ` Eric Dumazet
2009-01-13  5:45                           ` David Miller
2009-01-14  0:05                           ` David Miller
2009-01-13 23:31         ` David Miller
2009-01-13 23:26       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).