* [PATCH] tcp: splice as many packets as possible at once @ 2009-01-08 17:30 Willy Tarreau 2009-01-08 19:44 ` Jens Axboe 2009-01-08 21:50 ` Ben Mansell 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-08 17:30 UTC (permalink / raw) To: Jens Axboe Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar, linux-kernel, netdev Jens, here's the other patch I was talking about, for better behaviour of non-blocking splice(). Ben Mansell also confirms similar improvements in his tests, where non-blocking splice() initially showed half of read()/write() performance. Ben, would you mind adding a Tested-By line ? Also, please note that this is unrelated to the corruption bug I reported and does not fix it. Regards, Willy >From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Thu, 8 Jan 2009 17:10:13 +0100 Subject: [PATCH] tcp: splice as many packets as possible at once Currently, in non-blocking mode, tcp_splice_read() returns after splicing one segment regardless of the len argument. This results in low performance and very high overhead due to syscall rate when splicing from interfaces which do not support LRO. The fix simply consists in not breaking out of the loop after the first read. That way, we can read up to the size requested by the caller and still return when there is no data left. Performance has significantly improved with this fix, with the number of calls to splice() divided by about 20, and CPU usage dropped from 100% to 75%. Signed-off-by: Willy Tarreau <w@1wt.eu> --- net/ipv4/tcp.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 35bcddf..80261b4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, lock_sock(sk); if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || + (sk->sk_shutdown & RCV_SHUTDOWN) || signal_pending(current)) break; } -- 1.6.0.3 ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau @ 2009-01-08 19:44 ` Jens Axboe 2009-01-08 22:03 ` Willy Tarreau 2009-01-08 21:50 ` Ben Mansell 1 sibling, 1 reply; 190+ messages in thread From: Jens Axboe @ 2009-01-08 19:44 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar, linux-kernel, netdev On Thu, Jan 08 2009, Willy Tarreau wrote: > Jens, > > here's the other patch I was talking about, for better behaviour of > non-blocking splice(). Ben Mansell also confirms similar improvements > in his tests, where non-blocking splice() initially showed half of > read()/write() performance. Ben, would you mind adding a Tested-By > line ? Looks very good to me, I don't see any valid reason to have that !timeo break. So feel free to add my acked-by to that patch and send it to Dave. > > Also, please note that this is unrelated to the corruption bug I reported > and does not fix it. > > Regards, > Willy > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 8 Jan 2009 17:10:13 +0100 > Subject: [PATCH] tcp: splice as many packets as possible at once > > Currently, in non-blocking mode, tcp_splice_read() returns after > splicing one segment regardless of the len argument. This results > in low performance and very high overhead due to syscall rate when > splicing from interfaces which do not support LRO. > > The fix simply consists in not breaking out of the loop after the > first read. That way, we can read up to the size requested by the > caller and still return when there is no data left. > > Performance has significantly improved with this fix, with the > number of calls to splice() divided by about 20, and CPU usage > dropped from 100% to 75%. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > --- > net/ipv4/tcp.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 35bcddf..80261b4 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, > lock_sock(sk); > > if (sk->sk_err || sk->sk_state == TCP_CLOSE || > - (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || > + (sk->sk_shutdown & RCV_SHUTDOWN) || > signal_pending(current)) > break; > } > -- > 1.6.0.3 > -- Jens Axboe ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 19:44 ` Jens Axboe @ 2009-01-08 22:03 ` Willy Tarreau 0 siblings, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-08 22:03 UTC (permalink / raw) To: Jens Axboe Cc: David Miller, Jarek Poplawski, Ben Mansell, Ingo Molnar, linux-kernel, netdev On Thu, Jan 08, 2009 at 08:44:24PM +0100, Jens Axboe wrote: > On Thu, Jan 08 2009, Willy Tarreau wrote: > > Jens, > > > > here's the other patch I was talking about, for better behaviour of > > non-blocking splice(). Ben Mansell also confirms similar improvements > > in his tests, where non-blocking splice() initially showed half of > > read()/write() performance. Ben, would you mind adding a Tested-By > > line ? > > Looks very good to me, I don't see any valid reason to have that !timeo > break. So feel free to add my acked-by to that patch and send it to > Dave. Done, thanks. Also, I tried to follow the code path from the userspace splice() call and the kernel code. While splicing from socket to a pipe seems obvious till tcp_splice_read(), I have a hard time finding the correct way from the pipe to the socket. It *seems* to me that we proceed like this, I'd like someone to confirm : sys_splice() do_splice() do_splice_from() generic_splice_sendpage() [ I assumed this from socket_file_ops ] splice_from_pipe(,,pipe_to_sendpage) __splice_from_pipe(,,pipe_to_sendpage) pipe_to_sendpage() tcp_sendpage() [ I assumed this from inet_stream_ops ] do_tcp_sendpages() (if NETIF_F_SG) or sock_no_sendpage() I hope that I guessed it right because it does not appear obvious to me. It will help me try to understand better the corruption problem. Now for the read part, am I wrong tcp_splice_read() will return -EAGAIN if the pipe is full ? It seems to me that this will be brought up by splice_to_pipe(), via __tcp_splice_read, tcp_read_sock, tcp_splice_data_recv and skb_splice_bits. If so, this can cause serious trouble because if there is data available in a socket, poll() will return POLLIN. Then we call splice(sock, pipe) but the pipe is full, so we get -EAGAIN, which in turn makes the application think that we need to poll again, and since no data has moved, we'll immediately call splice() again. In my early experiments, I'm pretty sure I already came across such a situation. Shouldn't we return -EBUSY or -ENOSPC in this case ? I can work on patches if there is a consensus around those issues, and this will also help me understand better the splice internals. I just don't want to work in the wrong direction for nothing. Thanks, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau 2009-01-08 19:44 ` Jens Axboe @ 2009-01-08 21:50 ` Ben Mansell 2009-01-08 21:55 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Ben Mansell @ 2009-01-08 21:50 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, Jarek Poplawski, Ingo Molnar, linux-kernel, netdev, Jens Axboe > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 8 Jan 2009 17:10:13 +0100 > Subject: [PATCH] tcp: splice as many packets as possible at once > > Currently, in non-blocking mode, tcp_splice_read() returns after > splicing one segment regardless of the len argument. This results > in low performance and very high overhead due to syscall rate when > splicing from interfaces which do not support LRO. > > The fix simply consists in not breaking out of the loop after the > first read. That way, we can read up to the size requested by the > caller and still return when there is no data left. > > Performance has significantly improved with this fix, with the > number of calls to splice() divided by about 20, and CPU usage > dropped from 100% to 75%. > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput. Tested-by: Ben Mansell <ben@zeus.com> > Signed-off-by: Willy Tarreau <w@1wt.eu> > --- > net/ipv4/tcp.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 35bcddf..80261b4 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -615,7 +615,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, > lock_sock(sk); > > if (sk->sk_err || sk->sk_state == TCP_CLOSE || > - (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || > + (sk->sk_shutdown & RCV_SHUTDOWN) || > signal_pending(current)) > break; > } ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 21:50 ` Ben Mansell @ 2009-01-08 21:55 ` David Miller 2009-01-08 22:20 ` Willy Tarreau 2009-01-09 6:47 ` Eric Dumazet 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-08 21:55 UTC (permalink / raw) To: ben; +Cc: w, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Ben Mansell <ben@zeus.com> Date: Thu, 08 Jan 2009 21:50:44 +0000 > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 > > From: Willy Tarreau <w@1wt.eu> > > Date: Thu, 8 Jan 2009 17:10:13 +0100 > > Subject: [PATCH] tcp: splice as many packets as possible at once > > Currently, in non-blocking mode, tcp_splice_read() returns after > > splicing one segment regardless of the len argument. This results > > in low performance and very high overhead due to syscall rate when > > splicing from interfaces which do not support LRO. > > The fix simply consists in not breaking out of the loop after the > > first read. That way, we can read up to the size requested by the > > caller and still return when there is no data left. > > Performance has significantly improved with this fix, with the > > number of calls to splice() divided by about 20, and CPU usage > > dropped from 100% to 75%. > > > > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput. > > Tested-by: Ben Mansell <ben@zeus.com> I'm not applying this until someone explains to me why we should remove this test from the splice receive but keep it in the tcp_recvmsg() code where it has been essentially forever. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 21:55 ` David Miller @ 2009-01-08 22:20 ` Willy Tarreau 2009-01-13 23:08 ` David Miller 2009-01-09 6:47 ` Eric Dumazet 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-08 22:20 UTC (permalink / raw) To: David Miller; +Cc: ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote: > From: Ben Mansell <ben@zeus.com> > Date: Thu, 08 Jan 2009 21:50:44 +0000 > > > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 > > > From: Willy Tarreau <w@1wt.eu> > > > Date: Thu, 8 Jan 2009 17:10:13 +0100 > > > Subject: [PATCH] tcp: splice as many packets as possible at once > > > Currently, in non-blocking mode, tcp_splice_read() returns after > > > splicing one segment regardless of the len argument. This results > > > in low performance and very high overhead due to syscall rate when > > > splicing from interfaces which do not support LRO. > > > The fix simply consists in not breaking out of the loop after the > > > first read. That way, we can read up to the size requested by the > > > caller and still return when there is no data left. > > > Performance has significantly improved with this fix, with the > > > number of calls to splice() divided by about 20, and CPU usage > > > dropped from 100% to 75%. > > > > > > > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput. > > > > Tested-by: Ben Mansell <ben@zeus.com> > > I'm not applying this until someone explains to me why > we should remove this test from the splice receive but > keep it in the tcp_recvmsg() code where it has been > essentially forever. In my opinion, the code structure is different between both functions. In tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of all data which have been processed since the entry in the function. If we removed the test here, we could not break out of the loop once we have copied something. In tcp_splice_read(), the test is still present in the (!ret) code path, where ret is the last number of bytes processed, so the test is still performed regardless of what has been previously transferred. So in summary, in tcp_splice_read without this test, we get back to the top of the loop, and if __tcp_splice_read() returns 0, then we break out of the loop. I don't know if my explanation is clear or not, it's easier to follow the loops in front of the code :-/ Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 22:20 ` Willy Tarreau @ 2009-01-13 23:08 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-13 23:08 UTC (permalink / raw) To: w; +Cc: ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Thu, 8 Jan 2009 23:20:39 +0100 > On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote: > > I'm not applying this until someone explains to me why > > we should remove this test from the splice receive but > > keep it in the tcp_recvmsg() code where it has been > > essentially forever. > > In my opinion, the code structure is different between both functions. In > tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of > all data which have been processed since the entry in the function. If we > removed the test here, we could not break out of the loop once we have > copied something. In tcp_splice_read(), the test is still present in the > (!ret) code path, where ret is the last number of bytes processed, so the > test is still performed regardless of what has been previously transferred. > > So in summary, in tcp_splice_read without this test, we get back to the > top of the loop, and if __tcp_splice_read() returns 0, then we break out > of the loop. Ok I see what you're saying, the !timeo check is only necessary when the receive queue has been exhausted. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-08 21:55 ` David Miller 2009-01-08 22:20 ` Willy Tarreau @ 2009-01-09 6:47 ` Eric Dumazet 2009-01-09 7:04 ` Willy Tarreau ` (2 more replies) 1 sibling, 3 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 6:47 UTC (permalink / raw) To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe David Miller a écrit : > From: Ben Mansell <ben@zeus.com> > Date: Thu, 08 Jan 2009 21:50:44 +0000 > >>> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001 >>> From: Willy Tarreau <w@1wt.eu> >>> Date: Thu, 8 Jan 2009 17:10:13 +0100 >>> Subject: [PATCH] tcp: splice as many packets as possible at once >>> Currently, in non-blocking mode, tcp_splice_read() returns after >>> splicing one segment regardless of the len argument. This results >>> in low performance and very high overhead due to syscall rate when >>> splicing from interfaces which do not support LRO. >>> The fix simply consists in not breaking out of the loop after the >>> first read. That way, we can read up to the size requested by the >>> caller and still return when there is no data left. >>> Performance has significantly improved with this fix, with the >>> number of calls to splice() divided by about 20, and CPU usage >>> dropped from 100% to 75%. >>> >> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput. >> >> Tested-by: Ben Mansell <ben@zeus.com> > > I'm not applying this until someone explains to me why > we should remove this test from the splice receive but > keep it in the tcp_recvmsg() code where it has been > essentially forever. I found this patch usefull in my testings, but had a feeling something was not complete. If the goal is to reduce number of splice() calls, we also should reduce number of wakeups. If splice() is used in non blocking mode, nothing we can do here of course, since the application will use a poll()/select()/epoll() event before calling splice(). A good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things. I tested this on current tree and it is not working : we still have one wakeup for each frame (ethernet link is a 100 Mb/s one) bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(6, 5) = 0 accept(6, 0, NULL) = 7 setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1024 splice(0x3, 0, 0x5, 0, 0x400, 0x5) = 1024 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 About tcp_recvmsg(), we might also remove the "!timeo" test as well, more testings are needed. But remind that if an application provides a large buffer to tcp_recvmsg() call, removing the test will reduce the number of syscalls but might use more DCACHE. It could reduce performance on old cpus. With splice() call, we expect to not copy memory and trash DCACHE, and pipe buffers being limited to 16, we cope with a limited working set. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 6:47 ` Eric Dumazet @ 2009-01-09 7:04 ` Willy Tarreau 2009-01-09 7:28 ` Eric Dumazet 2009-01-09 15:42 ` Eric Dumazet 2009-01-13 23:26 ` David Miller 2 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 7:04 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote: > > I'm not applying this until someone explains to me why > > we should remove this test from the splice receive but > > keep it in the tcp_recvmsg() code where it has been > > essentially forever. > > I found this patch usefull in my testings, but had a feeling something > was not complete. If the goal is to reduce number of splice() calls, > we also should reduce number of wakeups. If splice() is used in non > blocking mode, nothing we can do here of course, since the application > will use a poll()/select()/epoll() event before calling splice(). A > good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things. > > I tested this on current tree and it is not working : we still have > one wakeup for each frame (ethernet link is a 100 Mb/s one) Well, it simply means that data are not coming in fast enough compared to the tiny amount of work you have to perform to forward them, there's nothing wrong with that. It is important in my opinion not to wait for *enough* data to come in, otherwise it might become impossible to forward small chunks. I mean, if there are only 300 bytes left to forward, we must not wait indefinitely for more data to come, we must forward those 300 bytes. In your case below, it simply means that the performance improvement brought by splice will be really minor because you'll just avoid 2 memory copies, which are ridiculously cheap at 100 Mbps. If you would change your program to use recv/send, you would observe the exact same pattern, because as soon as poll() wakes you up, you still only have one frame in the system buffers. On a small machine I have here (geode 500 MHz), I easily have multiple frames queued at 100 Mbps because when epoll() wakes me up, I have traffic on something like 10-100 sockets, and by the time I process the first ones, the later have time to queue up more data. > bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > listen(6, 5) = 0 > accept(6, 0, NULL) = 7 > setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1024 > splice(0x3, 0, 0x5, 0, 0x400, 0x5) = 1024 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 > splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 > splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 > splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 > splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 > poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1 > splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460 > splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 How much CPU is used for that ? Probably something like 1-2% ? If you could retry with 100-1000 concurrent sessions, you would observe more traffic on each socket, which is where the gain of splice becomes really important. > About tcp_recvmsg(), we might also remove the "!timeo" test as well, > more testings are needed. No right now we can't (we must move it somewhere else at least). Because once at least one byte has been received (copied != 0), no other check will break out of the loop (or at least I have not found it). > But remind that if an application provides > a large buffer to tcp_recvmsg() call, removing the test will reduce > the number of syscalls but might use more DCACHE. It could reduce > performance on old cpus. With splice() call, we expect to not > copy memory and trash DCACHE, and pipe buffers being limited to 16, > we cope with a limited working set. That's an interesting point indeed ! Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 7:04 ` Willy Tarreau @ 2009-01-09 7:28 ` Eric Dumazet 2009-01-09 7:42 ` Willy Tarreau 2009-01-13 23:27 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 7:28 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote: >>> I'm not applying this until someone explains to me why >>> we should remove this test from the splice receive but >>> keep it in the tcp_recvmsg() code where it has been >>> essentially forever. >> I found this patch usefull in my testings, but had a feeling something >> was not complete. If the goal is to reduce number of splice() calls, >> we also should reduce number of wakeups. If splice() is used in non >> blocking mode, nothing we can do here of course, since the application >> will use a poll()/select()/epoll() event before calling splice(). A >> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things. >> >> I tested this on current tree and it is not working : we still have >> one wakeup for each frame (ethernet link is a 100 Mb/s one) > > Well, it simply means that data are not coming in fast enough compared to > the tiny amount of work you have to perform to forward them, there's nothing > wrong with that. It is important in my opinion not to wait for *enough* data > to come in, otherwise it might become impossible to forward small chunks. > I mean, if there are only 300 bytes left to forward, we must not wait > indefinitely for more data to come, we must forward those 300 bytes. > > In your case below, it simply means that the performance improvement brought > by splice will be really minor because you'll just avoid 2 memory copies, > which are ridiculously cheap at 100 Mbps. If you would change your program > to use recv/send, you would observe the exact same pattern, because as soon > as poll() wakes you up, you still only have one frame in the system buffers. > On a small machine I have here (geode 500 MHz), I easily have multiple > frames queued at 100 Mbps because when epoll() wakes me up, I have traffic > on something like 10-100 sockets, and by the time I process the first ones, > the later have time to queue up more data. My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :) But if it doesnt work on a single flow, it wont work on many :) I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU could be available for other tasks. If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it would be good if kernel was smart enough and could reduce number of wakeups. (Next blocking point is the fixed limit of 16 pages per pipe, but thats another story) > >> About tcp_recvmsg(), we might also remove the "!timeo" test as well, >> more testings are needed. > > No right now we can't (we must move it somewhere else at least). Because > once at least one byte has been received (copied != 0), no other check > will break out of the loop (or at least I have not found it). > Of course we cant remove the test totally, but change the logic so that several skb might be used/consumed per tcp_recvmsg() call, like your patch did for splice() Lets focus on functional changes, not on implementation details :) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 7:28 ` Eric Dumazet @ 2009-01-09 7:42 ` Willy Tarreau 2009-01-13 23:27 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 7:42 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 08:28:09AM +0100, Eric Dumazet wrote: > My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :) > > But if it doesnt work on a single flow, it wont work on many :) Yes it will, precisely because during the time you spend processing flow #1, you're still receiving data for flow #2. I really invite you to try. That's what I've been observing for years of userland coding of proxies. > I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes > in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU > could be available for other tasks. I also observe 23000 bytes on average on gigabit, which is very good (only about 5000 calls per second). And the CPU usage is lower than with recv/send, and I'd like to be able to run some profiling because I observed very different performance patterns depending on the network cards used. Generally, almost all the time is spent in softirqs. It's easy to make poll wait a little bit more : call it later and do your work before calling it. Also, epoll_wait() lets you ask it to return just a few amount of FDs. This really improves data gathering. I generally observe best performance between 30-200 FDs per call, even with 10000 concurrent connections. During the time I process the first 200 FDs, data is accumulating in the other's buffers. > If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it > would be good if kernel was smart enough and could reduce number of wakeups. Yes, I agree about that. But my comment was about not making this behaviour mandatory for splice(). Letting the application choose is the way to go, of course. > (Next blocking point is the fixed limit of 16 pages per pipe, but > thats another story) Yes but that's not always easy to guess how many data you can feed into the pipe. It seems that depending on how the segments are gathered, you can store between 16 segments and 64 kB. I have observed some cases in blocking mode where I could not push more than a few kbytes with a small MSS, indicating to me that all those segments were each on a distinct page. I don't know precisely how that's handled internally. > >> About tcp_recvmsg(), we might also remove the "!timeo" test as well, > >> more testings are needed. > > > > No right now we can't (we must move it somewhere else at least). Because > > once at least one byte has been received (copied != 0), no other check > > will break out of the loop (or at least I have not found it). > > > > Of course we cant remove the test totally, but change the logic so that several skb > might be used/consumed per tcp_recvmsg() call, like your patch did for splice() OK, I initially understood that you suggested we could simply remove it like I did for splice. > Lets focus on functional changes, not on implementation details :) Agreed :-) Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 7:28 ` Eric Dumazet 2009-01-09 7:42 ` Willy Tarreau @ 2009-01-13 23:27 ` David Miller 2009-01-13 23:35 ` Eric Dumazet 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-13 23:27 UTC (permalink / raw) To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 09 Jan 2009 08:28:09 +0100 > If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, > [32768], 4), it would be good if kernel was smart enough and could > reduce number of wakeups. Right, and as I pointed out in previous replies the problem is that splice() receive in TCP doesn't check the low water mark at all. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-13 23:27 ` David Miller @ 2009-01-13 23:35 ` Eric Dumazet 0 siblings, 0 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-13 23:35 UTC (permalink / raw) To: David Miller; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 09 Jan 2009 08:28:09 +0100 > >> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, >> [32768], 4), it would be good if kernel was smart enough and could >> reduce number of wakeups. > > Right, and as I pointed out in previous replies the problem is > that splice() receive in TCP doesn't check the low water mark > at all. > > Yes I understand, but if splice() is running, wakeup occured, and no need to check if the wakeup was good or not... just proceed and consume some skb, since we already were awaken. Then, an application might setup a high SO_RCVLOWAT, but want to splice() only few bytes, so RCVLOWAT is only a hint for the thing that perform the wakeup, not for the consumer. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 6:47 ` Eric Dumazet 2009-01-09 7:04 ` Willy Tarreau @ 2009-01-09 15:42 ` Eric Dumazet 2009-01-09 17:57 ` Eric Dumazet ` (2 more replies) 2009-01-13 23:26 ` David Miller 2 siblings, 3 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 15:42 UTC (permalink / raw) To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe Eric Dumazet a écrit : > David Miller a écrit : >> I'm not applying this until someone explains to me why >> we should remove this test from the splice receive but >> keep it in the tcp_recvmsg() code where it has been >> essentially forever. Reading again tcp_recvmsg(), I found it already is able to eat several skb even in nonblocking mode. setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0 ioctl(5, FIONBIO, [1]) = 0 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is no issue with it. We really want to break from this loop if !timeo . Willy patch makes splice() behaving like tcp_recvmsg(), but we might call tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed) I wonder if the right fix should be done in tcp_read_sock() : this is the one who should eat several skbs IMHO, if we want optimal ACK generation. We break out of its loop at line 1246 if (!desc->count) /* this test is always true */ break; (__tcp_splice_read() set count to 0, right before calling tcp_read_sock()) So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 15:42 ` Eric Dumazet @ 2009-01-09 17:57 ` Eric Dumazet 2009-01-09 18:54 ` Willy Tarreau 2009-01-13 23:31 ` David Miller 2 siblings, 0 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 17:57 UTC (permalink / raw) To: David Miller; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe Eric Dumazet a écrit : > Eric Dumazet a écrit : >> David Miller a écrit : >>> I'm not applying this until someone explains to me why >>> we should remove this test from the splice receive but >>> keep it in the tcp_recvmsg() code where it has been >>> essentially forever. > > Reading again tcp_recvmsg(), I found it already is able to eat several skb > even in nonblocking mode. > > setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0 > ioctl(5, FIONBIO, [1]) = 0 > poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 > recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 > write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 > poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 > recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 > write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 > poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 > recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 > write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 > poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1 > recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536 > write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 > > > David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is > no issue with it. We really want to break from this loop if !timeo . > > Willy patch makes splice() behaving like tcp_recvmsg(), but we might call > tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed) > > I wonder if the right fix should be done in tcp_read_sock() : this is the > one who should eat several skbs IMHO, if we want optimal ACK generation. > > We break out of its loop at line 1246 > > if (!desc->count) /* this test is always true */ > break; > > (__tcp_splice_read() set count to 0, right before calling tcp_read_sock()) > > So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least. I tested following patch and got expected result : - less ACK, and correct rcvbuf adjustments. - I can fill the Gb link with one flow only, using less than 10% of the cpu, instead of 40% without patch. Setting desc->count to 1 seems to be the current practice. (Example in drivers/scsi/iscsi_tcp.c : iscsi_sw_tcp_data_ready()) ****************************************************************** * Note : this patch is wrong, because splice() can now * * return more bytes than asked for (if SPLICE_F_NONBLOCK asked) * ****************************************************************** [PATCH] tcp: splice as many packets as possible at once As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not optimal. It processes at most one segment per call. This results in low performance and very high overhead due to syscall rate when splicing from interfaces which do not support LRO. Willy provided a patch inside tcp_splice_read(), but a better fix is to let tcp_read_sock() process as many segments as possible, so that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called once per syscall. With this change, splice() behaves like tcp_recvmsg(), being able to consume many skbs in one system call. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bd6ff90..15bd67b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -533,6 +533,9 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) .arg.data = tss, }; + if (tss->flags & SPLICE_F_NONBLOCK) + rd_desc.count = 1; /* we want as many segments as possible */ + return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); } ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 15:42 ` Eric Dumazet 2009-01-09 17:57 ` Eric Dumazet @ 2009-01-09 18:54 ` Willy Tarreau 2009-01-09 20:51 ` Eric Dumazet 2009-01-13 23:31 ` David Miller 2 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 18:54 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Hi Eric, On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote: (...) > Willy patch makes splice() behaving like tcp_recvmsg(), but we might call > tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed) > > I wonder if the right fix should be done in tcp_read_sock() : this is the > one who should eat several skbs IMHO, if we want optimal ACK generation. > > We break out of its loop at line 1246 > > if (!desc->count) /* this test is always true */ > break; > > (__tcp_splice_read() set count to 0, right before calling tcp_read_sock()) > > So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least. That's a very interesting discovery that you made here. I have made mesurements with this line commented out just to get an idea. The hardest part was to find a CPU-bound machine. Finally I slowed my laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's call that 300). That way, I cannot saturate the PCI-based tg3 and I can observe the effects of various changes on the data rate. - original tcp_splice_read(), with "!timeo" : 24.1 MB/s - modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%) - original with line #1246 commented out : 34.5 MB/s (+43%) So you're right, avoiding calling tcp_read_sock() all the time gives a nice performance boost. Also, I found that tcp_splice_read() behaves like this when breaking out of the loop : lock_sock(); while () { ... __tcp_splice_read(); ... release_sock(); lock_sock(); if (break condition) break; } release_sock(); Which means that when breaking out of the loop on (!timeo) with ret > 0, we do release_sock/lock_sock/release_sock. So I tried a minor modification, consisting in moving the test before release_sock(), and leaving !timeo there with line #1246 commented out. That's a noticeable winner, as the data rate went up to 35.7 MB/s (+48%). Also, in your second mail, you're saying that your change might return more data than requested by the user. I can't find why, could you please explain to me, as I'm still quite ignorant in this area ? Thanks, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 18:54 ` Willy Tarreau @ 2009-01-09 20:51 ` Eric Dumazet 2009-01-09 21:24 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 20:51 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > Hi Eric, > > On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote: > (...) >> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call >> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed) >> >> I wonder if the right fix should be done in tcp_read_sock() : this is the >> one who should eat several skbs IMHO, if we want optimal ACK generation. >> >> We break out of its loop at line 1246 >> >> if (!desc->count) /* this test is always true */ >> break; >> >> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock()) >> >> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least. > > That's a very interesting discovery that you made here. I have made > mesurements with this line commented out just to get an idea. The > hardest part was to find a CPU-bound machine. Finally I slowed my > laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's > call that 300). That way, I cannot saturate the PCI-based tg3 and > I can observe the effects of various changes on the data rate. > > - original tcp_splice_read(), with "!timeo" : 24.1 MB/s > - modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%) > - original with line #1246 commented out : 34.5 MB/s (+43%) > > So you're right, avoiding calling tcp_read_sock() all the time > gives a nice performance boost. > > Also, I found that tcp_splice_read() behaves like this when breaking > out of the loop : > > lock_sock(); > while () { > ... > __tcp_splice_read(); > ... > release_sock(); > lock_sock(); > if (break condition) > break; > } > release_sock(); > > Which means that when breaking out of the loop on (!timeo) > with ret > 0, we do release_sock/lock_sock/release_sock. > > So I tried a minor modification, consisting in moving the > test before release_sock(), and leaving !timeo there with > line #1246 commented out. That's a noticeable winner, as > the data rate went up to 35.7 MB/s (+48%). > > Also, in your second mail, you're saying that your change > might return more data than requested by the user. I can't > find why, could you please explain to me, as I'm still quite > ignorant in this area ? Well, I just tested various user programs and indeed got this strange result : Here I call splice() with len=1000 (0x3e8), and you can see it gives a result of 1460 at the second call. I suspect a bug in splice code, that my patch just exposed. pipe([3, 4]) = 0 open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0644) = 5 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(6, 5) = 0 accept(6, 0, NULL) = 7 setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [1000], 4) = 0 poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1000 OK splice(0x3, 0, 0x5, 0, 0x3e8, 0x5) = 1000 poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1460 Oh well... 1460 > 1000 !!! splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1 splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1460 splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460 ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 20:51 ` Eric Dumazet @ 2009-01-09 21:24 ` Willy Tarreau 2009-01-09 22:02 ` Eric Dumazet 2009-01-09 22:07 ` Willy Tarreau 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 21:24 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: (...) > > Also, in your second mail, you're saying that your change > > might return more data than requested by the user. I can't > > find why, could you please explain to me, as I'm still quite > > ignorant in this area ? > > Well, I just tested various user programs and indeed got this > strange result : > > Here I call splice() with len=1000 (0x3e8), and you can see > it gives a result of 1460 at the second call. huh, not nice indeed! While looking at the code to see how this could be possible, I came across this minor thing (unrelated IMHO) : if (__skb_splice_bits(skb, &offset, &tlen, &spd)) goto done; >>>>>> else if (!tlen) <<<<<< goto done; /* * now see if we have a frag_list to map */ if (skb_shinfo(skb)->frag_list) { struct sk_buff *list = skb_shinfo(skb)->frag_list; for (; list && tlen; list = list->next) { if (__skb_splice_bits(list, &offset, &tlen, &spd)) break; } } done: Above on the enlighted line, we'd better remove the else and leave a plain "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes tlen, we still enter the if and evaluate the for condition for nothing. But let's leave that for later. > I suspect a bug in splice code, that my patch just exposed. I've checked in skb_splice_bits() and below and can't see how we can move more than the requested len. However, with your change, I don't clearly see how we break out of the loop in tcp_read_sock(). Maybe we first read 1000 then loop again and read remaining data ? I suspect that we should at least exit when ((struct tcp_splice_state *)desc->arg.data)->len = 0. At least that's something easy to add just before or after !desc->count for a test. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 21:24 ` Willy Tarreau @ 2009-01-09 22:02 ` Eric Dumazet 2009-01-09 22:09 ` Willy Tarreau 2009-01-09 22:07 ` Willy Tarreau 1 sibling, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 22:02 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: > (...) >>> Also, in your second mail, you're saying that your change >>> might return more data than requested by the user. I can't >>> find why, could you please explain to me, as I'm still quite >>> ignorant in this area ? >> Well, I just tested various user programs and indeed got this >> strange result : >> >> Here I call splice() with len=1000 (0x3e8), and you can see >> it gives a result of 1460 at the second call. > > huh, not nice indeed! > > While looking at the code to see how this could be possible, I > came across this minor thing (unrelated IMHO) : > > if (__skb_splice_bits(skb, &offset, &tlen, &spd)) > goto done; >>>>>>> else if (!tlen) <<<<<< > goto done; > > /* > * now see if we have a frag_list to map > */ > if (skb_shinfo(skb)->frag_list) { > struct sk_buff *list = skb_shinfo(skb)->frag_list; > > for (; list && tlen; list = list->next) { > if (__skb_splice_bits(list, &offset, &tlen, &spd)) > break; > } > } > > done: > > Above on the enlighted line, we'd better remove the else and leave a plain > "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes > tlen, we still enter the if and evaluate the for condition for nothing. But > let's leave that for later. > >> I suspect a bug in splice code, that my patch just exposed. > > I've checked in skb_splice_bits() and below and can't see how we can move > more than the requested len. > > However, with your change, I don't clearly see how we break out of > the loop in tcp_read_sock(). Maybe we first read 1000 then loop again > and read remaining data ? I suspect that we should at least exit when > ((struct tcp_splice_state *)desc->arg.data)->len = 0. > > At least that's something easy to add just before or after !desc->count > for a test. > I believe the bug is in tcp_splice_data_recv() I am going to test a new patch, but here is the thing I found: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bd6ff90..fbbddf4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, { struct tcp_splice_state *tss = rd_desc->arg.data; - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); + return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags); } static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:02 ` Eric Dumazet @ 2009-01-09 22:09 ` Willy Tarreau 0 siblings, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 22:09 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 11:02:06PM +0100, Eric Dumazet wrote: > Willy Tarreau a écrit : > > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: > > (...) > >>> Also, in your second mail, you're saying that your change > >>> might return more data than requested by the user. I can't > >>> find why, could you please explain to me, as I'm still quite > >>> ignorant in this area ? > >> Well, I just tested various user programs and indeed got this > >> strange result : > >> > >> Here I call splice() with len=1000 (0x3e8), and you can see > >> it gives a result of 1460 at the second call. > > > > huh, not nice indeed! > > > > While looking at the code to see how this could be possible, I > > came across this minor thing (unrelated IMHO) : > > > > if (__skb_splice_bits(skb, &offset, &tlen, &spd)) > > goto done; > >>>>>>> else if (!tlen) <<<<<< > > goto done; > > > > /* > > * now see if we have a frag_list to map > > */ > > if (skb_shinfo(skb)->frag_list) { > > struct sk_buff *list = skb_shinfo(skb)->frag_list; > > > > for (; list && tlen; list = list->next) { > > if (__skb_splice_bits(list, &offset, &tlen, &spd)) > > break; > > } > > } > > > > done: > > > > Above on the enlighted line, we'd better remove the else and leave a plain > > "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes > > tlen, we still enter the if and evaluate the for condition for nothing. But > > let's leave that for later. > > > >> I suspect a bug in splice code, that my patch just exposed. > > > > I've checked in skb_splice_bits() and below and can't see how we can move > > more than the requested len. > > > > However, with your change, I don't clearly see how we break out of > > the loop in tcp_read_sock(). Maybe we first read 1000 then loop again > > and read remaining data ? I suspect that we should at least exit when > > ((struct tcp_splice_state *)desc->arg.data)->len = 0. > > > > At least that's something easy to add just before or after !desc->count > > for a test. > > > > I believe the bug is in tcp_splice_data_recv() yes, see my other mail. > I am going to test a new patch, but here is the thing I found: > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index bd6ff90..fbbddf4 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, > { > struct tcp_splice_state *tss = rd_desc->arg.data; > > - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); > + return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags); > } it will not work, len is always 1000 in your case, for every call, as I had in my logs : kernel : tcp_splice_data_recv: skb=ed3d3480 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_sendpage: going through do_tcp_sendpages tcp_splice_data_recv: skb=ed3d3480 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_splice_data_recv: skb=ed3d3540 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_sendpage: going through do_tcp_sendpages tcp_sendpage: going through do_tcp_sendpages tcp_splice_data_recv: skb=ed3d3540 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_splice_data_recv: skb=ed3d3600 offset=0 len=1176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_sendpage: going through do_tcp_sendpages tcp_sendpage: going through do_tcp_sendpages tcp_splice_data_recv: skb=ed3d3600 offset=1000 len=176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3 tcp_sendpage: going through do_tcp_sendpages program : accept(3, 0, NULL) = 4 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0 fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 pipe([6, 7]) = 0 select(6, [5], [], NULL, NULL) = 1 (in [5]) splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 176 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0xb0, 0x3) = 176 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable) close(5) = 0 close(4) = 0 exit_group(0) = ? Now with the fix : accept(3, 0, NULL) = 4 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0 fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 pipe([6, 7]) = 0 select(6, [5], [], NULL, NULL) = 1 (in [5]) splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 96 select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4]) splice(0x6, 0, 0x4, 0, 0x60, 0x3) = 96 splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable) close(5) = 0 close(4) = 0 exit_group(0) = ? Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 21:24 ` Willy Tarreau 2009-01-09 22:02 ` Eric Dumazet @ 2009-01-09 22:07 ` Willy Tarreau 2009-01-09 22:12 ` Eric Dumazet 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 22:07 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote: > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: > (...) > > > Also, in your second mail, you're saying that your change > > > might return more data than requested by the user. I can't > > > find why, could you please explain to me, as I'm still quite > > > ignorant in this area ? > > > > Well, I just tested various user programs and indeed got this > > strange result : > > > > Here I call splice() with len=1000 (0x3e8), and you can see > > it gives a result of 1460 at the second call. OK finally I could reproduce it and found why we have this. It's expected in fact. The problem when we loop in tcp_read_sock() is that tss->len is not decremented by the amount of bytes read, this one is done only in tcp_splice_read() which is outer. The solution I found was to do just like other callers, which means use desc->count to keep the remaining number of bytes we want to read. In fact, tcp_read_sock() is designed to use that one as a stop condition, which explains why you first had to hide it. Now with the attached patch as a replacement for my previous one, both issues are solved : - I splice 1000 bytes if I ask to do so - I splice as much as possible if available (typically 23 kB). My observed performances are still at the top of earlier results and IMHO that way of counting bytes makes sense for an actor called from tcp_read_sock(). diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 35bcddf..51ff3aa 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len) { struct tcp_splice_state *tss = rd_desc->arg.data; + int ret; - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); + if (ret > 0) + rd_desc->count -= ret; + return ret; } static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) /* Store TCP splice context information in read_descriptor_t. */ read_descriptor_t rd_desc = { .arg.data = tss, + .count = tss->len, }; return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); Regards, Willy ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:07 ` Willy Tarreau @ 2009-01-09 22:12 ` Eric Dumazet 2009-01-09 22:17 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 22:12 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote: >> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: >> (...) >>>> Also, in your second mail, you're saying that your change >>>> might return more data than requested by the user. I can't >>>> find why, could you please explain to me, as I'm still quite >>>> ignorant in this area ? >>> Well, I just tested various user programs and indeed got this >>> strange result : >>> >>> Here I call splice() with len=1000 (0x3e8), and you can see >>> it gives a result of 1460 at the second call. > > OK finally I could reproduce it and found why we have this. It's > expected in fact. > > The problem when we loop in tcp_read_sock() is that tss->len is > not decremented by the amount of bytes read, this one is done > only in tcp_splice_read() which is outer. > > The solution I found was to do just like other callers, which means > use desc->count to keep the remaining number of bytes we want to > read. In fact, tcp_read_sock() is designed to use that one as a stop > condition, which explains why you first had to hide it. > > Now with the attached patch as a replacement for my previous one, > both issues are solved : > - I splice 1000 bytes if I ask to do so > - I splice as much as possible if available (typically 23 kB). > > My observed performances are still at the top of earlier results > and IMHO that way of counting bytes makes sense for an actor called > from tcp_read_sock(). > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 35bcddf..51ff3aa 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, > unsigned int offset, size_t len) > { > struct tcp_splice_state *tss = rd_desc->arg.data; > + int ret; > > - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); > + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); > + if (ret > 0) > + rd_desc->count -= ret; > + return ret; > } > > static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > /* Store TCP splice context information in read_descriptor_t. */ > read_descriptor_t rd_desc = { > .arg.data = tss, > + .count = tss->len, > }; > > return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); > OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :) diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 23808df..96b49e1 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -100,13 +100,11 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk, int flag) /* * Use rd_desc to pass 'conn' to iscsi_tcp_recv. - * We set count to 1 because we want the network layer to - * hand us all the skbs that are available. iscsi_tcp_recv - * handled pdus that cross buffers or pdus that still need data. + * iscsi_tcp_recv handled pdus that cross buffers or pdus that + * still need data. */ rd_desc.arg.data = conn; - rd_desc.count = 1; - tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv); + tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv, 65536); read_unlock(&sk->sk_callback_lock); diff --git a/include/net/tcp.h b/include/net/tcp.h index 218235d..b1facd1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -490,7 +490,7 @@ extern void tcp_get_info(struct sock *, struct tcp_info *); typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *, unsigned int, size_t); extern int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); + sk_read_actor_t recv_actor, size_t tlen); extern void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bd6ff90..fbbddf4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, { struct tcp_splice_state *tss = rd_desc->arg.data; - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); + return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags); } static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) @@ -533,7 +533,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) .arg.data = tss, }; - return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); + return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv, tss->len); } /** @@ -611,11 +611,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, tss.len -= ret; spliced += ret; + if (!timeo) + break; release_sock(sk); lock_sock(sk); if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || + (sk->sk_shutdown & RCV_SHUTDOWN) || signal_pending(current)) break; } @@ -1193,7 +1195,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) * (although both would be easy to implement). */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) + sk_read_actor_t recv_actor, size_t tlen) { struct sk_buff *skb; struct tcp_sock *tp = tcp_sk(sk); @@ -1209,6 +1211,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, size_t len; len = skb->len - offset; + if (len > tlen) + len = tlen; /* Stop reading if we hit a patch of urgent data */ if (tp->urg_data) { u32 urg_offset = tp->urg_seq - seq; @@ -1226,6 +1230,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, seq += used; copied += used; offset += used; + tlen -= used; } /* * If recv_actor drops the lock (e.g. TCP splice @@ -1243,7 +1248,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } sk_eat_skb(sk, skb, 0); - if (!desc->count) + if (!tlen) break; } tp->copied_seq = seq; diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 5cbb404..75f8e83 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1109,8 +1109,7 @@ static void xs_tcp_data_ready(struct sock *sk, int bytes) /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */ rd_desc.arg.data = xprt; do { - rd_desc.count = 65536; - read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv); + read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv, 65536); } while (read > 0); out: read_unlock(&sk->sk_callback_lock); ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:12 ` Eric Dumazet @ 2009-01-09 22:17 ` Willy Tarreau 2009-01-09 22:42 ` Evgeniy Polyakov 2009-01-09 22:45 ` Eric Dumazet 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 22:17 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote: > Willy Tarreau a écrit : > > On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote: > >> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: > >> (...) > >>>> Also, in your second mail, you're saying that your change > >>>> might return more data than requested by the user. I can't > >>>> find why, could you please explain to me, as I'm still quite > >>>> ignorant in this area ? > >>> Well, I just tested various user programs and indeed got this > >>> strange result : > >>> > >>> Here I call splice() with len=1000 (0x3e8), and you can see > >>> it gives a result of 1460 at the second call. > > > > OK finally I could reproduce it and found why we have this. It's > > expected in fact. > > > > The problem when we loop in tcp_read_sock() is that tss->len is > > not decremented by the amount of bytes read, this one is done > > only in tcp_splice_read() which is outer. > > > > The solution I found was to do just like other callers, which means > > use desc->count to keep the remaining number of bytes we want to > > read. In fact, tcp_read_sock() is designed to use that one as a stop > > condition, which explains why you first had to hide it. > > > > Now with the attached patch as a replacement for my previous one, > > both issues are solved : > > - I splice 1000 bytes if I ask to do so > > - I splice as much as possible if available (typically 23 kB). > > > > My observed performances are still at the top of earlier results > > and IMHO that way of counting bytes makes sense for an actor called > > from tcp_read_sock(). > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 35bcddf..51ff3aa 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, > > unsigned int offset, size_t len) > > { > > struct tcp_splice_state *tss = rd_desc->arg.data; > > + int ret; > > > > - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); > > + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); > > + if (ret > 0) > > + rd_desc->count -= ret; > > + return ret; > > } > > > > static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > > @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > > /* Store TCP splice context information in read_descriptor_t. */ > > read_descriptor_t rd_desc = { > > .arg.data = tss, > > + .count = tss->len, > > }; > > > > return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); > > > > OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :) I've seen the other callers, but they all use desc->count for their own purpose. That's how I understood what it was used for :-) I think it's better not to change the API here and use tcp_read_sock() how it's supposed to be used. Also, the less parameters to the function, the better. However I'm OK for the !timeo before release_sock/lock_sock. I just don't know if we can put the rest of the if above or not. I don't know what changes we're supposed to collect by doing release_sock/ lock_sock before the if(). Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:17 ` Willy Tarreau @ 2009-01-09 22:42 ` Evgeniy Polyakov 2009-01-09 22:50 ` Willy Tarreau 2009-01-10 7:40 ` Eric Dumazet 2009-01-09 22:45 ` Eric Dumazet 1 sibling, 2 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-09 22:42 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote: > However I'm OK for the !timeo before release_sock/lock_sock. I just > don't know if we can put the rest of the if above or not. I don't > know what changes we're supposed to collect by doing release_sock/ > lock_sock before the if(). Not to interrupt the discussion, but for the clarification, that release_sock/lock_sock is used to process the backlog accumulated while socket was locked. And while dropping additional pair before the final release is ok, but moving this itself should be thought of twice. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:42 ` Evgeniy Polyakov @ 2009-01-09 22:50 ` Willy Tarreau 2009-01-09 23:01 ` Evgeniy Polyakov 2009-01-10 7:40 ` Eric Dumazet 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 22:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Sat, Jan 10, 2009 at 01:42:58AM +0300, Evgeniy Polyakov wrote: > On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote: > > However I'm OK for the !timeo before release_sock/lock_sock. I just > > don't know if we can put the rest of the if above or not. I don't > > know what changes we're supposed to collect by doing release_sock/ > > lock_sock before the if(). > > Not to interrupt the discussion, but for the clarification, that > release_sock/lock_sock is used to process the backlog accumulated while > socket was locked. And while dropping additional pair before the final > release is ok, but moving this itself should be thought of twice. Nice, thanks Evgeniy. So it makes sense to move only the !timeo test above since it's not dependant on the socket, and leave the rest of the test where it currently is. That's what Eric has proposed in his latest patch. Well, I'm now trying to educate myself on the send part. It's still not very clear to me and I'd like to understand a little bit better why we have this corruption problem and why there is a difference between sending segments from memory and sending them from another socket where they were already waiting. I think I'll put printks everywhere and see what I can observe. Knowing about the GSO/SG workaround already helps me enable/disable the bug. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:50 ` Willy Tarreau @ 2009-01-09 23:01 ` Evgeniy Polyakov 2009-01-09 23:06 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-09 23:01 UTC (permalink / raw) To: Willy Tarreau Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau (w@1wt.eu) wrote: > Well, I'm now trying to educate myself on the send part. It's still > not very clear to me and I'd like to understand a little bit better > why we have this corruption problem and why there is a difference > between sending segments from memory and sending them from another > socket where they were already waiting. printks are the best choice, since you will get exactly what you are looking for instead of deciphering what developer or code told you. > I think I'll put printks everywhere and see what I can observe. > Knowing about the GSO/SG workaround already helps me enable/disable > the bug. I wish I could also be capable to disable the bugs... :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 23:01 ` Evgeniy Polyakov @ 2009-01-09 23:06 ` Willy Tarreau 0 siblings, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 23:06 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Eric Dumazet, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Sat, Jan 10, 2009 at 02:01:24AM +0300, Evgeniy Polyakov wrote: > On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau (w@1wt.eu) wrote: > > Well, I'm now trying to educate myself on the send part. It's still > > not very clear to me and I'd like to understand a little bit better > > why we have this corruption problem and why there is a difference > > between sending segments from memory and sending them from another > > socket where they were already waiting. > > printks are the best choice, since you will get exactly what you are > looking for instead of deciphering what developer or code told you. I know, but as I told in an earlier mail, it was not even clear to me what function were called. I have made guesses but that's quite hard when you don't know the entry point. I think I'm not too far from having discovered the call chain. Understanding it will be another story, of course ;-) > > I think I'll put printks everywhere and see what I can observe. > > Knowing about the GSO/SG workaround already helps me enable/disable > > the bug. > > I wish I could also be capable to disable the bugs... :) Me too :-) Here we have an opportunity to turn it on/off, let's take it ! Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:42 ` Evgeniy Polyakov 2009-01-09 22:50 ` Willy Tarreau @ 2009-01-10 7:40 ` Eric Dumazet 2009-01-11 12:58 ` Evgeniy Polyakov 1 sibling, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-10 7:40 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Evgeniy Polyakov a écrit : > On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau (w@1wt.eu) wrote: >> However I'm OK for the !timeo before release_sock/lock_sock. I just >> don't know if we can put the rest of the if above or not. I don't >> know what changes we're supposed to collect by doing release_sock/ >> lock_sock before the if(). > > Not to interrupt the discussion, but for the clarification, that > release_sock/lock_sock is used to process the backlog accumulated while > socket was locked. And while dropping additional pair before the final > release is ok, but moving this itself should be thought of twice. > Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits() So : 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary to process backlog. Its already done in skb_splice_bits() 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times then we should perform the following tests inside this loop ? if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || signal_pending(current)) break; And removie them from tcp_splice_read() ? ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-10 7:40 ` Eric Dumazet @ 2009-01-11 12:58 ` Evgeniy Polyakov 2009-01-11 13:14 ` Eric Dumazet 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-11 12:58 UTC (permalink / raw) To: Eric Dumazet Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Hi Eric. On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > > Not to interrupt the discussion, but for the clarification, that > > release_sock/lock_sock is used to process the backlog accumulated while > > socket was locked. And while dropping additional pair before the final > > release is ok, but moving this itself should be thought of twice. > > > > Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits() > > So : > > 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary > to process backlog. Its already done in skb_splice_bits() Yes, in the tcp_splice_read() they are added to remove a deadlock. > 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times > then we should perform the following tests inside this loop ? > > if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || > signal_pending(current)) break; > > And removie them from tcp_splice_read() ? It could be done, but for what reason? To detect disconnected socket early? Does it worth the changes? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-11 12:58 ` Evgeniy Polyakov @ 2009-01-11 13:14 ` Eric Dumazet 2009-01-11 13:35 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-11 13:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Evgeniy Polyakov a écrit : > Hi Eric. > > On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: >>> Not to interrupt the discussion, but for the clarification, that >>> release_sock/lock_sock is used to process the backlog accumulated while >>> socket was locked. And while dropping additional pair before the final >>> release is ok, but moving this itself should be thought of twice. >>> >> Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits() >> >> So : >> >> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary >> to process backlog. Its already done in skb_splice_bits() > > Yes, in the tcp_splice_read() they are added to remove a deadlock. Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ? > >> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times >> then we should perform the following tests inside this loop ? >> >> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || >> signal_pending(current)) break; >> >> And removie them from tcp_splice_read() ? > > It could be done, but for what reason? To detect disconnected socket early? > Does it worth the changes? > I was thinking about the case your thread is doing a splice() from tcp socket to a pipe, while another thread is doing the splice from this pipe to something else. Once patched, tcp_read_sock() could loop a long time... ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-11 13:14 ` Eric Dumazet @ 2009-01-11 13:35 ` Evgeniy Polyakov 2009-01-11 16:00 ` Eric Dumazet 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-11 13:35 UTC (permalink / raw) To: Eric Dumazet Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > >> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary > >> to process backlog. Its already done in skb_splice_bits() > > > > Yes, in the tcp_splice_read() they are added to remove a deadlock. > > Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ? Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock, and tcp_splice_read() needs it to process the backlog. I think that even with non-blocking splice that release_sock/lock_sock is needed, since we are able to do a parallel job: to receive new data (scheduled by early release_sock backlog processing) in bh and to process already received data via splice codepath. Maybe in non-blocking splice mode this is not an issue though, but for the blocking mode this allows to grab more skbs at once in skb_splice_bits. > > > >> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times > >> then we should perform the following tests inside this loop ? > >> > >> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || > >> signal_pending(current)) break; > >> > >> And removie them from tcp_splice_read() ? > > > > It could be done, but for what reason? To detect disconnected socket early? > > Does it worth the changes? > > I was thinking about the case your thread is doing a splice() from tcp socket to > a pipe, while another thread is doing the splice from this pipe to something else. > > Once patched, tcp_read_sock() could loop a long time... Well, it maybe a good idea... Can not say anything against it :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-11 13:35 ` Evgeniy Polyakov @ 2009-01-11 16:00 ` Eric Dumazet 2009-01-11 16:05 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-11 16:00 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Evgeniy Polyakov a écrit : > On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: >>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary >>>> to process backlog. Its already done in skb_splice_bits() >>> Yes, in the tcp_splice_read() they are added to remove a deadlock. >> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ? > > Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock, > and tcp_splice_read() needs it to process the backlog. While we drop lock in skb_splice_bits() to prevent the deadlock, we also process backlog at this stage. No need to process backlog again in the higher level function. > > I think that even with non-blocking splice that release_sock/lock_sock > is needed, since we are able to do a parallel job: to receive new data > (scheduled by early release_sock backlog processing) in bh and to > process already received data via splice codepath. > Maybe in non-blocking splice mode this is not an issue though, but for > the blocking mode this allows to grab more skbs at once in skb_splice_bits. skb_splice_bits() operates on one skb, you lost me :) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-11 16:00 ` Eric Dumazet @ 2009-01-11 16:05 ` Evgeniy Polyakov 2009-01-14 0:07 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-11 16:05 UTC (permalink / raw) To: Eric Dumazet Cc: Willy Tarreau, David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > >>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary > >>>> to process backlog. Its already done in skb_splice_bits() > >>> Yes, in the tcp_splice_read() they are added to remove a deadlock. > >> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ? > > > > Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock, > > and tcp_splice_read() needs it to process the backlog. > > While we drop lock in skb_splice_bits() to prevent the deadlock, we > also process backlog at this stage. No need to process backlog > again in the higher level function. Yes, but having it earlier allows to receive new skb while processing already received. > > I think that even with non-blocking splice that release_sock/lock_sock > > is needed, since we are able to do a parallel job: to receive new data > > (scheduled by early release_sock backlog processing) in bh and to > > process already received data via splice codepath. > > Maybe in non-blocking splice mode this is not an issue though, but for > > the blocking mode this allows to grab more skbs at once in skb_splice_bits. > > skb_splice_bits() operates on one skb, you lost me :) Exactly, and to have it we earlier release a socket so that it could be acked and while we copy it or doing anything else, the next one would received. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-11 16:05 ` Evgeniy Polyakov @ 2009-01-14 0:07 ` David Miller 2009-01-14 0:13 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-14 0:07 UTC (permalink / raw) To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Evgeniy Polyakov <zbr@ioremap.net> Date: Sun, 11 Jan 2009 19:05:48 +0300 > On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > > While we drop lock in skb_splice_bits() to prevent the deadlock, we > > also process backlog at this stage. No need to process backlog > > again in the higher level function. > > Yes, but having it earlier allows to receive new skb while processing > already received. > > > > I think that even with non-blocking splice that release_sock/lock_sock > > > is needed, since we are able to do a parallel job: to receive new data > > > (scheduled by early release_sock backlog processing) in bh and to > > > process already received data via splice codepath. > > > Maybe in non-blocking splice mode this is not an issue though, but for > > > the blocking mode this allows to grab more skbs at once in skb_splice_bits. > > > > skb_splice_bits() operates on one skb, you lost me :) > > Exactly, and to have it we earlier release a socket so that it could be > acked and while we copy it or doing anything else, the next one would > received. I think the socket release in skb_splice_bits() (although necessary) just muddies the waters, and whether the extra one done in tcp_splice_read() helps at all is open to debate. That skb_clone() done by skb_splice_bits() pisses me off too, we really ought to fix that. And we also have that data corruption bug to cure too. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:07 ` David Miller @ 2009-01-14 0:13 ` Evgeniy Polyakov 2009-01-14 0:16 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-14 0:13 UTC (permalink / raw) To: David Miller Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 13, 2009 at 04:07:21PM -0800, David Miller (davem@davemloft.net) wrote: > > Exactly, and to have it we earlier release a socket so that it could be > > acked and while we copy it or doing anything else, the next one would > > received. > > I think the socket release in skb_splice_bits() (although necessary) > just muddies the waters, and whether the extra one done in > tcp_splice_read() helps at all is open to debate. Well, yes, probably simple performance test with and without will clarify the things. > That skb_clone() done by skb_splice_bits() pisses me off too, > we really ought to fix that. And we also have that data corruption > bug to cure too. Clone is needed since tcp expects to own the skb and frees it unconditionally via __kfree_skb(). What is the best solution for the data corruption bug? To copy the data all the time or implement own allocator to be used in alloc_skb and friends to allocate the head? I think it can be done transparently for the drivers. I can volunteer for this :) The first one is more appropriate for the current bugfix-only stage, but this will result in the whole release being too slow. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:13 ` Evgeniy Polyakov @ 2009-01-14 0:16 ` David Miller 2009-01-14 0:22 ` Evgeniy Polyakov 2009-01-14 0:51 ` Herbert Xu 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-14 0:16 UTC (permalink / raw) To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Evgeniy Polyakov <zbr@ioremap.net> Date: Wed, 14 Jan 2009 03:13:45 +0300 > What is the best solution for the data corruption bug? To copy the data > all the time or implement own allocator to be used in alloc_skb and > friends to allocate the head? I think it can be done transparently for > the drivers. I can volunteer for this :) > The first one is more appropriate for the current bugfix-only stage, > but this will result in the whole release being too slow. I am looking into this right now. I wish there were some way we could make this code grab and release a reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to accomplish it's goals. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:16 ` David Miller @ 2009-01-14 0:22 ` Evgeniy Polyakov 2009-01-14 0:37 ` David Miller 2009-01-14 0:51 ` Herbert Xu 1 sibling, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-14 0:22 UTC (permalink / raw) To: David Miller Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller (davem@davemloft.net) wrote: > I wish there were some way we could make this code grab and release a > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to > accomplish it's goals. Ugh... Clone without cloninig, but by increasing the dataref. Getting that splice only needs that skb to track the head of the original, this may really work. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:22 ` Evgeniy Polyakov @ 2009-01-14 0:37 ` David Miller 2009-01-14 3:51 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-14 0:37 UTC (permalink / raw) To: zbr; +Cc: dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Evgeniy Polyakov <zbr@ioremap.net> Date: Wed, 14 Jan 2009 03:22:52 +0300 > On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller (davem@davemloft.net) wrote: > > I wish there were some way we could make this code grab and release a > > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to > > accomplish it's goals. > > Ugh... Clone without cloninig, but by increasing the dataref. Getting > that splice only needs that skb to track the head of the original, this > may really work. Here is something I scrambled together, it is largely based upon Jarek's patch: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5110b35..05126da 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -70,12 +70,17 @@ static struct kmem_cache *skbuff_head_cache __read_mostly; static struct kmem_cache *skbuff_fclone_cache __read_mostly; +static void skb_release_data(struct sk_buff *skb); + static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct sk_buff *skb = (struct sk_buff *) buf->private; - kfree_skb(skb); + if (skb) + skb_release_data(skb); + else + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, @@ -83,7 +88,10 @@ static void sock_pipe_buf_get(struct pipe_inode_info *pipe, { struct sk_buff *skb = (struct sk_buff *) buf->private; - skb_get(skb); + if (skb) + atomic_inc(&skb_shinfo(skb)->dataref); + else + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1336,7 +1344,10 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; - kfree_skb(skb); + if (skb) + skb_release_data(skb); + else + put_page(spd->pages[i]); } /* @@ -1344,7 +1355,7 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; @@ -1352,8 +1363,15 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); + spd->partial[spd->nr_pages].private = + (unsigned long) (linear ? skb : NULL); spd->nr_pages++; + + if (linear) + atomic_inc(&skb_shinfo(skb)->dataref); + else + get_page(page); + return 0; } @@ -1369,7 +1387,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1410,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1437,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1447,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; } ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:37 ` David Miller @ 2009-01-14 3:51 ` Herbert Xu 2009-01-14 4:25 ` David Miller 2009-01-14 7:27 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-14 3:51 UTC (permalink / raw) To: David Miller Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe David Miller <davem@davemloft.net> wrote: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5110b35..05126da 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -70,12 +70,17 @@ > static struct kmem_cache *skbuff_head_cache __read_mostly; > static struct kmem_cache *skbuff_fclone_cache __read_mostly; > > +static void skb_release_data(struct sk_buff *skb); > + > static void sock_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > struct sk_buff *skb = (struct sk_buff *) buf->private; > > - kfree_skb(skb); > + if (skb) > + skb_release_data(skb); > + else > + put_page(buf->page); > } Unfortunately this won't work, not even for network destinations. The reason is that this gets called as soon as the destination's splice hook returns, for networking that means when sendpage returns. So by that time we'll still be left with just a page reference on a page where the slab memory may already have been freed. To make this work we need to get the destination's splice hooks to acquire this reference. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 3:51 ` Herbert Xu @ 2009-01-14 4:25 ` David Miller 2009-01-14 7:27 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-14 4:25 UTC (permalink / raw) To: herbert Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Jan 2009 14:51:24 +1100 > Unfortunately this won't work, not even for network destinations. > > The reason is that this gets called as soon as the destination's > splice hook returns, for networking that means when sendpage returns. > > So by that time we'll still be left with just a page reference > on a page where the slab memory may already have been freed. > > To make this work we need to get the destination's splice hooks > to acquire this reference. Yes I realized this after your earlier posting today. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 3:51 ` Herbert Xu 2009-01-14 4:25 ` David Miller @ 2009-01-14 7:27 ` David Miller 2009-01-14 8:26 ` Herbert Xu 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-14 7:27 UTC (permalink / raw) To: herbert Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Jan 2009 14:51:24 +1100 > Unfortunately this won't work, not even for network destinations. > > The reason is that this gets called as soon as the destination's > splice hook returns, for networking that means when sendpage returns. > > So by that time we'll still be left with just a page reference > on a page where the slab memory may already have been freed. > > To make this work we need to get the destination's splice hooks > to acquire this reference. So while trying to figure out a sane way to fix this, I found another bug: /* * map the linear part */ if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), offset, len, skb, spd)) return 1; This will explode if the SLAB cache for skb->head is using compound (ie. order > 0) pages. For example, if this is an order-1 page being used for the skb->head data (which would be true on most systems for jumbo MTU frames being received into a linear SKB), the offset will be wrong and depending upon skb_headlen() we could reference past the end of that non-compound page we will end up grabbing a reference to. And then we'll end up with a compound page in an skb_shinfo() frag array, which is illegal. Well, at least, I can list several drivers that will barf when trying to TX that (Acenic, atlx1, cassini, jme, sungem), since they use pci_map_page(... virt_to_page(skb->data)) or similar. The core KMAP'ing support for SKBs will also not be able to grok such a beastly SKB. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 7:27 ` David Miller @ 2009-01-14 8:26 ` Herbert Xu 2009-01-14 8:53 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-14 8:26 UTC (permalink / raw) To: David Miller Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote: > > So while trying to figure out a sane way to fix this, I found > another bug: > > /* > * map the linear part > */ > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > offset, len, skb, spd)) > return 1; > > This will explode if the SLAB cache for skb->head is using compound > (ie. order > 0) pages. > > For example, if this is an order-1 page being used for the skb->head > data (which would be true on most systems for jumbo MTU frames being > received into a linear SKB), the offset will be wrong and depending > upon skb_headlen() we could reference past the end of that > non-compound page we will end up grabbing a reference to. I'm actually not worried so much about these packets since these drivers should be converted to skb frags as otherwise they'll probably stop working after a while due to memory fragmentation. But yeah for correctness we definitely should address this in skb_splice_bits. I still think Jarek's approach (the copying one) is probably the easiest for now until we can find a better way. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 8:26 ` Herbert Xu @ 2009-01-14 8:53 ` Jarek Poplawski 2009-01-14 9:29 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 8:53 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 07:26:30PM +1100, Herbert Xu wrote: > On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote: > > > > So while trying to figure out a sane way to fix this, I found > > another bug: > > > > /* > > * map the linear part > > */ > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > offset, len, skb, spd)) > > return 1; > > > > This will explode if the SLAB cache for skb->head is using compound > > (ie. order > 0) pages. > > > > For example, if this is an order-1 page being used for the skb->head > > data (which would be true on most systems for jumbo MTU frames being > > received into a linear SKB), the offset will be wrong and depending > > upon skb_headlen() we could reference past the end of that > > non-compound page we will end up grabbing a reference to. > > I'm actually not worried so much about these packets since these > drivers should be converted to skb frags as otherwise they'll > probably stop working after a while due to memory fragmentation. > > But yeah for correctness we definitely should address this in > skb_splice_bits. > > I still think Jarek's approach (the copying one) is probably the > easiest for now until we can find a better way. > Actually, I still think my second approach (the PageSlab) is probably (if tested) the easiest for now, because it should fix the reported (Willy's) problem, without any change or copy overhead for splice to file (which could be still wrong, but not obviously wrong). Then we could search for the only right way (which is most probably around Herbert's new skb page allocator. IMHO "my" "copying approach" is too risky e.g. for stable etc. because of unknown memory requirements, especially for some larger size page configs/systems. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 8:53 ` Jarek Poplawski @ 2009-01-14 9:29 ` David Miller 2009-01-14 9:42 ` Jarek Poplawski ` (3 more replies) 0 siblings, 4 replies; 190+ messages in thread From: David Miller @ 2009-01-14 9:29 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Wed, 14 Jan 2009 08:53:08 +0000 > Actually, I still think my second approach (the PageSlab) is probably > (if tested) the easiest for now, because it should fix the reported > (Willy's) problem, without any change or copy overhead for splice to > file (which could be still wrong, but not obviously wrong). It's a simple fix, but as Herbert stated it leaves other ->sendpage() implementations exposed to data corruption when the from side of the pipe buffer is a socket. That, to me, is almost worse than a bad fix. It's definitely worse than a slower but full fix, which the copy patch is. Therefore what I'll likely do is push Jarek's copy based cure, and meanwhile we can brainstorm some more on how to fix this properly in the long term. So, I've put together a full commit message and Jarek's patch below. One thing I notice is that the silly skb_clone() done by SKB splicing is no longer necessary. We could get rid of that to offset (some) of the cost we are adding with this bug fix. Comments? net: Fix data corruption when splicing from sockets. From: Jarek Poplawski <jarkao2@gmail.com> The trick in socket splicing where we try to convert the skb->data into a page based reference using virt_to_page() does not work so well. The idea is to pass the virt_to_page() reference via the pipe buffer, and refcount the buffer using a SKB reference. But if we are splicing from a socket to a socket (via sendpage) this doesn't work. The from side processing will grab the page (and SKB) references. The sendpage() calls will grab page references only, return, and then the from side processing completes and drops the SKB ref. The page based reference to skb->data is not enough to keep the kmalloc() buffer backing it from being reused. Yet, that is all that the socket send side has at this point. This leads to data corruption if the skb->data buffer is reused by SLAB before the send side socket actually gets the TX packet out to the device. The fix employed here is to simply allocate a page and copy the skb->data bytes into that page. This will hurt performance, but there is no clear way to fix this properly without a copy at the present time, and it is important to get rid of the data corruption. Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5110b35..6e43d52 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - kfree_skb(skb); + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - skb_get(skb); + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1334,9 +1330,19 @@ fault: */ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; + put_page(spd->pages[i]); +} - kfree_skb(skb); +static inline struct page *linear_to_page(struct page *page, unsigned int len, + unsigned int offset) +{ + struct page *p = alloc_pages(GFP_KERNEL, 0); + + if (!p) + return NULL; + memcpy(page_address(p) + offset, page_address(page) + offset, len); + + return p; } /* @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; + if (linear) { + page = linear_to_page(page, len, offset); + if (!page) + return 1; + } + spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); spd->nr_pages++; + get_page(page); + return 0; } @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:29 ` David Miller @ 2009-01-14 9:42 ` Jarek Poplawski 2009-01-14 10:06 ` David Miller 2009-01-14 9:54 ` Jarek Poplawski ` (2 subsequent siblings) 3 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 9:42 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > Actually, I still think my second approach (the PageSlab) is probably > > (if tested) the easiest for now, because it should fix the reported > > (Willy's) problem, without any change or copy overhead for splice to > > file (which could be still wrong, but not obviously wrong). > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > implementations exposed to data corruption when the from side of the > pipe buffer is a socket. I don't think Herbert meant other ->sendpage() implementations, but I could miss something. > That, to me, is almost worse than a bad fix. > > It's definitely worse than a slower but full fix, which the copy > patch is. Sorry, I can't see how this patch could make sendpage worse. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:42 ` Jarek Poplawski @ 2009-01-14 10:06 ` David Miller 2009-01-14 10:47 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-14 10:06 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Wed, 14 Jan 2009 09:42:16 +0000 > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > > > Actually, I still think my second approach (the PageSlab) is probably > > > (if tested) the easiest for now, because it should fix the reported > > > (Willy's) problem, without any change or copy overhead for splice to > > > file (which could be still wrong, but not obviously wrong). > > > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > > implementations exposed to data corruption when the from side of the > > pipe buffer is a socket. > > I don't think Herbert meant other ->sendpage() implementations, but I > could miss something. I think he did :-) Or, more generally, he could have been referring to splice pipe outputs. All of these things grab references to pages and expect that to keep the underlying data from being reallocated. That doesn't work for this skb->data case. > > That, to me, is almost worse than a bad fix. > > > > It's definitely worse than a slower but full fix, which the copy > > patch is. > > Sorry, I can't see how this patch could make sendpage worse. Because that patch only fixes TCP's ->sendpage() implementation. There are others out there which could end up experiencing similar data corruption. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 10:06 ` David Miller @ 2009-01-14 10:47 ` Jarek Poplawski 2009-01-14 11:29 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 10:47 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 02:06:37AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Wed, 14 Jan 2009 09:42:16 +0000 > > > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > > From: Jarek Poplawski <jarkao2@gmail.com> > > > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > > > > > Actually, I still think my second approach (the PageSlab) is probably > > > > (if tested) the easiest for now, because it should fix the reported > > > > (Willy's) problem, without any change or copy overhead for splice to > > > > file (which could be still wrong, but not obviously wrong). > > > > > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > > > implementations exposed to data corruption when the from side of the > > > pipe buffer is a socket. > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > could miss something. > > I think he did :-) I hope Herbert will make it clear. > > Or, more generally, he could have been referring to splice pipe > outputs. All of these things grab references to pages and > expect that to keep the underlying data from being reallocated. > > That doesn't work for this skb->data case. > > > > That, to me, is almost worse than a bad fix. > > > > > > It's definitely worse than a slower but full fix, which the copy > > > patch is. > > > > Sorry, I can't see how this patch could make sendpage worse. > > Because that patch only fixes TCP's ->sendpage() implementation. Yes, it fixes (I hope) the only reported implementation. > > There are others out there which could end up experiencing similar > data corruption. Since this fix is very simple (and IMHO safe) it could be probably used elsewhere too, but I doubt we should care at the moment. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 10:47 ` Jarek Poplawski @ 2009-01-14 11:29 ` Herbert Xu 2009-01-14 11:40 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-14 11:29 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > could miss something. > > > > I think he did :-) > > I hope Herbert will make it clear. Yes I did mean the other splice paths, and in particular, the path to disk. I hope that's clear enough :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 11:29 ` Herbert Xu @ 2009-01-14 11:40 ` Jarek Poplawski 2009-01-14 11:45 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 11:40 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote: > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > > could miss something. > > > > > > I think he did :-) > > > > I hope Herbert will make it clear. > > Yes I did mean the other splice paths, and in particular, the path > to disk. I hope that's clear enough :) So, I think I got it right: otherwise it would be enough to make this copy to a new page only before ->sendpage() calls e.g. in generic_splice_sendpage(). Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 11:40 ` Jarek Poplawski @ 2009-01-14 11:45 ` Jarek Poplawski 0 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 11:45 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 11:40:59AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote: > > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > > > could miss something. > > > > > > > > I think he did :-) > > > > > > I hope Herbert will make it clear. > > > > Yes I did mean the other splice paths, and in particular, the path > > to disk. I hope that's clear enough :) > > So, I think I got it right: otherwise it would be enough to make this > copy to a new page only before ->sendpage() calls e.g. in > generic_splice_sendpage(). Hmm... Actually in pipe_to_sendpage(). Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:29 ` David Miller 2009-01-14 9:42 ` Jarek Poplawski @ 2009-01-14 9:54 ` Jarek Poplawski 2009-01-14 10:01 ` Willy Tarreau ` (2 more replies) 2009-01-14 11:28 ` Herbert Xu 2009-01-15 23:03 ` Willy Tarreau 3 siblings, 3 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 9:54 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > Therefore what I'll likely do is push Jarek's copy based cure, > and meanwhile we can brainstorm some more on how to fix this > properly in the long term. > > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? Yes, this should lessen the overhead a bit. > > net: Fix data corruption when splicing from sockets. > > From: Jarek Poplawski <jarkao2@gmail.com> > > The trick in socket splicing where we try to convert the skb->data > into a page based reference using virt_to_page() does not work so > well. > > The idea is to pass the virt_to_page() reference via the pipe > buffer, and refcount the buffer using a SKB reference. > > But if we are splicing from a socket to a socket (via sendpage) > this doesn't work. > > The from side processing will grab the page (and SKB) references. > The sendpage() calls will grab page references only, return, and > then the from side processing completes and drops the SKB ref. > > The page based reference to skb->data is not enough to keep the > kmalloc() buffer backing it from being reused. Yet, that is > all that the socket send side has at this point. > > This leads to data corruption if the skb->data buffer is reused > by SLAB before the send side socket actually gets the TX packet > out to the device. > > The fix employed here is to simply allocate a page and copy the > skb->data bytes into that page. > > This will hurt performance, but there is no clear way to fix this > properly without a copy at the present time, and it is important > to get rid of the data corruption. > > Signed-off-by: David S. Miller <davem@davemloft.net> You are very brave! I'd prefer to wait for at least minimal testing by Willy... Thanks, Jarek P. BTW, an skb parameter could be removed from spd_fill_page() to make it even faster... ... > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > unsigned int len, unsigned int offset, > - struct sk_buff *skb) > + struct sk_buff *skb, int linear) > { > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > return 1; > > + if (linear) { > + page = linear_to_page(page, len, offset); > + if (!page) > + return 1; > + } > + > spd->pages[spd->nr_pages] = page; > spd->partial[spd->nr_pages].len = len; > spd->partial[spd->nr_pages].offset = offset; > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > spd->nr_pages++; > + get_page(page); > + > return 0; > } > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > static inline int __splice_segment(struct page *page, unsigned int poff, > unsigned int plen, unsigned int *off, > unsigned int *len, struct sk_buff *skb, > - struct splice_pipe_desc *spd) > + struct splice_pipe_desc *spd, int linear) > { > if (!*len) > return 1; > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > /* the linear region may spread across several pages */ > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > - if (spd_fill_page(spd, page, flen, poff, skb)) > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > return 1; > > __segment_seek(&page, &poff, &plen, flen); > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > - offset, len, skb, spd)) > + offset, len, skb, spd, 1)) > return 1; > > /* > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > if (__splice_segment(f->page, f->page_offset, f->size, > - offset, len, skb, spd)) > + offset, len, skb, spd, 0)) > return 1; > } > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:54 ` Jarek Poplawski @ 2009-01-14 10:01 ` Willy Tarreau 2009-01-14 12:06 ` Jarek Poplawski 2009-01-14 12:15 ` Jarek Poplawski 2 siblings, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-14 10:01 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... Will test, probably this evening. Thanks guys, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:54 ` Jarek Poplawski 2009-01-14 10:01 ` Willy Tarreau @ 2009-01-14 12:06 ` Jarek Poplawski 2009-01-14 12:15 ` Jarek Poplawski 2 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 12:06 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe, Changli Gao On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > > net: Fix data corruption when splicing from sockets. > > > > From: Jarek Poplawski <jarkao2@gmail.com> ... > > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... On the other hand, I can't waste your trust like this (plus I'm not suicidal), so a little update: Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > > Thanks, > Jarek P. > > BTW, an skb parameter could be removed from spd_fill_page() to make it > even faster... > > ... > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); > > + > > return 0; > > } > > > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > > static inline int __splice_segment(struct page *page, unsigned int poff, > > unsigned int plen, unsigned int *off, > > unsigned int *len, struct sk_buff *skb, > > - struct splice_pipe_desc *spd) > > + struct splice_pipe_desc *spd, int linear) > > { > > if (!*len) > > return 1; > > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > > /* the linear region may spread across several pages */ > > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > > > - if (spd_fill_page(spd, page, flen, poff, skb)) > > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > > return 1; > > > > __segment_seek(&page, &poff, &plen, flen); > > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 1)) > > return 1; > > > > /* > > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > > > if (__splice_segment(f->page, f->page_offset, f->size, > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 0)) > > return 1; > > } > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:54 ` Jarek Poplawski 2009-01-14 10:01 ` Willy Tarreau 2009-01-14 12:06 ` Jarek Poplawski @ 2009-01-14 12:15 ` Jarek Poplawski 2 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-14 12:15 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe, Changli Gao Sorry: take 2: On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > > net: Fix data corruption when splicing from sockets. > > > > From: Jarek Poplawski <jarkao2@gmail.com> ... > > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... On the other hand, I can't waste your trust like this (plus I'm not suicidal), so a little update: Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Fixed-by: Jens Axboe <jens.axboe@oracle.com> > > Thanks, > Jarek P. > > BTW, an skb parameter could be removed from spd_fill_page() to make it > even faster... > > ... > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); > > + > > return 0; > > } > > > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > > static inline int __splice_segment(struct page *page, unsigned int poff, > > unsigned int plen, unsigned int *off, > > unsigned int *len, struct sk_buff *skb, > > - struct splice_pipe_desc *spd) > > + struct splice_pipe_desc *spd, int linear) > > { > > if (!*len) > > return 1; > > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > > /* the linear region may spread across several pages */ > > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > > > - if (spd_fill_page(spd, page, flen, poff, skb)) > > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > > return 1; > > > > __segment_seek(&page, &poff, &plen, flen); > > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 1)) > > return 1; > > > > /* > > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > > > if (__splice_segment(f->page, f->page_offset, f->size, > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 0)) > > return 1; > > } > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:29 ` David Miller 2009-01-14 9:42 ` Jarek Poplawski 2009-01-14 9:54 ` Jarek Poplawski @ 2009-01-14 11:28 ` Herbert Xu 2009-01-15 23:03 ` Willy Tarreau 3 siblings, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-14 11:28 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, dada1, w, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > implementations exposed to data corruption when the from side of the > pipe buffer is a socket. > > That, to me, is almost worse than a bad fix. Yep, so far nobody has verified the disk path at all. So for all we know, if there is a delay on the way to disk, the exact same thing can occur. Besides, the PageSlab thing is going to copy for network to network anyway. > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? Yes that's probably a good idea. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 9:29 ` David Miller ` (2 preceding siblings ...) 2009-01-14 11:28 ` Herbert Xu @ 2009-01-15 23:03 ` Willy Tarreau 2009-01-15 23:19 ` David Miller 2009-01-15 23:19 ` Herbert Xu 3 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-15 23:03 UTC (permalink / raw) To: David Miller Cc: jarkao2, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > Therefore what I'll likely do is push Jarek's copy based cure, > and meanwhile we can brainstorm some more on how to fix this > properly in the long term. > > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? David, please don't merge it as-is. I've just tried it and got an OOM in __alloc_pages_internal after a few seconds of data transfer. I'm leaving the patch below for comments, maybe someone will spot something ? Don't we need at least one kfree() somewhere to match alloc_pages() ? Regards, Willy -- > > net: Fix data corruption when splicing from sockets. > > From: Jarek Poplawski <jarkao2@gmail.com> > > The trick in socket splicing where we try to convert the skb->data > into a page based reference using virt_to_page() does not work so > well. > > The idea is to pass the virt_to_page() reference via the pipe > buffer, and refcount the buffer using a SKB reference. > > But if we are splicing from a socket to a socket (via sendpage) > this doesn't work. > > The from side processing will grab the page (and SKB) references. > The sendpage() calls will grab page references only, return, and > then the from side processing completes and drops the SKB ref. > > The page based reference to skb->data is not enough to keep the > kmalloc() buffer backing it from being reused. Yet, that is > all that the socket send side has at this point. > > This leads to data corruption if the skb->data buffer is reused > by SLAB before the send side socket actually gets the TX packet > out to the device. > > The fix employed here is to simply allocate a page and copy the > skb->data bytes into that page. > > This will hurt performance, but there is no clear way to fix this > properly without a copy at the present time, and it is important > to get rid of the data corruption. > > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5110b35..6e43d52 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; > static void sock_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > - struct sk_buff *skb = (struct sk_buff *) buf->private; > - > - kfree_skb(skb); > + put_page(buf->page); > } > > static void sock_pipe_buf_get(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > - struct sk_buff *skb = (struct sk_buff *) buf->private; > - > - skb_get(skb); > + get_page(buf->page); > } > > static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, > @@ -1334,9 +1330,19 @@ fault: > */ > static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) > { > - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; > + put_page(spd->pages[i]); > +} > > - kfree_skb(skb); > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > + unsigned int offset) > +{ > + struct page *p = alloc_pages(GFP_KERNEL, 0); > + > + if (!p) > + return NULL; > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > + > + return p; > } > > /* > @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) > */ > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > unsigned int len, unsigned int offset, > - struct sk_buff *skb) > + struct sk_buff *skb, int linear) > { > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > return 1; > > + if (linear) { > + page = linear_to_page(page, len, offset); > + if (!page) > + return 1; > + } > + > spd->pages[spd->nr_pages] = page; > spd->partial[spd->nr_pages].len = len; > spd->partial[spd->nr_pages].offset = offset; > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > spd->nr_pages++; > + get_page(page); > + > return 0; > } > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > static inline int __splice_segment(struct page *page, unsigned int poff, > unsigned int plen, unsigned int *off, > unsigned int *len, struct sk_buff *skb, > - struct splice_pipe_desc *spd) > + struct splice_pipe_desc *spd, int linear) > { > if (!*len) > return 1; > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > /* the linear region may spread across several pages */ > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > - if (spd_fill_page(spd, page, flen, poff, skb)) > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > return 1; > > __segment_seek(&page, &poff, &plen, flen); > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > - offset, len, skb, spd)) > + offset, len, skb, spd, 1)) > return 1; > > /* > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > if (__splice_segment(f->page, f->page_offset, f->size, > - offset, len, skb, spd)) > + offset, len, skb, spd, 0)) > return 1; > } > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:03 ` Willy Tarreau @ 2009-01-15 23:19 ` David Miller 2009-01-15 23:19 ` Herbert Xu 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-15 23:19 UTC (permalink / raw) To: w Cc: jarkao2, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:03:31 +0100 > please don't merge it as-is. I've just tried it and got an OOM > in __alloc_pages_internal after a few seconds of data transfer. > > I'm leaving the patch below for comments, maybe someone will spot > something ? Don't we need at least one kfree() somewhere to match > alloc_pages() ? What is needed is a put_page(). But that should be happening as a result of the splice callback. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:03 ` Willy Tarreau 2009-01-15 23:19 ` David Miller @ 2009-01-15 23:19 ` Herbert Xu 2009-01-15 23:26 ` David Miller 2009-01-15 23:32 ` [PATCH] " Willy Tarreau 1 sibling, 2 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-15 23:19 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote: > > I'm leaving the patch below for comments, maybe someone will spot > something ? Don't we need at least one kfree() somewhere to match > alloc_pages() ? Indeed. > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); This get_page needs to be moved into an else clause of the previous if block. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:19 ` Herbert Xu @ 2009-01-15 23:26 ` David Miller 2009-01-15 23:32 ` Herbert Xu 2009-01-20 8:37 ` Jarek Poplawski 2009-01-15 23:32 ` [PATCH] " Willy Tarreau 1 sibling, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-15 23:26 UTC (permalink / raw) To: herbert Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 16 Jan 2009 10:19:35 +1100 > On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote: > > > + if (linear) { > > > + page = linear_to_page(page, len, offset); > > > + if (!page) > > > + return 1; > > > + } > > > + > > > spd->pages[spd->nr_pages] = page; > > > spd->partial[spd->nr_pages].len = len; > > > spd->partial[spd->nr_pages].offset = offset; > > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > > spd->nr_pages++; > > > + get_page(page); > > This get_page needs to be moved into an else clause of the previous > if block. Yep, good catch Herbert. New patch, this has the SKB clone removal as well: net: Fix data corruption when splicing from sockets. From: Jarek Poplawski <jarkao2@gmail.com> The trick in socket splicing where we try to convert the skb->data into a page based reference using virt_to_page() does not work so well. The idea is to pass the virt_to_page() reference via the pipe buffer, and refcount the buffer using a SKB reference. But if we are splicing from a socket to a socket (via sendpage) this doesn't work. The from side processing will grab the page (and SKB) references. The sendpage() calls will grab page references only, return, and then the from side processing completes and drops the SKB ref. The page based reference to skb->data is not enough to keep the kmalloc() buffer backing it from being reused. Yet, that is all that the socket send side has at this point. This leads to data corruption if the skb->data buffer is reused by SLAB before the send side socket actually gets the TX packet out to the device. The fix employed here is to simply allocate a page and copy the skb->data bytes into that page. This will hurt performance, but there is no clear way to fix this properly without a copy at the present time, and it is important to get rid of the data corruption. With fixes from Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 65eac77..56272ac 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - kfree_skb(skb); + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - skb_get(skb); + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1334,9 +1330,19 @@ fault: */ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; + put_page(spd->pages[i]); +} - kfree_skb(skb); +static inline struct page *linear_to_page(struct page *page, unsigned int len, + unsigned int offset) +{ + struct page *p = alloc_pages(GFP_KERNEL, 0); + + if (!p) + return NULL; + memcpy(page_address(p) + offset, page_address(page) + offset, len); + + return p; } /* @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; + if (linear) { + page = linear_to_page(page, len, offset); + if (!page) + return 1; + } else + get_page(page); + spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); spd->nr_pages++; + return 0; } @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; } @@ -1442,7 +1455,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, * the frag list, if such a thing exists. We'd probably need to recurse to * handle that cleanly. */ -int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, +int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int tlen, unsigned int flags) { @@ -1455,16 +1468,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, .ops = &sock_pipe_buf_ops, .spd_release = sock_spd_release, }; - struct sk_buff *skb; - - /* - * I'd love to avoid the clone here, but tcp_read_sock() - * ignores reference counts and unconditonally kills the sk_buff - * on return from the actor. - */ - skb = skb_clone(__skb, GFP_KERNEL); - if (unlikely(!skb)) - return -ENOMEM; /* * __skb_splice_bits() only fails if the output has no room left, @@ -1488,15 +1491,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, } done: - /* - * drop our reference to the clone, the pipe consumption will - * drop the rest. - */ - kfree_skb(skb); - if (spd.nr_pages) { + struct sock *sk = skb->sk; int ret; - struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:26 ` David Miller @ 2009-01-15 23:32 ` Herbert Xu 2009-01-15 23:34 ` David Miller 2009-01-20 8:37 ` Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-15 23:32 UTC (permalink / raw) To: David Miller Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > New patch, this has the SKB clone removal as well: Thanks Dave! Something else just came to mind though. > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > + unsigned int offset) > +{ > + struct page *p = alloc_pages(GFP_KERNEL, 0); > + > + if (!p) > + return NULL; > + memcpy(page_address(p) + offset, page_address(page) + offset, len); This won't work very well if skb->head is longer than a page. We'll need to divide it up into individual pages. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:32 ` Herbert Xu @ 2009-01-15 23:34 ` David Miller 2009-01-15 23:42 ` Willy Tarreau 2009-01-19 6:16 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-15 23:34 UTC (permalink / raw) To: herbert Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 16 Jan 2009 10:32:05 +1100 > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > + unsigned int offset) > > +{ > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > + > > + if (!p) > > + return NULL; > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > This won't work very well if skb->head is longer than a page. > > We'll need to divide it up into individual pages. Oh yes the same bug I pointed out the other day. But Willy can test this patch as-is, since he is not using jumbo frames in linear SKBs. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:34 ` David Miller @ 2009-01-15 23:42 ` Willy Tarreau 2009-01-15 23:44 ` Willy Tarreau 2009-01-19 6:16 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-15 23:42 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > + unsigned int offset) > > > +{ > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > + > > > + if (!p) > > > + return NULL; > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > This won't work very well if skb->head is longer than a page. > > > > We'll need to divide it up into individual pages. > > Oh yes the same bug I pointed out the other day. > > But Willy can test this patch as-is, Hey, nice work Dave. +3% performance from your previous patch (31.6 MB/s). It's going fine and stable here. > since he is not using jumbo frames in linear SKBs. If you're interested, this week-end I can do some tests on my myri10ge NICs which support LRO. I frequently observe 23 kB packets there, and they also support jumbo frames. Those should cover the case above. I'm afraid that's all for me for this evening, I have to get some sleep before going to work. If you want to cook up more patches, I'll be able to do a bit of testing in 5 hours now. Cheers! Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:42 ` Willy Tarreau @ 2009-01-15 23:44 ` Willy Tarreau 2009-01-15 23:54 ` David Miller 2009-01-16 6:51 ` Jarek Poplawski 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-15 23:44 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote: > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > > + unsigned int offset) > > > > +{ > > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > > + > > > > + if (!p) > > > > + return NULL; > > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > > > This won't work very well if skb->head is longer than a page. > > > > > > We'll need to divide it up into individual pages. > > > > Oh yes the same bug I pointed out the other day. > > > > But Willy can test this patch as-is, > > Hey, nice work Dave. +3% performance from your previous patch > (31.6 MB/s). It's going fine and stable here. And BTW feel free to add my Tested-by if you want in case you merge this fix. Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:44 ` Willy Tarreau @ 2009-01-15 23:54 ` David Miller 2009-01-19 0:42 ` Willy Tarreau 2009-01-16 6:51 ` Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-15 23:54 UTC (permalink / raw) To: w Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:44:08 +0100 > And BTW feel free to add my Tested-by if you want in case you merge > this fix. Done, thanks Willy. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:54 ` David Miller @ 2009-01-19 0:42 ` Willy Tarreau 2009-01-19 3:08 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-19 0:42 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Hi guys, On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > And BTW feel free to add my Tested-by if you want in case you merge > > this fix. > > Done, thanks Willy. Just for the record, I've now re-integrated those changes in a test kernel that I booted on my 10gig machines. I have updated my user-space code in haproxy to run a new series of tests. Eventhough there is a memcpy(), the results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : - 4.8 Gbps at 100% CPU using MTU=1500 without LRO (3.2 Gbps at 100% CPU without splice) - 9.2 Gbps at 50% CPU using MTU=1500 with LRO - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without splice) - 10 Gbps at 15% CPU using MTU=9000 with LRO These last ones are really impressive. While I had already observed such performance on the Myri10GE with Tux, it's the first time I can reach that level with so little CPU usage in haproxy ! So I think that the memcpy() workaround might be a non-issue for some time. I agree it's not beautiful but it works pretty well for now. The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's last patch based on Jarek's workaround for the corruption issue. Cheers, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 0:42 ` Willy Tarreau @ 2009-01-19 3:08 ` Herbert Xu 2009-01-19 3:27 ` David Miller 2009-01-19 3:28 ` David Miller 2009-01-20 12:01 ` Ben Mansell 2 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-19 3:08 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) One thing to note is that Myricom's driver probably uses page frags which means that you're not actually triggering the copy. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 3:08 ` Herbert Xu @ 2009-01-19 3:27 ` David Miller 2009-01-19 6:14 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-19 3:27 UTC (permalink / raw) To: herbert Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 19 Jan 2009 14:08:44 +1100 > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > One thing to note is that Myricom's driver probably uses page > frags which means that you're not actually triggering the copy. Right. And this is also the only reason why jumbo MTU worked :-) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 3:27 ` David Miller @ 2009-01-19 6:14 ` Willy Tarreau 2009-01-19 6:19 ` David Miller 2009-01-19 8:40 ` Jarek Poplawski 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-19 6:14 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > > > Just for the record, I've now re-integrated those changes in a test kernel > > > that I booted on my 10gig machines. I have updated my user-space code in > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > > (3.2 Gbps at 100% CPU without splice) > > > > One thing to note is that Myricom's driver probably uses page > > frags which means that you're not actually triggering the copy. So does this mean that the corruption problem should still there for such a driver ? I'm asking before testing, because at these speeds, validity tests are not that easy ;-) > Right. > > And this is also the only reason why jumbo MTU worked :-) What should we expect from other drivers with jumbo frames ? Hangs, corruption, errors, packet loss ? Thanks, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 6:14 ` Willy Tarreau @ 2009-01-19 6:19 ` David Miller 2009-01-19 6:45 ` Willy Tarreau 2009-01-19 10:19 ` Herbert Xu 2009-01-19 8:40 ` Jarek Poplawski 1 sibling, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-19 6:19 UTC (permalink / raw) To: w Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Mon, 19 Jan 2009 07:14:20 +0100 > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > One thing to note is that Myricom's driver probably uses page > > > frags which means that you're not actually triggering the copy. > > So does this mean that the corruption problem should still there for > such a driver ? I'm asking before testing, because at these speeds, > validity tests are not that easy ;-) It ought not to, but it seems that is the case where you saw the original corruptions, so hmmm... Actually, I see, the myri10ge driver does put up to 64 bytes of the initial packet into the linear area. If the IPV4 + TCP headers are less than this, you will hit the corruption case even with the myri10ge driver. So everything checks out. > > And this is also the only reason why jumbo MTU worked :-) > > What should we expect from other drivers with jumbo frames ? Hangs, > corruption, errors, packet loss ? Upon recent review I think jumbo frames in such drivers should actually be fine. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 6:19 ` David Miller @ 2009-01-19 6:45 ` Willy Tarreau 2009-01-19 10:19 ` Herbert Xu 1 sibling, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-19 6:45 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 07:14:20 +0100 > > > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > > > One thing to note is that Myricom's driver probably uses page > > > > frags which means that you're not actually triggering the copy. > > > > So does this mean that the corruption problem should still there for > > such a driver ? I'm asking before testing, because at these speeds, > > validity tests are not that easy ;-) > > It ought not to, but it seems that is the case where you > saw the original corruptions, so hmmm... > > Actually, I see, the myri10ge driver does put up to > 64 bytes of the initial packet into the linear area. > If the IPV4 + TCP headers are less than this, you will > hit the corruption case even with the myri10ge driver. > > So everything checks out. OK, so I will modify my tools in order to perform a few checks. Thanks! Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 6:19 ` David Miller 2009-01-19 6:45 ` Willy Tarreau @ 2009-01-19 10:19 ` Herbert Xu 2009-01-19 20:59 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-19 10:19 UTC (permalink / raw) To: David Miller Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > Actually, I see, the myri10ge driver does put up to > 64 bytes of the initial packet into the linear area. > If the IPV4 + TCP headers are less than this, you will > hit the corruption case even with the myri10ge driver. I thought splice only mapped the payload areas, no? So we should probably test with a non-paged driver to be totally sure. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 10:19 ` Herbert Xu @ 2009-01-19 20:59 ` David Miller 2009-01-19 21:24 ` Herbert Xu 2009-01-25 21:03 ` Willy Tarreau 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-19 20:59 UTC (permalink / raw) To: herbert Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 19 Jan 2009 21:19:24 +1100 > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > > > Actually, I see, the myri10ge driver does put up to > > 64 bytes of the initial packet into the linear area. > > If the IPV4 + TCP headers are less than this, you will > > hit the corruption case even with the myri10ge driver. > > I thought splice only mapped the payload areas, no? And the difference between 64 and IPV4+TCP header len becomes the payload, don't you see? :-) myri10ge just pulls min(64, skb->len) bytes from the SKB frags into the linear area, unconditionally. So a small number of payload bytes can in fact end up there. Otherwise Willy could never have triggered this bug. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 20:59 ` David Miller @ 2009-01-19 21:24 ` Herbert Xu 2009-01-25 21:03 ` Willy Tarreau 1 sibling, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-19 21:24 UTC (permalink / raw) To: David Miller Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote: . > And the difference between 64 and IPV4+TCP header len becomes the > payload, don't you see? :-) > > myri10ge just pulls min(64, skb->len) bytes from the SKB frags into > the linear area, unconditionally. So a small number of payload bytes > can in fact end up there. Ah, I thought they were doing a precise division. Yes pulling a constant length will trigger it if it's big enough. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 20:59 ` David Miller 2009-01-19 21:24 ` Herbert Xu @ 2009-01-25 21:03 ` Willy Tarreau 2009-01-26 7:59 ` Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-25 21:03 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Hi David, On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Mon, 19 Jan 2009 21:19:24 +1100 > > > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote: > > > > > > Actually, I see, the myri10ge driver does put up to > > > 64 bytes of the initial packet into the linear area. > > > If the IPV4 + TCP headers are less than this, you will > > > hit the corruption case even with the myri10ge driver. > > > > I thought splice only mapped the payload areas, no? > > And the difference between 64 and IPV4+TCP header len becomes the > payload, don't you see? :-) > > myri10ge just pulls min(64, skb->len) bytes from the SKB frags into > the linear area, unconditionally. So a small number of payload bytes > can in fact end up there. > > Otherwise Willy could never have triggered this bug. Just FWIW, I've updated my tools in order to perform content checks more easily. I cannot reproduce the issue at all with the myri10ge NICs, neither with large frames nor with tiny ones (8 bytes). However, I have noticed that the load is now sensible to the number of concurrent sessions. I'm using 2.6.29-rc2 with the perfcounters patches, and I'm not sure whether the difference in behaviour came with the data corruption fixes or with the new kernel (which has some profiling options turned on). Basically, below 800-1000 concurrent sessions, I have no problem reaching 10 Gbps with LRO and MTU=1500, with about 60% CPU. Above this number of session, the CPU suddenly jumps to 100% and the data rate drops to about 6.7 Gbps. I spent a long time trying to figure what it was, but I think that I have found. Kerneltop reports different figures above and below the limit. 1) below the limit : 1429.00 - 00000000784a7840 : tcp_sendpage 561.00 - 00000000784a6580 : tcp_read_sock 485.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge] 433.00 - 00000000781a40c0 : sys_splice 411.00 - 00000000784a6eb0 : tcp_poll 344.00 - 000000007847bcf0 : dev_queue_xmit 342.00 - 0000000078470be0 : __skb_splice_bits 319.00 - 0000000078472950 : __alloc_skb 310.00 - 0000000078185870 : kmem_cache_alloc 285.00 - 00000000784b2260 : tcp_transmit_skb 285.00 - 000000007850cac0 : _spin_lock 250.00 - 00000000781afda0 : sys_epoll_ctl 238.00 - 000000007810334c : system_call 232.00 - 000000007850ac20 : schedule 230.00 - 000000007850cc10 : _spin_lock_bh 222.00 - 00000000784705f0 : __skb_clone 220.00 - 000000007850cbc0 : _spin_lock_irqsave 213.00 - 00000000784a08f0 : ip_queue_xmit 211.00 - 0000000078185ea0 : __kmalloc_track_caller 2) above the limit : 1778.00 - 00000000784a7840 : tcp_sendpage 1281.00 - 0000000078472950 : __alloc_skb 639.00 - 00000000784a6780 : sk_stream_alloc_skb 507.00 - 0000000078185ea0 : __kmalloc_track_caller 484.00 - 0000000078185870 : kmem_cache_alloc 476.00 - 00000000784a6580 : tcp_read_sock 451.00 - 00000000784a08f0 : ip_queue_xmit 421.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge] 374.00 - 00000000781852e0 : __slab_alloc 361.00 - 00000000781a40c0 : sys_splice 273.00 - 0000000078470be0 : __skb_splice_bits 231.00 - 000000007850cac0 : _spin_lock 206.00 - 0000000078168b30 : get_pageblock_flags_group 165.00 - 00000000784a0260 : ip_finish_output 165.00 - 00000000784b2260 : tcp_transmit_skb 161.00 - 0000000078470460 : __copy_skb_header 153.00 - 000000007816d6d0 : put_page 144.00 - 000000007850cbc0 : _spin_lock_irqsave 137.00 - 0000000078189be0 : fget_light The memory allocation clearly is the culprit here. I'll try Jarek's patch which reduces memory allocation to see if that changes something, as I'm sure we can do fairly better, given how it behaves with limited sessions. Regards, Willy PS: this thread is long, if some of the people in CC want to get off the thread, please complain. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-25 21:03 ` Willy Tarreau @ 2009-01-26 7:59 ` Jarek Poplawski 2009-01-26 8:12 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-26 7:59 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote: ... > The memory allocation clearly is the culprit here. I'll try Jarek's > patch which reduces memory allocation to see if that changes something, > as I'm sure we can do fairly better, given how it behaves with limited > sessions. I think you are right, but I wonder if it's not better to wait with more profiling until this splicing is really redone. Regards, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-26 7:59 ` Jarek Poplawski @ 2009-01-26 8:12 ` Willy Tarreau 0 siblings, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-26 8:12 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Jan 26, 2009 at 07:59:33AM +0000, Jarek Poplawski wrote: > On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote: > ... > > The memory allocation clearly is the culprit here. I'll try Jarek's > > patch which reduces memory allocation to see if that changes something, > > as I'm sure we can do fairly better, given how it behaves with limited > > sessions. > > I think you are right, but I wonder if it's not better to wait with > more profiling until this splicing is really redone. Agreed. In fact I have run a few tests even with your patch and I could see no obvious figure starting to appear. I'll go back to 2.6.27-stable + the fixes because I don't really know what I'm testing under 2.6.29-rc2. Once I'm able to get reproducible reference numbers, I'll test again with the latest kernel. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 6:14 ` Willy Tarreau 2009-01-19 6:19 ` David Miller @ 2009-01-19 8:40 ` Jarek Poplawski 1 sibling, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-19 8:40 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Jan 19, 2009 at 07:14:20AM +0100, Willy Tarreau wrote: > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Mon, 19 Jan 2009 14:08:44 +1100 > > > > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote: > > > > > > > > Just for the record, I've now re-integrated those changes in a test kernel > > > > that I booted on my 10gig machines. I have updated my user-space code in > > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > > > (3.2 Gbps at 100% CPU without splice) > > > > > > One thing to note is that Myricom's driver probably uses page > > > frags which means that you're not actually triggering the copy. > > So does this mean that the corruption problem should still there for > such a driver ? I'm asking before testing, because at these speeds, > validity tests are not that easy ;-) I guess, David meant the performance: there is not much change because only a small part could be copied. The most harmed should be jumbo frames in linear only skbs. But no corruption is expected. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 0:42 ` Willy Tarreau 2009-01-19 3:08 ` Herbert Xu @ 2009-01-19 3:28 ` David Miller 2009-01-19 6:11 ` Willy Tarreau 2009-01-24 21:23 ` Willy Tarreau 2009-01-20 12:01 ` Ben Mansell 2 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-19 3:28 UTC (permalink / raw) To: w Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Mon, 19 Jan 2009 01:42:06 +0100 > Hi guys, > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > From: Willy Tarreau <w@1wt.eu> > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > this fix. > > > > Done, thanks Willy. > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > splice) > > - 10 Gbps at 15% CPU using MTU=9000 with LRO Thanks for the numbers. We can almost certainly do a lot better, so if you have the time and can get some oprofile dumps for these various cases that would be useful to us. Thanks again. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 3:28 ` David Miller @ 2009-01-19 6:11 ` Willy Tarreau 2009-01-24 21:23 ` Willy Tarreau 1 sibling, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-19 6:11 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 01:42:06 +0100 > > > Hi guys, > > > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > > From: Willy Tarreau <w@1wt.eu> > > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > > this fix. > > > > > > Done, thanks Willy. > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > > splice) > > > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > Thanks for the numbers. > > We can almost certainly do a lot better, so if you have the time and > can get some oprofile dumps for these various cases that would be > useful to us. No problem, of course. It's just a matter of time, but if we can push the numbers further, let's try. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 3:28 ` David Miller 2009-01-19 6:11 ` Willy Tarreau @ 2009-01-24 21:23 ` Willy Tarreau 1 sibling, 0 replies; 190+ messages in thread From: Willy Tarreau @ 2009-01-24 21:23 UTC (permalink / raw) To: David Miller Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Hi David, On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Mon, 19 Jan 2009 01:42:06 +0100 > > > Hi guys, > > > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: > > > From: Willy Tarreau <w@1wt.eu> > > > Date: Fri, 16 Jan 2009 00:44:08 +0100 > > > > > > > And BTW feel free to add my Tested-by if you want in case you merge > > > > this fix. > > > > > > Done, thanks Willy. > > > > Just for the record, I've now re-integrated those changes in a test kernel > > that I booted on my 10gig machines. I have updated my user-space code in > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > > (3.2 Gbps at 100% CPU without splice) > > > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > > splice) > > > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > Thanks for the numbers. > > We can almost certainly do a lot better, so if you have the time and > can get some oprofile dumps for these various cases that would be > useful to us. Well, I tried to get oprofile to work on those machines, but I'm surely missing something, as "opreport" always complains : opreport error: No sample file found: try running opcontrol --dump or specify a session containing sample files I've straced opcontrol --dump, opcontrol --stop, and I see nothing creating any file in the samples directory. I thought it would be opjitconv which would do it, but it's hard to debug, and I haven't used oprofile for a 2/3 years now. I followed exactly the instructions in the kernel doc, as well as a howto found on the net, but I remain out of luck. I just see a "complete_dump" file created with only two bytes. It's not easy to debug on those machines are they're diskless and PXE-booted from squashfs root images. However, upon Ingo's suggestion I tried his perfcounters while running a test at 5 Gbps with haproxy running alone on one core, and IRQs on the other one. No LRO was used and MTU was 1500. For instance, kerneltop tells how many CPU cycles are spent in each function : # kerneltop -e 0 -d 1 -c 1000000 -C 1 events RIP kernel function ______ ______ ________________ _______________ 1864.00 - 00000000f87be000 : init_module [myri10ge] 1078.00 - 00000000784a6580 : tcp_read_sock 901.00 - 00000000784a7840 : tcp_sendpage 857.00 - 0000000078470be0 : __skb_splice_bits 617.00 - 00000000784b2260 : tcp_transmit_skb 604.00 - 000000007849fdf0 : __ip_local_out 581.00 - 0000000078470460 : __copy_skb_header 569.00 - 000000007850cac0 : _spin_lock [myri10ge] 472.00 - 0000000078185150 : __slab_free 443.00 - 000000007850cc10 : _spin_lock_bh [sky2] 434.00 - 00000000781852e0 : __slab_alloc 408.00 - 0000000078488620 : __qdisc_run 355.00 - 0000000078185b20 : kmem_cache_free 352.00 - 0000000078472950 : __alloc_skb 348.00 - 00000000784705f0 : __skb_clone 337.00 - 0000000078185870 : kmem_cache_alloc [myri10ge] 302.00 - 0000000078472150 : skb_release_data 297.00 - 000000007847bcf0 : dev_queue_xmit 285.00 - 00000000784a08f0 : ip_queue_xmit You should ignore the lines init_module, _spin_lock, etc, in fact all lines indicating a module, as there's something wrong there, they always report the name of the last module loaded, and the name changes when the module is unloaded. I also tried dumping the number of cache misses per function : ------------------------------------------------------------------------------ KernelTop: 1146 irqs/sec [NMI, 1000 cache-misses], (all, cpu: 1) ------------------------------------------------------------------------------ events RIP kernel function ______ ______ ________________ _______________ 7512.00 - 00000000784a6580 : tcp_read_sock 2716.00 - 0000000078470be0 : __skb_splice_bits 2516.00 - 00000000784a08f0 : ip_queue_xmit 986.00 - 00000000784a7840 : tcp_sendpage 587.00 - 00000000781a40c0 : sys_splice 462.00 - 00000000781856b0 : kfree [myri10ge] 451.00 - 000000007849fdf0 : __ip_local_out 242.00 - 0000000078185b20 : kmem_cache_free 205.00 - 00000000784b1b90 : __tcp_select_window 153.00 - 000000007850cac0 : _spin_lock [myri10ge] 142.00 - 000000007849f6a0 : ip_fragment 119.00 - 0000000078188950 : rw_verify_area 117.00 - 00000000784a99e0 : tcp_rcv_space_adjust 107.00 - 000000007850cc10 : _spin_lock_bh [sky2] 104.00 - 00000000781852e0 : __slab_alloc There are other options to combine events but I don't understand the output when I enable it. I think that when properly used, these tools can report useful information. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 0:42 ` Willy Tarreau 2009-01-19 3:08 ` Herbert Xu 2009-01-19 3:28 ` David Miller @ 2009-01-20 12:01 ` Ben Mansell 2009-01-20 12:11 ` Evgeniy Polyakov 2 siblings, 1 reply; 190+ messages in thread From: Ben Mansell @ 2009-01-20 12:01 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, jarkao2, zbr, dada1, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau wrote: > Hi guys, > > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote: >> From: Willy Tarreau <w@1wt.eu> >> Date: Fri, 16 Jan 2009 00:44:08 +0100 >> >>> And BTW feel free to add my Tested-by if you want in case you merge >>> this fix. >> Done, thanks Willy. > > Just for the record, I've now re-integrated those changes in a test kernel > that I booted on my 10gig machines. I have updated my user-space code in > haproxy to run a new series of tests. Eventhough there is a memcpy(), the > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) : > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO > (3.2 Gbps at 100% CPU without splice) > > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO > > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without > splice) > > - 10 Gbps at 15% CPU using MTU=9000 with LRO > > These last ones are really impressive. While I had already observed such > performance on the Myri10GE with Tux, it's the first time I can reach that > level with so little CPU usage in haproxy ! > > So I think that the memcpy() workaround might be a non-issue for some time. > I agree it's not beautiful but it works pretty well for now. > > The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of > -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's > last patch based on Jarek's workaround for the corruption issue. I've also tested on the same three patches (against 2.6.27.2 here), and the patches appear to work just fine. I'm running a similar proxy benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 2xforcedeth). splice is working OK now, although I get identical results when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 98% system). I may be hitting a h/w limitation which prevents any higher throughput, but I'm a little surprised that splice() didn't use less CPU time. Anyway, the splice code is working which is the important part! Ben ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-20 12:01 ` Ben Mansell @ 2009-01-20 12:11 ` Evgeniy Polyakov 2009-01-20 13:43 ` Ben Mansell 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-20 12:11 UTC (permalink / raw) To: Ben Mansell Cc: Willy Tarreau, David Miller, herbert, jarkao2, dada1, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote: > I've also tested on the same three patches (against 2.6.27.2 here), and > the patches appear to work just fine. I'm running a similar proxy > benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, > 2xforcedeth). splice is working OK now, although I get identical results > when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, > 98% system). With small MTU or when driver does not support fragmented allocation (iirc at least forcedeth does not) skb will contain all the data in the linear part and thus will be copied in the kernel. read()/write() does effectively the same, but in userspace. This should only affect splice usage which involves socket->pipe data transfer. > I may be hitting a h/w limitation which prevents any higher throughput, > but I'm a little surprised that splice() didn't use less CPU time. > Anyway, the splice code is working which is the important part! Does splice without patches (but with performance improvement for non-blocking splice) has the same performance? It does not copy data, but you may hit the data corruption? If performance is the same, this maybe indeed HW limitation. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-20 12:11 ` Evgeniy Polyakov @ 2009-01-20 13:43 ` Ben Mansell 2009-01-20 14:06 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Ben Mansell @ 2009-01-20 13:43 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Willy Tarreau, David Miller, herbert, jarkao2, dada1, mingo, linux-kernel, netdev, jens.axboe Evgeniy Polyakov wrote: > On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote: >> I've also tested on the same three patches (against 2.6.27.2 here), and >> the patches appear to work just fine. I'm running a similar proxy >> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, >> 2xforcedeth). splice is working OK now, although I get identical results >> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, >> 98% system). > > With small MTU or when driver does not support fragmented allocation > (iirc at least forcedeth does not) skb will contain all the data in the > linear part and thus will be copied in the kernel. read()/write() does > effectively the same, but in userspace. > This should only affect splice usage which involves socket->pipe data > transfer. I'll try with some larger MTUs and see if that helps - it should also give an improvement if I'm hitting a limit on the number of packets/second that the cards can process, regardless of splice... >> I may be hitting a h/w limitation which prevents any higher throughput, >> but I'm a little surprised that splice() didn't use less CPU time. >> Anyway, the splice code is working which is the important part! > > Does splice without patches (but with performance improvement for > non-blocking splice) has the same performance? It does not copy data, > but you may hit the data corruption? If performance is the same, this > maybe indeed HW limitation. With an unpatched kernel, the splice performance was worse (due to the one packet per-splice issues). With the small patch to fix that, I was getting around 2 Gbps performance, although oddly enough, I could only get 2 Gbps with read()/write() then as well... I'll try and do some tests on a machine that hopefully doesn't have the bottlenecks (and one that uses different NICs) Ben ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-20 13:43 ` Ben Mansell @ 2009-01-20 14:06 ` Jarek Poplawski 0 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-20 14:06 UTC (permalink / raw) To: Ben Mansell Cc: Evgeniy Polyakov, Willy Tarreau, David Miller, herbert, dada1, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 01:43:52PM +0000, Ben Mansell wrote: ... > With an unpatched kernel, the splice performance was worse (due to the > one packet per-splice issues). With the small patch to fix that, I was > getting around 2 Gbps performance, although oddly enough, I could only > get 2 Gbps with read()/write() then as well... > > I'll try and do some tests on a machine that hopefully doesn't have the > bottlenecks (and one that uses different NICs) I guess you should especially check if SG and checksums are on, and it could depend on a chip within those NICs as well. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:44 ` Willy Tarreau 2009-01-15 23:54 ` David Miller @ 2009-01-16 6:51 ` Jarek Poplawski 2009-01-19 6:08 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-16 6:51 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 16, 2009 at 12:44:08AM +0100, Willy Tarreau wrote: > On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote: > > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > > > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > > > + unsigned int offset) > > > > > +{ > > > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > > > + > > > > > + if (!p) > > > > > + return NULL; > > > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > > > > > This won't work very well if skb->head is longer than a page. > > > > > > > > We'll need to divide it up into individual pages. > > > > > > Oh yes the same bug I pointed out the other day. > > > > > > But Willy can test this patch as-is, > > > > Hey, nice work Dave. +3% performance from your previous patch > > (31.6 MB/s). It's going fine and stable here. > > And BTW feel free to add my Tested-by if you want in case you merge > this fix. > > Willy > Herbert, good catch! David, if it's not too late I think more credits are needed, especially for Willy. He did "a bit" more than testing. Alas, I can't see this problem with skb->head longer than page. There is even some comment on this in __splice_segment(), but I can miss something. I'm more concerned with memory usage if these skbs are not acked for some reason. Isn't there some DOS issue possible? Thanks everybody, Jarek P. ---------> Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Fixed-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-16 6:51 ` Jarek Poplawski @ 2009-01-19 6:08 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-19 6:08 UTC (permalink / raw) To: jarkao2 Cc: w, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 16 Jan 2009 06:51:08 +0000 > David, if it's not too late I think more credits are needed, > especially for Willy. He did "a bit" more than testing. ... > Foreseen-by: Changli Gao <xiaosuo@gmail.com> > Diagnosed-by: Willy Tarreau <w@1wt.eu> > Reported-by: Willy Tarreau <w@1wt.eu> > Fixed-by: Jens Axboe <jens.axboe@oracle.com> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> I will be sure to add these before committing, thanks Jarek. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:34 ` David Miller 2009-01-15 23:42 ` Willy Tarreau @ 2009-01-19 6:16 ` David Miller 2009-01-19 10:20 ` Herbert Xu 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-19 6:16 UTC (permalink / raw) To: herbert Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: David Miller <davem@davemloft.net> Date: Thu, 15 Jan 2009 15:34:49 -0800 (PST) > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 16 Jan 2009 10:32:05 +1100 > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > > > + unsigned int offset) > > > +{ > > > + struct page *p = alloc_pages(GFP_KERNEL, 0); > > > + > > > + if (!p) > > > + return NULL; > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > > > > This won't work very well if skb->head is longer than a page. > > > > We'll need to divide it up into individual pages. > > Oh yes the same bug I pointed out the other day. > > But Willy can test this patch as-is, since he is not using > jumbo frames in linear SKBs. Actually, Herbert, it turns out this case should be OK. Look a level or two higher, at __splice_segment(), it even has a comment :-) -------------------- do { unsigned int flen = min(*len, plen); /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); *len -= flen; } while (*len && plen); -------------------- That code should work and do what we need. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-19 6:16 ` David Miller @ 2009-01-19 10:20 ` Herbert Xu 0 siblings, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-19 10:20 UTC (permalink / raw) To: David Miller Cc: w, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sun, Jan 18, 2009 at 10:16:03PM -0800, David Miller wrote: > > Actually, Herbert, it turns out this case should be OK. > > Look a level or two higher, at __splice_segment(), it even has a > comment :-) Aha, that should take care of it. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:26 ` David Miller 2009-01-15 23:32 ` Herbert Xu @ 2009-01-20 8:37 ` Jarek Poplawski 2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-20 8:37 UTC (permalink / raw) To: David Miller Cc: herbert, w, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote: ... > net: Fix data corruption when splicing from sockets. > > From: Jarek Poplawski <jarkao2@gmail.com> > > The trick in socket splicing where we try to convert the skb->data > into a page based reference using virt_to_page() does not work so > well. ... > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 65eac77..56272ac 100644 ... Here is a tiny upgrade to save some memory by reusing a page for more chunks if possible, which I think could be considered, after the testing of the main patch is finished. (There could be also added an additional freeing of this cached page before socket destruction, maybe in tcp_splice_read(), if somebody finds good place.) Thanks, Jarek P. --- include/net/sock.h | 4 ++++ net/core/skbuff.c | 32 ++++++++++++++++++++++++++------ net/core/sock.c | 2 ++ net/ipv4/tcp_ipv4.c | 8 ++++++++ 4 files changed, 40 insertions(+), 6 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 5a3a151..4ded741 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -190,6 +190,8 @@ struct sock_common { * @sk_user_data: RPC layer private data * @sk_sndmsg_page: cached page for sendmsg * @sk_sndmsg_off: cached offset for sendmsg + * @sk_splice_page: cached page for splice + * @sk_splice_off: cached offset for splice * @sk_send_head: front of stuff to transmit * @sk_security: used by security modules * @sk_mark: generic packet mark @@ -279,6 +281,8 @@ struct sock { struct page *sk_sndmsg_page; struct sk_buff *sk_send_head; __u32 sk_sndmsg_off; + struct page *sk_splice_page; + __u32 sk_splice_off; int sk_write_pending; #ifdef CONFIG_SECURITY void *sk_security; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 56272ac..74adc79 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) } static inline struct page *linear_to_page(struct page *page, unsigned int len, - unsigned int offset) + unsigned int *offset, + struct sk_buff *skb) { - struct page *p = alloc_pages(GFP_KERNEL, 0); + struct sock *sk = skb->sk; + struct page *p = sk->sk_splice_page; + unsigned int off; - if (!p) - return NULL; - memcpy(page_address(p) + offset, page_address(page) + offset, len); + if (!p) { +new_page: + p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0); + if (!p) + return NULL; + + off = sk->sk_splice_off = 0; + /* hold this page until it's full or unneeded */ + get_page(p); + } else { + off = sk->sk_splice_off; + if (off + len > PAGE_SIZE) { + put_page(p); + goto new_page; + } + } + + memcpy(page_address(p) + off, page_address(page) + *offset, len); + sk->sk_splice_off += len; + *offset = off; return p; } @@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, return 1; if (linear) { - page = linear_to_page(page, len, offset); + page = linear_to_page(page, len, &offset, skb); if (!page) return 1; } else diff --git a/net/core/sock.c b/net/core/sock.c index f3a0d08..6b258a9 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_sndmsg_page = NULL; sk->sk_sndmsg_off = 0; + sk->sk_splice_page = NULL; + sk->sk_splice_off = 0; sk->sk_peercred.pid = 0; sk->sk_peercred.uid = -1; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 19d7b42..cf3d367 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk) sk->sk_sndmsg_page = NULL; } + /* + * If splice cached page exists, toss it. + */ + if (sk->sk_splice_page) { + __free_page(sk->sk_splice_page); + sk->sk_splice_page = NULL; + } + percpu_counter_dec(&tcp_sockets_allocated); } ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 8:37 ` Jarek Poplawski @ 2009-01-20 9:33 ` Jarek Poplawski 2009-01-20 10:00 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-20 9:33 UTC (permalink / raw) To: David Miller Cc: herbert, w, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 08:37:26AM +0000, Jarek Poplawski wrote: ... > Here is a tiny upgrade to save some memory by reusing a page for more > chunks if possible, which I think could be considered, after the > testing of the main patch is finished. (There could be also added an > additional freeing of this cached page before socket destruction, > maybe in tcp_splice_read(), if somebody finds good place.) OOPS! I did it again... Here is better refcounting. Jarek P. --- (take 2) include/net/sock.h | 4 ++++ net/core/skbuff.c | 32 ++++++++++++++++++++++++++------ net/core/sock.c | 2 ++ net/ipv4/tcp_ipv4.c | 8 ++++++++ 4 files changed, 40 insertions(+), 6 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 5a3a151..4ded741 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -190,6 +190,8 @@ struct sock_common { * @sk_user_data: RPC layer private data * @sk_sndmsg_page: cached page for sendmsg * @sk_sndmsg_off: cached offset for sendmsg + * @sk_splice_page: cached page for splice + * @sk_splice_off: cached offset for splice * @sk_send_head: front of stuff to transmit * @sk_security: used by security modules * @sk_mark: generic packet mark @@ -279,6 +281,8 @@ struct sock { struct page *sk_sndmsg_page; struct sk_buff *sk_send_head; __u32 sk_sndmsg_off; + struct page *sk_splice_page; + __u32 sk_splice_off; int sk_write_pending; #ifdef CONFIG_SECURITY void *sk_security; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 56272ac..02a1a6c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) } static inline struct page *linear_to_page(struct page *page, unsigned int len, - unsigned int offset) + unsigned int *offset, + struct sk_buff *skb) { - struct page *p = alloc_pages(GFP_KERNEL, 0); + struct sock *sk = skb->sk; + struct page *p = sk->sk_splice_page; + unsigned int off; - if (!p) - return NULL; - memcpy(page_address(p) + offset, page_address(page) + offset, len); + if (!p) { +new_page: + p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0); + if (!p) + return NULL; + + off = sk->sk_splice_off = 0; + /* we hold one ref to this page until it's full or unneeded */ + } else { + off = sk->sk_splice_off; + if (off + len > PAGE_SIZE) { + put_page(p); + goto new_page; + } + } + + memcpy(page_address(p) + off, page_address(page) + *offset, len); + sk->sk_splice_off += len; + *offset = off; + get_page(p); return p; } @@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, return 1; if (linear) { - page = linear_to_page(page, len, offset); + page = linear_to_page(page, len, &offset, skb); if (!page) return 1; } else diff --git a/net/core/sock.c b/net/core/sock.c index f3a0d08..6b258a9 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_sndmsg_page = NULL; sk->sk_sndmsg_off = 0; + sk->sk_splice_page = NULL; + sk->sk_splice_off = 0; sk->sk_peercred.pid = 0; sk->sk_peercred.uid = -1; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 19d7b42..cf3d367 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk) sk->sk_sndmsg_page = NULL; } + /* + * If splice cached page exists, toss it. + */ + if (sk->sk_splice_page) { + __free_page(sk->sk_splice_page); + sk->sk_splice_page = NULL; + } + percpu_counter_dec(&tcp_sockets_allocated); } ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski @ 2009-01-20 10:00 ` Evgeniy Polyakov 2009-01-20 10:20 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-20 10:00 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Hi Jarek. On Tue, Jan 20, 2009 at 09:33:52AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > Here is a tiny upgrade to save some memory by reusing a page for more > > chunks if possible, which I think could be considered, after the > > testing of the main patch is finished. (There could be also added an > > additional freeing of this cached page before socket destruction, > > maybe in tcp_splice_read(), if somebody finds good place.) > > OOPS! I did it again... Here is better refcounting. > > Jarek P. > > --- (take 2) > > include/net/sock.h | 4 ++++ > net/core/skbuff.c | 32 ++++++++++++++++++++++++++------ > net/core/sock.c | 2 ++ > net/ipv4/tcp_ipv4.c | 8 ++++++++ > 4 files changed, 40 insertions(+), 6 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 5a3a151..4ded741 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -190,6 +190,8 @@ struct sock_common { > * @sk_user_data: RPC layer private data > * @sk_sndmsg_page: cached page for sendmsg > * @sk_sndmsg_off: cached offset for sendmsg > + * @sk_splice_page: cached page for splice > + * @sk_splice_off: cached offset for splice Ugh, increase every socket by 16 bytes... Does TCP one still fit the page? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 10:00 ` Evgeniy Polyakov @ 2009-01-20 10:20 ` Jarek Poplawski 2009-01-20 10:31 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-20 10:20 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 01:00:43PM +0300, Evgeniy Polyakov wrote: > Hi Jarek. Hi Evgeniy. ... > > diff --git a/include/net/sock.h b/include/net/sock.h > > index 5a3a151..4ded741 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -190,6 +190,8 @@ struct sock_common { > > * @sk_user_data: RPC layer private data > > * @sk_sndmsg_page: cached page for sendmsg > > * @sk_sndmsg_off: cached offset for sendmsg > > + * @sk_splice_page: cached page for splice > > + * @sk_splice_off: cached offset for splice > > Ugh, increase every socket by 16 bytes... Does TCP one still fit the > page? Good question! Alas I can't check this soon, but if it's really like this, of course this needs some better idea and rework. (BTW, I'd like to prevent here as much as possible some strange activities like 1 byte (payload) packets getting full pages without any accounting.) Thanks, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 10:20 ` Jarek Poplawski @ 2009-01-20 10:31 ` Evgeniy Polyakov 2009-01-20 11:01 ` Jarek Poplawski 2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski 0 siblings, 2 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-20 10:31 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > Good question! Alas I can't check this soon, but if it's really like > this, of course this needs some better idea and rework. (BTW, I'd like > to prevent here as much as possible some strange activities like 1 > byte (payload) packets getting full pages without any accounting.) I believe approach to meet all our goals is to have own network memory allocator, so that each skb could have its payload in the fragments, we would not suffer from the heavy fragmentation and power-of-two overhead for the larger MTUs, have a reserve for the OOM condition and generally do not depend on the main system behaviour. I will resurrect to some point my network allocator to check how things go in the modern environment, if no one will beat this idea first :) 1. Network (tree) allocator http://www.ioremap.net/projects/nta -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 10:31 ` Evgeniy Polyakov @ 2009-01-20 11:01 ` Jarek Poplawski 2009-01-20 17:16 ` David Miller 2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-20 11:01 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote: > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > Good question! Alas I can't check this soon, but if it's really like > > this, of course this needs some better idea and rework. (BTW, I'd like > > to prevent here as much as possible some strange activities like 1 > > byte (payload) packets getting full pages without any accounting.) > > I believe approach to meet all our goals is to have own network memory > allocator, so that each skb could have its payload in the fragments, we > would not suffer from the heavy fragmentation and power-of-two overhead > for the larger MTUs, have a reserve for the OOM condition and generally > do not depend on the main system behaviour. 100% right! But I guess we need this current fix for -stable, and I'm a bit worried about safety. > > I will resurrect to some point my network allocator to check how things > go in the modern environment, if no one will beat this idea first :) I can't see too much beating of ideas around this problem now... I Wish you luck! > > 1. Network (tree) allocator > http://www.ioremap.net/projects/nta > Great, I'll try to learn a bit btw., Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 11:01 ` Jarek Poplawski @ 2009-01-20 17:16 ` David Miller 2009-01-21 9:54 ` Jarek Poplawski 2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-20 17:16 UTC (permalink / raw) To: jarkao2 Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 20 Jan 2009 11:01:44 +0000 > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote: > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > Good question! Alas I can't check this soon, but if it's really like > > > this, of course this needs some better idea and rework. (BTW, I'd like > > > to prevent here as much as possible some strange activities like 1 > > > byte (payload) packets getting full pages without any accounting.) > > > > I believe approach to meet all our goals is to have own network memory > > allocator, so that each skb could have its payload in the fragments, we > > would not suffer from the heavy fragmentation and power-of-two overhead > > for the larger MTUs, have a reserve for the OOM condition and generally > > do not depend on the main system behaviour. > > 100% right! But I guess we need this current fix for -stable, and I'm > a bit worried about safety. Jarek, we already have a page and offset you can use. It's called sk_sndmsg_page but that is just the (current) name. Nothing prevents you from reusing it for your purposes here. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 17:16 ` David Miller @ 2009-01-21 9:54 ` Jarek Poplawski 2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski 1 sibling, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-21 9:54 UTC (permalink / raw) To: David Miller Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote: ... > Jarek, we already have a page and offset you can use. > > It's called sk_sndmsg_page but that is just the (current) name. > Nothing prevents you from reusing it for your purposes here. I'm trying to get some know-how about this field. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-20 17:16 ` David Miller 2009-01-21 9:54 ` Jarek Poplawski @ 2009-01-22 9:04 ` Jarek Poplawski 2009-01-26 5:22 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-22 9:04 UTC (permalink / raw) To: David Miller Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 20 Jan 2009 11:01:44 +0000 > > > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote: > > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > > Good question! Alas I can't check this soon, but if it's really like > > > > this, of course this needs some better idea and rework. (BTW, I'd like > > > > to prevent here as much as possible some strange activities like 1 > > > > byte (payload) packets getting full pages without any accounting.) > > > > > > I believe approach to meet all our goals is to have own network memory > > > allocator, so that each skb could have its payload in the fragments, we > > > would not suffer from the heavy fragmentation and power-of-two overhead > > > for the larger MTUs, have a reserve for the OOM condition and generally > > > do not depend on the main system behaviour. > > > > 100% right! But I guess we need this current fix for -stable, and I'm > > a bit worried about safety. > > Jarek, we already have a page and offset you can use. > > It's called sk_sndmsg_page but that is just the (current) name. > Nothing prevents you from reusing it for your purposes here. It seems this sk_sndmsg_page usage (refcounting) isn't consistent. I used here tcp_sndmsg() way, but I think I'll go back to this question soon. Thanks, Jarek P. ------------> take 3 net: Optimize memory usage when splicing from sockets. The recent fix of data corruption when splicing from sockets uses memory very inefficiently allocating a new page to copy each chunk of linear part of skb. This patch uses the same page until it's full (almost) by caching the page in sk_sndmsg_page field. With changes from David S. Miller <davem@davemloft.net> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Tested-by: needed... --- net/core/skbuff.c | 45 +++++++++++++++++++++++++++++++++++---------- 1 files changed, 35 insertions(+), 10 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2e5f2ca..2e64c1b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1333,14 +1333,39 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) put_page(spd->pages[i]); } -static inline struct page *linear_to_page(struct page *page, unsigned int len, - unsigned int offset) -{ - struct page *p = alloc_pages(GFP_KERNEL, 0); +static inline struct page *linear_to_page(struct page *page, unsigned int *len, + unsigned int *offset, + struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + struct page *p = sk->sk_sndmsg_page; + unsigned int off; + + if (!p) { +new_page: + p = sk->sk_sndmsg_page = alloc_pages(sk->sk_allocation, 0); + if (!p) + return NULL; - if (!p) - return NULL; - memcpy(page_address(p) + offset, page_address(page) + offset, len); + off = sk->sk_sndmsg_off = 0; + /* hold one ref to this page until it's full */ + } else { + unsigned int mlen; + + off = sk->sk_sndmsg_off; + mlen = PAGE_SIZE - off; + if (mlen < 64 && mlen < *len) { + put_page(p); + goto new_page; + } + + *len = min_t(unsigned int, *len, mlen); + } + + memcpy(page_address(p) + off, page_address(page) + *offset, *len); + sk->sk_sndmsg_off += *len; + *offset = off; + get_page(p); return p; } @@ -1349,21 +1374,21 @@ static inline struct page *linear_to_page(struct page *page, unsigned int len, * Fill page/offset/length into spd, if it can hold more pages. */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, - unsigned int len, unsigned int offset, + unsigned int *len, unsigned int offset, struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; if (linear) { - page = linear_to_page(page, len, offset); + page = linear_to_page(page, len, &offset, skb); if (!page) return 1; } else get_page(page); spd->pages[spd->nr_pages] = page; - spd->partial[spd->nr_pages].len = len; + spd->partial[spd->nr_pages].len = *len; spd->partial[spd->nr_pages].offset = offset; spd->nr_pages++; @@ -1405,7 +1430,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb, linear)) + if (spd_fill_page(spd, page, &flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski @ 2009-01-26 5:22 ` David Miller 2009-01-27 7:11 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-26 5:22 UTC (permalink / raw) To: jarkao2 Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Thu, 22 Jan 2009 09:04:42 +0000 > It seems this sk_sndmsg_page usage (refcounting) isn't consistent. > I used here tcp_sndmsg() way, but I think I'll go back to this question > soon. Indeed, it is something to look into, as well as locking. I'll try to find some time for this, thanks Jarek. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-26 5:22 ` David Miller @ 2009-01-27 7:11 ` Herbert Xu 2009-01-27 7:54 ` Jarek Poplawski 2009-02-01 8:41 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: Herbert Xu @ 2009-01-27 7:11 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe David Miller <davem@davemloft.net> wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Thu, 22 Jan 2009 09:04:42 +0000 > >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent. >> I used here tcp_sndmsg() way, but I think I'll go back to this question >> soon. > > Indeed, it is something to look into, as well as locking. > > I'll try to find some time for this, thanks Jarek. After a quick look it seems to be OK to me. The code in the patch is called from tcp_splice_read, which holds the socket lock. So as long as the patch uses the usual TCP convention it should work. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 7:11 ` Herbert Xu @ 2009-01-27 7:54 ` Jarek Poplawski 2009-01-27 10:09 ` Herbert Xu 2009-02-01 8:41 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 7:54 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 06:11:30PM +1100, Herbert Xu wrote: > David Miller <davem@davemloft.net> wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Thu, 22 Jan 2009 09:04:42 +0000 > > > >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent. > >> I used here tcp_sndmsg() way, but I think I'll go back to this question > >> soon. > > > > Indeed, it is something to look into, as well as locking. > > > > I'll try to find some time for this, thanks Jarek. > > After a quick look it seems to be OK to me. The code in the patch > is called from tcp_splice_read, which holds the socket lock. So as > long as the patch uses the usual TCP convention it should work. Yes, but ip_append_data() (and skb_append_datato_frags() for NETIF_F_UFO only, so currently not a problem), uses this differently, and these pages in sk->sk_sndmsg_page could leak or be used after kfree. (I didn't track locking in these other places). Thanks, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 7:54 ` Jarek Poplawski @ 2009-01-27 10:09 ` Herbert Xu 2009-01-27 10:35 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-27 10:09 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote: > > Yes, but ip_append_data() (and skb_append_datato_frags() for > NETIF_F_UFO only, so currently not a problem), uses this differently, > and these pages in sk->sk_sndmsg_page could leak or be used after > kfree. (I didn't track locking in these other places). It'll be freed when the socket is freed so that should be fine. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 10:09 ` Herbert Xu @ 2009-01-27 10:35 ` Jarek Poplawski 2009-01-27 10:57 ` Jarek Poplawski 2009-01-27 11:48 ` Herbert Xu 0 siblings, 2 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 10:35 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote: > On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote: > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > and these pages in sk->sk_sndmsg_page could leak or be used after > > kfree. (I didn't track locking in these other places). > > It'll be freed when the socket is freed so that should be fine. > I don't think so: these places can overwrite sk->sk_sndmsg_page left after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new pointer without put_page() (they only reference copied chunks and expect auto freeing). On the other hand, if tcp_sendmsg() reads after them it could use a pointer after the page is freed, I guess. Cheers, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 10:35 ` Jarek Poplawski @ 2009-01-27 10:57 ` Jarek Poplawski 2009-01-27 11:48 ` Herbert Xu 1 sibling, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 10:57 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote: > > On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote: > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > kfree. (I didn't track locking in these other places). > > > > It'll be freed when the socket is freed so that should be fine. > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > pointer without put_page() (they only reference copied chunks and > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > them it could use a pointer after the page is freed, I guess. tcp_v4_destroy_sock() looks like vulnerable too. BTW, skb_append_datato_frags() currently doesn't need to use this sk->sk_sndmsg_page at all - it doesn't use caching between calls. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 10:35 ` Jarek Poplawski 2009-01-27 10:57 ` Jarek Poplawski @ 2009-01-27 11:48 ` Herbert Xu 2009-01-27 12:16 ` Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-27 11:48 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > kfree. (I didn't track locking in these other places). > > > > It'll be freed when the socket is freed so that should be fine. > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > pointer without put_page() (they only reference copied chunks and > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > them it could use a pointer after the page is freed, I guess. I wasn't referring to the first part of your sentence. That can't happen because they're only used for UDP sockets, this is a TCP socket. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 11:48 ` Herbert Xu @ 2009-01-27 12:16 ` Jarek Poplawski 2009-01-27 12:31 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 12:16 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote: > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > > kfree. (I didn't track locking in these other places). > > > > > > It'll be freed when the socket is freed so that should be fine. > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > > pointer without put_page() (they only reference copied chunks and > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > > them it could use a pointer after the page is freed, I guess. > > I wasn't referring to the first part of your sentence. That can't > happen because they're only used for UDP sockets, this is a TCP > socket. Do you mean this part from ip_append_data() isn't used for TCP?: 1007 1008 if (page && (left = PAGE_SIZE - off) > 0) { 1009 if (copy >= left) 1010 copy = left; 1011 if (page != frag->page) { 1012 if (i == MAX_SKB_FRAGS) { 1013 err = -EMSGSIZE; 1014 goto error; 1015 } 1016 get_page(page); 1017 skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0); 1018 frag = &skb_shinfo(skb)->frags[i]; 1019 } 1020 } else if (i < MAX_SKB_FRAGS) { 1021 if (copy > PAGE_SIZE) 1022 copy = PAGE_SIZE; 1023 page = alloc_pages(sk->sk_allocation, 0); 1024 if (page == NULL) { 1025 err = -ENOMEM; 1026 goto error; 1027 } 1028 sk->sk_sndmsg_page = page; 1029 sk->sk_sndmsg_off = 0; 1030 1031 skb_fill_page_desc(skb, i, page, 0, 0); 1032 frag = &skb_shinfo(skb)->frags[i]; 1033 } else { Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 12:16 ` Jarek Poplawski @ 2009-01-27 12:31 ` Jarek Poplawski 2009-01-27 17:06 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 12:31 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote: > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote: > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > > > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > > > kfree. (I didn't track locking in these other places). > > > > > > > > It'll be freed when the socket is freed so that should be fine. > > > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > > > pointer without put_page() (they only reference copied chunks and > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > > > them it could use a pointer after the page is freed, I guess. > > > > I wasn't referring to the first part of your sentence. That can't > > happen because they're only used for UDP sockets, this is a TCP > > socket. > > Do you mean this part from ip_append_data() isn't used for TCP?: Actually, the beginning part of ip_append_data() should be enough too. So I guess I missed your point... Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 12:31 ` Jarek Poplawski @ 2009-01-27 17:06 ` David Miller 2009-01-28 8:10 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-27 17:06 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 27 Jan 2009 12:31:11 +0000 > On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote: > > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote: > > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > > > > > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > > > > kfree. (I didn't track locking in these other places). > > > > > > > > > > It'll be freed when the socket is freed so that should be fine. > > > > > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > > > > pointer without put_page() (they only reference copied chunks and > > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > > > > them it could use a pointer after the page is freed, I guess. > > > > > > I wasn't referring to the first part of your sentence. That can't > > > happen because they're only used for UDP sockets, this is a TCP > > > socket. > > > > Do you mean this part from ip_append_data() isn't used for TCP?: > > Actually, the beginning part of ip_append_data() should be enough too. > So I guess I missed your point... TCP doesn't use ip_append_data(), period. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 17:06 ` David Miller @ 2009-01-28 8:10 ` Jarek Poplawski 0 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-01-28 8:10 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Jan 27, 2009 at 09:06:51AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 27 Jan 2009 12:31:11 +0000 > > > On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote: > > > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote: > > > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote: > > > > > > > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for > > > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently, > > > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after > > > > > > > kfree. (I didn't track locking in these other places). > > > > > > > > > > > > It'll be freed when the socket is freed so that should be fine. > > > > > > > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left > > > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new > > > > > pointer without put_page() (they only reference copied chunks and > > > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after > > > > > them it could use a pointer after the page is freed, I guess. > > > > > > > > I wasn't referring to the first part of your sentence. That can't > > > > happen because they're only used for UDP sockets, this is a TCP > > > > socket. > > > > > > Do you mean this part from ip_append_data() isn't used for TCP?: > > > > Actually, the beginning part of ip_append_data() should be enough too. > > So I guess I missed your point... > > TCP doesn't use ip_append_data(), period. Hmm... I see: TCP does use ip_send_reply(), so ip_append_data() too, but with a special socket. Thanks for the explanations, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v3] tcp: splice as many packets as possible at once 2009-01-27 7:11 ` Herbert Xu 2009-01-27 7:54 ` Jarek Poplawski @ 2009-02-01 8:41 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-02-01 8:41 UTC (permalink / raw) To: herbert Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 27 Jan 2009 18:11:30 +1100 > David Miller <davem@davemloft.net> wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Thu, 22 Jan 2009 09:04:42 +0000 > > > >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent. > >> I used here tcp_sndmsg() way, but I think I'll go back to this question > >> soon. > > > > Indeed, it is something to look into, as well as locking. > > > > I'll try to find some time for this, thanks Jarek. > > After a quick look it seems to be OK to me. The code in the patch > is called from tcp_splice_read, which holds the socket lock. So as > long as the patch uses the usual TCP convention it should work. I've tossed Jarek's patch into net-next-2.6, thanks everyone. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-20 10:31 ` Evgeniy Polyakov 2009-01-20 11:01 ` Jarek Poplawski @ 2009-01-26 8:20 ` Jarek Poplawski 2009-01-26 21:21 ` Evgeniy Polyakov 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-26 8:20 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On 20-01-2009 11:31, Evgeniy Polyakov wrote: ... > I believe approach to meet all our goals is to have own network memory > allocator, so that each skb could have its payload in the fragments, we > would not suffer from the heavy fragmentation and power-of-two overhead > for the larger MTUs, have a reserve for the OOM condition and generally > do not depend on the main system behaviour. > > I will resurrect to some point my network allocator to check how things > go in the modern environment, if no one will beat this idea first :) > > 1. Network (tree) allocator > http://www.ioremap.net/projects/nta I looked at this a bit, but alas I didn't find much for this Herbert's idea of payload in fragments/pages. Maybe some kind of API RFC is needed before this resurrection? Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski @ 2009-01-26 21:21 ` Evgeniy Polyakov 2009-01-27 6:10 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-01-26 21:21 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Hi Jarek. On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > 1. Network (tree) allocator > > http://www.ioremap.net/projects/nta > > I looked at this a bit, but alas I didn't find much for this Herbert's > idea of payload in fragments/pages. Maybe some kind of API RFC is > needed before this resurrection? Basic idea is to steal some (probably a lot) pages from the slab allocator and put network buffers there without strict need for power-of-two alignment and possible wraps when we add skb_shared_info at the end, so that old e1000 driver required order-4 allocations for the jumbo frames. We can do that in alloc_skb() and friends and put returned buffers into skb's fraglist and updated reference counters for those pages; and with additional copy of the network headers into skb->head. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-26 21:21 ` Evgeniy Polyakov @ 2009-01-27 6:10 ` David Miller 2009-01-27 7:40 ` Jarek Poplawski 2009-01-27 18:42 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: David Miller @ 2009-01-27 6:10 UTC (permalink / raw) To: zbr Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Evgeniy Polyakov <zbr@ioremap.net> Date: Tue, 27 Jan 2009 00:21:30 +0300 > Hi Jarek. > > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > 1. Network (tree) allocator > > > http://www.ioremap.net/projects/nta > > > > I looked at this a bit, but alas I didn't find much for this Herbert's > > idea of payload in fragments/pages. Maybe some kind of API RFC is > > needed before this resurrection? > > Basic idea is to steal some (probably a lot) pages from the slab > allocator and put network buffers there without strict need for > power-of-two alignment and possible wraps when we add skb_shared_info at > the end, so that old e1000 driver required order-4 allocations for the > jumbo frames. We can do that in alloc_skb() and friends and put returned > buffers into skb's fraglist and updated reference counters for those > pages; and with additional copy of the network headers into skb->head. We are going back and forth saying the same thing, I think :-) (BTW, I think NTA is cool and we might do something like that eventually) The basic thing we have to do is make the drivers receive into pages, and then slide the network headers (only) into the linear SKB data area. Even for drivers like NIU and myri10ge that do this, they only use heuristics or some fixed minimum to decide how much to move to the linear area. Result? Some data payload bits end up there because it overshoots. Since we have pskb_may_pull() calls everywhere necessary, which means not in eth_type_trans(), we could just make these drivers (and future drivers converted to operate in this way) only put the ethernet headers there initially. Then the rest of the stack will take care of moving the network and transport payloads there, as necessary. I bet it won't even hurt latency or routing/firewall performance. I did test this with the NIU driver at one point, and it did not change TCP latency nor throughput at all even at 10g speeds. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-27 6:10 ` David Miller @ 2009-01-27 7:40 ` Jarek Poplawski 2009-01-30 21:42 ` David Miller 2009-01-27 18:42 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-01-27 7:40 UTC (permalink / raw) To: David Miller Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Jan 26, 2009 at 10:10:56PM -0800, David Miller wrote: > From: Evgeniy Polyakov <zbr@ioremap.net> > Date: Tue, 27 Jan 2009 00:21:30 +0300 > > > Hi Jarek. > > > > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > > 1. Network (tree) allocator > > > > http://www.ioremap.net/projects/nta > > > > > > I looked at this a bit, but alas I didn't find much for this Herbert's > > > idea of payload in fragments/pages. Maybe some kind of API RFC is > > > needed before this resurrection? > > > > Basic idea is to steal some (probably a lot) pages from the slab > > allocator and put network buffers there without strict need for > > power-of-two alignment and possible wraps when we add skb_shared_info at > > the end, so that old e1000 driver required order-4 allocations for the > > jumbo frames. We can do that in alloc_skb() and friends and put returned > > buffers into skb's fraglist and updated reference counters for those > > pages; and with additional copy of the network headers into skb->head. I think the main problem is to respect put_page() more, and maybe you mean to add this to your allocator too, but using slab pages for this looks a bit complex to me, but I can miss something. > We are going back and forth saying the same thing, I think :-) > (BTW, I think NTA is cool and we might do something like that > eventually) > > The basic thing we have to do is make the drivers receive into > pages, and then slide the network headers (only) into the linear > SKB data area. As a matter of fact, I wonder if these headers should be always separated. Their "chunk" could be refcounted as well, I guess. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-27 7:40 ` Jarek Poplawski @ 2009-01-30 21:42 ` David Miller 2009-01-30 21:59 ` Willy Tarreau ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: David Miller @ 2009-01-30 21:42 UTC (permalink / raw) To: jarkao2 Cc: zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 27 Jan 2009 07:40:48 +0000 > I think the main problem is to respect put_page() more, and maybe you > mean to add this to your allocator too, but using slab pages for this > looks a bit complex to me, but I can miss something. Hmmm, Jarek's comments here made me realize that we might be able to do some hack with cooperation with SLAB. Basically the idea is that if the page count of a SLAB page is greater than one, SLAB will not use that page for new allocations. It's cheesy and the SLAB developers will likely barf at the idea, but it would certainly work. Back to real life, I think long term the thing to do is to just do the cached page allocator thing we'll be doing after Jarek's socket page patch is integrated, and for best performance the driver has to receive it's data into pages, only explicitly pulling the ethernet header into the linear area, like NIU does. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 21:42 ` David Miller @ 2009-01-30 21:59 ` Willy Tarreau 2009-01-30 22:03 ` David Miller 2009-01-30 22:16 ` Herbert Xu 2009-02-03 11:38 ` Nick Piggin 2 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-30 21:59 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 27 Jan 2009 07:40:48 +0000 > > > I think the main problem is to respect put_page() more, and maybe you > > mean to add this to your allocator too, but using slab pages for this > > looks a bit complex to me, but I can miss something. > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. I thought it was the standard behaviour. That may explain why I did not understand much of previous discussion then :-/ > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. Maybe that would be enough as a definitive fix for a stable release, so that we can go on with deeper changes in newer versions ? > Back to real life, I think long term the thing to do is to just do the > cached page allocator thing we'll be doing after Jarek's socket page > patch is integrated, and for best performance the driver has to > receive it's data into pages, only explicitly pulling the ethernet > header into the linear area, like NIU does. Are there NICs out there able to do that themselves or does the driver need to rely on complex hacks in order to achieve this ? Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 21:59 ` Willy Tarreau @ 2009-01-30 22:03 ` David Miller 2009-01-30 22:13 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-01-30 22:03 UTC (permalink / raw) To: w Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Fri, 30 Jan 2009 22:59:20 +0100 > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > It's cheesy and the SLAB developers will likely barf at the > > idea, but it would certainly work. > > Maybe that would be enough as a definitive fix for a stable > release, so that we can go on with deeper changes in newer > versions ? Such a check could have performance ramifications, I wouldn't risk it and already I intend to push Jarek's page allocator splice fix back to -stable eventually. > > Back to real life, I think long term the thing to do is to just do the > > cached page allocator thing we'll be doing after Jarek's socket page > > patch is integrated, and for best performance the driver has to > > receive it's data into pages, only explicitly pulling the ethernet > > header into the linear area, like NIU does. > > Are there NICs out there able to do that themselves or does the > driver need to rely on complex hacks in order to achieve this ? Any NIC, even the dumbest ones, can be made to receive into pages. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 22:03 ` David Miller @ 2009-01-30 22:13 ` Willy Tarreau 2009-01-30 22:15 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-30 22:13 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Fri, 30 Jan 2009 22:59:20 +0100 > > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > > It's cheesy and the SLAB developers will likely barf at the > > > idea, but it would certainly work. > > > > Maybe that would be enough as a definitive fix for a stable > > release, so that we can go on with deeper changes in newer > > versions ? > > Such a check could have performance ramifications, I wouldn't > risk it and already I intend to push Jarek's page allocator > splice fix back to -stable eventually. OK. > > > Back to real life, I think long term the thing to do is to just do the > > > cached page allocator thing we'll be doing after Jarek's socket page > > > patch is integrated, and for best performance the driver has to > > > receive it's data into pages, only explicitly pulling the ethernet > > > header into the linear area, like NIU does. > > > > Are there NICs out there able to do that themselves or does the > > driver need to rely on complex hacks in order to achieve this ? > > Any NIC, even the dumbest ones, can be made to receive into pages. OK I thought that it was not always easy to split between headers and payload. I know that myri10ge can be configured to receive into either skbs or pages, but I was not sure about the real implications behind that. Thanks Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 22:13 ` Willy Tarreau @ 2009-01-30 22:15 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-30 22:15 UTC (permalink / raw) To: w Cc: jarkao2, zbr, herbert, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Fri, 30 Jan 2009 23:13:46 +0100 > On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote: > > Any NIC, even the dumbest ones, can be made to receive into pages. > > OK I thought that it was not always easy to split between headers > and payload. I know that myri10ge can be configured to receive into > either skbs or pages, but I was not sure about the real implications > behind that. For a dumb NIC you wouldn't split, you'd receive directly, the entire packet, into part of a page. Then right befire you give it to the stack, you pull the ethernet header from the page into the linear area. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 21:42 ` David Miller 2009-01-30 21:59 ` Willy Tarreau @ 2009-01-30 22:16 ` Herbert Xu 2009-02-02 8:08 ` Jarek Poplawski 2009-02-03 11:38 ` Nick Piggin 2 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-30 22:16 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. > > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. I'm not going anywhere near that discussion :) > Back to real life, I think long term the thing to do is to just do the > cached page allocator thing we'll be doing after Jarek's socket page > patch is integrated, and for best performance the driver has to > receive it's data into pages, only explicitly pulling the ethernet > header into the linear area, like NIU does. Yes that sounds like the way to go. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 22:16 ` Herbert Xu @ 2009-02-02 8:08 ` Jarek Poplawski 2009-02-02 8:18 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-02 8:08 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: ... > > Back to real life, I think long term the thing to do is to just do the > > cached page allocator thing we'll be doing after Jarek's socket page > > patch is integrated, and for best performance the driver has to > > receive it's data into pages, only explicitly pulling the ethernet > > header into the linear area, like NIU does. > > Yes that sounds like the way to go. Looks like a lot of changes in drivers, plus: would it work with jumbo frames? I wonder why the linear area can't be allocated as paged, and freed with put_page() instead of kfree(skb->head) in skb_release_data(). Actually, at least for some time, there could be used both of these methods (on paged alloc failure) with some ifs in skb_release_data() and spd_fill_page() (to check if linear_to_page() is needed). Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-02 8:08 ` Jarek Poplawski @ 2009-02-02 8:18 ` David Miller 2009-02-02 8:43 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-02 8:18 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 2 Feb 2009 08:08:55 +0000 > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > ... > > > Back to real life, I think long term the thing to do is to just do the > > > cached page allocator thing we'll be doing after Jarek's socket page > > > patch is integrated, and for best performance the driver has to > > > receive it's data into pages, only explicitly pulling the ethernet > > > header into the linear area, like NIU does. > > > > Yes that sounds like the way to go. > > Looks like a lot of changes in drivers, plus: would it work with jumbo > frames? I wonder why the linear area can't be allocated as paged, and > freed with put_page() instead of kfree(skb->head) in skb_release_data(). Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-02 8:18 ` David Miller @ 2009-02-02 8:43 ` Jarek Poplawski 2009-02-03 7:50 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-02 8:43 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 2 Feb 2009 08:08:55 +0000 > > > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > ... > > > > Back to real life, I think long term the thing to do is to just do the > > > > cached page allocator thing we'll be doing after Jarek's socket page > > > > patch is integrated, and for best performance the driver has to > > > > receive it's data into pages, only explicitly pulling the ethernet > > > > header into the linear area, like NIU does. > > > > > > Yes that sounds like the way to go. > > > > Looks like a lot of changes in drivers, plus: would it work with jumbo > > frames? I wonder why the linear area can't be allocated as paged, and > > freed with put_page() instead of kfree(skb->head) in skb_release_data(). > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. I mean allocating chunks of cached pages similarly to sk_sndmsg_page way. I guess the similar problem is to be worked out in any case. But it seems doing it on the linear area requires less changes in other places. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-02 8:43 ` Jarek Poplawski @ 2009-02-03 7:50 ` David Miller 2009-02-03 9:41 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-03 7:50 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 2 Feb 2009 08:43:58 +0000 > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page > way. I guess the similar problem is to be worked out in any case. But > it seems doing it on the linear area requires less changes in other > places. This is a very interesting idea, but it has some drawbacks: 1) Just like any other allocator we'll need to find a way to handle > PAGE_SIZE allocations, and thus add handling for compound pages etc. And exactly the drivers that want such huge SKB data areas on receive should be converted to use scatter gather page vectors in order to avoid multi-order pages and thus strains on the page allocator. 2) Space wastage and poor packing can be an issue. Even with SLAB/SLUB we get poor packing, look at Evegeniy's graphs that he made when writing his NTA patches. Now, when choosing a way to move forward, I'm willing to accept a little bit of the issues in #2 for the sake of avoiding the issues in #1 above. Jarek, note that we can just keep your current splice() copy hacks in there. And as a result we can have an easier to handle migration path. We just do the page RX allocation conversions in the drivers where performance really matters, for hardware a lot of people have. That's a lot smoother and has less issues that converting the system wide SKB allocator upside down. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 7:50 ` David Miller @ 2009-02-03 9:41 ` Jarek Poplawski 2009-02-03 11:10 ` Evgeniy Polyakov ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-03 9:41 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 2 Feb 2009 08:43:58 +0000 > > > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. > > > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page > > way. I guess the similar problem is to be worked out in any case. But > > it seems doing it on the linear area requires less changes in other > > places. > > This is a very interesting idea, but it has some drawbacks: > > 1) Just like any other allocator we'll need to find a way to > handle > PAGE_SIZE allocations, and thus add handling for > compound pages etc. > > And exactly the drivers that want such huge SKB data areas > on receive should be converted to use scatter gather page > vectors in order to avoid multi-order pages and thus strains > on the page allocator. I guess compound pages are handled by put_page() enough, but I don't think they should be main argument here, and I agree: scatter gather should be used where possible. > > 2) Space wastage and poor packing can be an issue. > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > graphs that he made when writing his NTA patches. I'm a bit lost here: could you "remind" the way page space would be used/saved in your paged variant e.g. for ~1500B skbs? > > Now, when choosing a way to move forward, I'm willing to accept a > little bit of the issues in #2 for the sake of avoiding the > issues in #1 above. > > Jarek, note that we can just keep your current splice() copy hacks in > there. And as a result we can have an easier to handle migration > path. We just do the page RX allocation conversions in the drivers > where performance really matters, for hardware a lot of people have. > > That's a lot smoother and has less issues that converting the system > wide SKB allocator upside down. > Yes, this looks reasonable. On the other hand, I think it would be nice to get some opinions of slab folks (incl. Evgeniy) on the expected efficiency of such a solution. (It seems releasing with put_page() will always have some cost with delayed reusing and/or waste of space.) Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 9:41 ` Jarek Poplawski @ 2009-02-03 11:10 ` Evgeniy Polyakov 2009-02-03 11:24 ` Herbert Xu ` (2 more replies) 2009-02-04 7:56 ` Jarek Poplawski 2009-02-06 7:52 ` David Miller 2 siblings, 3 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 11:10 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > 1) Just like any other allocator we'll need to find a way to > > handle > PAGE_SIZE allocations, and thus add handling for > > compound pages etc. > > > > And exactly the drivers that want such huge SKB data areas > > on receive should be converted to use scatter gather page > > vectors in order to avoid multi-order pages and thus strains > > on the page allocator. > > I guess compound pages are handled by put_page() enough, but I don't > think they should be main argument here, and I agree: scatter gather > should be used where possible. Problem is to allocate them, since with the time memory will be quite fragmented, which will not allow to find a big enough page. NTA tried to solve this by not allowing to free the data allocated on the different CPU, contrary to what SLAB does. Modulo cache coherency improvements, it allows to combine freed chunks back into the pages and combine them in turn to get bigger contiguous areas suitable for the drivers which were not converted to use the scatter gather approach. I even believe that for some hardware it is the only way to deal with the jumbo frames. > > 2) Space wastage and poor packing can be an issue. > > > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > > graphs that he made when writing his NTA patches. > > I'm a bit lost here: could you "remind" the way page space would be > used/saved in your paged variant e.g. for ~1500B skbs? At least in NTA I used cache line alignment for smaller chunks, while SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes per packet (modulo size of the shared info structure). > Yes, this looks reasonable. On the other hand, I think it would be > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > efficiency of such a solution. (It seems releasing with put_page() will > always have some cost with delayed reusing and/or waste of space.) Well, my opinion is rather biased here :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:10 ` Evgeniy Polyakov @ 2009-02-03 11:24 ` Herbert Xu 2009-02-03 11:49 ` Evgeniy Polyakov 2009-02-04 0:46 ` David Miller 2009-02-03 12:36 ` Jarek Poplawski 2009-02-04 0:46 ` David Miller 2 siblings, 2 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-03 11:24 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote: > > I even believe that for some hardware it is the only way to deal > with the jumbo frames. Not necessarily. Even if the hardware can only DMA into contiguous memory, we can always allocate a sufficient number of contiguous buffers initially, and then always copy them into fragmented skbs at receive time. This way the contiguous buffers are never depleted. Granted copying sucks, but this is really because the underlying hardware is badly designed. Also copying is way better than not receiving at all due to memory fragmentation. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:24 ` Herbert Xu @ 2009-02-03 11:49 ` Evgeniy Polyakov 2009-02-03 11:53 ` Herbert Xu 2009-02-03 13:05 ` david 2009-02-04 0:46 ` David Miller 1 sibling, 2 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 11:49 UTC (permalink / raw) To: Herbert Xu Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > I even believe that for some hardware it is the only way to deal > > with the jumbo frames. > > Not necessarily. Even if the hardware can only DMA into contiguous > memory, we can always allocate a sufficient number of contiguous > buffers initially, and then always copy them into fragmented skbs > at receive time. This way the contiguous buffers are never > depleted. How many such preallocated frames is enough? Does it enough to have all sockets recv buffer sizes divided by the MTU size? Or just some of them, or... That will work but there are way too many corner cases. > Granted copying sucks, but this is really because the underlying > hardware is badly designed. Also copying is way better than > not receiving at all due to memory fragmentation. Maybe just do not allow jumbo frames when memory is fragmented enough and fallback to the smaller MTU in this case? With LRO/GRO stuff there should be not that much of the overhead compared to multiple-page copies. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:49 ` Evgeniy Polyakov @ 2009-02-03 11:53 ` Herbert Xu 2009-02-03 12:07 ` Evgeniy Polyakov 2009-02-03 13:05 ` david 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-02-03 11:53 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 02:49:44PM +0300, Evgeniy Polyakov wrote: > How many such preallocated frames is enough? Does it enough to have all > sockets recv buffer sizes divided by the MTU size? Or just some of them, > or... That will work but there are way too many corner cases. Easy, the driver is already allocating them right now so we don't have to change a thing :) All we have to do is change the refill mechanism to always allocate a replacement skb in the rx path, and if that fails, allocate a fragmented skb instead and copy the received data into it so that the contiguous skb can be reused. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:53 ` Herbert Xu @ 2009-02-03 12:07 ` Evgeniy Polyakov 2009-02-03 12:12 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 12:07 UTC (permalink / raw) To: Herbert Xu Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 10:53:13PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > How many such preallocated frames is enough? Does it enough to have all > > sockets recv buffer sizes divided by the MTU size? Or just some of them, > > or... That will work but there are way too many corner cases. > > Easy, the driver is already allocating them right now so we don't > have to change a thing :) How many? A hundred or so descriptors (or even several thousands) - this really does not scale for the somewhat loaded IO servers, that's why we frequently get questions why dmesg is filler with order-3 and higher allocation failure dumps. > All we have to do is change the refill mechanism to always allocate > a replacement skb in the rx path, and if that fails, allocate a > fragmented skb instead and copy the received data into it so that > the contiguous skb can be reused. Having a 'reserve' skb per socket is a good idea, but what if numbr of sockets is way too big? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:07 ` Evgeniy Polyakov @ 2009-02-03 12:12 ` Herbert Xu 2009-02-03 12:18 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-02-03 12:12 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 03:07:15PM +0300, Evgeniy Polyakov wrote: > > How many? A hundred or so descriptors (or even several thousands) - > this really does not scale for the somewhat loaded IO servers, that's > why we frequently get questions why dmesg is filler with order-3 and > higher allocation failure dumps. I think you've misunderstood my suggested scheme. I'm suggesting that we keep the driver initialisation path as is, so however many skb's the driver is allocating at open() time remains unchanged. Usually this would be the number of entries on the ring buffer. We can't do any better than that since if the hardware can't do SG then you'll just have to find this many contiguous buffers. The only change we need to make is at receive time. Instead of always pushing the received skb into the stack, we should try to allocate a linear replacement skb, and if that fails, allocate a fragmented skb and copy the data into it. That way we can always push a linear skb back into the ring buffer. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:12 ` Herbert Xu @ 2009-02-03 12:18 ` Evgeniy Polyakov 2009-02-03 12:25 ` Willy Tarreau 2009-02-03 12:27 ` Herbert Xu 0 siblings, 2 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 12:18 UTC (permalink / raw) To: Herbert Xu Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > The only change we need to make is at receive time. Instead of > always pushing the received skb into the stack, we should try to > allocate a linear replacement skb, and if that fails, allocate > a fragmented skb and copy the data into it. That way we can > always push a linear skb back into the ring buffer. Yes, that's was the part about 'reserve' buffer for the sockets you cut :) I agree that this will work and will be better than nothing, but copying 9kb into 3 pages is rather CPU hungry operation, and I think (but have no numbers though) that system will behave faster if MTU is reduced to the standard one. Another solution is to have a proper allocator which will be able to defragment the data, if talking about the alternatives to the drop. So: 1. copy the whole jumbo skb into fragmented one 2. reduce the MTU 3. rely on the allocator For the 'good' hardware and drivers nothing from the above is really needed. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:18 ` Evgeniy Polyakov @ 2009-02-03 12:25 ` Willy Tarreau 2009-02-03 12:28 ` Herbert Xu 2009-02-04 0:47 ` David Miller 2009-02-03 12:27 ` Herbert Xu 1 sibling, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-02-03 12:25 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Herbert Xu, Jarek Poplawski, David Miller, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > The only change we need to make is at receive time. Instead of > > always pushing the received skb into the stack, we should try to > > allocate a linear replacement skb, and if that fails, allocate > > a fragmented skb and copy the data into it. That way we can > > always push a linear skb back into the ring buffer. > > Yes, that's was the part about 'reserve' buffer for the sockets you cut > :) > > I agree that this will work and will be better than nothing, but copying > 9kb into 3 pages is rather CPU hungry operation, and I think (but have > no numbers though) that system will behave faster if MTU is reduced to > the standard one. Well, FWIW, I've always observed better performance with 4k MTU (4080 to be precise) than with 9K, and I think that the overhead of allocating 3 contiguous pages is a major reason for this. > Another solution is to have a proper allocator which will be able to > defragment the data, if talking about the alternatives to the drop. > > So: > 1. copy the whole jumbo skb into fragmented one > 2. reduce the MTU you'll not reduce MTU of established connections though. And trying to advertise MSS changes in the middle of a TCP connection is an awful hack which I think will not work everywhere. > 3. rely on the allocator > > For the 'good' hardware and drivers nothing from the above is really needed. > > -- > Evgeniy Polyakov Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:25 ` Willy Tarreau @ 2009-02-03 12:28 ` Herbert Xu 2009-02-04 0:47 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-03 12:28 UTC (permalink / raw) To: Willy Tarreau Cc: Evgeniy Polyakov, Jarek Poplawski, David Miller, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 01:25:35PM +0100, Willy Tarreau wrote: > > > So: > > 1. copy the whole jumbo skb into fragmented one > > 2. reduce the MTU > > you'll not reduce MTU of established connections though. And trying to > advertise MSS changes in the middle of a TCP connection is an awful > hack which I think will not work everywhere. Not to mention that Ethernet isn't just IP, so the protocol that is being used might not have a concept of path MTU discovery at all. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:25 ` Willy Tarreau 2009-02-03 12:28 ` Herbert Xu @ 2009-02-04 0:47 ` David Miller 2009-02-04 6:19 ` Willy Tarreau 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-04 0:47 UTC (permalink / raw) To: w Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Tue, 3 Feb 2009 13:25:35 +0100 > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > be precise) than with 9K, and I think that the overhead of allocating 3 > contiguous pages is a major reason for this. With what hardware? If it's with myri10ge, that driver uses page frags so would not be using 3 contiguous pages even for jumbo frames. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 0:47 ` David Miller @ 2009-02-04 6:19 ` Willy Tarreau 2009-02-04 8:12 ` Evgeniy Polyakov 2009-02-04 9:12 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: Willy Tarreau @ 2009-02-04 6:19 UTC (permalink / raw) To: David Miller Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Tue, 3 Feb 2009 13:25:35 +0100 > > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > > be precise) than with 9K, and I think that the overhead of allocating 3 > > contiguous pages is a major reason for this. > > With what hardware? If it's with myri10ge, that driver uses page > frags so would not be using 3 contiguous pages even for jumbo frames. Yes myri10ge for the optimal 4080, but with e1000 too (though I don't remember the exact optimal value, I think it was slightly lower). For the myri10ge, could this be caused by the cache footprint then ? I can also retry with various values between 4 and 9k, including values close to 8k. Maybe the fact that 4k is better than 9 is because we get better filling of all pages ? I also remember having used a 7 kB MTU on e1000 and dl2k in the past. BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the allocation failures which were polluting the logs, so it's been running with that setting for years now. Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 6:19 ` Willy Tarreau @ 2009-02-04 8:12 ` Evgeniy Polyakov 2009-02-04 8:54 ` Willy Tarreau 2009-02-04 9:12 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-04 8:12 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote: > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > remember the exact optimal value, I think it was slightly lower). Very likely it is related to the allocator - the same allocation overhead to get a page, but 2.5 times bigger frame. > For the myri10ge, could this be caused by the cache footprint then ? > I can also retry with various values between 4 and 9k, including > values close to 8k. Maybe the fact that 4k is better than 9 is > because we get better filling of all pages ? > > I also remember having used a 7 kB MTU on e1000 and dl2k in the past. > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the > allocation failures which were polluting the logs, so it's been running > with that setting for years now. Recent e1000 (e1000e) uses fragments, so it does not suffer from the high-order allocation failures. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 8:12 ` Evgeniy Polyakov @ 2009-02-04 8:54 ` Willy Tarreau 2009-02-04 8:59 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-02-04 8:54 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 11:12:01AM +0300, Evgeniy Polyakov wrote: > On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote: > > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > > remember the exact optimal value, I think it was slightly lower). > > Very likely it is related to the allocator - the same allocation > overhead to get a page, but 2.5 times bigger frame. > > > For the myri10ge, could this be caused by the cache footprint then ? > > I can also retry with various values between 4 and 9k, including > > values close to 8k. Maybe the fact that 4k is better than 9 is > > because we get better filling of all pages ? > > > > I also remember having used a 7 kB MTU on e1000 and dl2k in the past. > > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the > > allocation failures which were polluting the logs, so it's been running > > with that setting for years now. > > Recent e1000 (e1000e) uses fragments, so it does not suffer from the > high-order allocation failures. My server is running 2.4 :-), but I observed the same issues with older 2.6 as well. I can certainly imagine that things have changed a lot since, but the initial point remains : jumbo frames are expensive to deal with, and with recent NICs and drivers, we might get close performance for little additional cost. After all, initial justification for jumbo frames was the devastating interrupt rate and all NICs coalesce interrupts now. So if we can optimize all the infrastructure for extremely fast processing of standard frames (1500) and still support jumbo frames in a suboptimal mode, I think it could be a very good trade-off. Regards, willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 8:54 ` Willy Tarreau @ 2009-02-04 8:59 ` Herbert Xu 2009-02-04 9:01 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-02-04 8:59 UTC (permalink / raw) To: Willy Tarreau Cc: Evgeniy Polyakov, David Miller, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > My server is running 2.4 :-), but I observed the same issues with older > 2.6 as well. I can certainly imagine that things have changed a lot since, > but the initial point remains : jumbo frames are expensive to deal with, > and with recent NICs and drivers, we might get close performance for > little additional cost. After all, initial justification for jumbo frames > was the devastating interrupt rate and all NICs coalesce interrupts now. This is total crap! Jumbo frames are way better than any of the hacks (such as GSO) that people have come up with to get around it. The only reason we are not using it as much is because of this nasty thing called the Internet. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 8:59 ` Herbert Xu @ 2009-02-04 9:01 ` David Miller 2009-02-04 9:12 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-04 9:01 UTC (permalink / raw) To: herbert Cc: w, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 4 Feb 2009 19:59:07 +1100 > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > My server is running 2.4 :-), but I observed the same issues with older > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > but the initial point remains : jumbo frames are expensive to deal with, > > and with recent NICs and drivers, we might get close performance for > > little additional cost. After all, initial justification for jumbo frames > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > This is total crap! Jumbo frames are way better than any of the > hacks (such as GSO) that people have come up with to get around it. > The only reason we are not using it as much is because of this > nasty thing called the Internet. Completely agreed. If Jumbo frames are slower, it is NOT some fundamental issue. It is rather due to some misdesign of the hardware or it's driver. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 9:01 ` David Miller @ 2009-02-04 9:12 ` Willy Tarreau 2009-02-04 9:15 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 190+ messages in thread From: Willy Tarreau @ 2009-02-04 9:12 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 4 Feb 2009 19:59:07 +1100 > > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > > > My server is running 2.4 :-), but I observed the same issues with older > > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > > but the initial point remains : jumbo frames are expensive to deal with, > > > and with recent NICs and drivers, we might get close performance for > > > little additional cost. After all, initial justification for jumbo frames > > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > > > This is total crap! Jumbo frames are way better than any of the > > hacks (such as GSO) that people have come up with to get around it. > > The only reason we are not using it as much is because of this > > nasty thing called the Internet. > > Completely agreed. > > If Jumbo frames are slower, it is NOT some fundamental issue. It is > rather due to some misdesign of the hardware or it's driver. Agreed we can't use them *because* of the internet, but this limitation has forced hardware designers to find valid alternatives. For instance, having the ability to reach 10 Gbps with 1500 bytes frames on myri10ge with a low CPU usage is a real achievement. This is "only" 800 kpps after all. And the arbitrary choice of 9k for jumbo frames was total crap too. It's clear that no hardware designer was involved in the process. They have to stuff 16kB of RAM on a NIC to use only 9. And we need to allocate 3 pages for slightly more than 2. 7.5 kB would have been better in this regard. I still find it nice to lower CPU usage with frames larger than 1500, but given the fact that this is rarely used (even in datacenters), I think our efforts should concentrate on where the real users are, ie <1500. Regards, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 9:12 ` Willy Tarreau @ 2009-02-04 9:15 ` David Miller 2009-02-04 19:19 ` Roland Dreier 2009-02-05 8:32 ` Bill Fink 2 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-02-04 9:15 UTC (permalink / raw) To: w Cc: herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Wed, 4 Feb 2009 10:12:17 +0100 > And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. Willy, do some reasearch, this also completely wrong. Alteon effectively created jumbo MTUs in their Acenic chips since that was the first chip to ever do it. Those were hardware engineers only making those design decisions. I think this is will I will stop taking part in this part of the discussion. Every posting is full of misinformation and I've got better things to do than to refute them every 5 minutes. :-/ ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 9:12 ` Willy Tarreau 2009-02-04 9:15 ` David Miller @ 2009-02-04 19:19 ` Roland Dreier 2009-02-04 19:28 ` Willy Tarreau 2009-02-05 8:32 ` Bill Fink 2 siblings, 1 reply; 190+ messages in thread From: Roland Dreier @ 2009-02-04 19:19 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe > And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > better in this regard. 9K was not totally arbitrary. The CRC used for checksumming ethernet packets has a probability of undetected errors that goes up about 11-thousand something bytes. So the real limit is ~11000 bytes, and I believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR and transport headers without fragmentation. - R. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 19:19 ` Roland Dreier @ 2009-02-04 19:28 ` Willy Tarreau 2009-02-04 19:48 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-02-04 19:28 UTC (permalink / raw) To: Roland Dreier Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 11:19:06AM -0800, Roland Dreier wrote: > > And the arbitrary choice of 9k for jumbo frames was total crap too. > > It's clear that no hardware designer was involved in the process. > > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > > better in this regard. > > 9K was not totally arbitrary. The CRC used for checksumming ethernet > packets has a probability of undetected errors that goes up about > 11-thousand something bytes. So the real limit is ~11000 bytes, and I > believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR > and transport headers without fragmentation. Yes I know that initial motivation. But IMHO it was a purely functional motivation without real considerations of the implications. When you read Alteon's initial proposal, there is even a biased analysis (they compare the fill ratio obtained with one 8k frame with that of 6 1.5k frames). Their own argument does not stand with their final proposal! I think that there might already have been people pushing for 8k and not 9 by then, but in order to get wide acceptance in datacenters, they had to please the NFS admins. Now it's useless to speculate on history and I agree with Davem that we're wasting our time with this discussion, let's go back to the keyboard ;-) Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 19:28 ` Willy Tarreau @ 2009-02-04 19:48 ` Jarek Poplawski 0 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-04 19:48 UTC (permalink / raw) To: Willy Tarreau Cc: Roland Dreier, David Miller, herbert, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, Feb 04, 2009 at 08:28:20PM +0100, Willy Tarreau wrote: ... > Now it's useless to speculate on history and I agree with Davem that > we're wasting our time with this discussion, let's go back to the > keyboard ;-) Davem knows everything, so he is ...wrong. This is a very useful discussion. Go back to the keyboards, please :-) Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 9:12 ` Willy Tarreau 2009-02-04 9:15 ` David Miller 2009-02-04 19:19 ` Roland Dreier @ 2009-02-05 8:32 ` Bill Fink 2 siblings, 0 replies; 190+ messages in thread From: Bill Fink @ 2009-02-05 8:32 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, herbert, zbr, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wed, 4 Feb 2009, Willy Tarreau wrote: > On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Wed, 4 Feb 2009 19:59:07 +1100 > > > > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > > > > > My server is running 2.4 :-), but I observed the same issues with older > > > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > > > but the initial point remains : jumbo frames are expensive to deal with, > > > > and with recent NICs and drivers, we might get close performance for > > > > little additional cost. After all, initial justification for jumbo frames > > > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > > > > > This is total crap! Jumbo frames are way better than any of the > > > hacks (such as GSO) that people have come up with to get around it. > > > The only reason we are not using it as much is because of this > > > nasty thing called the Internet. > > > > Completely agreed. > > > > If Jumbo frames are slower, it is NOT some fundamental issue. It is > > rather due to some misdesign of the hardware or it's driver. > > Agreed we can't use them *because* of the internet, but this > limitation has forced hardware designers to find valid alternatives. > For instance, having the ability to reach 10 Gbps with 1500 bytes > frames on myri10ge with a low CPU usage is a real achievement. This > is "only" 800 kpps after all. > > And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > better in this regard. > > I still find it nice to lower CPU usage with frames larger than 1500, > but given the fact that this is rarely used (even in datacenters), I > think our efforts should concentrate on where the real users are, ie > <1500. Those in the HPC realm use 9000 byte jumbo frames because it makes a major performance difference, especially across large RTT paths, and the Internet2 backbone fully supports 9000 byte jumbo frames (with some wishing we could support much larger frame sizes). Local environment: 9000 byte jumbo frames: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.1875 MB / 10.01 sec = 9905.9707 Mbps 100 %TX 76 %RX 0 retrans 0.15 msRTT 4080 byte MTU: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 9171.6875 MB / 10.02 sec = 7680.7663 Mbps 100 %TX 99 %RX 0 retrans 0.19 msRTT The performance impact is even more pronounced on a large RTT path such as the following netem emulated 80 ms RTT path: 9000 byte jumbo frames: [root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15 25904.2500 MB / 30.16 sec = 7205.8755 Mbps 96 %TX 55 %RX 0 retrans 82.73 msRTT 4080 byte MTU: [root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15 8650.0129 MB / 30.25 sec = 2398.8862 Mbps 33 %TX 19 %RX 2371 retrans 81.98 msRTT And if there's any loss in the path, the performance difference is also dramatic, such as here across a real MAN environment with about a 1 ms RTT: 9000 byte jumbo frames: [root@chance9 ~]# nuttcp -w20m 192.168.88.8 7711.8750 MB / 10.05 sec = 6436.2406 Mbps 82 %TX 96 %RX 261 retrans 0.92 msRTT 4080 byte MTU: [root@chance9 ~]# nuttcp -w20m 192.168.88.8 4551.0625 MB / 10.08 sec = 3786.2108 Mbps 50 %TX 95 %RX 42 retrans 0.95 msRTT All testing was with myri10ge on the transmitter side (2.6.20.7 kernel). So my experience has definitely been that 9000 byte jumbo frames are a major performance win for high throughput applications. -Bill ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 6:19 ` Willy Tarreau 2009-02-04 8:12 ` Evgeniy Polyakov @ 2009-02-04 9:12 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-02-04 9:12 UTC (permalink / raw) To: w Cc: zbr, herbert, jarkao2, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Wed, 4 Feb 2009 07:19:47 +0100 > On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote: > > From: Willy Tarreau <w@1wt.eu> > > Date: Tue, 3 Feb 2009 13:25:35 +0100 > > > > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > > > be precise) than with 9K, and I think that the overhead of allocating 3 > > > contiguous pages is a major reason for this. > > > > With what hardware? If it's with myri10ge, that driver uses page > > frags so would not be using 3 contiguous pages even for jumbo frames. > > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > remember the exact optimal value, I think it was slightly lower). > > For the myri10ge, could this be caused by the cache footprint then ? > I can also retry with various values between 4 and 9k, including > values close to 8k. Maybe the fact that 4k is better than 9 is > because we get better filling of all pages ? Looking quickly, myri10ge's buffer manager is incredibly simplistic so it wastes a lot of memory and gives terrible cache behavior. When using JUMBO MTU it just gives whole pages to the chip. So it looks like, assuming 4096 byte PAGE_SIZE and 9000 byte jumbo MTU, the chip will allocate for a full size frame: FULL PAGE FULL PAGE FULL PAGE and only ~1K of that last full page will be utilized. The headers will therefore always land on the same cache lines, and PAGE_SIZE-~1K will be wasted. Whereas for < PAGE_SIZE mtu selections, it will give MTU sized blocks to the chip for packet data allocation. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:18 ` Evgeniy Polyakov 2009-02-03 12:25 ` Willy Tarreau @ 2009-02-03 12:27 ` Herbert Xu 1 sibling, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-03 12:27 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote: > > I agree that this will work and will be better than nothing, but copying > 9kb into 3 pages is rather CPU hungry operation, and I think (but have > no numbers though) that system will behave faster if MTU is reduced to > the standard one. Reducing the MTU can create all sorts of problems so it should be avoided if at all possible. These days, path MTU discovery is haphazard at best. In fact MTU problems are the main reason why jumbo frames simply don't get deployed. > Another solution is to have a proper allocator which will be able to > defragment the data, if talking about the alternatives to the drop. Sure, if we can create an allocator that can guarantee contiguous allocations all the time then by all means go for it. But until we get there, doing what I suggested is way better than stopping the receiving process altogether. > So: > 1. copy the whole jumbo skb into fragmented one > 2. reduce the MTU > 3. rely on the allocator Yes, improving the allocator would obviously inrease the performance, however, there is nothing against employing both methods. I'd always avoid reducing the MTU at run-time though. > For the 'good' hardware and drivers nothing from the above is really needed. Right, that's why there is a point beyond which improving the allocator is no longer worthwhile. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:49 ` Evgeniy Polyakov 2009-02-03 11:53 ` Herbert Xu @ 2009-02-03 13:05 ` david 2009-02-03 12:12 ` Evgeniy Polyakov 1 sibling, 1 reply; 190+ messages in thread From: david @ 2009-02-03 13:05 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Herbert Xu, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, 3 Feb 2009, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: >>> I even believe that for some hardware it is the only way to deal >>> with the jumbo frames. >> >> Not necessarily. Even if the hardware can only DMA into contiguous >> memory, we can always allocate a sufficient number of contiguous >> buffers initially, and then always copy them into fragmented skbs >> at receive time. This way the contiguous buffers are never >> depleted. > > How many such preallocated frames is enough? Does it enough to have all > sockets recv buffer sizes divided by the MTU size? Or just some of them, > or... That will work but there are way too many corner cases. > >> Granted copying sucks, but this is really because the underlying >> hardware is badly designed. Also copying is way better than >> not receiving at all due to memory fragmentation. > > Maybe just do not allow jumbo frames when memory is fragmented enough > and fallback to the smaller MTU in this case? With LRO/GRO stuff there > should be not that much of the overhead compared to multiple-page > copies. 1. define 'fragmented enough' 2. the packet size was already negotiated on your existing connections, how are you going to change all those on the fly? 3. what do you do when a remote system sends you a large packet? drop it on the floor? having some pool of large buffers to receive into (and copy out of those buffers as quickly as possible) would cause a performance hit when things get bad, but isn't that better than dropping packets? as for the number of buffers to use. make a reasonable guess. if you only have a small number of packets around, use the buffers directly, as you use more of them start copying, as useage climbs attempt to allocate more. if you can't allocate more (and you have all of your existing ones in use) you will have to drop the packet, but at that point are you really in any worse shape than if you didn't have some mechanism to copy out of the large buffers? David Lang ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 13:05 ` david @ 2009-02-03 12:12 ` Evgeniy Polyakov 2009-02-03 12:18 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 12:12 UTC (permalink / raw) To: david Cc: Herbert Xu, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 05:05:14AM -0800, david@lang.hm (david@lang.hm) wrote: > >Maybe just do not allow jumbo frames when memory is fragmented enough > >and fallback to the smaller MTU in this case? With LRO/GRO stuff there > >should be not that much of the overhead compared to multiple-page > >copies. > > > 1. define 'fragmented enough' When allocator can not provide requested amount of data. > 2. the packet size was already negotiated on your existing connections, > how are you going to change all those on the fly? I.e. MTU can not be changed on-flight? Magic world. > 3. what do you do when a remote system sends you a large packet? drop it > on the floor? We already do just that when jumbo frame can not be allocated :) > having some pool of large buffers to receive into (and copy out of those > buffers as quickly as possible) would cause a performance hit when things > get bad, but isn't that better than dropping packets? It is a solution, but I think it will behave noticebly worse than with decresed MTU. > as for the number of buffers to use. make a reasonable guess. if you only > have a small number of packets around, use the buffers directly, as you > use more of them start copying, as useage climbs attempt to allocate more. > if you can't allocate more (and you have all of your existing ones in use) > you will have to drop the packet, but at that point are you really in any > worse shape than if you didn't have some mechanism to copy out of the > large buffers? That's the main point: how to deal with broken hardware? I think (but have no strong numbers though) that having 6 packets with 1500 MTU combined into GRO/LRO frame will be processed way faster than copying 9k MTU into 3 pages and process single skb. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:12 ` Evgeniy Polyakov @ 2009-02-03 12:18 ` Herbert Xu 2009-02-03 12:30 ` Evgeniy Polyakov 2009-02-03 12:33 ` Nick Piggin 0 siblings, 2 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-03 12:18 UTC (permalink / raw) To: Evgeniy Polyakov Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote: > > It is a solution, but I think it will behave noticebly worse than > with decresed MTU. Not necessarily. Remember GSO/GRO in essence are just hacks to get around the fact that we can't increase the MTU to where we want it to be. MTU reduces the cost over the entire path while GRO/GSO only do so for the sender and the receiver. In other words when given the choice between a larger MTU with copying or GRO, the larger MTU will probably win anyway as it's optimising the entire path rather than just the receiver. > That's the main point: how to deal with broken hardware? I think (but > have no strong numbers though) that having 6 packets with 1500 MTU > combined into GRO/LRO frame will be processed way faster than copying 9k > MTU into 3 pages and process single skb. Please note that with my scheme, you'd only start copying if you can't allocate a linear skb. So if memory fragmentation doesn't happen then there is no copying at all. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:18 ` Herbert Xu @ 2009-02-03 12:30 ` Evgeniy Polyakov 2009-02-03 12:33 ` Herbert Xu 2009-02-03 12:33 ` Nick Piggin 1 sibling, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 12:30 UTC (permalink / raw) To: Herbert Xu Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 11:18:08PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > It is a solution, but I think it will behave noticebly worse than > > with decresed MTU. > > Not necessarily. Remember GSO/GRO in essence are just hacks to > get around the fact that we can't increase the MTU to where we > want it to be. MTU reduces the cost over the entire path while > GRO/GSO only do so for the sender and the receiver. > > In other words when given the choice between a larger MTU with > copying or GRO, the larger MTU will probably win anyway as it's > optimising the entire path rather than just the receiver. Well, we both do not have the data and very likely will not change the opinions :) But we can proceed the discussion in case something interesting will appear. For example I can hack up e1000e driver to do a dumb copy of 9k each time it has received a jumbo frame and compare it with usual 1.5k MTU performance. But getting that modern CPUs are loafing with noticebly big IO chunks, this may only show that CPU was increased with the copy. But still may work. > > That's the main point: how to deal with broken hardware? I think (but > > have no strong numbers though) that having 6 packets with 1500 MTU > > combined into GRO/LRO frame will be processed way faster than copying 9k > > MTU into 3 pages and process single skb. > > Please note that with my scheme, you'd only start copying if you > can't allocate a linear skb. So if memory fragmentation doesn't > happen then there is no copying at all. Yes, absolutely. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:30 ` Evgeniy Polyakov @ 2009-02-03 12:33 ` Herbert Xu 0 siblings, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-03 12:33 UTC (permalink / raw) To: Evgeniy Polyakov Cc: david, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 03:30:15PM +0300, Evgeniy Polyakov wrote: > > But we can proceed the discussion in case something interesting will > appear. For example I can hack up e1000e driver to do a dumb copy of 9k > each time it has received a jumbo frame and compare it with usual 1.5k > MTU performance. But getting that modern CPUs are loafing with noticebly > big IO chunks, this may only show that CPU was increased with the copy. > But still may work. Comparing performance is pointless because the only time you need to do the copy is when the allocator has failed. So there is *no* alternative to copying, regardless of how slow it is. You can always improve the allocator whether we do this copying fallback or not. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:18 ` Herbert Xu 2009-02-03 12:30 ` Evgeniy Polyakov @ 2009-02-03 12:33 ` Nick Piggin 1 sibling, 0 replies; 190+ messages in thread From: Nick Piggin @ 2009-02-03 12:33 UTC (permalink / raw) To: Herbert Xu Cc: Evgeniy Polyakov, david, Jarek Poplawski, David Miller, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tuesday 03 February 2009 23:18:08 Herbert Xu wrote: > On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote: > > It is a solution, but I think it will behave noticebly worse than > > with decresed MTU. > > Not necessarily. Remember GSO/GRO in essence are just hacks to > get around the fact that we can't increase the MTU to where we > want it to be. MTU reduces the cost over the entire path while > GRO/GSO only do so for the sender and the receiver. > > In other words when given the choice between a larger MTU with > copying or GRO, the larger MTU will probably win anyway as it's > optimising the entire path rather than just the receiver. > > > That's the main point: how to deal with broken hardware? I think (but > > have no strong numbers though) that having 6 packets with 1500 MTU > > combined into GRO/LRO frame will be processed way faster than copying 9k > > MTU into 3 pages and process single skb. > > Please note that with my scheme, you'd only start copying if you > can't allocate a linear skb. So if memory fragmentation doesn't > happen then there is no copying at all. This sounds like a really nice idea (to the layman)! ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:24 ` Herbert Xu 2009-02-03 11:49 ` Evgeniy Polyakov @ 2009-02-04 0:46 ` David Miller 2009-02-04 9:41 ` Benny Amorsen 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-04 0:46 UTC (permalink / raw) To: herbert Cc: zbr, jarkao2, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 3 Feb 2009 22:24:31 +1100 > Not necessarily. Even if the hardware can only DMA into contiguous > memory, we can always allocate a sufficient number of contiguous > buffers initially, and then always copy them into fragmented skbs > at receive time. This way the contiguous buffers are never > depleted. > > Granted copying sucks, but this is really because the underlying > hardware is badly designed. Also copying is way better than > not receiving at all due to memory fragmentation. This scheme sounds very reasonable. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 0:46 ` David Miller @ 2009-02-04 9:41 ` Benny Amorsen 2009-02-04 12:01 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Benny Amorsen @ 2009-02-04 9:41 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, jarkao2, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe David Miller <davem@davemloft.net> writes: > From: Herbert Xu <herbert@gondor.apana.org.au> >> Granted copying sucks, but this is really because the underlying >> hardware is badly designed. Also copying is way better than >> not receiving at all due to memory fragmentation. > > This scheme sounds very reasonable. Would it be possible to add a counter somewhere for this, or is that too expensive? /Benny ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 9:41 ` Benny Amorsen @ 2009-02-04 12:01 ` Herbert Xu 0 siblings, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-04 12:01 UTC (permalink / raw) To: Benny Amorsen Cc: davem, zbr, jarkao2, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe Benny Amorsen <benny+usenet@amorsen.dk> wrote: > > Would it be possible to add a counter somewhere for this, or is that > too expensive? Yes a counter would be useful and is reasonable. But you can probably deduce it by just looking at slabinfo. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:10 ` Evgeniy Polyakov 2009-02-03 11:24 ` Herbert Xu @ 2009-02-03 12:36 ` Jarek Poplawski 2009-02-03 13:06 ` Evgeniy Polyakov 2009-02-04 0:46 ` David Miller 2 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-03 12:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > 1) Just like any other allocator we'll need to find a way to > > > handle > PAGE_SIZE allocations, and thus add handling for > > > compound pages etc. > > > > > > And exactly the drivers that want such huge SKB data areas > > > on receive should be converted to use scatter gather page > > > vectors in order to avoid multi-order pages and thus strains > > > on the page allocator. > > > > I guess compound pages are handled by put_page() enough, but I don't > > think they should be main argument here, and I agree: scatter gather > > should be used where possible. > > Problem is to allocate them, since with the time memory will be > quite fragmented, which will not allow to find a big enough page. Yes, it's a problem, but I don't think the main one. Since we're currently concerned with zero-copy for splice I think we could concentrate on most common cases, and treat jumbo frames with best effort only: if there are free compound pages - fine, otherwise we fallback to slab and copy in splice. > > NTA tried to solve this by not allowing to free the data allocated on > the different CPU, contrary to what SLAB does. Modulo cache coherency > improvements, it allows to combine freed chunks back into the pages and > combine them in turn to get bigger contiguous areas suitable for the > drivers which were not converted to use the scatter gather approach. > I even believe that for some hardware it is the only way to deal > with the jumbo frames. > > > > 2) Space wastage and poor packing can be an issue. > > > > > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > > > graphs that he made when writing his NTA patches. > > > > I'm a bit lost here: could you "remind" the way page space would be > > used/saved in your paged variant e.g. for ~1500B skbs? > > At least in NTA I used cache line alignment for smaller chunks, while > SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes > per packet (modulo size of the shared info structure). > > > Yes, this looks reasonable. On the other hand, I think it would be > > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > > efficiency of such a solution. (It seems releasing with put_page() will > > always have some cost with delayed reusing and/or waste of space.) > > Well, my opinion is rather biased here :) I understand NTA could be better than slabs in above-mentioned cases, but I'm not sure you explaind enough your point on solving this zero-copy problem vs. NTA? Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 12:36 ` Jarek Poplawski @ 2009-02-03 13:06 ` Evgeniy Polyakov 2009-02-03 13:25 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 13:06 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > I understand NTA could be better than slabs in above-mentioned cases, > but I'm not sure you explaind enough your point on solving this > zero-copy problem vs. NTA? NTA steals pages from the SLAB so we can maintain any reference counter logic in them, so linear part of the skb may be not really freed/reused until reference counter hits zero. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 13:06 ` Evgeniy Polyakov @ 2009-02-03 13:25 ` Jarek Poplawski 2009-02-03 14:20 ` Evgeniy Polyakov 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-03 13:25 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 04:06:06PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > I understand NTA could be better than slabs in above-mentioned cases, > > but I'm not sure you explaind enough your point on solving this > > zero-copy problem vs. NTA? > > NTA steals pages from the SLAB so we can maintain any reference counter > logic in them, so linear part of the skb may be not really freed/reused > until reference counter hits zero. Now it's clear. So this looks like one of the options considered by David. Then I wonder about details... It seems some kind of scheduled browsing for refcounts is needed or is there something better? Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 13:25 ` Jarek Poplawski @ 2009-02-03 14:20 ` Evgeniy Polyakov 0 siblings, 0 replies; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-03 14:20 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 01:25:37PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > Now it's clear. So this looks like one of the options considered by > David. Then I wonder about details... It seems some kind of scheduled > browsing for refcounts is needed or is there something better? It depends on the implementation, for example each kfree() may check the reference counter and return page to the allocator when it is really free. Since page may contiain multiple objects its reference counter may hit zero someday in the future, or never reach it if data was not freed. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 11:10 ` Evgeniy Polyakov 2009-02-03 11:24 ` Herbert Xu 2009-02-03 12:36 ` Jarek Poplawski @ 2009-02-04 0:46 ` David Miller 2009-02-04 8:08 ` Evgeniy Polyakov 2 siblings, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-04 0:46 UTC (permalink / raw) To: zbr Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Evgeniy Polyakov <zbr@ioremap.net> Date: Tue, 3 Feb 2009 14:10:12 +0300 > NTA tried to solve this by not allowing to free the data allocated on > the different CPU, contrary to what SLAB does. Modulo cache coherency > improvements, This could kill performance on NUMA systems if we are not careful. If we ever consider NTA seriously, these issues would need to be performance tested. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 0:46 ` David Miller @ 2009-02-04 8:08 ` Evgeniy Polyakov 2009-02-04 9:23 ` Nick Piggin 0 siblings, 1 reply; 190+ messages in thread From: Evgeniy Polyakov @ 2009-02-04 8:08 UTC (permalink / raw) To: David Miller Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote: > > NTA tried to solve this by not allowing to free the data allocated on > > the different CPU, contrary to what SLAB does. Modulo cache coherency > > improvements, > > This could kill performance on NUMA systems if we are not careful. > > If we ever consider NTA seriously, these issues would need to > be performance tested. Quite contrary I think. Memory is allocated and freed on the same CPU, which means on the same memory domain, closest to the CPU in question. I did not test NUMA though, but NTA performance on the usual CPU (it is 2.5 years old already :) was noticebly good. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-04 8:08 ` Evgeniy Polyakov @ 2009-02-04 9:23 ` Nick Piggin 0 siblings, 0 replies; 190+ messages in thread From: Nick Piggin @ 2009-02-04 9:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Wednesday 04 February 2009 19:08:51 Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote: > > > NTA tried to solve this by not allowing to free the data allocated on > > > the different CPU, contrary to what SLAB does. Modulo cache coherency > > > improvements, > > > > This could kill performance on NUMA systems if we are not careful. > > > > If we ever consider NTA seriously, these issues would need to > > be performance tested. > > Quite contrary I think. Memory is allocated and freed on the same CPU, > which means on the same memory domain, closest to the CPU in question. > > I did not test NUMA though, but NTA performance on the usual CPU (it is > 2.5 years old already :) was noticebly good. I had a quick look at NTA... I didn't understand much of it yet, but the remote freeing scheme is kind of like what I did for slqb. The freeing CPU queues objects back to the CPU that allocated them, which eventually checks the queue and frees them itself. I don't know how much cache coherency gains you get from this -- in most slab allocations, I think the object tends to be cache on on the CPU that frees it. I'm doing it mainly to try avoid locking... I guess that makes for cache coherency benefit itself. If NTA does significantly better than slab allocator, I would be quite interested. It might be something that we can learn from and use in the general slab allocator (or maybe something more network specific that NTA does). ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 9:41 ` Jarek Poplawski 2009-02-03 11:10 ` Evgeniy Polyakov @ 2009-02-04 7:56 ` Jarek Poplawski 2009-02-06 7:52 ` David Miller 2 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-04 7:56 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski wrote: > On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Mon, 2 Feb 2009 08:43:58 +0000 > > > > > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > > > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. > > > > > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page > > > way. I guess the similar problem is to be worked out in any case. But > > > it seems doing it on the linear area requires less changes in other > > > places. > > > > This is a very interesting idea, but it has some drawbacks: > > > > 1) Just like any other allocator we'll need to find a way to > > handle > PAGE_SIZE allocations, and thus add handling for > > compound pages etc. > > > > And exactly the drivers that want such huge SKB data areas > > on receive should be converted to use scatter gather page > > vectors in order to avoid multi-order pages and thus strains > > on the page allocator. > > I guess compound pages are handled by put_page() enough, but I don't > think they should be main argument here, and I agree: scatter gather > should be used where possible. > > > > > 2) Space wastage and poor packing can be an issue. > > > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > > graphs that he made when writing his NTA patches. > > I'm a bit lost here: could you "remind" the way page space would be > used/saved in your paged variant e.g. for ~1500B skbs? Here is some proof of concept to make sure I wasn't misunderstood. alloc_paged() is used only for "normal" size skbs (2x ~1500B per page; I think Herbert mentioned something like this at the beginning; it also avoids allocs other than GFP_ATOMIC and GFP_KERNEL for simplicity.) I guess it could be replaced with any other mechanizm allocting to a fragment or Evgeniy's allocator when it's ready. Alas it's not tested, but if it works, I think it should show how much gain is expected here for most common traffic. Jarek P. --- diff -Nurp a/include/linux/netdevice.h b/include/linux/netdevice.h --- a/include/linux/netdevice.h 2009-02-02 20:23:46.000000000 +0100 +++ b/include/linux/netdevice.h 2009-02-02 21:52:46.000000000 +0100 @@ -1154,6 +1154,9 @@ struct softnet_data struct sk_buff *completion_queue; struct napi_struct backlog; + + struct page *alloc_skb_page[2]; + unsigned int alloc_skb_off[2]; }; DECLARE_PER_CPU(struct softnet_data,softnet_data); diff -Nurp a/include/linux/skbuff.h b/include/linux/skbuff.h --- a/include/linux/skbuff.h 2009-02-02 20:23:46.000000000 +0100 +++ b/include/linux/skbuff.h 2009-02-02 22:12:04.000000000 +0100 @@ -144,7 +144,8 @@ struct skb_shared_info { unsigned short gso_size; /* Warning: this field is not always filled in (UFO)! */ unsigned short gso_segs; - unsigned short gso_type; + __u8 gso_type; + __u8 alloc_paged; __be32 ip6_frag_id; #ifdef CONFIG_HAS_DMA unsigned int num_dma_maps; diff -Nurp a/net/core/dev.c b/net/core/dev.c --- a/net/core/dev.c 2009-02-02 19:37:33.000000000 +0100 +++ b/net/core/dev.c 2009-02-02 23:15:55.000000000 +0100 @@ -5243,6 +5243,9 @@ static int __init net_dev_init(void) queue->backlog.poll = process_backlog; queue->backlog.weight = weight_p; queue->backlog.gro_list = NULL; + + queue->alloc_skb_page[0] = NULL; + queue->alloc_skb_page[1] = NULL; } dev_boot_phase = 0; diff -Nurp a/net/core/skbuff.c b/net/core/skbuff.c --- a/net/core/skbuff.c 2009-02-02 20:23:46.000000000 +0100 +++ b/net/core/skbuff.c 2009-02-02 23:57:10.000000000 +0100 @@ -151,6 +151,55 @@ void skb_truesize_bug(struct sk_buff *sk } EXPORT_SYMBOL(skb_truesize_bug); +static inline void *alloc_paged(unsigned int size, gfp_t gfp_mask) +{ + struct softnet_data *sd; + unsigned long flags; + unsigned int off; + struct page *p; + void *ret; + int i; + + if (size < 1400 || size > 2000) + return NULL; + + if (gfp_mask == GFP_ATOMIC) + i = 0; + else if (gfp_mask == GFP_KERNEL) + i = 1; + else + return NULL; + + local_irq_save(flags); + sd = &__get_cpu_var(softnet_data); + p = sd->alloc_skb_page[i]; + + if (p) { + off = sd->alloc_skb_off[i]; + if (off + size > PAGE_SIZE) { + put_page(p); + goto new_page; + } + } else { +new_page: + p = sd->alloc_skb_page[i] = alloc_pages(gfp_mask, 0); + if (!p) { + ret = NULL; + goto out; + } + + off = 0; + /* hold one ref to this page until it's full */ + } + + get_page(p); + ret = page_address(p) + off; + sd->alloc_skb_off[i] = off + size; +out: + local_irq_restore(flags); + return ret; +} + /* Allocate a new skbuff. We do this ourselves so we can fill in a few * 'private' fields and also do memory statistics to find all the * [BEEP] leaks. @@ -178,7 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; - u8 *data; + u8 *data, paged = 0; cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; @@ -188,8 +237,13 @@ struct sk_buff *__alloc_skb(unsigned int goto out; size = SKB_DATA_ALIGN(size); - data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), - gfp_mask, node); + data = alloc_paged(size + sizeof(struct skb_shared_info), gfp_mask); + if (data) + paged = 1; + else + data = kmalloc_node_track_caller(size + + sizeof(struct skb_shared_info), + gfp_mask, node); if (!data) goto nodata; @@ -214,6 +268,7 @@ struct sk_buff *__alloc_skb(unsigned int shinfo->gso_type = 0; shinfo->ip6_frag_id = 0; shinfo->frag_list = NULL; + shinfo->alloc_paged = paged; if (fclone) { struct sk_buff *child = skb + 1; @@ -341,7 +396,10 @@ static void skb_release_data(struct sk_b if (skb_shinfo(skb)->frag_list) skb_drop_fraglist(skb); - kfree(skb->head); + if (skb_shinfo(skb)->alloc_paged) + put_page(virt_to_page(skb->head)); + else + kfree(skb->head); } } @@ -1380,7 +1438,7 @@ static inline int spd_fill_page(struct s if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; - if (linear) { + if (linear && !skb_shinfo(skb)->alloc_paged) { page = linear_to_page(page, len, &offset, skb); if (!page) return 1; ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-03 9:41 ` Jarek Poplawski 2009-02-03 11:10 ` Evgeniy Polyakov 2009-02-04 7:56 ` Jarek Poplawski @ 2009-02-06 7:52 ` David Miller 2009-02-06 8:09 ` Herbert Xu ` (2 more replies) 2 siblings, 3 replies; 190+ messages in thread From: David Miller @ 2009-02-06 7:52 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 3 Feb 2009 09:41:08 +0000 > Yes, this looks reasonable. On the other hand, I think it would be > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > efficiency of such a solution. (It seems releasing with put_page() will > always have some cost with delayed reusing and/or waste of space.) I think we can't avoid using carved up pages for skb->data in the end. The whole kernel wants to speak in pages and be able to grab and release them in one way and one way only (get_page() and put_page()). What do you think is more likely? Us teaching the whole entire kernel how to hold onto SKB linear data buffers, or the networking fixing itself to operate on pages for it's header metadata? :-) What we'll end up with is likely a hybrid scheme. High speed devices will receive into pages. And also the skb->data area will be page backed and held using get_page()/put_page() references. It is not even worth optimizing for skb->data holding the entire packet, that's not the case that matters. These skb->data areas will thus be 128 bytes plus the skb_shinfo structure blob. They also will be recycled often, rather than held onto for long periods of time. In fact we can optimize that even further in many ways, for example by dropping the skb->data backed memory once the skb is queued to the socket receive buffer. That will make skb->data buffer lifetimes miniscule even under heavy receive load. In that kind of situation, doing even the most stupidest page slicing algorithm, similar to what we do now with sk->sk_sndmsg_page, is more than adequate and things like NTA (purely to solve this problem) is overengineering. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 7:52 ` David Miller @ 2009-02-06 8:09 ` Herbert Xu 2009-02-06 9:10 ` Jarek Poplawski 2009-02-06 18:59 ` Jarek Poplawski 2 siblings, 0 replies; 190+ messages in thread From: Herbert Xu @ 2009-02-06 8:09 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote: > > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. Indeed, while I was doing the tun accounting stuff and reviewing the rx accounting users, it was apparent that we don't need to carry most of this stuff in our receive queues. This is almost the opposite of the skbless data idea for TSO. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 7:52 ` David Miller 2009-02-06 8:09 ` Herbert Xu @ 2009-02-06 9:10 ` Jarek Poplawski 2009-02-06 9:17 ` David Miller 2009-02-06 9:23 ` Herbert Xu 2009-02-06 18:59 ` Jarek Poplawski 2 siblings, 2 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 9:10 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 3 Feb 2009 09:41:08 +0000 > > > Yes, this looks reasonable. On the other hand, I think it would be > > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > > efficiency of such a solution. (It seems releasing with put_page() will > > always have some cost with delayed reusing and/or waste of space.) > > I think we can't avoid using carved up pages for skb->data in the end. > The whole kernel wants to speak in pages and be able to grab and > release them in one way and one way only (get_page() and put_page()). > > What do you think is more likely? Us teaching the whole entire kernel > how to hold onto SKB linear data buffers, or the networking fixing > itself to operate on pages for it's header metadata? :-) This idea looks very reasonable, except I wander why nobody else didn't need this kind of mm interface. Another question is it seems many mechanisms like fast searching, defragmentation etc. could be reused. > What we'll end up with is likely a hybrid scheme. High speed devices > will receive into pages. And also the skb->data area will be page > backed and held using get_page()/put_page() references. > > It is not even worth optimizing for skb->data holding the entire > packet, that's not the case that matters. > > These skb->data areas will thus be 128 bytes plus the skb_shinfo > structure blob. They also will be recycled often, rather than held > onto for long periods of time. Looks fine, except: you mentioned dumb NICs, which would need this page space on receive, anyway. BTW, don't they need this on transmit again? > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. > > In that kind of situation, doing even the most stupidest page slicing > algorithm, similar to what we do now with sk->sk_sndmsg_page, is > more than adequate and things like NTA (purely to solve this problem) > is overengineering. Hmm... I don't get it. It seems these slabs do a lot of advanced work, and still some people like Evgeniy or Nick thought it's not enough, and even found it worth of their time to rework this. There is also a question of memory accounting: do you think admins don't care if we give away say 25% additionally? Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:10 ` Jarek Poplawski @ 2009-02-06 9:17 ` David Miller 2009-02-06 9:42 ` Jarek Poplawski 2009-02-06 9:23 ` Herbert Xu 1 sibling, 1 reply; 190+ messages in thread From: David Miller @ 2009-02-06 9:17 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 6 Feb 2009 09:10:34 +0000 > Hmm... I don't get it. It seems these slabs do a lot of advanced work, > and still some people like Evgeniy or Nick thought it's not enough, > and even found it worth of their time to rework this. Note that, at least to some extent, the memory allocators are duplicating some of the locality and NUMA logic that's already present in the page allocator itself. Except that they are handling the fact that objects are moving around instead of pages. Also keep in mind that we might also want to encourage drivers to make use of the SKB recycling mechanisms we have. So this will decrease lifetimes, and thus the wastage and locality issues immensely. We truly want something different from what the general purpose allocator provides. Namely, a reference countable buffer. And all I'm saying is that since the page allocator provides that facility, and using pages solves all of the splice() et al. problems, building something extremely simple on top of the page allocator seems to be a good way to go. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:17 ` David Miller @ 2009-02-06 9:42 ` Jarek Poplawski 2009-02-06 9:49 ` David Miller 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 9:42 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 01:17:22AM -0800, David Miller wrote: ... > And all I'm saying is that since the page allocator provides that > facility, and using pages solves all of the splice() et al. problems, > building something extremely simple on top of the page allocator seems > to be a good way to go. This all is absolutely right if we can afford it: more simple - more memory wasted. I hope you're right, but on the other hand, people don't use slob by default. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:42 ` Jarek Poplawski @ 2009-02-06 9:49 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-02-06 9:49 UTC (permalink / raw) To: jarkao2 Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 6 Feb 2009 09:42:53 +0000 > I hope you're right, but on the other hand, people don't use slob by > default. The use is different, so it's a bad comparison, really. SLAB has to satisfy all kinds of object lifetimes, all kinds of sizes and use cases, and perform generally well in all situations with zero knowledge about usage patterns. Networking is strictly the opposite of that set of requirements. We know all of these parameters (or their ranges) and can on top of that control them if necessary. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:10 ` Jarek Poplawski 2009-02-06 9:17 ` David Miller @ 2009-02-06 9:23 ` Herbert Xu 2009-02-06 9:51 ` Jarek Poplawski 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-02-06 9:23 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote: > > Looks fine, except: you mentioned dumb NICs, which would need this > page space on receive, anyway. BTW, don't they need this on transmit > again? A lot more NICs support SG on tx than rx. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:23 ` Herbert Xu @ 2009-02-06 9:51 ` Jarek Poplawski 2009-02-06 10:28 ` Herbert Xu 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 9:51 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 08:23:26PM +1100, Herbert Xu wrote: > On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote: > > > > Looks fine, except: you mentioned dumb NICs, which would need this > > page space on receive, anyway. BTW, don't they need this on transmit > > again? > > A lot more NICs support SG on tx than rx. OK, but since there is not so much difference, and we need to waste it in some cases anyway, plus handle it later some special way, I'm a bit in doubt. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 9:51 ` Jarek Poplawski @ 2009-02-06 10:28 ` Herbert Xu 2009-02-06 10:58 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-02-06 10:28 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > OK, but since there is not so much difference, and we need to waste > it in some cases anyway, plus handle it later some special way, I'm > a bit in doubt. Well the thing is cards that don't support SG on tx probably don't support jumbo frames either. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 10:28 ` Herbert Xu @ 2009-02-06 10:58 ` Jarek Poplawski 2009-02-06 11:10 ` Willy Tarreau 0 siblings, 1 reply; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 10:58 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > OK, but since there is not so much difference, and we need to waste > > it in some cases anyway, plus handle it later some special way, I'm > > a bit in doubt. > > Well the thing is cards that don't support SG on tx probably > don't support jumbo frames either. ?? I mean this 128 byte chunk would be hard to reuse after copying to skb->data, and if reused, we could miss this for some NICs on TX, so the whole packed would need a copy. BTW, David mentioned something simple like sk_sndmsg_page would be enough, but I guess not for these non-SG NICs. We have to allocate bigger chunks for them, so more fragmentation to handle. Cheers, Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 10:58 ` Jarek Poplawski @ 2009-02-06 11:10 ` Willy Tarreau 2009-02-06 11:47 ` Jarek Poplawski 0 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-02-06 11:10 UTC (permalink / raw) To: Jarek Poplawski Cc: Herbert Xu, David Miller, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote: > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > > > OK, but since there is not so much difference, and we need to waste > > > it in some cases anyway, plus handle it later some special way, I'm > > > a bit in doubt. > > > > Well the thing is cards that don't support SG on tx probably > > don't support jumbo frames either. > > ?? I mean this 128 byte chunk would be hard to reuse after copying > to skb->data, and if reused, we could miss this for some NICs on TX, > so the whole packed would need a copy. couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit map to indicate which slot is used and which one is free ? This would just be a matter of calling ffz() to find one spare place in a page. Also, that bitmap might serve as a refcount, because if it drops to zero, it means all slots are unused. And -1 means all slots are used. This would reduce wastage if wee need to allocate 128 bytes often. Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 11:10 ` Willy Tarreau @ 2009-02-06 11:47 ` Jarek Poplawski 0 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 11:47 UTC (permalink / raw) To: Willy Tarreau Cc: Herbert Xu, David Miller, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Feb 06, 2009 at 12:10:15PM +0100, Willy Tarreau wrote: > On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote: > > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > > > > > OK, but since there is not so much difference, and we need to waste > > > > it in some cases anyway, plus handle it later some special way, I'm > > > > a bit in doubt. > > > > > > Well the thing is cards that don't support SG on tx probably > > > don't support jumbo frames either. > > > > ?? I mean this 128 byte chunk would be hard to reuse after copying > > to skb->data, and if reused, we could miss this for some NICs on TX, > > so the whole packed would need a copy. > > couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit > map to indicate which slot is used and which one is free ? This would > just be a matter of calling ffz() to find one spare place in a page. > Also, that bitmap might serve as a refcount, because if it drops to > zero, it means all slots are unused. And -1 means all slots are used. > > This would reduce wastage if wee need to allocate 128 bytes often. Something like this would be useful for SG NICs for the paged skb->data area. But I'm concerned with non-SG ones: if I got it right, for 1500 byte packets we need to allocate such a chunk, copy 128+ bytes to skb->data, and we have 128+ unused. If there are later such short packets, we don't need this space: why copy? But even if we find the way to use this, or don't reserve such space while receiving from SG NIC, than we will need to copy both chunks to another page for TX on non-SG NICs. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-02-06 7:52 ` David Miller 2009-02-06 8:09 ` Herbert Xu 2009-02-06 9:10 ` Jarek Poplawski @ 2009-02-06 18:59 ` Jarek Poplawski 2 siblings, 0 replies; 190+ messages in thread From: Jarek Poplawski @ 2009-02-06 18:59 UTC (permalink / raw) To: David Miller Cc: herbert, zbr, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe David Miller wrote, On 02/06/2009 08:52 AM: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 3 Feb 2009 09:41:08 +0000 > >> Yes, this looks reasonable. On the other hand, I think it would be >> nice to get some opinions of slab folks (incl. Evgeniy) on the expected >> efficiency of such a solution. (It seems releasing with put_page() will >> always have some cost with delayed reusing and/or waste of space.) > > I think we can't avoid using carved up pages for skb->data in the end. > The whole kernel wants to speak in pages and be able to grab and > release them in one way and one way only (get_page() and put_page()). > > What do you think is more likely? Us teaching the whole entire kernel > how to hold onto SKB linear data buffers, or the networking fixing > itself to operate on pages for it's header metadata? :-) > > What we'll end up with is likely a hybrid scheme. High speed devices > will receive into pages. And also the skb->data area will be page > backed and held using get_page()/put_page() references. So, after a full awakening I think I got your point at last! I thought all the time we're trying to do something more general, and you're seemingly focused on SG capable NICs, with myri10ge or niu as model to follow. I'm OK with this. Very nice idea and much less work! (It's only enough to CC all the maintainers !) > It is not even worth optimizing for skb->data holding the entire > packet, that's not the case that matters. > > These skb->data areas will thus be 128 bytes plus the skb_shinfo > structure blob. They also will be recycled often, rather than held > onto for long periods of time. > > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. > > In that kind of situation, doing even the most stupidest page slicing > algorithm, similar to what we do now with sk->sk_sndmsg_page, is > more than adequate and things like NTA (purely to solve this problem) > is overengineering. This is 100% right, except if we try to do something with non-SG and/or jumbos - IMHO this requires some overengineering. Jarek P. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-30 21:42 ` David Miller 2009-01-30 21:59 ` Willy Tarreau 2009-01-30 22:16 ` Herbert Xu @ 2009-02-03 11:38 ` Nick Piggin 2 siblings, 0 replies; 190+ messages in thread From: Nick Piggin @ 2009-02-03 11:38 UTC (permalink / raw) To: David Miller Cc: jarkao2, zbr, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Saturday 31 January 2009 08:42:27 David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 27 Jan 2009 07:40:48 +0000 > > > I think the main problem is to respect put_page() more, and maybe you > > mean to add this to your allocator too, but using slab pages for this > > looks a bit complex to me, but I can miss something. > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. Wouldn't your caller need to know what objects are already allocated in that page too? > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. It is nasty, yes. Using the page allocator directly seeems like a better approach. And btw. be careful of using page->_count for anything, due to speculative page references... basically it is useful only to test zero or non-zero refcount. If designing a new scheme for the network layer, it would be nicer to begin by using say _mapcount or private or some other field in there for a refcount (and I have a patch to avoid the atomic put_page_testzero in page freeing for a caller that does their own refcounting, so don't fear that extra overhead too much :)). ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH v2] tcp: splice as many packets as possible at once 2009-01-27 6:10 ` David Miller 2009-01-27 7:40 ` Jarek Poplawski @ 2009-01-27 18:42 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-27 18:42 UTC (permalink / raw) To: zbr Cc: jarkao2, herbert, w, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: David Miller <davem@davemloft.net> Date: Mon, 26 Jan 2009 22:10:56 -0800 (PST) > Even for drivers like NIU and myri10ge that do this, they only > use heuristics or some fixed minimum to decide how much to > move to the linear area. > > Result? Some data payload bits end up there because it overshoots. ... > I did test this with the NIU driver at one point, and it did not > change TCP latency nor throughput at all even at 10g speeds. As a followup, it turns out that NIU right now does this properly. It only pulls a maximum of ETH_HLEN into the linear area before giving the SKB to netif_receive_skb(). ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:19 ` Herbert Xu 2009-01-15 23:26 ` David Miller @ 2009-01-15 23:32 ` Willy Tarreau 2009-01-15 23:35 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-15 23:32 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 16, 2009 at 10:19:35AM +1100, Herbert Xu wrote: > On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote: > > > > I'm leaving the patch below for comments, maybe someone will spot > > something ? Don't we need at least one kfree() somewhere to match > > alloc_pages() ? > > Indeed. > > > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > > unsigned int len, unsigned int offset, > > > - struct sk_buff *skb) > > > + struct sk_buff *skb, int linear) > > > { > > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > > return 1; > > > > > > + if (linear) { > > > + page = linear_to_page(page, len, offset); > > > + if (!page) > > > + return 1; > > > + } > > > + > > > spd->pages[spd->nr_pages] = page; > > > spd->partial[spd->nr_pages].len = len; > > > spd->partial[spd->nr_pages].offset = offset; > > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > > spd->nr_pages++; > > > + get_page(page); > > This get_page needs to be moved into an else clause of the previous > if block. Good catch Herbert, it's working fine now. The performance I get (30.6 MB/s) is between the original splice version (24.1) and the recently optimized one (35.7). That's not that bad considering we're copying the data. Cheers, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-15 23:32 ` [PATCH] " Willy Tarreau @ 2009-01-15 23:35 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-15 23:35 UTC (permalink / raw) To: w Cc: herbert, jarkao2, zbr, dada1, ben, mingo, linux-kernel, netdev, jens.axboe From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:32:20 +0100 > Good catch Herbert, it's working fine now. The performance I get > (30.6 MB/s) is between the original splice version (24.1) and the > recently optimized one (35.7). That's not that bad considering > we're copying the data. Willy, see how much my skb clone removal in the patch I just posted helps, if at all. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:16 ` David Miller 2009-01-14 0:22 ` Evgeniy Polyakov @ 2009-01-14 0:51 ` Herbert Xu 2009-01-14 1:24 ` David Miller 1 sibling, 1 reply; 190+ messages in thread From: Herbert Xu @ 2009-01-14 0:51 UTC (permalink / raw) To: David Miller Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe David Miller <davem@davemloft.net> wrote: > > I wish there were some way we could make this code grab and release a > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to > accomplish it's goals. We can probably do that for spliced data that end up going to the networking stack again. However, as splice is generic the data may be headed to a destination other than the network stack. In that case to make dataref work we'd need some mechanism that allows non-network entities to get/drop this ref count. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-14 0:51 ` Herbert Xu @ 2009-01-14 1:24 ` David Miller 0 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-14 1:24 UTC (permalink / raw) To: herbert Cc: zbr, dada1, w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Jan 2009 11:51:10 +1100 > We can probably do that for spliced data that end up going to > the networking stack again. However, as splice is generic the > data may be headed to a destination other than the network stack. > > In that case to make dataref work we'd need some mechanism > that allows non-network entities to get/drop this ref count. I see. I'll think about this some more. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:17 ` Willy Tarreau 2009-01-09 22:42 ` Evgeniy Polyakov @ 2009-01-09 22:45 ` Eric Dumazet 2009-01-09 22:53 ` Willy Tarreau 1 sibling, 1 reply; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 22:45 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote: >> Willy Tarreau a écrit : >>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote: >>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: >>>> (...) >>>>>> Also, in your second mail, you're saying that your change >>>>>> might return more data than requested by the user. I can't >>>>>> find why, could you please explain to me, as I'm still quite >>>>>> ignorant in this area ? >>>>> Well, I just tested various user programs and indeed got this >>>>> strange result : >>>>> >>>>> Here I call splice() with len=1000 (0x3e8), and you can see >>>>> it gives a result of 1460 at the second call. >>> OK finally I could reproduce it and found why we have this. It's >>> expected in fact. >>> >>> The problem when we loop in tcp_read_sock() is that tss->len is >>> not decremented by the amount of bytes read, this one is done >>> only in tcp_splice_read() which is outer. >>> >>> The solution I found was to do just like other callers, which means >>> use desc->count to keep the remaining number of bytes we want to >>> read. In fact, tcp_read_sock() is designed to use that one as a stop >>> condition, which explains why you first had to hide it. >>> >>> Now with the attached patch as a replacement for my previous one, >>> both issues are solved : >>> - I splice 1000 bytes if I ask to do so >>> - I splice as much as possible if available (typically 23 kB). >>> >>> My observed performances are still at the top of earlier results >>> and IMHO that way of counting bytes makes sense for an actor called >>> from tcp_read_sock(). >>> >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c >>> index 35bcddf..51ff3aa 100644 >>> --- a/net/ipv4/tcp.c >>> +++ b/net/ipv4/tcp.c >>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, >>> unsigned int offset, size_t len) >>> { >>> struct tcp_splice_state *tss = rd_desc->arg.data; >>> + int ret; >>> >>> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); >>> + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); >>> + if (ret > 0) >>> + rd_desc->count -= ret; >>> + return ret; >>> } >>> >>> static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) >>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) >>> /* Store TCP splice context information in read_descriptor_t. */ >>> read_descriptor_t rd_desc = { >>> .arg.data = tss, >>> + .count = tss->len, >>> }; >>> >>> return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); >>> >> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :) > > I've seen the other callers, but they all use desc->count for their own > purpose. That's how I understood what it was used for :-) Ah yes, I reread your patch and you are right. > > I think it's better not to change the API here and use tcp_read_sock() > how it's supposed to be used. Also, the less parameters to the function, > the better. > > However I'm OK for the !timeo before release_sock/lock_sock. I just > don't know if we can put the rest of the if above or not. I don't > know what changes we're supposed to collect by doing release_sock/ > lock_sock before the if(). Only the (!timeo) can be above. Other conditions must be checked after the release/lock. Thank you ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:45 ` Eric Dumazet @ 2009-01-09 22:53 ` Willy Tarreau 2009-01-09 23:34 ` Eric Dumazet 0 siblings, 1 reply; 190+ messages in thread From: Willy Tarreau @ 2009-01-09 22:53 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote: > Willy Tarreau a écrit : > > On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote: > >> Willy Tarreau a écrit : > >>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote: > >>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote: > >>>> (...) > >>>>>> Also, in your second mail, you're saying that your change > >>>>>> might return more data than requested by the user. I can't > >>>>>> find why, could you please explain to me, as I'm still quite > >>>>>> ignorant in this area ? > >>>>> Well, I just tested various user programs and indeed got this > >>>>> strange result : > >>>>> > >>>>> Here I call splice() with len=1000 (0x3e8), and you can see > >>>>> it gives a result of 1460 at the second call. > >>> OK finally I could reproduce it and found why we have this. It's > >>> expected in fact. > >>> > >>> The problem when we loop in tcp_read_sock() is that tss->len is > >>> not decremented by the amount of bytes read, this one is done > >>> only in tcp_splice_read() which is outer. > >>> > >>> The solution I found was to do just like other callers, which means > >>> use desc->count to keep the remaining number of bytes we want to > >>> read. In fact, tcp_read_sock() is designed to use that one as a stop > >>> condition, which explains why you first had to hide it. > >>> > >>> Now with the attached patch as a replacement for my previous one, > >>> both issues are solved : > >>> - I splice 1000 bytes if I ask to do so > >>> - I splice as much as possible if available (typically 23 kB). > >>> > >>> My observed performances are still at the top of earlier results > >>> and IMHO that way of counting bytes makes sense for an actor called > >>> from tcp_read_sock(). > >>> > >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > >>> index 35bcddf..51ff3aa 100644 > >>> --- a/net/ipv4/tcp.c > >>> +++ b/net/ipv4/tcp.c > >>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, > >>> unsigned int offset, size_t len) > >>> { > >>> struct tcp_splice_state *tss = rd_desc->arg.data; > >>> + int ret; > >>> > >>> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); > >>> + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); > >>> + if (ret > 0) > >>> + rd_desc->count -= ret; > >>> + return ret; > >>> } > >>> > >>> static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > >>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) > >>> /* Store TCP splice context information in read_descriptor_t. */ > >>> read_descriptor_t rd_desc = { > >>> .arg.data = tss, > >>> + .count = tss->len, > >>> }; > >>> > >>> return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); > >>> > >> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :) > > > > I've seen the other callers, but they all use desc->count for their own > > purpose. That's how I understood what it was used for :-) > > Ah yes, I reread your patch and you are right. > > > > > I think it's better not to change the API here and use tcp_read_sock() > > how it's supposed to be used. Also, the less parameters to the function, > > the better. > > > > However I'm OK for the !timeo before release_sock/lock_sock. I just > > don't know if we can put the rest of the if above or not. I don't > > know what changes we're supposed to collect by doing release_sock/ > > lock_sock before the if(). > > Only the (!timeo) can be above. Other conditions must be checked after > the release/lock. Yes that's what Evgeniy explained too. I smelled something like this but did not know. Care to redo the whole patch, since you already have all code parts at hand as well as some fragments of commit messages ? You can even add my Tested-by if you want. Finally it was nice that Dave asked for this explanation because it drove our nose to the fishy parts ;-) Thanks, Willy ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 22:53 ` Willy Tarreau @ 2009-01-09 23:34 ` Eric Dumazet 2009-01-13 5:45 ` David Miller 2009-01-14 0:05 ` David Miller 0 siblings, 2 replies; 190+ messages in thread From: Eric Dumazet @ 2009-01-09 23:34 UTC (permalink / raw) To: Willy Tarreau Cc: David Miller, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe Willy Tarreau a écrit : > On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote: >> Only the (!timeo) can be above. Other conditions must be checked after >> the release/lock. > > Yes that's what Evgeniy explained too. I smelled something like this > but did not know. > > Care to redo the whole patch, since you already have all code parts > at hand as well as some fragments of commit messages ? You can even > add my Tested-by if you want. Finally it was nice that Dave asked > for this explanation because it drove our nose to the fishy parts ;-) Sure, here it is : David, do you think we still must call __tcp_splice_read() only once in tcp_splice_read() if SPLICE_F_NONBLOCK is set ? With following patch, a splice() call is limited to 16 frames, typically 16*1460 = 23360 bytes. Removing the test as Willy did in its patch could return the exact length requested by user (limited to 16 pages), giving nice blocks if feeding a file on disk... Thank you From: Willy Tarreau <w@1wt.eu> [PATCH] tcp: splice as many packets as possible at once As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not optimal. It processes at most one segment per call. This results in low performance and very high overhead due to syscall rate when splicing from interfaces which do not support LRO. Willy provided a patch inside tcp_splice_read(), but a better fix is to let tcp_read_sock() process as many segments as possible, so that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less often. With this change, splice() behaves like tcp_recvmsg(), being able to consume many skbs in one system call. With typical 1460 bytes of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return 16*1460 = 23360 bytes. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bd6ff90..1233835 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len) { struct tcp_splice_state *tss = rd_desc->arg.data; + int ret; - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags); + if (ret > 0) + rd_desc->count -= ret; + return ret; } static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) /* Store TCP splice context information in read_descriptor_t. */ read_descriptor_t rd_desc = { .arg.data = tss, + .count = tss->len, }; return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); @@ -611,11 +616,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, tss.len -= ret; spliced += ret; + if (!timeo) + break; release_sock(sk); lock_sock(sk); if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || + (sk->sk_shutdown & RCV_SHUTDOWN) || signal_pending(current)) break; } ^ permalink raw reply related [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 23:34 ` Eric Dumazet @ 2009-01-13 5:45 ` David Miller 2009-01-14 0:05 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-13 5:45 UTC (permalink / raw) To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Eric Dumazet <dada1@cosmosbay.com> Date: Sat, 10 Jan 2009 00:34:40 +0100 > David, do you think we still must call __tcp_splice_read() only once > in tcp_splice_read() if SPLICE_F_NONBLOCK is set ? Eric, I'll get to this thread as soon as I can, perhaps tomorrow. I want to get all of the build fallout and bug fixes for 2.6.29-rcX sorted before everyone heads off to LCA in the next week or so :-) ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 23:34 ` Eric Dumazet 2009-01-13 5:45 ` David Miller @ 2009-01-14 0:05 ` David Miller 1 sibling, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-14 0:05 UTC (permalink / raw) To: dada1; +Cc: w, ben, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Eric Dumazet <dada1@cosmosbay.com> Date: Sat, 10 Jan 2009 00:34:40 +0100 > David, do you think we still must call __tcp_splice_read() only once > in tcp_splice_read() if SPLICE_F_NONBLOCK is set ? You seem to be working that out in another thread :-) > [PATCH] tcp: splice as many packets as possible at once > > As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not > optimal. It processes at most one segment per call. > This results in low performance and very high overhead due to syscall rate > when splicing from interfaces which do not support LRO. > > Willy provided a patch inside tcp_splice_read(), but a better fix > is to let tcp_read_sock() process as many segments as possible, so > that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less > often. > > With this change, splice() behaves like tcp_recvmsg(), being able > to consume many skbs in one system call. With typical 1460 bytes > of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return > 16*1460 = 23360 bytes. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> I've applied this, thanks! ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 15:42 ` Eric Dumazet 2009-01-09 17:57 ` Eric Dumazet 2009-01-09 18:54 ` Willy Tarreau @ 2009-01-13 23:31 ` David Miller 2 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-13 23:31 UTC (permalink / raw) To: dada1; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 09 Jan 2009 16:42:44 +0100 > David, if you referred to code at line 1374 of net/ipv4/tcp.c, I > believe there is no issue with it. We really want to break from this > loop if !timeo . Correct, I agree, and I gave some detailed analysis of this in another response :-) > Willy patch makes splice() behaving like tcp_recvmsg(), but we might call > tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed) "Like", sure, but not the same since splice() lacks the low-water and backlog checks. > I wonder if the right fix should be done in tcp_read_sock() : this is the > one who should eat several skbs IMHO, if we want optimal ACK generation. > > We break out of its loop at line 1246 > > if (!desc->count) /* this test is always true */ > break; > > (__tcp_splice_read() set count to 0, right before calling tcp_read_sock()) > > So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least. Yes, that's very odd. ^ permalink raw reply [flat|nested] 190+ messages in thread
* Re: [PATCH] tcp: splice as many packets as possible at once 2009-01-09 6:47 ` Eric Dumazet 2009-01-09 7:04 ` Willy Tarreau 2009-01-09 15:42 ` Eric Dumazet @ 2009-01-13 23:26 ` David Miller 2 siblings, 0 replies; 190+ messages in thread From: David Miller @ 2009-01-13 23:26 UTC (permalink / raw) To: dada1; +Cc: ben, w, jarkao2, mingo, linux-kernel, netdev, jens.axboe From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 09 Jan 2009 07:47:16 +0100 > I found this patch usefull in my testings, but had a feeling something > was not complete. If the goal is to reduce number of splice() calls, > we also should reduce number of wakeups. If splice() is used in non > blocking mode, nothing we can do here of course, since the application > will use a poll()/select()/epoll() event before calling splice(). A > good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things. Spice read does not handle SO_RCVLOWAT like tcp_recvmsg() does. We should probably add a: target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); and check 'target' against 'spliced' in the main loop of tcp_splice_read(). > About tcp_recvmsg(), we might also remove the "!timeo" test as well, > more testings are needed. But remind that if an application provides > a large buffer to tcp_recvmsg() call, removing the test will reduce > the number of syscalls but might use more DCACHE. It could reduce > performance on old cpus. With splice() call, we expect to not > copy memory and trash DCACHE, and pipe buffers being limited to 16, > we cope with a limited working set. I sometimes have a suspicion we can remove this test too, but it's not really that clear. If an application is doing non-blocking reads and they care about latency, they shouldn't be providing huge buffers. This much I agree with, but... If you look at where this check is placed in the recvmsg() case, it is done after we have verified that there is no socket backlog. if (copied >= target && !sk->sk_backlog.tail) break; if (copied) { if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo || signal_pending(current)) break; } else { So either: 1) We haven't met the target. And note that target is one unless the user makes an explicit receive low-water setting. 2) Or there is no backlog. When we get to the 'if (copied)' check. You can view this "!timeo" check as meaning "non-blocking". When we get to it, we are guarenteed that we haven't met the target and we have no backlog. So it is absolutely appropriate to break out of recvmsg() processing here if non-blocking. There is a lot of logic and feature handling in tcp_splice_read() and that's why the semantics of "!timeo" cases are not being handled properly here. ^ permalink raw reply [flat|nested] 190+ messages in thread
end of thread, other threads:[~2009-02-06 18:59 UTC | newest] Thread overview: 190+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-01-08 17:30 [PATCH] tcp: splice as many packets as possible at once Willy Tarreau 2009-01-08 19:44 ` Jens Axboe 2009-01-08 22:03 ` Willy Tarreau 2009-01-08 21:50 ` Ben Mansell 2009-01-08 21:55 ` David Miller 2009-01-08 22:20 ` Willy Tarreau 2009-01-13 23:08 ` David Miller 2009-01-09 6:47 ` Eric Dumazet 2009-01-09 7:04 ` Willy Tarreau 2009-01-09 7:28 ` Eric Dumazet 2009-01-09 7:42 ` Willy Tarreau 2009-01-13 23:27 ` David Miller 2009-01-13 23:35 ` Eric Dumazet 2009-01-09 15:42 ` Eric Dumazet 2009-01-09 17:57 ` Eric Dumazet 2009-01-09 18:54 ` Willy Tarreau 2009-01-09 20:51 ` Eric Dumazet 2009-01-09 21:24 ` Willy Tarreau 2009-01-09 22:02 ` Eric Dumazet 2009-01-09 22:09 ` Willy Tarreau 2009-01-09 22:07 ` Willy Tarreau 2009-01-09 22:12 ` Eric Dumazet 2009-01-09 22:17 ` Willy Tarreau 2009-01-09 22:42 ` Evgeniy Polyakov 2009-01-09 22:50 ` Willy Tarreau 2009-01-09 23:01 ` Evgeniy Polyakov 2009-01-09 23:06 ` Willy Tarreau 2009-01-10 7:40 ` Eric Dumazet 2009-01-11 12:58 ` Evgeniy Polyakov 2009-01-11 13:14 ` Eric Dumazet 2009-01-11 13:35 ` Evgeniy Polyakov 2009-01-11 16:00 ` Eric Dumazet 2009-01-11 16:05 ` Evgeniy Polyakov 2009-01-14 0:07 ` David Miller 2009-01-14 0:13 ` Evgeniy Polyakov 2009-01-14 0:16 ` David Miller 2009-01-14 0:22 ` Evgeniy Polyakov 2009-01-14 0:37 ` David Miller 2009-01-14 3:51 ` Herbert Xu 2009-01-14 4:25 ` David Miller 2009-01-14 7:27 ` David Miller 2009-01-14 8:26 ` Herbert Xu 2009-01-14 8:53 ` Jarek Poplawski 2009-01-14 9:29 ` David Miller 2009-01-14 9:42 ` Jarek Poplawski 2009-01-14 10:06 ` David Miller 2009-01-14 10:47 ` Jarek Poplawski 2009-01-14 11:29 ` Herbert Xu 2009-01-14 11:40 ` Jarek Poplawski 2009-01-14 11:45 ` Jarek Poplawski 2009-01-14 9:54 ` Jarek Poplawski 2009-01-14 10:01 ` Willy Tarreau 2009-01-14 12:06 ` Jarek Poplawski 2009-01-14 12:15 ` Jarek Poplawski 2009-01-14 11:28 ` Herbert Xu 2009-01-15 23:03 ` Willy Tarreau 2009-01-15 23:19 ` David Miller 2009-01-15 23:19 ` Herbert Xu 2009-01-15 23:26 ` David Miller 2009-01-15 23:32 ` Herbert Xu 2009-01-15 23:34 ` David Miller 2009-01-15 23:42 ` Willy Tarreau 2009-01-15 23:44 ` Willy Tarreau 2009-01-15 23:54 ` David Miller 2009-01-19 0:42 ` Willy Tarreau 2009-01-19 3:08 ` Herbert Xu 2009-01-19 3:27 ` David Miller 2009-01-19 6:14 ` Willy Tarreau 2009-01-19 6:19 ` David Miller 2009-01-19 6:45 ` Willy Tarreau 2009-01-19 10:19 ` Herbert Xu 2009-01-19 20:59 ` David Miller 2009-01-19 21:24 ` Herbert Xu 2009-01-25 21:03 ` Willy Tarreau 2009-01-26 7:59 ` Jarek Poplawski 2009-01-26 8:12 ` Willy Tarreau 2009-01-19 8:40 ` Jarek Poplawski 2009-01-19 3:28 ` David Miller 2009-01-19 6:11 ` Willy Tarreau 2009-01-24 21:23 ` Willy Tarreau 2009-01-20 12:01 ` Ben Mansell 2009-01-20 12:11 ` Evgeniy Polyakov 2009-01-20 13:43 ` Ben Mansell 2009-01-20 14:06 ` Jarek Poplawski 2009-01-16 6:51 ` Jarek Poplawski 2009-01-19 6:08 ` David Miller 2009-01-19 6:16 ` David Miller 2009-01-19 10:20 ` Herbert Xu 2009-01-20 8:37 ` Jarek Poplawski 2009-01-20 9:33 ` [PATCH v2] " Jarek Poplawski 2009-01-20 10:00 ` Evgeniy Polyakov 2009-01-20 10:20 ` Jarek Poplawski 2009-01-20 10:31 ` Evgeniy Polyakov 2009-01-20 11:01 ` Jarek Poplawski 2009-01-20 17:16 ` David Miller 2009-01-21 9:54 ` Jarek Poplawski 2009-01-22 9:04 ` [PATCH v3] " Jarek Poplawski 2009-01-26 5:22 ` David Miller 2009-01-27 7:11 ` Herbert Xu 2009-01-27 7:54 ` Jarek Poplawski 2009-01-27 10:09 ` Herbert Xu 2009-01-27 10:35 ` Jarek Poplawski 2009-01-27 10:57 ` Jarek Poplawski 2009-01-27 11:48 ` Herbert Xu 2009-01-27 12:16 ` Jarek Poplawski 2009-01-27 12:31 ` Jarek Poplawski 2009-01-27 17:06 ` David Miller 2009-01-28 8:10 ` Jarek Poplawski 2009-02-01 8:41 ` David Miller 2009-01-26 8:20 ` [PATCH v2] " Jarek Poplawski 2009-01-26 21:21 ` Evgeniy Polyakov 2009-01-27 6:10 ` David Miller 2009-01-27 7:40 ` Jarek Poplawski 2009-01-30 21:42 ` David Miller 2009-01-30 21:59 ` Willy Tarreau 2009-01-30 22:03 ` David Miller 2009-01-30 22:13 ` Willy Tarreau 2009-01-30 22:15 ` David Miller 2009-01-30 22:16 ` Herbert Xu 2009-02-02 8:08 ` Jarek Poplawski 2009-02-02 8:18 ` David Miller 2009-02-02 8:43 ` Jarek Poplawski 2009-02-03 7:50 ` David Miller 2009-02-03 9:41 ` Jarek Poplawski 2009-02-03 11:10 ` Evgeniy Polyakov 2009-02-03 11:24 ` Herbert Xu 2009-02-03 11:49 ` Evgeniy Polyakov 2009-02-03 11:53 ` Herbert Xu 2009-02-03 12:07 ` Evgeniy Polyakov 2009-02-03 12:12 ` Herbert Xu 2009-02-03 12:18 ` Evgeniy Polyakov 2009-02-03 12:25 ` Willy Tarreau 2009-02-03 12:28 ` Herbert Xu 2009-02-04 0:47 ` David Miller 2009-02-04 6:19 ` Willy Tarreau 2009-02-04 8:12 ` Evgeniy Polyakov 2009-02-04 8:54 ` Willy Tarreau 2009-02-04 8:59 ` Herbert Xu 2009-02-04 9:01 ` David Miller 2009-02-04 9:12 ` Willy Tarreau 2009-02-04 9:15 ` David Miller 2009-02-04 19:19 ` Roland Dreier 2009-02-04 19:28 ` Willy Tarreau 2009-02-04 19:48 ` Jarek Poplawski 2009-02-05 8:32 ` Bill Fink 2009-02-04 9:12 ` David Miller 2009-02-03 12:27 ` Herbert Xu 2009-02-03 13:05 ` david 2009-02-03 12:12 ` Evgeniy Polyakov 2009-02-03 12:18 ` Herbert Xu 2009-02-03 12:30 ` Evgeniy Polyakov 2009-02-03 12:33 ` Herbert Xu 2009-02-03 12:33 ` Nick Piggin 2009-02-04 0:46 ` David Miller 2009-02-04 9:41 ` Benny Amorsen 2009-02-04 12:01 ` Herbert Xu 2009-02-03 12:36 ` Jarek Poplawski 2009-02-03 13:06 ` Evgeniy Polyakov 2009-02-03 13:25 ` Jarek Poplawski 2009-02-03 14:20 ` Evgeniy Polyakov 2009-02-04 0:46 ` David Miller 2009-02-04 8:08 ` Evgeniy Polyakov 2009-02-04 9:23 ` Nick Piggin 2009-02-04 7:56 ` Jarek Poplawski 2009-02-06 7:52 ` David Miller 2009-02-06 8:09 ` Herbert Xu 2009-02-06 9:10 ` Jarek Poplawski 2009-02-06 9:17 ` David Miller 2009-02-06 9:42 ` Jarek Poplawski 2009-02-06 9:49 ` David Miller 2009-02-06 9:23 ` Herbert Xu 2009-02-06 9:51 ` Jarek Poplawski 2009-02-06 10:28 ` Herbert Xu 2009-02-06 10:58 ` Jarek Poplawski 2009-02-06 11:10 ` Willy Tarreau 2009-02-06 11:47 ` Jarek Poplawski 2009-02-06 18:59 ` Jarek Poplawski 2009-02-03 11:38 ` Nick Piggin 2009-01-27 18:42 ` David Miller 2009-01-15 23:32 ` [PATCH] " Willy Tarreau 2009-01-15 23:35 ` David Miller 2009-01-14 0:51 ` Herbert Xu 2009-01-14 1:24 ` David Miller 2009-01-09 22:45 ` Eric Dumazet 2009-01-09 22:53 ` Willy Tarreau 2009-01-09 23:34 ` Eric Dumazet 2009-01-13 5:45 ` David Miller 2009-01-14 0:05 ` David Miller 2009-01-13 23:31 ` David Miller 2009-01-13 23:26 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).