* Stable regression with 'tcp: allow splice() to build full TSO packets' @ 2012-05-17 12:18 Willy Tarreau 2012-05-17 15:01 ` Willy Tarreau 2012-05-17 20:41 ` Eric Dumazet 0 siblings, 2 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 12:18 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Hi Eric, I'm facing a regression in stable 3.2.17 and 3.0.31 which is exhibited by your patch 'tcp: allow splice() to build full TSO packets' which unfortunately I am very interested in ! What I'm observing is that TCP transmits using splice() stall quite quickly if I'm using pipes larger than 64kB (even 65537 is enough to reliably observe the stall). I'm seeing this on haproxy running on a small ARM machine (a dockstar), which exchanges data through a gig switch with my development PC. The NIC (mv643xx) doesn't support TSO but has GSO enabled. If I disable GSO, the problem remains. I can however make the problem disappear by disabling SG or Tx checksumming. BTW, using recv/send() instead of splice() also gets rid of the problem. I can also reduce the risk of seeing the problem by increasing the default TCP buffer sizes in tcp_wmem. By default I'm running at 16kB, but if I increase the output buffer size above the pipe size, the problem *seems* to disappear though I can't be certain, since larger buffers generally means the problem takes longer to appear, probably due to the fact that the buffers don't need to be filled. Still I'm certain that with 64k TCP buffers and 128k pipes I'm still seeing it. With strace, I'm seeing data fill up the pipe with the splice() call responsible for pushing the data to the output socket returing -1 EAGAIN. During this time, the client receives no data. Something bugs me, I have tested with a dummy server of mine, httpterm, which uses tee+splice() to push data outside, and it has no problem filling the gig pipe, and correctly recoverers from the EAGAIN : send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160 tee(0x3, 0x6, 0x10000, 0x2) = 42552 splice(0x5, 0, 0xd, 0, 0xa00000, 0x2) = 14440 tee(0x3, 0x6, 0x10000, 0x2) = 13880 splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = -1 EAGAIN (Resource temporarily unavailable) ... tee(0x3, 0x6, 0x10000, 0x2) = 13880 splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = 51100 tee(0x3, 0x6, 0x10000, 0x2) = 50744 splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2) = 32120 tee(0x3, 0x6, 0x10000, 0x2) = 30264 splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2) = -1 EAGAIN (Resource temporarily unavailable) etc... It's only with haproxy which uses splice() to copy data between two sockets that I'm getting the issue (data forwarded from fd 0xe to fd 0x6) : 16:03:17.797144 pipe([36, 37]) = 0 16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k) 16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220 16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220 16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0 16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7 16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0 16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060 16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060 16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0 16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4 16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0 16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520 16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520 16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0 16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0 16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500 16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360 16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0 ## note: epoll_ctl(ADD, fd=6, dir=OUT). 16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0 16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0 16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900 16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable) So output fd 6 hangs here and will not appear anymore until here where I pressed Ctrl-C to stop the test : 16:03:24.740985 gettimeofday({1324652604, 741042}, NULL) = 0 16:03:24.741148 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7 16:03:24.951762 gettimeofday({1324652604, 951838}, NULL) = 0 16:03:24.951956 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EPIPE (Broken pipe) I tried disabling LRO/GRO at the input interface (which happens to be the same) to see if fragmentation of input data had any impact on this but nothing chnages. Please note that I'm not even certain the patch is the culprit, I'm suspecting that by improving splice() efficiency, it might make a latent issue become more visible. I have no data to back this feeling, but nothing strikes me in your patch. I don't know what I can do to troubleshoot this issue. I don't want to pollute the list with network captures nor strace outputs, but I have them if you're interested in verifying a few things. I have another platform available for a test (Atom+82574L supporting TSO). I'll rebuild and boot on this one to see if I observe the same behaviour. If you have any suggestion about things to check of tweaks to change in the code, I'm quite open to experiment. Best regards, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau @ 2012-05-17 15:01 ` Willy Tarreau 2012-05-17 15:43 ` Eric Dumazet 2012-05-17 19:55 ` David Miller 2012-05-17 20:41 ` Eric Dumazet 1 sibling, 2 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 15:01 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Hi Eric, On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote: > Hi Eric, > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > exhibited by your patch 'tcp: allow splice() to build full TSO > packets' which unfortunately I am very interested in ! > > What I'm observing is that TCP transmits using splice() stall > quite quickly if I'm using pipes larger than 64kB (even 65537 > is enough to reliably observe the stall). (...) I managed to fix the issue and I really think that the fix makes sense. I'm appending the patch, please could you review it and if approved, push it for inclusion ? BTW, your patch significantly improves performance here. On this machine I was reaching max 515 Mbps of proxied traffic, and with the patch I reach 665 Mbps with the same test ! I think you managed to fix what caused splice() to always be slightly slower than recv/send() for a long time in my tests with a number of NICs ! Thanks, Willy ------- >From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 16:48:56 +0200 Subject: [PATCH] tcp: force push data out when buffers are missing Commit 2f533844242 (tcp: allow splice() to build full TSO packets) significantly improved splice() performance for some workloads but caused stalls when pipe buffers were larger than socket buffers. The issue seems to happen when no data can be copied at all due to lack of buffers, which results in pending data never being pushed. This change checks if all pending data has been pushed or not and pushes them when waiting for send buffers. --- net/ipv4/tcp.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 80b988f..f6d9e5f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -850,7 +850,7 @@ new_segment: wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: - if (copied) + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) -- 1.7.2.1.45.g54fbc ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 15:01 ` Willy Tarreau @ 2012-05-17 15:43 ` Eric Dumazet 2012-05-17 15:56 ` Willy Tarreau 2012-05-17 19:55 ` David Miller 1 sibling, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 15:43 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote: > Hi Eric, > > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote: > > Hi Eric, > > > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > > exhibited by your patch 'tcp: allow splice() to build full TSO > > packets' which unfortunately I am very interested in ! > > > > What I'm observing is that TCP transmits using splice() stall > > quite quickly if I'm using pipes larger than 64kB (even 65537 > > is enough to reliably observe the stall). > (...) > > I managed to fix the issue and I really think that the fix makes sense. > I'm appending the patch, please could you review it and if approved, > push it for inclusion ? > > BTW, your patch significantly improves performance here. On this > machine I was reaching max 515 Mbps of proxied traffic, and with > the patch I reach 665 Mbps with the same test ! I think you managed > to fix what caused splice() to always be slightly slower than > recv/send() for a long time in my tests with a number of NICs ! > What NIC do you use exactly ? With commit 1d0c0b328a6 in net-next (net: makes skb_splice_bits() aware of skb->head_frag) You'll be able to get even more speed, if NIC uses frag to hold frame. > Thanks, > Willy > > ------- > > From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 16:48:56 +0200 > Subject: [PATCH] tcp: force push data out when buffers are missing > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > significantly improved splice() performance for some workloads but > caused stalls when pipe buffers were larger than socket buffers. > But... This seems bug was already there ? Only having more data in pipe trigger it more often ? > The issue seems to happen when no data can be copied at all due to > lack of buffers, which results in pending data never being pushed. > > This change checks if all pending data has been pushed or not and > pushes them when waiting for send buffers. > --- > net/ipv4/tcp.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 80b988f..f6d9e5f 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -850,7 +850,7 @@ new_segment: > wait_for_sndbuf: > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > wait_for_memory: > - if (copied) > + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ( tcp: tcp_sendpages() should call tcp_push() once ) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 15:43 ` Eric Dumazet @ 2012-05-17 15:56 ` Willy Tarreau 2012-05-17 16:33 ` Eric Dumazet 2012-05-17 18:38 ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings 0 siblings, 2 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 15:56 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote: > > Hi Eric, > > > > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote: > > > Hi Eric, > > > > > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > > > exhibited by your patch 'tcp: allow splice() to build full TSO > > > packets' which unfortunately I am very interested in ! > > > > > > What I'm observing is that TCP transmits using splice() stall > > > quite quickly if I'm using pipes larger than 64kB (even 65537 > > > is enough to reliably observe the stall). > > (...) > > > > I managed to fix the issue and I really think that the fix makes sense. > > I'm appending the patch, please could you review it and if approved, > > push it for inclusion ? > > > > BTW, your patch significantly improves performance here. On this > > machine I was reaching max 515 Mbps of proxied traffic, and with > > the patch I reach 665 Mbps with the same test ! I think you managed > > to fix what caused splice() to always be slightly slower than > > recv/send() for a long time in my tests with a number of NICs ! > > > > What NIC do you use exactly ? It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, dreamplugs and almost all ARM-based cheap NAS boxes. > With commit 1d0c0b328a6 in net-next > (net: makes skb_splice_bits() aware of skb->head_frag) > You'll be able to get even more speed, if NIC uses frag to hold frame. I'm going to check this now, sounds interesting :-) The NIC does not support TSO but I've seen an alternate driver for this NIC which pretends to do TSO and in fact builds header frags so that the NIC is able to send all frames at once. I think it's already what GSO is doing but I'm wondering whether it would be possible to get more speed by doing this than by relying on GSO to (possibly) split the frags earlier. Note that I'm not quite skilled on this, so do not hesitate to correct me if I'm saying stupid things. (...) > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > > significantly improved splice() performance for some workloads but > > caused stalls when pipe buffers were larger than socket buffers. > > But... This seems bug was already there ? Only having more data in pipe > trigger it more often ? I honnestly don't know. As I said in the initial mail, I suspect this is the case because I don't see in your patch what could cause the bug, but I suspect it makes it more frequent. Still, I could not manage to reproduce the bug without your patch. So maybe we switched from just below a limit to just over. > > The issue seems to happen when no data can be copied at all due to > > lack of buffers, which results in pending data never being pushed. > > > > This change checks if all pending data has been pushed or not and > > pushes them when waiting for send buffers. > > --- > > net/ipv4/tcp.c | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 80b988f..f6d9e5f 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -850,7 +850,7 @@ new_segment: > > wait_for_sndbuf: > > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > > wait_for_memory: > > - if (copied) > > + if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq) > > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > > Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 > ( tcp: tcp_sendpages() should call tcp_push() once ) Yes I have it, in fact I'm on -stable (3.0.31) which includes your backport of the two patches in a single one. It's just a few lines below, out of the context of this patch. Best regards, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 15:56 ` Willy Tarreau @ 2012-05-17 16:33 ` Eric Dumazet 2012-05-17 16:40 ` Willy Tarreau 2012-05-17 18:38 ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings 1 sibling, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 16:33 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > dreamplugs and almost all ARM-based cheap NAS boxes. > > > With commit 1d0c0b328a6 in net-next > > (net: makes skb_splice_bits() aware of skb->head_frag) > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > I'm going to check this now, sounds interesting :-) splice(socket -> pipe) does a copy of frame from skb->head to a page fragment. With latest patches (net-next), we can provide a frag for skb->head and avoid this copy in splice(). You would have a true zero copy socket->socket mode. tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 16:33 ` Eric Dumazet @ 2012-05-17 16:40 ` Willy Tarreau 2012-05-17 16:47 ` Eric Dumazet 2012-05-17 16:49 ` Eric Dumazet 0 siblings, 2 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 16:40 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > > dreamplugs and almost all ARM-based cheap NAS boxes. > > > > > With commit 1d0c0b328a6 in net-next > > > (net: makes skb_splice_bits() aware of skb->head_frag) > > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > > > I'm going to check this now, sounds interesting :-) > > splice(socket -> pipe) does a copy of frame from skb->head to a page > fragment. > > With latest patches (net-next), we can provide a frag for skb->head and > avoid this copy in splice(). You would have a true zero copy > socket->socket mode. That's what I've read in your commit messages. Indeed, we've been waiting for this to happen for a long time. It should bring a real benefit on small devices like the one I'm playing with which have very low memory bandwidth, because it's a shame to copy the data and thrash the cache when we don't need to see them. I'm currently trying to get mainline to boot then I'll switch to net-next. > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. I'll try to port your patch (8d4057 tg3: provide frags as skb head) to mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll send a patch :-) Cheers, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 16:40 ` Willy Tarreau @ 2012-05-17 16:47 ` Eric Dumazet 2012-05-17 16:49 ` Eric Dumazet 1 sibling, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 16:47 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote: > > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: > > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote: > > > > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC > > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs, > > > dreamplugs and almost all ARM-based cheap NAS boxes. > > > > > > > With commit 1d0c0b328a6 in net-next > > > > (net: makes skb_splice_bits() aware of skb->head_frag) > > > > You'll be able to get even more speed, if NIC uses frag to hold frame. > > > > > > I'm going to check this now, sounds interesting :-) > > > > splice(socket -> pipe) does a copy of frame from skb->head to a page > > fragment. > > > > With latest patches (net-next), we can provide a frag for skb->head and > > avoid this copy in splice(). You would have a true zero copy > > socket->socket mode. > > That's what I've read in your commit messages. Indeed, we've been waiting > for this to happen for a long time. It should bring a real benefit on small > devices like the one I'm playing with which have very low memory bandwidth, > because it's a shame to copy the data and thrash the cache when we don't > need to see them. > > I'm currently trying to get mainline to boot then I'll switch to net-next. > > > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well. > > I'll try to port your patch (8d4057 tg3: provide frags as skb head) to > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > send a patch :-) In fact I could post a generic patch, to make netdev_alloc_skb() do this automatically for all network drivers. Its probably less than 30 lines of code. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 16:40 ` Willy Tarreau 2012-05-17 16:47 ` Eric Dumazet @ 2012-05-17 16:49 ` Eric Dumazet 2012-05-17 17:22 ` Willy Tarreau 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet 1 sibling, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 16:49 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > try to port your patch (8d4057 tg3: provide frags as skb head) to > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > send a patch :-) tg3 uses build_skb(), thats an API that allocates the skb after frame is filled by the device, to have less cache misses in RX path. But for most devices, netdev_alloc_skb() would be more than enough. I'll send a patch once tested. Thanks ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 16:49 ` Eric Dumazet @ 2012-05-17 17:22 ` Willy Tarreau 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet 1 sibling, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 17:22 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 06:49:47PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote: > > try to port your patch (8d4057 tg3: provide frags as skb head) to > > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll > > send a patch :-) > > > tg3 uses build_skb(), thats an API that allocates the skb after frame is > filled by the device, to have less cache misses in RX path. > > But for most devices, netdev_alloc_skb() would be more than enough. OK, so that's out of reach for me right now. > I'll send a patch once tested. I'll happily give it a try. BTW, net-next gave me around 13-14% improvement over 3.0/3.2+splice patches ! Now I'm seeing splice() being 75% faster than recv/send, while it used to be slower on this hardware not that long ago! That's really impressive, great work! Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 16:49 ` Eric Dumazet 2012-05-17 17:22 ` Willy Tarreau @ 2012-05-17 17:34 ` Eric Dumazet 2012-05-17 17:45 ` Willy Tarreau ` (2 more replies) 1 sibling, 3 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 17:34 UTC (permalink / raw) To: Willy Tarreau, David Miller; +Cc: netdev From: Eric Dumazet <edumazet@google.com> Please note I havent tested yet this patch, lacking hardware for this. (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames, ixgbe uses fragments...) Any volunteer ? Thanks [PATCH net-next] net: netdev_alloc_skb() use build_skb() netdev_alloc_skb() is used by networks driver in their RX path to allocate an skb to receive an incoming frame. With recent skb->head_frag infrastructure, it makes sense to change netdev_alloc_skb() to use build_skb() and a frag allocator. This permits a zero copy splice(socket->pipe), and better GRO or TCP coalescing. Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/core/skbuff.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2a18719..c02a8ec 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -293,6 +293,12 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) } EXPORT_SYMBOL(build_skb); +struct netdev_alloc_cache { + struct page *page; + unsigned int offset; +}; +static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache); + /** * __netdev_alloc_skb - allocate an skbuff for rx on a specific device * @dev: network device to receive on @@ -310,8 +316,32 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, gfp_t gfp_mask) { struct sk_buff *skb; + unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE); + if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) { + struct netdev_alloc_cache *nc; + void *data = NULL; + + nc = &get_cpu_var(netdev_alloc_cache); + if (!nc->page) { +refill: nc->page = alloc_page(gfp_mask); + nc->offset = 0; + } + if (likely(nc->page)) { + if (nc->offset + fragsz > PAGE_SIZE) { + put_page(nc->page); + goto refill; + } + data = page_address(nc->page) + nc->offset; + nc->offset += fragsz; + get_page(nc->page); + } + put_cpu_var(netdev_alloc_cache); + skb = data ? build_skb(data, fragsz) : NULL; + } else { + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE); + } if (likely(skb)) { skb_reserve(skb, NET_SKB_PAD); skb->dev = dev; ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet @ 2012-05-17 17:45 ` Willy Tarreau 2012-06-04 12:39 ` Michael S. Tsirkin 2012-05-17 19:53 ` David Miller 2012-06-04 12:37 ` Michael S. Tsirkin 2 siblings, 1 reply; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 17:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > Please note I havent tested yet this patch, lacking hardware for this. > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames, > ixgbe uses fragments...) > > Any volunteer ? > > Thanks > > [PATCH net-next] net: netdev_alloc_skb() use build_skb() > > netdev_alloc_skb() is used by networks driver in their RX path to > allocate an skb to receive an incoming frame. > > With recent skb->head_frag infrastructure, it makes sense to change > netdev_alloc_skb() to use build_skb() and a frag allocator. > > This permits a zero copy splice(socket->pipe), and better GRO or TCP > coalescing. Impressed ! For the first time I could proxy HTTP traffic at gigabit speed on this little box powered by USB ! I've long believed that proper splicing would make this possible and now I'm seeing it is. Congrats Eric ! I'm still observing some stalls on medium-sized files (eg: 100k, smaller than the pipe size, don't know yet if there is any relation). I'll check closer and try to report something more exploitable. Cheers, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 17:45 ` Willy Tarreau @ 2012-06-04 12:39 ` Michael S. Tsirkin 2012-06-04 12:44 ` Willy Tarreau 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 12:39 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, David Miller, netdev On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote: > Impressed ! > > For the first time I could proxy HTTP traffic at gigabit speed on this > little box powered by USB ! I've long believed that proper splicing > would make this possible and now I'm seeing it is. Congrats Eric ! which userspace do you use? anything I can try? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 12:39 ` Michael S. Tsirkin @ 2012-06-04 12:44 ` Willy Tarreau 0 siblings, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-06-04 12:44 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Eric Dumazet, David Miller, netdev On Mon, Jun 04, 2012 at 03:39:12PM +0300, Michael S. Tsirkin wrote: > On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote: > > Impressed ! > > > > For the first time I could proxy HTTP traffic at gigabit speed on this > > little box powered by USB ! I've long believed that proper splicing > > would make this possible and now I'm seeing it is. Congrats Eric ! > > which userspace do you use? It's haproxy-1.5-dev with splicing enabled. > anything I can try? Yes, feel free to download -dev11, build it for kernels >= 2.6.28 and make a small config to relay TCP/HTTP to another host. of course you need gigabit-capable client and server. Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet 2012-05-17 17:45 ` Willy Tarreau @ 2012-05-17 19:53 ` David Miller 2012-05-18 4:41 ` Eric Dumazet 2012-06-04 12:37 ` Michael S. Tsirkin 2 siblings, 1 reply; 61+ messages in thread From: David Miller @ 2012-05-17 19:53 UTC (permalink / raw) To: eric.dumazet; +Cc: w, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 17 May 2012 19:34:16 +0200 > [PATCH net-next] net: netdev_alloc_skb() use build_skb() > > netdev_alloc_skb() is used by networks driver in their RX path to > allocate an skb to receive an incoming frame. > > With recent skb->head_frag infrastructure, it makes sense to change > netdev_alloc_skb() to use build_skb() and a frag allocator. > > This permits a zero copy splice(socket->pipe), and better GRO or TCP > coalescing. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, we can sort out any fallout very easily before 3.5 is released. Awesome work Eric. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 19:53 ` David Miller @ 2012-05-18 4:41 ` Eric Dumazet 0 siblings, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-18 4:41 UTC (permalink / raw) To: David Miller; +Cc: w, netdev On Thu, 2012-05-17 at 15:53 -0400, David Miller wrote: > Applied, we can sort out any fallout very easily before 3.5 is released. > > Awesome work Eric. Thanks David I'll send a followup patch to fix issues : - Must be IRQ safe (non NAPI drivers can use it) - Must not leak the frag is build_skb() fails to allocate sk_buff - Factorize code so that dev_alloc_skb()/__dev_alloc_skb() are a wrapper around __netdev_alloc_skb() ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet 2012-05-17 17:45 ` Willy Tarreau 2012-05-17 19:53 ` David Miller @ 2012-06-04 12:37 ` Michael S. Tsirkin 2012-06-04 13:06 ` Eric Dumazet 2 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 12:37 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > Please note I havent tested yet this patch, lacking hardware for this. > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames, > ixgbe uses fragments...) virtio-net uses netdev_alloc_skb but maybe it should call build_skb instead? Also, it's not uncommon for drivers to copy short packets out to be able to reuse pages. virtio does this but I am guessing the logic is not really virtio specific. We could do if (len < GOOD_COPY_LEN) netdev_alloc_skb memmov else build_skb but maybe it makes sense to put this logic in build_skb? -- MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 12:37 ` Michael S. Tsirkin @ 2012-06-04 13:06 ` Eric Dumazet 2012-06-04 13:41 ` Michael S. Tsirkin 0 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 13:06 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote: > On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > Please note I havent tested yet this patch, lacking hardware for this. > > > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames, > > ixgbe uses fragments...) > > virtio-net uses netdev_alloc_skb but maybe it should call > build_skb instead? > > Also, it's not uncommon for drivers to copy short packets out to be able > to reuse pages. virtio does this but I am guessing the logic is not > really virtio specific. > > We could do > if (len < GOOD_COPY_LEN) > netdev_alloc_skb > memmov > else > build_skb > > but maybe it makes sense to put this logic in build_skb? > > I am not sure to understand the question. If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to change. build_skb() is for drivers that allocate the memory to hold frame, and wait for NIC completion before allocating/populating the skb itself. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 13:06 ` Eric Dumazet @ 2012-06-04 13:41 ` Michael S. Tsirkin 2012-06-04 14:01 ` Eric Dumazet 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 13:41 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 03:06:53PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote: > > On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote: > > > From: Eric Dumazet <edumazet@google.com> > > > > > > Please note I havent tested yet this patch, lacking hardware for this. > > > > > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames, > > > ixgbe uses fragments...) > > > > virtio-net uses netdev_alloc_skb but maybe it should call > > build_skb instead? > > > > Also, it's not uncommon for drivers to copy short packets out to be able > > to reuse pages. virtio does this but I am guessing the logic is not > > really virtio specific. > > > > We could do > > if (len < GOOD_COPY_LEN) > > netdev_alloc_skb > > memmov > > else > > build_skb > > > > but maybe it makes sense to put this logic in build_skb? > > > > > > I am not sure to understand the question. > > If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to > change. > > build_skb() is for drivers that allocate the memory to hold frame, and > wait for NIC completion before allocating/populating the skb itself. > This is generally what virtio does, take a look: page_to_skb fills the first fragment and receive_mergeable fills the rest (other modes are for legacy hardware). The way hypervisor now works is this (we call it mergeable buffers): - pages are passed to hardware - hypervisor puts virtio specific stuff in first 12 bytes on first page - following this, the rest of the first page and all following pages have data The driver gets the 1st page, allocates the skb, copies out the 12 byte header and copies the first 128 bytes of data into skb. The rest if any is populated by the pages. So I guess I'm asking for advice, would it make sense to switch to build_skb and how best to handle the data copying above? Maybe it would help if we changed the hypervisor to write the 12 bytes separately? -- MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 13:41 ` Michael S. Tsirkin @ 2012-06-04 14:01 ` Eric Dumazet 2012-06-04 14:09 ` Eric Dumazet 2012-06-04 14:17 ` Michael S. Tsirkin 0 siblings, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 14:01 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote: > This is generally what virtio does, take a look: > page_to_skb fills the first fragment and receive_mergeable fills the > rest (other modes are for legacy hardware). > > The way hypervisor now works is this (we call it mergeable buffers): > > - pages are passed to hardware > - hypervisor puts virtio specific stuff in first 12 bytes > on first page > - following this, the rest of the first page and all following > pages have data > > The driver gets the 1st page, allocates the skb, copies out the 12 byte > header and copies the first 128 bytes of data into skb. > The rest if any is populated by the pages. > > So I guess I'm asking for advice, would it make sense to switch to build_skb > and how best to handle the data copying above? Maybe it would help > if we changed the hypervisor to write the 12 bytes separately? > Thanks for these details. Not sure 12 bytes of headroom would be enough (instead of the NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be done indeed is to use the first page as the skb->head, so using build_skb() indeed, removing one fragment, one (small) copy and one {put|get}_page() pair. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 14:01 ` Eric Dumazet @ 2012-06-04 14:09 ` Eric Dumazet 2012-06-04 14:17 ` Michael S. Tsirkin 1 sibling, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 14:09 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 16:01 +0200, Eric Dumazet wrote: > Not sure 12 bytes of headroom would be enough (instead of the > NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be > done indeed is to use the first page as the skb->head, so using > build_skb() indeed, removing one fragment, one (small) copy and one > {put|get}_page() pair. It would also avoid 'pulling' tcp data payload in linear part. page_to_skb() does : copy = len; if (copy > skb_tailroom(skb)) copy = skb_tailroom(skb); memcpy(skb_put(skb, copy), p, copy); This means GRO or TCP coalescing (or splice()) has to handle two segments to fetch data. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 14:01 ` Eric Dumazet 2012-06-04 14:09 ` Eric Dumazet @ 2012-06-04 14:17 ` Michael S. Tsirkin 2012-06-04 15:01 ` Eric Dumazet 1 sibling, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 14:17 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 04:01:41PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote: > > > This is generally what virtio does, take a look: > > page_to_skb fills the first fragment and receive_mergeable fills the > > rest (other modes are for legacy hardware). > > > > The way hypervisor now works is this (we call it mergeable buffers): > > > > - pages are passed to hardware > > - hypervisor puts virtio specific stuff in first 12 bytes > > on first page > > - following this, the rest of the first page and all following > > pages have data > > > > The driver gets the 1st page, allocates the skb, copies out the 12 byte > > header and copies the first 128 bytes of data into skb. > > The rest if any is populated by the pages. > > > > So I guess I'm asking for advice, would it make sense to switch to build_skb > > and how best to handle the data copying above? Maybe it would help > > if we changed the hypervisor to write the 12 bytes separately? > > > > Thanks for these details. > > Not sure 12 bytes of headroom would be enough (instead of the > NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be > done indeed is to use the first page as the skb->head, so using > build_skb() indeed, removing one fragment, one (small) copy and one > {put|get}_page() pair. > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD after build_skb. You are saying it's not a must? Hmm so maybe we should teach the hypervisor to write data out at an offset. Interesting. Another question is about very small packets truesize. build_skb sets truesize to frag_size but isn't this too small? We keep the whole page around, no? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 14:17 ` Michael S. Tsirkin @ 2012-06-04 15:01 ` Eric Dumazet 2012-06-04 17:20 ` Michael S. Tsirkin 2012-06-04 18:16 ` Michael S. Tsirkin 0 siblings, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 15:01 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote: > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD > after build_skb. You are saying it's not a must? > 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes on most x86 current cpus) to avoid using half a cache line. > Hmm so maybe we should teach the hypervisor to write data > out at an offset. Interesting. > > Another question is about very small packets truesize. > build_skb sets truesize to frag_size but isn't > this too small? We keep the whole page around, no? We keep one page per cpu, at most. For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good enough (this is very close from 2048 'truesize') But yes, for some uses (wifi for example), we might use a full page per skb, yet underestimate skb->truesize. Hopefully we can track these uses and fix them. ath9k for example could be changed, to be able to reassemble up to 3 frags instead of 2 frags, ie extending what did commit 0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx buffers to get rid of order-1 skb allocations) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 15:01 ` Eric Dumazet @ 2012-06-04 17:20 ` Michael S. Tsirkin 2012-06-04 17:44 ` Eric Dumazet 2012-06-04 18:16 ` Michael S. Tsirkin 1 sibling, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 17:20 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote: > > > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD > > after build_skb. You are saying it's not a must? > > > > 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes > on most x86 current cpus) to avoid using half a cache line. OK so if we want to use build_skb we need to pad by at least 32 - 12 bytes or more likely 64-12. We lose 52 bytes per page but save a copy of 128 bytes for medium sized packets. > > Hmm so maybe we should teach the hypervisor to write data > > out at an offset. Interesting. > > > > Another question is about very small packets truesize. > > build_skb sets truesize to frag_size but isn't > > this too small? We keep the whole page around, no? > > We keep one page per cpu, at most. > > For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into > two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good > enough (this is very close from 2048 'truesize') I see. virtio allocates full 4K pages which works well for GSO but it means smaller packets would get large truesize wasting lots of memory. So maybe we should keep copying packets < 128 bytes. > But yes, for some uses (wifi for example), we might use a full page per > skb, yet underestimate skb->truesize. Hopefully we can track these uses > and fix them. > > ath9k for example could be changed, to be able to reassemble up to 3 > frags instead of 2 frags, ie extending what did commit > 0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx > buffers to get rid of order-1 skb allocations) > > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 17:20 ` Michael S. Tsirkin @ 2012-06-04 17:44 ` Eric Dumazet 2012-06-04 18:16 ` Michael S. Tsirkin 0 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 17:44 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote: > I see. virtio allocates full 4K pages which works well for GSO > but it means smaller packets would get large truesize > wasting lots of memory. > So maybe we should keep copying packets < 128 bytes. Yet, but restrict the copy to bare minimum if > 128 bytes See following commit for details (actual changelog bigger than patch itself) commit 56138f50d1900b0c3d8647376e37b488b23ba53d Author: Eric Dumazet <edumazet@google.com> Date: Fri May 18 04:48:33 2012 +0000 iwlwifi: dont pull too much payload in skb head As iwlwifi use fat skbs, it should not pull too much data in skb->head, and particularly no tcp data payload, or splice() is slower, and TCP coalescing is disabled. Copying payload to userland also involves at least two copies (part from header, part from fragment) Each layer will pull its header from the fragment as needed. (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes) With this patch applied, I have a major reduction of collapsed/pruned TCP packets, a nice increase of TCPRcvCoalesce counter, and overall better Internet User experience. Small packets are still using a fragless skb, so that page can be reused by the driver. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 17:44 ` Eric Dumazet @ 2012-06-04 18:16 ` Michael S. Tsirkin 2012-06-04 19:24 ` Eric Dumazet 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 07:44:50PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote: > > > I see. virtio allocates full 4K pages which works well for GSO > > but it means smaller packets would get large truesize > > wasting lots of memory. > > So maybe we should keep copying packets < 128 bytes. > > Yet, but restrict the copy to bare minimum if > 128 bytes > > See following commit for details (actual changelog bigger than patch > itself) > > commit 56138f50d1900b0c3d8647376e37b488b23ba53d > Author: Eric Dumazet <edumazet@google.com> > Date: Fri May 18 04:48:33 2012 +0000 > > iwlwifi: dont pull too much payload in skb head > > As iwlwifi use fat skbs, it should not pull too much data in skb->head, > and particularly no tcp data payload, or splice() is slower, and TCP > coalescing is disabled. Copying payload to userland also involves at > least two copies (part from header, part from fragment) > > Each layer will pull its header from the fragment as needed. > > (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes) > > With this patch applied, I have a major reduction of collapsed/pruned > TCP packets, a nice increase of TCPRcvCoalesce counter, and overall > better Internet User experience. > > Small packets are still using a fragless skb, so that page can be reused > by the driver. > Will take a look, thanks. By the way, this comment at build_skb: * @frag_size: size of fragment, or 0 if head was kmalloced is not very clear to me. Could you clarify what exactly size of fragment means in this context? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 18:16 ` Michael S. Tsirkin @ 2012-06-04 19:24 ` Eric Dumazet 2012-06-04 19:48 ` Michael S. Tsirkin 0 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 19:24 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > Will take a look, thanks. > By the way, this comment at build_skb: > * @frag_size: size of fragment, or 0 if head was kmalloced > > is not very clear to me. Could you clarify what exactly size > of fragment means in this context? > If your driver did : data = kmalloc(100) then you use @frag_size=0, so that build_skb() does the ksize(data) to fetch real size (It can depend on slab/slub/sob allocator) If you used netdev_alloc_frag(128), then you use 128 because there is no way build_skb() can guess the size of the fragment. Its also how we signal to build_skb() that skb->head_frag is set to 1. __netdev_alloc_skb() for example does : void *data = netdev_alloc_frag(fragsz); skb = build_skb(data, fragsz); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:24 ` Eric Dumazet @ 2012-06-04 19:48 ` Michael S. Tsirkin 2012-06-04 19:56 ` Eric Dumazet 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 19:48 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 09:24:02PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > > > Will take a look, thanks. > > By the way, this comment at build_skb: > > * @frag_size: size of fragment, or 0 if head was kmalloced > > > > is not very clear to me. Could you clarify what exactly size > > of fragment means in this context? > > > > > If your driver did : > > data = kmalloc(100) then you use @frag_size=0, so that build_skb() does > the ksize(data) to fetch real size (It can depend on slab/slub/sob > allocator) > > > If you used netdev_alloc_frag(128), then you use 128 because there is no > way build_skb() can guess the size of the fragment. Its also how we > signal to build_skb() that skb->head_frag is set to 1. > > __netdev_alloc_skb() for example does : > > void *data = netdev_alloc_frag(fragsz); > skb = build_skb(data, fragsz); > If I do this what will truesize be? 128, no? ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:48 ` Michael S. Tsirkin @ 2012-06-04 19:56 ` Eric Dumazet 2012-06-04 21:20 ` Michael S. Tsirkin 0 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 19:56 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote: > If I do this what will truesize be? 128, no? My example was not correct, since you must have enough room for the SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) ( 320 bytes ) So it would be 128 + 320 = 448 ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:56 ` Eric Dumazet @ 2012-06-04 21:20 ` Michael S. Tsirkin 2012-06-05 2:50 ` Eric Dumazet 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 21:20 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 09:56:52PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote: > > > If I do this what will truesize be? 128, no? > > My example was not correct, since you must have enough room for the > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) ( 320 bytes ) > > So it would be 128 + 320 = 448 > Ugh. I forgot about that. shinfo goes into the same page, so we'll have to also make all frags shorter by 320 to leave space for shinfo at tail. overall looks like we need hyprevisor extensions if we want to use build_skb ... ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 21:20 ` Michael S. Tsirkin @ 2012-06-05 2:50 ` Eric Dumazet 0 siblings, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-05 2:50 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Tue, 2012-06-05 at 00:20 +0300, Michael S. Tsirkin wrote: > On Mon, Jun 04, 2012 at 09:56:52PM +0200, Eric Dumazet wrote: > > On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote: > > > > > If I do this what will truesize be? 128, no? > > > > My example was not correct, since you must have enough room for the > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) ( 320 bytes ) > > > > So it would be 128 + 320 = 448 > > > > > Ugh. I forgot about that. shinfo goes into the same page, > so we'll have to also make all frags shorter by 320 > to leave space for shinfo at tail. > overall looks like we need hyprevisor extensions if > we want to use build_skb ... Maybe not. If you provided a 2048 bytes block and hypervisor filled one (small) frame, there might be available room at the end of the block for the shinfo already. If yes : You can use build_skb() If not : netdev_alloc_skb_ip_align() I mentioned this trick for ixgbe driver. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 15:01 ` Eric Dumazet 2012-06-04 17:20 ` Michael S. Tsirkin @ 2012-06-04 18:16 ` Michael S. Tsirkin 2012-06-04 19:29 ` Eric Dumazet 1 sibling, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote: > > > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD > > after build_skb. You are saying it's not a must? > > > > 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes > on most x86 current cpus) to avoid using half a cache line. > > > Hmm so maybe we should teach the hypervisor to write data > > out at an offset. Interesting. > > > > Another question is about very small packets truesize. > > build_skb sets truesize to frag_size but isn't > > this too small? We keep the whole page around, no? > > We keep one page per cpu, at most. > > For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into > two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good > enough (this is very close from 2048 'truesize') Yes but if a tcp socket then hangs on, on one of the fragments, while the other has been freed, the whole page is still never reused, right? Doesn't this mean truesize should be 4K? -- MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 18:16 ` Michael S. Tsirkin @ 2012-06-04 19:29 ` Eric Dumazet 2012-06-04 19:43 ` Michael S. Tsirkin 0 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 19:29 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > Yes but if a tcp socket then hangs on, on one of the fragments, > while the other has been freed, the whole page is still > never reused, right? > > Doesn't this mean truesize should be 4K? > Yes, or more exactly PAGE_SIZE, but then performance would really go down on machines with 64KB pages. Maybe we should make the whole frag head idea enabled only for PAGE_SIZE=4096. Not sure we want to track precise truesize, as the minimum truesize is SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) (64 + 64 + 320) = 448 Its not like buggy drivers that used truesize = length ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:29 ` Eric Dumazet @ 2012-06-04 19:43 ` Michael S. Tsirkin 2012-06-04 19:52 ` Eric Dumazet 2012-06-04 19:56 ` Michael S. Tsirkin 0 siblings, 2 replies; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 19:43 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > > > Yes but if a tcp socket then hangs on, on one of the fragments, > > while the other has been freed, the whole page is still > > never reused, right? > > > > Doesn't this mean truesize should be 4K? > > > > Yes, or more exactly PAGE_SIZE, but then performance would really go > down on machines with 64KB pages. > Maybe we should make the whole frag > head idea enabled only for PAGE_SIZE=4096. > > Not sure we want to track precise truesize, as the minimum truesize is > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct > skb_shared_info)) (64 + 64 + 320) = 448 > > Its not like buggy drivers that used truesize = length > > Interesting. But where's the threshold? -- MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:43 ` Michael S. Tsirkin @ 2012-06-04 19:52 ` Eric Dumazet 2012-06-04 21:54 ` Michael S. Tsirkin 2012-06-04 19:56 ` Michael S. Tsirkin 1 sibling, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 19:52 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 22:43 +0300, Michael S. Tsirkin wrote: > On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote: > > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > > > > > Yes but if a tcp socket then hangs on, on one of the fragments, > > > while the other has been freed, the whole page is still > > > never reused, right? > > > > > > Doesn't this mean truesize should be 4K? > > > > > > > Yes, or more exactly PAGE_SIZE, but then performance would really go > > down on machines with 64KB pages. > > Maybe we should make the whole frag > > head idea enabled only for PAGE_SIZE=4096. > > > > Not sure we want to track precise truesize, as the minimum truesize is > > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct > > skb_shared_info)) (64 + 64 + 320) = 448 > > > > Its not like buggy drivers that used truesize = length > > > > > > Interesting. But where's the threshold? > It all depends on the global limit you have on your machine. If you allow tcp memory to use 10% of ram, then a systematic x4 error would allow it to use 40% of ram. Mabe not enough to crash. Now you have to find a real workload able to hit this limit for real... But, if you "allow" a driver to claim a truesize of 1 (instead of 4096), you can reach the limit and OOM faster You know, even the current page stored for each socket (sk_sndmsg_page) can be a problem if you setup 1.000.000 tcp sockets. That can consume 4GB of ram (added to inode/sockets themselves) This is not really taken into account right now... ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:52 ` Eric Dumazet @ 2012-06-04 21:54 ` Michael S. Tsirkin 2012-06-05 2:46 ` Eric Dumazet 0 siblings, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 21:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 09:52:59PM +0200, Eric Dumazet wrote: > On Mon, 2012-06-04 at 22:43 +0300, Michael S. Tsirkin wrote: > > On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote: > > > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > > > > > > > Yes but if a tcp socket then hangs on, on one of the fragments, > > > > while the other has been freed, the whole page is still > > > > never reused, right? > > > > > > > > Doesn't this mean truesize should be 4K? > > > > > > > > > > Yes, or more exactly PAGE_SIZE, but then performance would really go > > > down on machines with 64KB pages. > > > Maybe we should make the whole frag > > > head idea enabled only for PAGE_SIZE=4096. > > > > > > Not sure we want to track precise truesize, as the minimum truesize is > > > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct > > > skb_shared_info)) (64 + 64 + 320) = 448 > > > > > > Its not like buggy drivers that used truesize = length > > > > > > > > > > Interesting. But where's the threshold? > > > > It all depends on the global limit you have on your machine. > > If you allow tcp memory to use 10% of ram, then a systematic x4 error > would allow it to use 40% of ram. Mabe not enough to crash. > > Now you have to find a real workload able to hit this limit for real... > > But, if you "allow" a driver to claim a truesize of 1 (instead of 4096), > you can reach the limit and OOM faster > > You know, even the current page stored for each socket (sk_sndmsg_page) > can be a problem if you setup 1.000.000 tcp sockets. That can consume > 4GB of ram (added to inode/sockets themselves) > This is not really taken into account right now... > > Yes but what bugs me if the box is not under memory pressure this overestimation limits buffers for no real gain. How about we teach tcp to use data_len for buffer limits normally and switch to truesize when low on memory? -- MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 21:54 ` Michael S. Tsirkin @ 2012-06-05 2:46 ` Eric Dumazet 0 siblings, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-05 2:46 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Tue, 2012-06-05 at 00:54 +0300, Michael S. Tsirkin wrote: > Yes but what bugs me if the box is not under memory pressure > this overestimation limits buffers for no real gain. > How about we teach tcp to use data_len for buffer > limits normally and switch to truesize when low on memory? > You should first have evidence of the effect of this limitation. I see more evidence of poor choices at driver level than core level. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:43 ` Michael S. Tsirkin 2012-06-04 19:52 ` Eric Dumazet @ 2012-06-04 19:56 ` Michael S. Tsirkin 2012-06-04 20:05 ` Eric Dumazet 1 sibling, 1 reply; 61+ messages in thread From: Michael S. Tsirkin @ 2012-06-04 19:56 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev On Mon, Jun 04, 2012 at 10:43:30PM +0300, Michael S. Tsirkin wrote: > On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote: > > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote: > > > > > Yes but if a tcp socket then hangs on, on one of the fragments, > > > while the other has been freed, the whole page is still > > > never reused, right? > > > > > > Doesn't this mean truesize should be 4K? > > > > > > > Yes, or more exactly PAGE_SIZE, but then performance would really go > > down on machines with 64KB pages. > > Maybe we should make the whole frag > > head idea enabled only for PAGE_SIZE=4096. > > > > Not sure we want to track precise truesize, as the minimum truesize is > > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct > > skb_shared_info)) (64 + 64 + 320) = 448 > > > > Its not like buggy drivers that used truesize = length > > > > > > Interesting. But where's the threshold? And just to explain why I'm asking, if it's OK to declare 2K when you use 4K, we can do (effectively): + skb->truesize += PAGE_SIZE * nfrags; - skb->truesize += min(1500, skb->data_len); which means with 1500 byte packets we can use as much memory as we did before 4b727361f0bc7ee7378298941066d8aa15023ffb > -- > MST ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb() 2012-06-04 19:56 ` Michael S. Tsirkin @ 2012-06-04 20:05 ` Eric Dumazet 0 siblings, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-06-04 20:05 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev On Mon, 2012-06-04 at 22:56 +0300, Michael S. Tsirkin wrote: > And just to explain why I'm asking, if it's OK to > declare 2K when you use 4K, we can do (effectively): > > + skb->truesize += PAGE_SIZE * nfrags; > - skb->truesize += min(1500, skb->data_len); > > which means with 1500 byte packets we can use > as much memory as we did before > 4b727361f0bc7ee7378298941066d8aa15023ffb > How are you sure data_len is >= 1500 ? If you know that you use a full page for a fragment, then PAGE_SIZE is better than 1500 If you share a page with 2/4 frames (But not one hundred), you can assume risk of underestimation is really low. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 15:56 ` Willy Tarreau 2012-05-17 16:33 ` Eric Dumazet @ 2012-05-17 18:38 ` Ben Hutchings 1 sibling, 0 replies; 61+ messages in thread From: Ben Hutchings @ 2012-05-17 18:38 UTC (permalink / raw) To: Willy Tarreau; +Cc: Eric Dumazet, netdev On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote: [...] > The NIC does not support TSO but I've seen an alternate driver for this > NIC which pretends to do TSO and in fact builds header frags so that the > NIC is able to send all frames at once. I think it's already what GSO is > doing but I'm wondering whether it would be possible to get more speed > by doing this than by relying on GSO to (possibly) split the frags earlier. [...] Yes, GSO has some overhead for skb allocation and additional function calls that you can avoid by doing 'TSO' in the driver. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 15:01 ` Willy Tarreau 2012-05-17 15:43 ` Eric Dumazet @ 2012-05-17 19:55 ` David Miller 2012-05-17 20:04 ` Willy Tarreau 1 sibling, 1 reply; 61+ messages in thread From: David Miller @ 2012-05-17 19:55 UTC (permalink / raw) To: w; +Cc: eric.dumazet, netdev From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 17:01:57 +0200 >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 16:48:56 +0200 > Subject: [PATCH] tcp: force push data out when buffers are missing > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > significantly improved splice() performance for some workloads but > caused stalls when pipe buffers were larger than socket buffers. > > The issue seems to happen when no data can be copied at all due to > lack of buffers, which results in pending data never being pushed. > > This change checks if all pending data has been pushed or not and > pushes them when waiting for send buffers. Eric, please indicate whether we need Willy's patch here. I want to propagate this fix as fast as possible if so. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 19:55 ` David Miller @ 2012-05-17 20:04 ` Willy Tarreau 2012-05-17 20:07 ` David Miller 0 siblings, 1 reply; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 20:04 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev Hi David, On Thu, May 17, 2012 at 03:55:03PM -0400, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 17:01:57 +0200 > > >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001 > > From: Willy Tarreau <w@1wt.eu> > > Date: Thu, 17 May 2012 16:48:56 +0200 > > Subject: [PATCH] tcp: force push data out when buffers are missing > > > > Commit 2f533844242 (tcp: allow splice() to build full TSO packets) > > significantly improved splice() performance for some workloads but > > caused stalls when pipe buffers were larger than socket buffers. > > > > The issue seems to happen when no data can be copied at all due to > > lack of buffers, which results in pending data never being pushed. > > > > This change checks if all pending data has been pushed or not and > > pushes them when waiting for send buffers. > > Eric, please indicate whether we need Willy's patch here. > > I want to propagate this fix as fast as possible if so. I think you should hold off for now, because it's possible that my patch hides another issue instead of fixing it. I'm having the same stall issue again since I applied Eric's build_skb patch, but not for all data sizes. So if the same issue is still there, it's possible that we're playing hide-and-seek with it. That's rather strange. Thanks, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 20:04 ` Willy Tarreau @ 2012-05-17 20:07 ` David Miller 0 siblings, 0 replies; 61+ messages in thread From: David Miller @ 2012-05-17 20:07 UTC (permalink / raw) To: w; +Cc: eric.dumazet, netdev From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 22:04:04 +0200 > I'm having the same stall issue again since I applied Eric's build_skb > patch, but not for all data sizes. So if the same issue is still there, > it's possible that we're playing hide-and-seek with it. That's rather > strange. Ok, a Heisenbug :-) Let me know when you guys resolve this. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau 2012-05-17 15:01 ` Willy Tarreau @ 2012-05-17 20:41 ` Eric Dumazet 2012-05-17 21:14 ` Willy Tarreau 1 sibling, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 20:41 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote: > Hi Eric, > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > exhibited by your patch 'tcp: allow splice() to build full TSO > packets' which unfortunately I am very interested in ! > > What I'm observing is that TCP transmits using splice() stall > quite quickly if I'm using pipes larger than 64kB (even 65537 > is enough to reliably observe the stall). > > I'm seeing this on haproxy running on a small ARM machine (a > dockstar), which exchanges data through a gig switch with my > development PC. The NIC (mv643xx) doesn't support TSO but has > GSO enabled. If I disable GSO, the problem remains. I can however > make the problem disappear by disabling SG or Tx checksumming. > BTW, using recv/send() instead of splice() also gets rid of the > problem. > > I can also reduce the risk of seeing the problem by increasing > the default TCP buffer sizes in tcp_wmem. By default I'm running > at 16kB, but if I increase the output buffer size above the pipe > size, the problem *seems* to disappear though I can't be certain, > since larger buffers generally means the problem takes longer to > appear, probably due to the fact that the buffers don't need to > be filled. Still I'm certain that with 64k TCP buffers and 128k > pipes I'm still seeing it. > > With strace, I'm seeing data fill up the pipe with the splice() > call responsible for pushing the data to the output socket returing > -1 EAGAIN. During this time, the client receives no data. > > Something bugs me, I have tested with a dummy server of mine, > httpterm, which uses tee+splice() to push data outside, and it > has no problem filling the gig pipe, and correctly recoverers > from the EAGAIN : > > send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160 > tee(0x3, 0x6, 0x10000, 0x2) = 42552 > splice(0x5, 0, 0xd, 0, 0xa00000, 0x2) = 14440 > tee(0x3, 0x6, 0x10000, 0x2) = 13880 > splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = -1 EAGAIN (Resource temporarily unavailable) > ... > tee(0x3, 0x6, 0x10000, 0x2) = 13880 > splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = 51100 > tee(0x3, 0x6, 0x10000, 0x2) = 50744 > splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2) = 32120 > tee(0x3, 0x6, 0x10000, 0x2) = 30264 > splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2) = -1 EAGAIN (Resource temporarily unavailable) > > etc... > > It's only with haproxy which uses splice() to copy data between > two sockets that I'm getting the issue (data forwarded from fd 0xe > to fd 0x6) : > > 16:03:17.797144 pipe([36, 37]) = 0 > 16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k) > 16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220 > 16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220 > 16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0 > 16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7 > 16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0 > 16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060 > 16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060 > 16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0 > 16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4 > 16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0 > 16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520 > 16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520 > 16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0 > 16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 > 16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0 > 16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500 > 16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360 > 16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0 ## note: epoll_ctl(ADD, fd=6, dir=OUT). > 16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0 > 16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 > 16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0 > 16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900 > 16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable) > Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ? (because socket buffer is full) > So output fd 6 hangs here and will not appear anymore until > here where I pressed Ctrl-C to stop the test : > I just want to make sure its not a userland error that triggers now much faster than with prior kernels. You drain bytes from fd 0xe to pipe buffers, but I dont see you check write ability on destination socket prior the splice(pipe -> socket) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 20:41 ` Eric Dumazet @ 2012-05-17 21:14 ` Willy Tarreau 2012-05-17 21:40 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 21:14 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 10:41:19PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote: > > Hi Eric, > > > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is > > exhibited by your patch 'tcp: allow splice() to build full TSO > > packets' which unfortunately I am very interested in ! > > > > What I'm observing is that TCP transmits using splice() stall > > quite quickly if I'm using pipes larger than 64kB (even 65537 > > is enough to reliably observe the stall). > > > > I'm seeing this on haproxy running on a small ARM machine (a > > dockstar), which exchanges data through a gig switch with my > > development PC. The NIC (mv643xx) doesn't support TSO but has > > GSO enabled. If I disable GSO, the problem remains. I can however > > make the problem disappear by disabling SG or Tx checksumming. > > BTW, using recv/send() instead of splice() also gets rid of the > > problem. > > > > I can also reduce the risk of seeing the problem by increasing > > the default TCP buffer sizes in tcp_wmem. By default I'm running > > at 16kB, but if I increase the output buffer size above the pipe > > size, the problem *seems* to disappear though I can't be certain, > > since larger buffers generally means the problem takes longer to > > appear, probably due to the fact that the buffers don't need to > > be filled. Still I'm certain that with 64k TCP buffers and 128k > > pipes I'm still seeing it. > > > > With strace, I'm seeing data fill up the pipe with the splice() > > call responsible for pushing the data to the output socket returing > > -1 EAGAIN. During this time, the client receives no data. > > > > Something bugs me, I have tested with a dummy server of mine, > > httpterm, which uses tee+splice() to push data outside, and it > > has no problem filling the gig pipe, and correctly recoverers > > from the EAGAIN : > > > > send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160 > > tee(0x3, 0x6, 0x10000, 0x2) = 42552 > > splice(0x5, 0, 0xd, 0, 0xa00000, 0x2) = 14440 > > tee(0x3, 0x6, 0x10000, 0x2) = 13880 > > splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = -1 EAGAIN (Resource temporarily unavailable) > > ... > > tee(0x3, 0x6, 0x10000, 0x2) = 13880 > > splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2) = 51100 > > tee(0x3, 0x6, 0x10000, 0x2) = 50744 > > splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2) = 32120 > > tee(0x3, 0x6, 0x10000, 0x2) = 30264 > > splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2) = -1 EAGAIN (Resource temporarily unavailable) > > > > etc... > > > > It's only with haproxy which uses splice() to copy data between > > two sockets that I'm getting the issue (data forwarded from fd 0xe > > to fd 0x6) : > > > > 16:03:17.797144 pipe([36, 37]) = 0 > > 16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k) > > 16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220 > > 16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220 > > 16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0 > > 16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7 > > 16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0 > > 16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060 > > 16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060 > > 16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0 > > 16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4 > > 16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0 > > 16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520 > > 16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520 > > 16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0 > > 16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 > > 16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0 > > 16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500 > > 16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360 > > 16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0 ## note: epoll_ctl(ADD, fd=6, dir=OUT). > > 16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0 > > 16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1 > > 16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0 > > 16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900 > > 16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable) > > > > Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ? > (because socket buffer is full) Exactly. > > So output fd 6 hangs here and will not appear anymore until > > here where I pressed Ctrl-C to stop the test : > > > > I just want to make sure its not a userland error that triggers now much > faster than with prior kernels. I understand and that could be possible indeed. Still, this precise code has been used for a few years now in prod at 10+ Gbps, so while that does not mean it's exempt from any race condition or bug, we have not observed this behaviour earlier. In fact, what I've not tested much was the small ARM based machine which is just a convenient development system to try to optimize network performance. Among the differences I see with usual deployments is that the NIC doesn't support TSO, while I've been used to enable splicing only where TSO was supported, because before your recent optimizations, it was less performant than recv/send. > You drain bytes from fd 0xe to pipe buffers, but I dont see you check > write ability on destination socket prior the splice(pipe -> socket) I don't, I only rely on EAGAIN to re-enable polling for write (otherwise it becomes a real mess of epoll_ctl which sensibly hurts performance). I only re-enable polling if FDs can't move anymore. Before doing a splice(read), if any data are left pending in the pipe, I first try a splice(write) to try to flush the pipe, then I perform the splice(read) then try to flush the pipe again using a splice(write). Then polling is enabled if we block on EAGAIN. I could fix the issue here by reworking my first patch. I think I was too much conservative, because if we leave do_tcp_sendpages() on OOM with copied == 0, in my opinion we never push. My first attempt tried to call tcp_push() only once but it seems like this is a wrong idea because it doesn't allow new attempts if for example tcp_write_xmit() cannot send upon first attempt. After calling tcp_push() inconditionnally on OOM, I cannot reproduce the issue at all, and the traffic reaches a steady 950 Mbps in each direction. I'm appending the patch, you'll know better than me if it's correct or not. Best regards, Willy --- >From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 22:43:20 +0200 Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions Since recent changes on TCP splicing (starting with commits 2f533844 and 35f9c09f), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Cc: <stable@vger.kernel.org> --- net/ipv4/tcp.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 63ddaee..231fe53 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -917,8 +917,7 @@ new_segment: wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: - if (copied) - tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); + tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) goto do_error; -- 1.7.2.1.45.g54fbc ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:14 ` Willy Tarreau @ 2012-05-17 21:40 ` Eric Dumazet 2012-05-17 21:50 ` Eric Dumazet 2012-05-17 21:54 ` Willy Tarreau 2012-05-17 21:47 ` Willy Tarreau 2012-05-17 22:14 ` Eric Dumazet 2 siblings, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 21:40 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 23:14 +0200, Willy Tarreau wrote: > I understand and that could be possible indeed. Still, this precise code > has been used for a few years now in prod at 10+ Gbps, so while that does > not mean it's exempt from any race condition or bug, we have not observed > this behaviour earlier. In fact, what I've not tested much was the small > ARM based machine which is just a convenient development system to try to > optimize network performance. Among the differences I see with usual > deployments is that the NIC doesn't support TSO, while I've been used to > enable splicing only where TSO was supported, because before your recent > optimizations, it was less performant than recv/send. > > > You drain bytes from fd 0xe to pipe buffers, but I dont see you check > > write ability on destination socket prior the splice(pipe -> socket) > > I don't, I only rely on EAGAIN to re-enable polling for write (otherwise > it becomes a real mess of epoll_ctl which sensibly hurts performance). I > only re-enable polling if FDs can't move anymore. > > Before doing a splice(read), if any data are left pending in the pipe, I > first try a splice(write) to try to flush the pipe, then I perform the > splice(read) then try to flush the pipe again using a splice(write). > Then polling is enabled if we block on EAGAIN. > > I could fix the issue here by reworking my first patch. I think I was > too much conservative, because if we leave do_tcp_sendpages() on OOM > with copied == 0, in my opinion we never push. My first attempt tried > to call tcp_push() only once but it seems like this is a wrong idea > because it doesn't allow new attempts if for example tcp_write_xmit() > cannot send upon first attempt. > > After calling tcp_push() inconditionnally on OOM, I cannot reproduce > the issue at all, and the traffic reaches a steady 950 Mbps in each > direction. > > I'm appending the patch, you'll know better than me if it's correct or > not. > > Best regards, > Willy > > --- > > From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 22:43:20 +0200 > Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions > > Since recent changes on TCP splicing (starting with commits 2f533844 > and 35f9c09f), I started seeing massive stalls when forwarding traffic > between two sockets using splice() when pipe buffers were larger than > socket buffers. > > Latest changes (net: netdev_alloc_skb() use build_skb()) made the > problem even more apparent. > > The reason seems to be that if do_tcp_sendpages() fails on out of memory > condition without being able to send at least one byte, tcp_push() is not > called and the buffers cannot be flushed. > > After applying the attached patch, I cannot reproduce the stalls at all > and the data rate it perfectly stable and steady under any condition > which previously caused the problem to be permanent. > > The issue seems to have been there since before the kernel migrated to > git, which makes me think that the stalls I occasionally experienced > with tux during stress-tests years ago were probably related to the > same issue. > > This issue was first encountered on 3.0.31 and 3.2.17, so please backport > to -stable. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > Cc: <stable@vger.kernel.org> > --- > net/ipv4/tcp.c | 3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 63ddaee..231fe53 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -917,8 +917,7 @@ new_segment: > wait_for_sndbuf: > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > wait_for_memory: > - if (copied) > - tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > + tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > goto do_error; I dont understand why we should tcp_push() if we sent 0 bytes in this splice() call. The push() should have be done already by prior splice() call, dont you think ? out: if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) tcp_push(sk, flags, mss_now, tp->nonagle); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:40 ` Eric Dumazet @ 2012-05-17 21:50 ` Eric Dumazet 2012-05-17 21:57 ` Willy Tarreau 2012-05-17 22:01 ` Eric Dumazet 2012-05-17 21:54 ` Willy Tarreau 1 sibling, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 21:50 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > I dont understand why we should tcp_push() if we sent 0 bytes in this > splice() call. > > The push() should have be done already by prior splice() call, dont you > think ? > > out: > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > tcp_push(sk, flags, mss_now, tp->nonagle); > I think I now understand One splice() syscall actually calls do_tcp_sendpages() several times (N). Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set And last one with MSG_SENDPAGE_NOTLAST unset So maybe we should replace this test by : if (!(flags & MSG_SENDPAGE_NOTLAST)) tcp_push(...); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:50 ` Eric Dumazet @ 2012-05-17 21:57 ` Willy Tarreau 2012-05-17 22:01 ` Eric Dumazet 1 sibling, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 21:57 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 11:50:10PM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > splice() call. > > > > The push() should have be done already by prior splice() call, dont you > > think ? > > > > out: > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > I think I now understand > > One splice() syscall actually calls do_tcp_sendpages() several times > (N). > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > And last one with MSG_SENDPAGE_NOTLAST unset > > So maybe we should replace this test by : > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > tcp_push(...); Just tested, and it unfortunately does not fix the issue :-( I'll see if I can log errors from lower layers. Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:50 ` Eric Dumazet 2012-05-17 21:57 ` Willy Tarreau @ 2012-05-17 22:01 ` Eric Dumazet 2012-05-17 22:10 ` Eric Dumazet 2012-05-17 22:16 ` Willy Tarreau 1 sibling, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 22:01 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > splice() call. > > > > The push() should have be done already by prior splice() call, dont you > > think ? > > > > out: > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > I think I now understand > > One splice() syscall actually calls do_tcp_sendpages() several times > (N). > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > And last one with MSG_SENDPAGE_NOTLAST unset > > So maybe we should replace this test by : > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > tcp_push(...); > > Its more tricky than that. If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call us again, but we need to flush(). (This bug is indeed very old) So maybe use : if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) || !copied) tcp_push(...); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:01 ` Eric Dumazet @ 2012-05-17 22:10 ` Eric Dumazet 2012-05-17 22:16 ` Willy Tarreau 1 sibling, 0 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 22:10 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Fri, 2012-05-18 at 00:02 +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote: > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > > splice() call. > > > > > > The push() should have be done already by prior splice() call, dont you > > > think ? > > > > > > out: > > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > > > > I think I now understand > > > > One splice() syscall actually calls do_tcp_sendpages() several times > > (N). > > > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > > > And last one with MSG_SENDPAGE_NOTLAST unset > > > > So maybe we should replace this test by : > > > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > > tcp_push(...); > > > > > > Its more tricky than that. > > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call > us again, but we need to flush(). (This bug is indeed very old) > > So maybe use : > > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) || > !copied) > tcp_push(...); But then your patch is equivalentm so I'm going to Ack it ;) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:01 ` Eric Dumazet 2012-05-17 22:10 ` Eric Dumazet @ 2012-05-17 22:16 ` Willy Tarreau 2012-05-17 22:22 ` Eric Dumazet 1 sibling, 1 reply; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 22:16 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote: > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote: > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > > splice() call. > > > > > > The push() should have be done already by prior splice() call, dont you > > > think ? > > > > > > out: > > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > > > > I think I now understand > > > > One splice() syscall actually calls do_tcp_sendpages() several times > > (N). > > > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > > > And last one with MSG_SENDPAGE_NOTLAST unset > > > > So maybe we should replace this test by : > > > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > > tcp_push(...); > > > > > > Its more tricky than that. > > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call > us again, but we need to flush(). (This bug is indeed very old) > > So maybe use : > > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) || > !copied) > tcp_push(...); That's what I wanted to do at first but was still worried about the situations where we fail upon first call due to lack of memory. And indeed that did not fix the issue either :-( At least now I've checked that we fail here in do_tcp_sendpages() : if (!sk_stream_memory_free(sk)) { printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied); goto wait_for_sndbuf; } Well, finally I just put the same test as yours above for the call to tcp_push() upon out of memory and it fixed it too. I have simplified the expression, since ((A && !B) || !A) == !(A & B). With the updated patch here it works for me. I don't know if it's better. Willy ----- >From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 22:43:20 +0200 Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions Since recent changes on TCP splicing (starting with commits 2f533844 and 35f9c09f), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Cc: <stable@vger.kernel.org> --- net/ipv4/tcp.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 63ddaee..cfe47c1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -917,7 +917,7 @@ new_segment: wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: - if (copied) + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) @@ -927,7 +927,7 @@ wait_for_memory: } out: - if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) tcp_push(sk, flags, mss_now, tp->nonagle); return copied; -- 1.7.2.1.45.g54fbc ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:16 ` Willy Tarreau @ 2012-05-17 22:22 ` Eric Dumazet 2012-05-17 22:24 ` Willy Tarreau 2012-05-17 22:27 ` Joe Perches 0 siblings, 2 replies; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 22:22 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote: > On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote: > > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote: > > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > > > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > > > splice() call. > > > > > > > > The push() should have be done already by prior splice() call, dont you > > > > think ? > > > > > > > > out: > > > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > > > > > > > I think I now understand > > > > > > One splice() syscall actually calls do_tcp_sendpages() several times > > > (N). > > > > > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > > > > > And last one with MSG_SENDPAGE_NOTLAST unset > > > > > > So maybe we should replace this test by : > > > > > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > > > tcp_push(...); > > > > > > > > > > Its more tricky than that. > > > > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call > > us again, but we need to flush(). (This bug is indeed very old) > > > > So maybe use : > > > > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) || > > !copied) > > tcp_push(...); > > That's what I wanted to do at first but was still worried about the > situations where we fail upon first call due to lack of memory. And > indeed that did not fix the issue either :-( > > At least now I've checked that we fail here in do_tcp_sendpages() : > > if (!sk_stream_memory_free(sk)) { > printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied); > goto wait_for_sndbuf; > } > > Well, finally I just put the same test as yours above for the call to > tcp_push() upon out of memory and it fixed it too. I have simplified > the expression, since ((A && !B) || !A) == !(A & B). > > With the updated patch here it works for me. I don't know if it's > better. > > Willy > > ----- > > From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 22:43:20 +0200 > Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions > > Since recent changes on TCP splicing (starting with commits 2f533844 > and 35f9c09f), I started seeing massive stalls when forwarding traffic > between two sockets using splice() when pipe buffers were larger than > socket buffers. > > Latest changes (net: netdev_alloc_skb() use build_skb()) made the > problem even more apparent. > > The reason seems to be that if do_tcp_sendpages() fails on out of memory > condition without being able to send at least one byte, tcp_push() is not > called and the buffers cannot be flushed. > > After applying the attached patch, I cannot reproduce the stalls at all > and the data rate it perfectly stable and steady under any condition > which previously caused the problem to be permanent. > > The issue seems to have been there since before the kernel migrated to > git, which makes me think that the stalls I occasionally experienced > with tux during stress-tests years ago were probably related to the > same issue. > > This issue was first encountered on 3.0.31 and 3.2.17, so please backport > to -stable. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > Cc: <stable@vger.kernel.org> > --- > net/ipv4/tcp.c | 4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 63ddaee..cfe47c1 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -917,7 +917,7 @@ new_segment: > wait_for_sndbuf: > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > wait_for_memory: > - if (copied) > + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > @@ -927,7 +927,7 @@ wait_for_memory: > } > > out: > - if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) > tcp_push(sk, flags, mss_now, tp->nonagle); > return copied; > Hmm... I believe I prefer your prior patch ( the one I Acked) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:22 ` Eric Dumazet @ 2012-05-17 22:24 ` Willy Tarreau 2012-05-17 22:25 ` David Miller 2012-05-17 22:27 ` Joe Perches 1 sibling, 1 reply; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 22:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote: > On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote: > > On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote: > > > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote: > > > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote: > > > > > > > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > > > > > splice() call. > > > > > > > > > > The push() should have be done already by prior splice() call, dont you > > > > > think ? > > > > > > > > > > out: > > > > > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > > > > tcp_push(sk, flags, mss_now, tp->nonagle); > > > > > > > > > > > > > I think I now understand > > > > > > > > One splice() syscall actually calls do_tcp_sendpages() several times > > > > (N). > > > > > > > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set > > > > > > > > And last one with MSG_SENDPAGE_NOTLAST unset > > > > > > > > So maybe we should replace this test by : > > > > > > > > if (!(flags & MSG_SENDPAGE_NOTLAST)) > > > > tcp_push(...); > > > > > > > > > > > > > > Its more tricky than that. > > > > > > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call > > > us again, but we need to flush(). (This bug is indeed very old) > > > > > > So maybe use : > > > > > > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) || > > > !copied) > > > tcp_push(...); > > > > That's what I wanted to do at first but was still worried about the > > situations where we fail upon first call due to lack of memory. And > > indeed that did not fix the issue either :-( > > > > At least now I've checked that we fail here in do_tcp_sendpages() : > > > > if (!sk_stream_memory_free(sk)) { > > printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied); > > goto wait_for_sndbuf; > > } > > > > Well, finally I just put the same test as yours above for the call to > > tcp_push() upon out of memory and it fixed it too. I have simplified > > the expression, since ((A && !B) || !A) == !(A & B). > > > > With the updated patch here it works for me. I don't know if it's > > better. > > > > Willy > > > > ----- > > > > From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001 > > From: Willy Tarreau <w@1wt.eu> > > Date: Thu, 17 May 2012 22:43:20 +0200 > > Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions > > > > Since recent changes on TCP splicing (starting with commits 2f533844 > > and 35f9c09f), I started seeing massive stalls when forwarding traffic > > between two sockets using splice() when pipe buffers were larger than > > socket buffers. > > > > Latest changes (net: netdev_alloc_skb() use build_skb()) made the > > problem even more apparent. > > > > The reason seems to be that if do_tcp_sendpages() fails on out of memory > > condition without being able to send at least one byte, tcp_push() is not > > called and the buffers cannot be flushed. > > > > After applying the attached patch, I cannot reproduce the stalls at all > > and the data rate it perfectly stable and steady under any condition > > which previously caused the problem to be permanent. > > > > The issue seems to have been there since before the kernel migrated to > > git, which makes me think that the stalls I occasionally experienced > > with tux during stress-tests years ago were probably related to the > > same issue. > > > > This issue was first encountered on 3.0.31 and 3.2.17, so please backport > > to -stable. > > > > Signed-off-by: Willy Tarreau <w@1wt.eu> > > Cc: <stable@vger.kernel.org> > > --- > > net/ipv4/tcp.c | 4 ++-- > > 1 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 63ddaee..cfe47c1 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -917,7 +917,7 @@ new_segment: > > wait_for_sndbuf: > > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > > wait_for_memory: > > - if (copied) > > + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) > > tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > > @@ -927,7 +927,7 @@ wait_for_memory: > > } > > > > out: > > - if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > > + if (!(copied && (flags & MSG_SENDPAGE_NOTLAST))) > > tcp_push(sk, flags, mss_now, tp->nonagle); > > return copied; > > > > > Hmm... I believe I prefer your prior patch ( the one I Acked) I too, less changes are better :-) Thanks, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:24 ` Willy Tarreau @ 2012-05-17 22:25 ` David Miller 2012-05-17 22:30 ` Willy Tarreau 0 siblings, 1 reply; 61+ messages in thread From: David Miller @ 2012-05-17 22:25 UTC (permalink / raw) To: w; +Cc: eric.dumazet, netdev From: Willy Tarreau <w@1wt.eu> Date: Fri, 18 May 2012 00:24:31 +0200 > On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote: >> Hmm... I believe I prefer your prior patch ( the one I Acked) > > I too, less changes are better :-) And I think that's the one I'm going to apply after I audit this code a little bit myself. Thanks! ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:25 ` David Miller @ 2012-05-17 22:30 ` Willy Tarreau 2012-05-17 22:35 ` David Miller 0 siblings, 1 reply; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 22:30 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Thu, May 17, 2012 at 06:25:45PM -0400, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Fri, 18 May 2012 00:24:31 +0200 > > > On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote: > >> Hmm... I believe I prefer your prior patch ( the one I Acked) > > > > I too, less changes are better :-) > > And I think that's the one I'm going to apply after I audit this > code a little bit myself. > > Thanks! Sorry for the double mail, I didn't know if you were following the thread when I forwarded it to you. Cheers, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:30 ` Willy Tarreau @ 2012-05-17 22:35 ` David Miller 2012-05-17 22:49 ` Willy Tarreau 0 siblings, 1 reply; 61+ messages in thread From: David Miller @ 2012-05-17 22:35 UTC (permalink / raw) To: w; +Cc: eric.dumazet, netdev From: Willy Tarreau <w@1wt.eu> Date: Fri, 18 May 2012 00:30:28 +0200 > On Thu, May 17, 2012 at 06:25:45PM -0400, David Miller wrote: >> From: Willy Tarreau <w@1wt.eu> >> Date: Fri, 18 May 2012 00:24:31 +0200 >> >> > On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote: >> >> Hmm... I believe I prefer your prior patch ( the one I Acked) >> > >> > I too, less changes are better :-) >> >> And I think that's the one I'm going to apply after I audit this >> code a little bit myself. >> >> Thanks! > > Sorry for the double mail, I didn't know if you were following the thread > when I forwarded it to you. No problem. BTW, this issue goes back all the way to the original implementation of do_tcp_sendpages(). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:35 ` David Miller @ 2012-05-17 22:49 ` Willy Tarreau 0 siblings, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 22:49 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Thu, May 17, 2012 at 06:35:20PM -0400, David Miller wrote: > BTW, this issue goes back all the way to the original implementation > of do_tcp_sendpages(). I'm not surprized, I remember about tux on 2.4 sometimes leaving frozen connections behind it when pushed to extreme speeds back in the days where it was audacious to pull 2 Gbps out of a Pentium3. I like it when we find very old issues when improving the code, it generally means we're stressing it a bit more and making more efficient use of it. Cheers, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:22 ` Eric Dumazet 2012-05-17 22:24 ` Willy Tarreau @ 2012-05-17 22:27 ` Joe Perches 1 sibling, 0 replies; 61+ messages in thread From: Joe Perches @ 2012-05-17 22:27 UTC (permalink / raw) To: Eric Dumazet; +Cc: Willy Tarreau, netdev On Fri, 2012-05-18 at 00:22 +0200, Eric Dumazet wrote: > On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote: > > I have simplified the expression, > > since ((A && !B) || !A) == !(A & B). Perhaps it's clearer for a reader the first way and any decent compiler (even gcc) could probably do that optimization. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:40 ` Eric Dumazet 2012-05-17 21:50 ` Eric Dumazet @ 2012-05-17 21:54 ` Willy Tarreau 1 sibling, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 21:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 11:40:22PM +0200, Eric Dumazet wrote: > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 63ddaee..231fe53 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -917,8 +917,7 @@ new_segment: > > wait_for_sndbuf: > > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > > wait_for_memory: > > - if (copied) > > - tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > + tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > > goto do_error; > > > I dont understand why we should tcp_push() if we sent 0 bytes in this > splice() call. > > The push() should have be done already by prior splice() call, dont you > think ? > > out: > if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > tcp_push(sk, flags, mss_now, tp->nonagle); > I do think so, but my guess is that something fails below us at some point. I tracked the tcp_push() chain down to tcp_write_xmit() which I *think* might fail. I'm not certain about the conditions which can make this happen though. However what I'm almost sure is that if we fail one tcp_write_xmit() from a call to tcp_push(), next time we come here in do_tcp_sendpages(), the buffers won't have moved and this second splice() will immediately go to out with copied == 0. Maybe I just worked around the consequence of a deeper issue and something else needs to enable tcp_push() when we fail, but it's not very clear to me. I felt like the call to tcp_check_probe_timer() in __tcp_push_pending_frames() is there for this purpose, but I'm not sure. Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:14 ` Willy Tarreau 2012-05-17 21:40 ` Eric Dumazet @ 2012-05-17 21:47 ` Willy Tarreau 2012-05-17 22:14 ` Eric Dumazet 2 siblings, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 21:47 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, May 17, 2012 at 11:14:14PM +0200, Willy Tarreau wrote: > On Thu, May 17, 2012 at 10:41:19PM +0200, Eric Dumazet wrote: > > You drain bytes from fd 0xe to pipe buffers, but I dont see you check > > write ability on destination socket prior the splice(pipe -> socket) > > I don't, I only rely on EAGAIN to re-enable polling for write (otherwise > it becomes a real mess of epoll_ctl which sensibly hurts performance). I > only re-enable polling if FDs can't move anymore. > > Before doing a splice(read), if any data are left pending in the pipe, I > first try a splice(write) to try to flush the pipe, then I perform the > splice(read) then try to flush the pipe again using a splice(write). > Then polling is enabled if we block on EAGAIN. Just for the sake of documentation, I have rebuilt an up-to-date strace so that we get the details of the epoll_ctl(). I've restarted with only the classical epoll (without the events cache) so that it's easier to trace event status with strace. Here we clearly see that the output FD (6) was not reported as being writable for a long period, until haproxy's timeout struck : 15:07:50.130499 gettimeofday({1324652870, 130772}, NULL) = 0 15:07:50.130937 gettimeofday({1324652870, 131071}, NULL) = 0 15:07:50.131195 epoll_wait(3, {}, 6, 1000) = 0 15:07:51.132529 gettimeofday({1324652871, 132654}, NULL) = 0 15:07:51.132777 gettimeofday({1324652871, 132895}, NULL) = 0 15:07:51.133013 epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 6, 1000) = 1 15:07:51.561594 gettimeofday({1324652871, 561723}, NULL) = 0 15:07:51.561866 accept(4, {sa_family=AF_INET, sin_port=htons(56441), sin_addr=inet_addr("192.168.0.31")}, [16]) = 6 15:07:51.562249 fcntl64(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 15:07:51.562582 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0 15:07:51.562844 accept(4, 0x7ea65a64, [128]) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.563142 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0 15:07:51.563421 gettimeofday({1324652871, 563540}, NULL) = 0 15:07:51.563657 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 7, 1000) = 1 15:07:51.563923 gettimeofday({1324652871, 564041}, NULL) = 0 15:07:51.564180 recv(6, "GET /?s=1g HTTP/1.0\r\nUser-Agent:"..., 7000, 0) = 126 15:07:51.564574 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 15:07:51.564817 fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 15:07:51.565040 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 15:07:51.565277 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0 15:07:51.565524 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("192.168.0.31")}, 16) = -1 EINPROGRESS (Operation now in progress) 15:07:51.565955 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0 15:07:51.566235 epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=6, u64=6}}) = 0 15:07:51.566494 gettimeofday({1324652871, 566613}, NULL) = 0 15:07:51.566730 epoll_wait(3, {{EPOLLOUT, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.566994 gettimeofday({1324652871, 567114}, NULL) = 0 15:07:51.567246 send(7, "GET /?s=1g HTTP/1.0\r\nUser-Agent:"..., 107, MSG_DONTWAIT|MSG_NOSIGNAL) = 107 15:07:51.567589 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 15:07:51.567907 gettimeofday({1324652871, 568031}, NULL) = 0 15:07:51.568148 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.568471 gettimeofday({1324652871, 568593}, NULL) = 0 15:07:51.568715 recv(7, "HTTP/1.1 200\r\nConnection: close\r"..., 7000, 0) = 4380 15:07:51.569057 recv(7, "89.12345678\n.123456789.123456789"..., 2620, 0) = 2620 15:07:51.569425 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0 15:07:51.569745 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLOUT, {u32=6, u64=6}}) = 0 15:07:51.570013 gettimeofday({1324652871, 570133}, NULL) = 0 15:07:51.570251 epoll_wait(3, {{EPOLLOUT, {u32=6, u64=6}}}, 8, 1000) = 1 15:07:51.570514 gettimeofday({1324652871, 570634}, NULL) = 0 15:07:51.570754 send(6, "HTTP/1.1 200\r\nConnection: close\r"..., 7000, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE) = 7000 15:07:51.571244 epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=6, u64=6}}) = 0 15:07:51.571518 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 15:07:51.571825 gettimeofday({1324652871, 571947}, NULL) = 0 15:07:51.572064 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.572355 gettimeofday({1324652871, 572476}, NULL) = 0 15:07:51.572617 pipe([8, 9]) = 0 15:07:51.572863 fcntl64(8, 0x407 /* F_??? */, 0x40000) = 262144 15:07:51.573107 splice(7, NULL, 9, NULL, 1073734988, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 9060 15:07:51.573408 splice(7, NULL, 9, NULL, 1073725928, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1460 15:07:51.573696 splice(7, NULL, 9, NULL, 1073724468, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.573942 splice(8, NULL, 6, NULL, 10520, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 10520 15:07:51.574328 gettimeofday({1324652871, 574452}, NULL) = 0 15:07:51.574572 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.574988 gettimeofday({1324652871, 575112}, NULL) = 0 15:07:51.575235 splice(7, NULL, 9, NULL, 1073724468, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060 15:07:51.575513 splice(8, NULL, 6, NULL, 16060, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060 15:07:51.575878 gettimeofday({1324652871, 576003}, NULL) = 0 15:07:51.576142 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.576510 gettimeofday({1324652871, 576631}, NULL) = 0 15:07:51.576753 splice(7, NULL, 9, NULL, 1073708408, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 11680 15:07:51.577027 splice(8, NULL, 6, NULL, 11680, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 11680 15:07:51.577358 gettimeofday({1324652871, 577480}, NULL) = 0 15:07:51.577614 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.577959 gettimeofday({1324652871, 578080}, NULL) = 0 15:07:51.578201 splice(7, NULL, 9, NULL, 1073696728, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 13140 15:07:51.578492 splice(8, NULL, 6, NULL, 13140, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 13140 15:07:51.578799 gettimeofday({1324652871, 578921}, NULL) = 0 15:07:51.579057 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.579413 gettimeofday({1324652871, 579534}, NULL) = 0 15:07:51.579654 splice(7, NULL, 9, NULL, 1073683588, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600 15:07:51.579927 splice(8, NULL, 6, NULL, 14600, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600 15:07:51.580260 gettimeofday({1324652871, 580389}, NULL) = 0 15:07:51.580526 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.580952 gettimeofday({1324652871, 581073}, NULL) = 0 15:07:51.581195 splice(7, NULL, 9, NULL, 1073668988, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060 15:07:51.581491 splice(8, NULL, 6, NULL, 16060, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060 15:07:51.581801 gettimeofday({1324652871, 581930}, NULL) = 0 15:07:51.582173 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.582475 gettimeofday({1324652871, 582597}, NULL) = 0 15:07:51.582719 splice(7, NULL, 9, NULL, 1073652928, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 17520 15:07:51.582998 splice(8, NULL, 6, NULL, 17520, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 17520 15:07:51.583345 gettimeofday({1324652871, 583476}, NULL) = 0 15:07:51.583725 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.584001 gettimeofday({1324652871, 584131}, NULL) = 0 15:07:51.584329 splice(7, NULL, 9, NULL, 1073635408, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 29200 15:07:51.584599 splice(8, NULL, 6, NULL, 29200, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 24820 15:07:51.584968 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLOUT, {u32=6, u64=6}}) = 0 Here FD6 was enabled on output since some data remained in the pipe. 15:07:51.585244 gettimeofday({1324652871, 585374}, NULL) = 0 15:07:51.585583 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.585852 gettimeofday({1324652871, 585972}, NULL) = 0 15:07:51.586093 splice(7, NULL, 9, NULL, 1073606208, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 45260 15:07:51.586384 splice(8, NULL, 6, NULL, 49640, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.586668 gettimeofday({1324652871, 586789}, NULL) = 0 15:07:51.586908 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.587246 gettimeofday({1324652871, 587367}, NULL) = 0 15:07:51.587488 splice(7, NULL, 9, NULL, 1073560948, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 20440 15:07:51.587764 splice(8, NULL, 6, NULL, 70080, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.588045 gettimeofday({1324652871, 588166}, NULL) = 0 15:07:51.588284 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.588641 gettimeofday({1324652871, 588762}, NULL) = 0 15:07:51.588882 splice(7, NULL, 9, NULL, 1073540508, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 23360 15:07:51.589157 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.589438 gettimeofday({1324652871, 589558}, NULL) = 0 15:07:51.589677 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.590041 gettimeofday({1324652871, 590163}, NULL) = 0 15:07:51.590284 splice(7, NULL, 9, NULL, 1073517148, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.590533 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.590781 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0 15:07:51.591054 gettimeofday({1324652871, 591173}, NULL) = 0 15:07:51.591290 epoll_wait(3, {{EPOLLOUT, {u32=6, u64=6}}}, 8, 1000) = 1 write reported on FD6 15:07:51.623172 gettimeofday({1324652871, 623304}, NULL) = 0 15:07:51.623430 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 74460 74kB sent, 18980 bytes remaining, FD status not changed. 15:07:51.623739 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 15:07:51.624009 gettimeofday({1324652871, 624128}, NULL) = 0 15:07:51.624246 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.624510 gettimeofday({1324652871, 624629}, NULL) = 0 15:07:51.624749 splice(7, NULL, 9, NULL, 1073517148, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 59860 15:07:51.625067 splice(8, NULL, 6, NULL, 78840, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.625357 gettimeofday({1324652871, 625479}, NULL) = 0 15:07:51.625597 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.625970 gettimeofday({1324652871, 626092}, NULL) = 0 15:07:51.626213 splice(7, NULL, 9, NULL, 1073457288, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600 15:07:51.626490 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.626773 gettimeofday({1324652871, 626894}, NULL) = 0 15:07:51.627013 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1 15:07:51.627387 gettimeofday({1324652871, 627509}, NULL) = 0 15:07:51.627630 splice(7, NULL, 9, NULL, 1073442688, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.627883 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:51.628131 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0 And here we're blocked on both FDs. We disable polling for reads on the input FD since the pipe is full. We have still not disabled events on FD6, which is not reported anymore. 15:07:51.628389 gettimeofday({1324652871, 628508}, NULL) = 0 15:07:51.628625 epoll_wait(3, {}, 8, 1000) = 0 15:07:52.629938 gettimeofday({1324652872, 630103}, NULL) = 0 15:07:52.630232 gettimeofday({1324652872, 630364}, NULL) = 0 15:07:52.630481 epoll_wait(3, {}, 8, 1000) = 0 15:07:53.631760 gettimeofday({1324652873, 631895}, NULL) = 0 15:07:53.632018 gettimeofday({1324652873, 632146}, NULL) = 0 15:07:53.632289 epoll_wait(3, {}, 8, 1000) = 0 15:07:54.633641 gettimeofday({1324652874, 633770}, NULL) = 0 15:07:54.633896 gettimeofday({1324652874, 634017}, NULL) = 0 15:07:54.634135 epoll_wait(3, {}, 8, 1000) = 0 15:07:55.635457 gettimeofday({1324652875, 635582}, NULL) = 0 15:07:55.635705 gettimeofday({1324652875, 635824}, NULL) = 0 15:07:55.635943 epoll_wait(3, {}, 8, 930) = 0 15:07:56.567197 gettimeofday({1324652876, 567322}, NULL) = 0 15:07:56.567448 gettimeofday({1324652876, 567568}, NULL) = 0 15:07:56.567685 epoll_wait(3, {}, 8, 1000) = 0 15:07:57.568855 gettimeofday({1324652877, 569012}, NULL) = 0 15:07:57.569145 gettimeofday({1324652877, 569278}, NULL) = 0 15:07:57.569396 epoll_wait(3, {}, 8, 1000) = 0 15:07:58.570760 gettimeofday({1324652878, 570899}, NULL) = 0 15:07:58.571023 gettimeofday({1324652878, 571154}, NULL) = 0 15:07:58.571270 epoll_wait(3, {}, 8, 999) = 0 15:07:59.571418 gettimeofday({1324652879, 571546}, NULL) = 0 Timeout strikes, we're about to close. It happens that the FD is re-enabled at this point. That doesn't really matter anyway. 15:07:59.571688 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 15:07:59.571964 gettimeofday({1324652879, 572084}, NULL) = 0 15:07:59.572227 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 57) = 1 15:07:59.572499 gettimeofday({1324652879, 572619}, NULL) = 0 15:07:59.572742 splice(7, NULL, 9, NULL, 1073442688, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 15:07:59.573010 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) The FD 6 was still blocked. 15:07:59.573269 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0 15:07:59.573529 gettimeofday({1324652879, 573649}, NULL) = 0 15:07:59.573766 epoll_wait(3, {}, 8, 56) = 0 15:07:59.630081 gettimeofday({1324652879, 630202}, NULL) = 0 15:07:59.630327 setsockopt(6, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 15:07:59.630594 close(6) = 0 15:07:59.630909 close(7) = 0 FDs are closed. 15:07:59.631424 close(9) = 0 15:07:59.631746 close(8) = 0 And pipe is killed because it contains remaining data. Regards, Willy ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 21:14 ` Willy Tarreau 2012-05-17 21:40 ` Eric Dumazet 2012-05-17 21:47 ` Willy Tarreau @ 2012-05-17 22:14 ` Eric Dumazet 2012-05-17 22:29 ` Willy Tarreau 2 siblings, 1 reply; 61+ messages in thread From: Eric Dumazet @ 2012-05-17 22:14 UTC (permalink / raw) To: Willy Tarreau; +Cc: netdev On Thu, 2012-05-17 at 23:14 +0200, Willy Tarreau wrote: > --- > > From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001 > From: Willy Tarreau <w@1wt.eu> > Date: Thu, 17 May 2012 22:43:20 +0200 > Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions > > Since recent changes on TCP splicing (starting with commits 2f533844 > and 35f9c09f), I started seeing massive stalls when forwarding traffic > between two sockets using splice() when pipe buffers were larger than > socket buffers. > > Latest changes (net: netdev_alloc_skb() use build_skb()) made the > problem even more apparent. > > The reason seems to be that if do_tcp_sendpages() fails on out of memory > condition without being able to send at least one byte, tcp_push() is not > called and the buffers cannot be flushed. > > After applying the attached patch, I cannot reproduce the stalls at all > and the data rate it perfectly stable and steady under any condition > which previously caused the problem to be permanent. > > The issue seems to have been there since before the kernel migrated to > git, which makes me think that the stalls I occasionally experienced > with tux during stress-tests years ago were probably related to the > same issue. > > This issue was first encountered on 3.0.31 and 3.2.17, so please backport > to -stable. > > Signed-off-by: Willy Tarreau <w@1wt.eu> > Cc: <stable@vger.kernel.org> > --- > net/ipv4/tcp.c | 3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 63ddaee..231fe53 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -917,8 +917,7 @@ new_segment: > wait_for_sndbuf: > set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > wait_for_memory: > - if (copied) > - tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > + tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); > > if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) > goto do_error; Acked-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Stable regression with 'tcp: allow splice() to build full TSO packets' 2012-05-17 22:14 ` Eric Dumazet @ 2012-05-17 22:29 ` Willy Tarreau 0 siblings, 0 replies; 61+ messages in thread From: Willy Tarreau @ 2012-05-17 22:29 UTC (permalink / raw) To: David Miller; +Cc: Eric Dumazet, netdev David, After several tests, we finally agreed on this one. I can't make the transfers fail anymore. And they're damn fast thanks to Eric's recent work! Best regards, Willy --- >From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w@1wt.eu> Date: Thu, 17 May 2012 22:43:20 +0200 Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions Since recent changes on TCP splicing (starting with commits 2f533844 and 35f9c09f), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Cc: <stable@vger.kernel.org> Acked-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/tcp.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 63ddaee..231fe53 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -917,8 +917,7 @@ new_segment: wait_for_sndbuf: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); wait_for_memory: - if (copied) - tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); + tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) goto do_error; ^ permalink raw reply related [flat|nested] 61+ messages in thread
end of thread, other threads:[~2012-06-05 2:50 UTC | newest] Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau 2012-05-17 15:01 ` Willy Tarreau 2012-05-17 15:43 ` Eric Dumazet 2012-05-17 15:56 ` Willy Tarreau 2012-05-17 16:33 ` Eric Dumazet 2012-05-17 16:40 ` Willy Tarreau 2012-05-17 16:47 ` Eric Dumazet 2012-05-17 16:49 ` Eric Dumazet 2012-05-17 17:22 ` Willy Tarreau 2012-05-17 17:34 ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet 2012-05-17 17:45 ` Willy Tarreau 2012-06-04 12:39 ` Michael S. Tsirkin 2012-06-04 12:44 ` Willy Tarreau 2012-05-17 19:53 ` David Miller 2012-05-18 4:41 ` Eric Dumazet 2012-06-04 12:37 ` Michael S. Tsirkin 2012-06-04 13:06 ` Eric Dumazet 2012-06-04 13:41 ` Michael S. Tsirkin 2012-06-04 14:01 ` Eric Dumazet 2012-06-04 14:09 ` Eric Dumazet 2012-06-04 14:17 ` Michael S. Tsirkin 2012-06-04 15:01 ` Eric Dumazet 2012-06-04 17:20 ` Michael S. Tsirkin 2012-06-04 17:44 ` Eric Dumazet 2012-06-04 18:16 ` Michael S. Tsirkin 2012-06-04 19:24 ` Eric Dumazet 2012-06-04 19:48 ` Michael S. Tsirkin 2012-06-04 19:56 ` Eric Dumazet 2012-06-04 21:20 ` Michael S. Tsirkin 2012-06-05 2:50 ` Eric Dumazet 2012-06-04 18:16 ` Michael S. Tsirkin 2012-06-04 19:29 ` Eric Dumazet 2012-06-04 19:43 ` Michael S. Tsirkin 2012-06-04 19:52 ` Eric Dumazet 2012-06-04 21:54 ` Michael S. Tsirkin 2012-06-05 2:46 ` Eric Dumazet 2012-06-04 19:56 ` Michael S. Tsirkin 2012-06-04 20:05 ` Eric Dumazet 2012-05-17 18:38 ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings 2012-05-17 19:55 ` David Miller 2012-05-17 20:04 ` Willy Tarreau 2012-05-17 20:07 ` David Miller 2012-05-17 20:41 ` Eric Dumazet 2012-05-17 21:14 ` Willy Tarreau 2012-05-17 21:40 ` Eric Dumazet 2012-05-17 21:50 ` Eric Dumazet 2012-05-17 21:57 ` Willy Tarreau 2012-05-17 22:01 ` Eric Dumazet 2012-05-17 22:10 ` Eric Dumazet 2012-05-17 22:16 ` Willy Tarreau 2012-05-17 22:22 ` Eric Dumazet 2012-05-17 22:24 ` Willy Tarreau 2012-05-17 22:25 ` David Miller 2012-05-17 22:30 ` Willy Tarreau 2012-05-17 22:35 ` David Miller 2012-05-17 22:49 ` Willy Tarreau 2012-05-17 22:27 ` Joe Perches 2012-05-17 21:54 ` Willy Tarreau 2012-05-17 21:47 ` Willy Tarreau 2012-05-17 22:14 ` Eric Dumazet 2012-05-17 22:29 ` Willy Tarreau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).