netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Stable regression with 'tcp: allow splice() to build full TSO packets'
@ 2012-05-17 12:18 Willy Tarreau
  2012-05-17 15:01 ` Willy Tarreau
  2012-05-17 20:41 ` Eric Dumazet
  0 siblings, 2 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 12:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

I'm facing a regression in stable 3.2.17 and 3.0.31 which is
exhibited by your patch 'tcp: allow splice() to build full TSO
packets' which unfortunately I am very interested in !

What I'm observing is that TCP transmits using splice() stall
quite quickly if I'm using pipes larger than 64kB (even 65537
is enough to reliably observe the stall).

I'm seeing this on haproxy running on a small ARM machine (a
dockstar), which exchanges data through a gig switch with my
development PC. The NIC (mv643xx) doesn't support TSO but has
GSO enabled. If I disable GSO, the problem remains. I can however
make the problem disappear by disabling SG or Tx checksumming.
BTW, using recv/send() instead of splice() also gets rid of the
problem.

I can also reduce the risk of seeing the problem by increasing
the default TCP buffer sizes in tcp_wmem. By default I'm running
at 16kB, but if I increase the output buffer size above the pipe
size, the problem *seems* to disappear though I can't be certain,
since larger buffers generally means the problem takes longer to
appear, probably due to the fact that the buffers don't need to
be filled. Still I'm certain that with 64k TCP buffers and 128k
pipes I'm still seeing it.

With strace, I'm seeing data fill up the pipe with the splice()
call responsible for pushing the data to the output socket returing
-1 EAGAIN. During this time, the client receives no data.

Something bugs me, I have tested with a dummy server of mine,
httpterm, which uses tee+splice() to push data outside, and it
has no problem filling the gig pipe, and correctly recoverers
from the EAGAIN :

  send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
  tee(0x3, 0x6, 0x10000, 0x2)             = 42552
  splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
  ...
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
  tee(0x3, 0x6, 0x10000, 0x2)             = 50744
  splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
  tee(0x3, 0x6, 0x10000, 0x2)             = 30264
  splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)

etc...

It's only with haproxy which uses splice() to copy data between
two sockets that I'm getting the issue (data forwarded from fd 0xe
to fd 0x6) :

  16:03:17.797144 pipe([36, 37])          = 0
  16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
  16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
  16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
  16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
  16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
  16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
  16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
  16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
  16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
  16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
  16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
  16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
  16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
  16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
  16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
  16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
  16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
  16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
  16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
  16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
  16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)

So output fd 6 hangs here and will not appear anymore until
here where I pressed Ctrl-C to stop the test :

  16:03:24.740985 gettimeofday({1324652604, 741042}, NULL) = 0
  16:03:24.741148 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:24.951762 gettimeofday({1324652604, 951838}, NULL) = 0
  16:03:24.951956 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EPIPE (Broken pipe)

I tried disabling LRO/GRO at the input interface (which happens to be
the same) to see if fragmentation of input data had any impact on this
but nothing chnages.

Please note that I'm not even certain the patch is the culprit, I'm
suspecting that by improving splice() efficiency, it might make a
latent issue become more visible. I have no data to back this
feeling, but nothing strikes me in your patch.

I don't know what I can do to troubleshoot this issue. I don't want
to pollute the list with network captures nor strace outputs, but I
have them if you're interested in verifying a few things.

I have another platform available for a test (Atom+82574L supporting
TSO). I'll rebuild and boot on this one to see if I observe the same
behaviour.

If you have any suggestion about things to check of tweaks to change
in the code, I'm quite open to experiment.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau
@ 2012-05-17 15:01 ` Willy Tarreau
  2012-05-17 15:43   ` Eric Dumazet
  2012-05-17 19:55   ` David Miller
  2012-05-17 20:41 ` Eric Dumazet
  1 sibling, 2 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 15:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
(...)

I managed to fix the issue and I really think that the fix makes sense.
I'm appending the patch, please could you review it and if approved,
push it for inclusion ?

BTW, your patch significantly improves performance here. On this
machine I was reaching max 515 Mbps of proxied traffic, and with
the patch I reach 665 Mbps with the same test ! I think you managed
to fix what caused splice() to always be slightly slower than
recv/send() for a long time in my tests with a number of NICs !

Thanks,
Willy

-------

>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 16:48:56 +0200
Subject: [PATCH] tcp: force push data out when buffers are missing

Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
significantly improved splice() performance for some workloads but
caused stalls when pipe buffers were larger than socket buffers.

The issue seems to happen when no data can be copied at all due to
lack of buffers, which results in pending data never being pushed.

This change checks if all pending data has been pushed or not and
pushes them when waiting for send buffers.
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 80b988f..f6d9e5f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -850,7 +850,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
+		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
 			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
-- 
1.7.2.1.45.g54fbc

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 15:01 ` Willy Tarreau
@ 2012-05-17 15:43   ` Eric Dumazet
  2012-05-17 15:56     ` Willy Tarreau
  2012-05-17 19:55   ` David Miller
  1 sibling, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 15:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > exhibited by your patch 'tcp: allow splice() to build full TSO
> > packets' which unfortunately I am very interested in !
> > 
> > What I'm observing is that TCP transmits using splice() stall
> > quite quickly if I'm using pipes larger than 64kB (even 65537
> > is enough to reliably observe the stall).
> (...)
> 
> I managed to fix the issue and I really think that the fix makes sense.
> I'm appending the patch, please could you review it and if approved,
> push it for inclusion ?
> 
> BTW, your patch significantly improves performance here. On this
> machine I was reaching max 515 Mbps of proxied traffic, and with
> the patch I reach 665 Mbps with the same test ! I think you managed
> to fix what caused splice() to always be slightly slower than
> recv/send() for a long time in my tests with a number of NICs !
> 

What NIC do you use exactly ?

With commit 1d0c0b328a6 in net-next
(net: makes skb_splice_bits() aware of skb->head_frag)
You'll be able to get even more speed, if NIC uses frag to hold frame.

> Thanks,
> Willy
> 
> -------
> 
> From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 

But... This seems bug was already there ? Only having more data in pipe
trigger it more often ?

> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 80b988f..f6d9e5f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -850,7 +850,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
>  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)

Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
( tcp: tcp_sendpages() should call tcp_push() once )

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 15:43   ` Eric Dumazet
@ 2012-05-17 15:56     ` Willy Tarreau
  2012-05-17 16:33       ` Eric Dumazet
  2012-05-17 18:38       ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings
  0 siblings, 2 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 15:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > > Hi Eric,
> > > 
> > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > > exhibited by your patch 'tcp: allow splice() to build full TSO
> > > packets' which unfortunately I am very interested in !
> > > 
> > > What I'm observing is that TCP transmits using splice() stall
> > > quite quickly if I'm using pipes larger than 64kB (even 65537
> > > is enough to reliably observe the stall).
> > (...)
> > 
> > I managed to fix the issue and I really think that the fix makes sense.
> > I'm appending the patch, please could you review it and if approved,
> > push it for inclusion ?
> > 
> > BTW, your patch significantly improves performance here. On this
> > machine I was reaching max 515 Mbps of proxied traffic, and with
> > the patch I reach 665 Mbps with the same test ! I think you managed
> > to fix what caused splice() to always be slightly slower than
> > recv/send() for a long time in my tests with a number of NICs !
> > 
> 
> What NIC do you use exactly ?

It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
dreamplugs and almost all ARM-based cheap NAS boxes.

> With commit 1d0c0b328a6 in net-next
> (net: makes skb_splice_bits() aware of skb->head_frag)
> You'll be able to get even more speed, if NIC uses frag to hold frame.

I'm going to check this now, sounds interesting :-)

The NIC does not support TSO but I've seen an alternate driver for this
NIC which pretends to do TSO and in fact builds header frags so that the
NIC is able to send all frames at once. I think it's already what GSO is
doing but I'm wondering whether it would be possible to get more speed
by doing this than by relying on GSO to (possibly) split the frags earlier.

Note that I'm not quite skilled on this, so do not hesitate to correct me
if I'm saying stupid things.

(...)

> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> 
> But... This seems bug was already there ? Only having more data in pipe
> trigger it more often ?

I honnestly don't know. As I said in the initial mail, I suspect this is
the case because I don't see in your patch what could cause the bug, but
I suspect it makes it more frequent. Still, I could not manage to reproduce
the bug without your patch. So maybe we switched from just below a limit
to just over.

> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> > ---
> >  net/ipv4/tcp.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 80b988f..f6d9e5f 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -850,7 +850,7 @@ new_segment:
> >  wait_for_sndbuf:
> >  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >  wait_for_memory:
> > -		if (copied)
> > +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
> >  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> >  
> >  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> 
> Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
> ( tcp: tcp_sendpages() should call tcp_push() once )

Yes I have it, in fact I'm on -stable (3.0.31) which includes your backport
of the two patches in a single one. It's just a few lines below, out of the
context of this patch.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 15:56     ` Willy Tarreau
@ 2012-05-17 16:33       ` Eric Dumazet
  2012-05-17 16:40         ` Willy Tarreau
  2012-05-17 18:38       ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings
  1 sibling, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 16:33 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:

> It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> dreamplugs and almost all ARM-based cheap NAS boxes.
> 
> > With commit 1d0c0b328a6 in net-next
> > (net: makes skb_splice_bits() aware of skb->head_frag)
> > You'll be able to get even more speed, if NIC uses frag to hold frame.
> 
> I'm going to check this now, sounds interesting :-)

splice(socket -> pipe) does a copy of frame from skb->head to a page
fragment.

With latest patches (net-next), we can provide a frag for skb->head and
avoid this copy in splice(). You would have a true zero copy
socket->socket mode.

tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 16:33       ` Eric Dumazet
@ 2012-05-17 16:40         ` Willy Tarreau
  2012-05-17 16:47           ` Eric Dumazet
  2012-05-17 16:49           ` Eric Dumazet
  0 siblings, 2 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> 
> > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > dreamplugs and almost all ARM-based cheap NAS boxes.
> > 
> > > With commit 1d0c0b328a6 in net-next
> > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > 
> > I'm going to check this now, sounds interesting :-)
> 
> splice(socket -> pipe) does a copy of frame from skb->head to a page
> fragment.
> 
> With latest patches (net-next), we can provide a frag for skb->head and
> avoid this copy in splice(). You would have a true zero copy
> socket->socket mode.

That's what I've read in your commit messages. Indeed, we've been waiting
for this to happen for a long time. It should bring a real benefit on small
devices like the one I'm playing with which have very low memory bandwidth,
because it's a shame to copy the data and thrash the cache when we don't
need to see them.

I'm currently trying to get mainline to boot then I'll switch to net-next.

> tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.

I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
send a patch :-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 16:40         ` Willy Tarreau
@ 2012-05-17 16:47           ` Eric Dumazet
  2012-05-17 16:49           ` Eric Dumazet
  1 sibling, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 16:47 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> > 
> > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > > dreamplugs and almost all ARM-based cheap NAS boxes.
> > > 
> > > > With commit 1d0c0b328a6 in net-next
> > > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > > 
> > > I'm going to check this now, sounds interesting :-)
> > 
> > splice(socket -> pipe) does a copy of frame from skb->head to a page
> > fragment.
> > 
> > With latest patches (net-next), we can provide a frag for skb->head and
> > avoid this copy in splice(). You would have a true zero copy
> > socket->socket mode.
> 
> That's what I've read in your commit messages. Indeed, we've been waiting
> for this to happen for a long time. It should bring a real benefit on small
> devices like the one I'm playing with which have very low memory bandwidth,
> because it's a shame to copy the data and thrash the cache when we don't
> need to see them.
> 
> I'm currently trying to get mainline to boot then I'll switch to net-next.
> 
> > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.
> 
> I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
> mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> send a patch :-)

In fact I could post a generic patch, to make netdev_alloc_skb() do this
automatically for all network drivers.

Its probably less than 30 lines of code.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 16:40         ` Willy Tarreau
  2012-05-17 16:47           ` Eric Dumazet
@ 2012-05-17 16:49           ` Eric Dumazet
  2012-05-17 17:22             ` Willy Tarreau
  2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
  1 sibling, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 16:49 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
>  try to port your patch (8d4057 tg3: provide frags as skb head) to
> mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> send a patch :-)


tg3 uses build_skb(), thats an API that allocates the skb after frame is
filled by the device, to have less cache misses in RX path.

But for most devices, netdev_alloc_skb() would be more than enough.

I'll send a patch once tested.

Thanks

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 16:49           ` Eric Dumazet
@ 2012-05-17 17:22             ` Willy Tarreau
  2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
  1 sibling, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 17:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 06:49:47PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
> >  try to port your patch (8d4057 tg3: provide frags as skb head) to
> > mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> > send a patch :-)
> 
> 
> tg3 uses build_skb(), thats an API that allocates the skb after frame is
> filled by the device, to have less cache misses in RX path.
> 
> But for most devices, netdev_alloc_skb() would be more than enough.

OK, so that's out of reach for me right now.

> I'll send a patch once tested.

I'll happily give it a try.

BTW, net-next gave me around 13-14% improvement over 3.0/3.2+splice
patches ! Now I'm seeing splice() being 75% faster than recv/send, while
it used to be slower on this hardware not that long ago! That's really
impressive, great work!

Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 16:49           ` Eric Dumazet
  2012-05-17 17:22             ` Willy Tarreau
@ 2012-05-17 17:34             ` Eric Dumazet
  2012-05-17 17:45               ` Willy Tarreau
                                 ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 17:34 UTC (permalink / raw)
  To: Willy Tarreau, David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Please note I havent tested yet this patch, lacking hardware for this.

(tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
ixgbe uses fragments...)

Any volunteer ?

Thanks

[PATCH net-next] net: netdev_alloc_skb() use build_skb()

netdev_alloc_skb() is used by networks driver in their RX path to
allocate an skb to receive an incoming frame.

With recent skb->head_frag infrastructure, it makes sense to change
netdev_alloc_skb() to use build_skb() and a frag allocator.

This permits a zero copy splice(socket->pipe), and better GRO or TCP
coalescing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2a18719..c02a8ec 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -293,6 +293,12 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+struct netdev_alloc_cache {
+	struct page *page;
+	unsigned int offset;
+};
+static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+
 /**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
@@ -310,8 +316,32 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
 	struct sk_buff *skb;
+	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
+			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+		struct netdev_alloc_cache *nc;
+		void *data = NULL;
+
+		nc = &get_cpu_var(netdev_alloc_cache);
+		if (!nc->page) {
+refill:			nc->page = alloc_page(gfp_mask);
+			nc->offset = 0;
+		}
+		if (likely(nc->page)) {
+			if (nc->offset + fragsz > PAGE_SIZE) {
+				put_page(nc->page);
+				goto refill;
+			}
+			data = page_address(nc->page) + nc->offset;
+			nc->offset += fragsz;
+			get_page(nc->page);
+		}
+		put_cpu_var(netdev_alloc_cache);
+		skb = data ? build_skb(data, fragsz) : NULL;
+	} else {
+		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+	}
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
@ 2012-05-17 17:45               ` Willy Tarreau
  2012-06-04 12:39                 ` Michael S. Tsirkin
  2012-05-17 19:53               ` David Miller
  2012-06-04 12:37               ` Michael S. Tsirkin
  2 siblings, 1 reply; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 17:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Please note I havent tested yet this patch, lacking hardware for this.
> 
> (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> ixgbe uses fragments...)
> 
> Any volunteer ?
> 
> Thanks
> 
> [PATCH net-next] net: netdev_alloc_skb() use build_skb()
> 
> netdev_alloc_skb() is used by networks driver in their RX path to
> allocate an skb to receive an incoming frame.
> 
> With recent skb->head_frag infrastructure, it makes sense to change
> netdev_alloc_skb() to use build_skb() and a frag allocator.
> 
> This permits a zero copy splice(socket->pipe), and better GRO or TCP
> coalescing.

Impressed !

For the first time I could proxy HTTP traffic at gigabit speed on this
little box powered by USB ! I've long believed that proper splicing
would make this possible and now I'm seeing it is. Congrats Eric !

I'm still observing some stalls on medium-sized files (eg: 100k,
smaller than the pipe size, don't know yet if there is any relation).
I'll check closer and try to report something more exploitable.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 15:56     ` Willy Tarreau
  2012-05-17 16:33       ` Eric Dumazet
@ 2012-05-17 18:38       ` Ben Hutchings
  1 sibling, 0 replies; 61+ messages in thread
From: Ben Hutchings @ 2012-05-17 18:38 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, netdev

On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
[...]
> The NIC does not support TSO but I've seen an alternate driver for this
> NIC which pretends to do TSO and in fact builds header frags so that the
> NIC is able to send all frames at once. I think it's already what GSO is
> doing but I'm wondering whether it would be possible to get more speed
> by doing this than by relying on GSO to (possibly) split the frags earlier.
[...]

Yes, GSO has some overhead for skb allocation and additional function
calls that you can avoid by doing 'TSO' in the driver.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
  2012-05-17 17:45               ` Willy Tarreau
@ 2012-05-17 19:53               ` David Miller
  2012-05-18  4:41                 ` Eric Dumazet
  2012-06-04 12:37               ` Michael S. Tsirkin
  2 siblings, 1 reply; 61+ messages in thread
From: David Miller @ 2012-05-17 19:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 May 2012 19:34:16 +0200

> [PATCH net-next] net: netdev_alloc_skb() use build_skb()
> 
> netdev_alloc_skb() is used by networks driver in their RX path to
> allocate an skb to receive an incoming frame.
> 
> With recent skb->head_frag infrastructure, it makes sense to change
> netdev_alloc_skb() to use build_skb() and a frag allocator.
> 
> This permits a zero copy splice(socket->pipe), and better GRO or TCP
> coalescing.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, we can sort out any fallout very easily before 3.5 is released.

Awesome work Eric.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 15:01 ` Willy Tarreau
  2012-05-17 15:43   ` Eric Dumazet
@ 2012-05-17 19:55   ` David Miller
  2012-05-17 20:04     ` Willy Tarreau
  1 sibling, 1 reply; 61+ messages in thread
From: David Miller @ 2012-05-17 19:55 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 17:01:57 +0200

>>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 
> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.

Eric, please indicate whether we need Willy's patch here.

I want to propagate this fix as fast as possible if so.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 19:55   ` David Miller
@ 2012-05-17 20:04     ` Willy Tarreau
  2012-05-17 20:07       ` David Miller
  0 siblings, 1 reply; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 20:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

Hi David,

On Thu, May 17, 2012 at 03:55:03PM -0400, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 17:01:57 +0200
> 
> >>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 17 May 2012 16:48:56 +0200
> > Subject: [PATCH] tcp: force push data out when buffers are missing
> > 
> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> > 
> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> 
> Eric, please indicate whether we need Willy's patch here.
> 
> I want to propagate this fix as fast as possible if so.

I think you should hold off for now, because it's possible that my patch
hides another issue instead of fixing it.

I'm having the same stall issue again since I applied Eric's build_skb
patch, but not for all data sizes. So if the same issue is still there,
it's possible that we're playing hide-and-seek with it. That's rather
strange.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 20:04     ` Willy Tarreau
@ 2012-05-17 20:07       ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2012-05-17 20:07 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:04:04 +0200

> I'm having the same stall issue again since I applied Eric's build_skb
> patch, but not for all data sizes. So if the same issue is still there,
> it's possible that we're playing hide-and-seek with it. That's rather
> strange.

Ok, a Heisenbug :-)  Let me know when you guys resolve this.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau
  2012-05-17 15:01 ` Willy Tarreau
@ 2012-05-17 20:41 ` Eric Dumazet
  2012-05-17 21:14   ` Willy Tarreau
  1 sibling, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 20:41 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
> 
> I'm seeing this on haproxy running on a small ARM machine (a
> dockstar), which exchanges data through a gig switch with my
> development PC. The NIC (mv643xx) doesn't support TSO but has
> GSO enabled. If I disable GSO, the problem remains. I can however
> make the problem disappear by disabling SG or Tx checksumming.
> BTW, using recv/send() instead of splice() also gets rid of the
> problem.
> 
> I can also reduce the risk of seeing the problem by increasing
> the default TCP buffer sizes in tcp_wmem. By default I'm running
> at 16kB, but if I increase the output buffer size above the pipe
> size, the problem *seems* to disappear though I can't be certain,
> since larger buffers generally means the problem takes longer to
> appear, probably due to the fact that the buffers don't need to
> be filled. Still I'm certain that with 64k TCP buffers and 128k
> pipes I'm still seeing it.
> 
> With strace, I'm seeing data fill up the pipe with the splice()
> call responsible for pushing the data to the output socket returing
> -1 EAGAIN. During this time, the client receives no data.
> 
> Something bugs me, I have tested with a dummy server of mine,
> httpterm, which uses tee+splice() to push data outside, and it
> has no problem filling the gig pipe, and correctly recoverers
> from the EAGAIN :
> 
>   send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
>   tee(0x3, 0x6, 0x10000, 0x2)             = 42552
>   splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
>   ...
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
>   tee(0x3, 0x6, 0x10000, 0x2)             = 50744
>   splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
>   tee(0x3, 0x6, 0x10000, 0x2)             = 30264
>   splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> 
> etc...
> 
> It's only with haproxy which uses splice() to copy data between
> two sockets that I'm getting the issue (data forwarded from fd 0xe
> to fd 0x6) :
> 
>   16:03:17.797144 pipe([36, 37])          = 0
>   16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
>   16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
>   16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
>   16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
>   16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
>   16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
>   16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
>   16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
>   16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
>   16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
>   16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
>   16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
>   16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
>   16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
>   16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
>   16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
>   16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
>   16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
>   16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
>   16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
>   16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
>   16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
> 

Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ?
(because socket buffer is full)

> So output fd 6 hangs here and will not appear anymore until
> here where I pressed Ctrl-C to stop the test :
> 

I just want to make sure its not a userland error that triggers now much
faster than with prior kernels.

You drain bytes from fd 0xe to pipe buffers, but I dont see you check
write ability on destination socket prior the splice(pipe -> socket)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 20:41 ` Eric Dumazet
@ 2012-05-17 21:14   ` Willy Tarreau
  2012-05-17 21:40     ` Eric Dumazet
                       ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 21:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 10:41:19PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > exhibited by your patch 'tcp: allow splice() to build full TSO
> > packets' which unfortunately I am very interested in !
> > 
> > What I'm observing is that TCP transmits using splice() stall
> > quite quickly if I'm using pipes larger than 64kB (even 65537
> > is enough to reliably observe the stall).
> > 
> > I'm seeing this on haproxy running on a small ARM machine (a
> > dockstar), which exchanges data through a gig switch with my
> > development PC. The NIC (mv643xx) doesn't support TSO but has
> > GSO enabled. If I disable GSO, the problem remains. I can however
> > make the problem disappear by disabling SG or Tx checksumming.
> > BTW, using recv/send() instead of splice() also gets rid of the
> > problem.
> > 
> > I can also reduce the risk of seeing the problem by increasing
> > the default TCP buffer sizes in tcp_wmem. By default I'm running
> > at 16kB, but if I increase the output buffer size above the pipe
> > size, the problem *seems* to disappear though I can't be certain,
> > since larger buffers generally means the problem takes longer to
> > appear, probably due to the fact that the buffers don't need to
> > be filled. Still I'm certain that with 64k TCP buffers and 128k
> > pipes I'm still seeing it.
> > 
> > With strace, I'm seeing data fill up the pipe with the splice()
> > call responsible for pushing the data to the output socket returing
> > -1 EAGAIN. During this time, the client receives no data.
> > 
> > Something bugs me, I have tested with a dummy server of mine,
> > httpterm, which uses tee+splice() to push data outside, and it
> > has no problem filling the gig pipe, and correctly recoverers
> > from the EAGAIN :
> > 
> >   send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 42552
> >   splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
> >   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> >   ...
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
> >   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 50744
> >   splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
> >   tee(0x3, 0x6, 0x10000, 0x2)             = 30264
> >   splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> > 
> > etc...
> > 
> > It's only with haproxy which uses splice() to copy data between
> > two sockets that I'm getting the issue (data forwarded from fd 0xe
> > to fd 0x6) :
> > 
> >   16:03:17.797144 pipe([36, 37])          = 0
> >   16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
> >   16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
> >   16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
> >   16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
> >   16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
> >   16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
> >   16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
> >   16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
> >   16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
> >   16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
> >   16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
> >   16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
> >   16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
> >   16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
> >   16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
> >   16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
> >   16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
> >   16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
> >   16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
> >   16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
> >   16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
> >   16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
> >   16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
> >   16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
> > 
> 
> Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ?
> (because socket buffer is full)

Exactly.

> > So output fd 6 hangs here and will not appear anymore until
> > here where I pressed Ctrl-C to stop the test :
> > 
> 
> I just want to make sure its not a userland error that triggers now much
> faster than with prior kernels.

I understand and that could be possible indeed. Still, this precise code
has been used for a few years now in prod at 10+ Gbps, so while that does
not mean it's exempt from any race condition or bug, we have not observed
this behaviour earlier. In fact, what I've not tested much was the small
ARM based machine which is just a convenient development system to try to
optimize network performance. Among the differences I see with usual
deployments is that the NIC doesn't support TSO, while I've been used to
enable splicing only where TSO was supported, because before your recent
optimizations, it was less performant than recv/send.

> You drain bytes from fd 0xe to pipe buffers, but I dont see you check
> write ability on destination socket prior the splice(pipe -> socket)

I don't, I only rely on EAGAIN to re-enable polling for write (otherwise
it becomes a real mess of epoll_ctl which sensibly hurts performance). I
only re-enable polling if FDs can't move anymore.

Before doing a splice(read), if any data are left pending in the pipe, I
first try a splice(write) to try to flush the pipe, then I perform the
splice(read) then try to flush the pipe again using a splice(write).
Then polling is enabled if we block on EAGAIN.

I could fix the issue here by reworking my first patch. I think I was
too much conservative, because if we leave do_tcp_sendpages() on OOM
with copied == 0, in my opinion we never push. My first attempt tried
to call tcp_push() only once but it seems like this is a wrong idea
because it doesn't allow new attempts if for example tcp_write_xmit()
cannot send upon first attempt.

After calling tcp_push() inconditionnally on OOM, I cannot reproduce
the issue at all, and the traffic reaches a steady 950 Mbps in each
direction.

I'm appending the patch, you'll know better than me if it's correct or
not.

Best regards,
Willy

---

>From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:43:20 +0200
Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions

Since recent changes on TCP splicing (starting with commits 2f533844
and 35f9c09f), I started seeing massive stalls when forwarding traffic
between two sockets using splice() when pipe buffers were larger than
socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: <stable@vger.kernel.org>
---
 net/ipv4/tcp.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 63ddaee..231fe53 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -917,8 +917,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
-			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
-- 
1.7.2.1.45.g54fbc

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:14   ` Willy Tarreau
@ 2012-05-17 21:40     ` Eric Dumazet
  2012-05-17 21:50       ` Eric Dumazet
  2012-05-17 21:54       ` Willy Tarreau
  2012-05-17 21:47     ` Willy Tarreau
  2012-05-17 22:14     ` Eric Dumazet
  2 siblings, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 21:40 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 23:14 +0200, Willy Tarreau wrote:

> I understand and that could be possible indeed. Still, this precise code
> has been used for a few years now in prod at 10+ Gbps, so while that does
> not mean it's exempt from any race condition or bug, we have not observed
> this behaviour earlier. In fact, what I've not tested much was the small
> ARM based machine which is just a convenient development system to try to
> optimize network performance. Among the differences I see with usual
> deployments is that the NIC doesn't support TSO, while I've been used to
> enable splicing only where TSO was supported, because before your recent
> optimizations, it was less performant than recv/send.
> 
> > You drain bytes from fd 0xe to pipe buffers, but I dont see you check
> > write ability on destination socket prior the splice(pipe -> socket)
> 
> I don't, I only rely on EAGAIN to re-enable polling for write (otherwise
> it becomes a real mess of epoll_ctl which sensibly hurts performance). I
> only re-enable polling if FDs can't move anymore.
> 
> Before doing a splice(read), if any data are left pending in the pipe, I
> first try a splice(write) to try to flush the pipe, then I perform the
> splice(read) then try to flush the pipe again using a splice(write).
> Then polling is enabled if we block on EAGAIN.
> 
> I could fix the issue here by reworking my first patch. I think I was
> too much conservative, because if we leave do_tcp_sendpages() on OOM
> with copied == 0, in my opinion we never push. My first attempt tried
> to call tcp_push() only once but it seems like this is a wrong idea
> because it doesn't allow new attempts if for example tcp_write_xmit()
> cannot send upon first attempt.
> 
> After calling tcp_push() inconditionnally on OOM, I cannot reproduce
> the issue at all, and the traffic reaches a steady 950 Mbps in each
> direction.
> 
> I'm appending the patch, you'll know better than me if it's correct or
> not.
> 
> Best regards,
> Willy
> 
> ---
> 
> From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 22:43:20 +0200
> Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions
> 
> Since recent changes on TCP splicing (starting with commits 2f533844
> and 35f9c09f), I started seeing massive stalls when forwarding traffic
> between two sockets using splice() when pipe buffers were larger than
> socket buffers.
> 
> Latest changes (net: netdev_alloc_skb() use build_skb()) made the
> problem even more apparent.
> 
> The reason seems to be that if do_tcp_sendpages() fails on out of memory
> condition without being able to send at least one byte, tcp_push() is not
> called and the buffers cannot be flushed.
> 
> After applying the attached patch, I cannot reproduce the stalls at all
> and the data rate it perfectly stable and steady under any condition
> which previously caused the problem to be permanent.
> 
> The issue seems to have been there since before the kernel migrated to
> git, which makes me think that the stalls I occasionally experienced
> with tux during stress-tests years ago were probably related to the
> same issue.
> 
> This issue was first encountered on 3.0.31 and 3.2.17, so please backport
> to -stable.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Cc: <stable@vger.kernel.org>
> ---
>  net/ipv4/tcp.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 63ddaee..231fe53 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -917,8 +917,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> -			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> +		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
>  			goto do_error;


I dont understand why we should tcp_push() if we sent 0 bytes in this
splice() call.

The push() should have be done already by prior splice() call, dont you
think ?

out:
	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
		tcp_push(sk, flags, mss_now, tp->nonagle);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:14   ` Willy Tarreau
  2012-05-17 21:40     ` Eric Dumazet
@ 2012-05-17 21:47     ` Willy Tarreau
  2012-05-17 22:14     ` Eric Dumazet
  2 siblings, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 21:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 11:14:14PM +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 10:41:19PM +0200, Eric Dumazet wrote:
> > You drain bytes from fd 0xe to pipe buffers, but I dont see you check
> > write ability on destination socket prior the splice(pipe -> socket)
> 
> I don't, I only rely on EAGAIN to re-enable polling for write (otherwise
> it becomes a real mess of epoll_ctl which sensibly hurts performance). I
> only re-enable polling if FDs can't move anymore.
> 
> Before doing a splice(read), if any data are left pending in the pipe, I
> first try a splice(write) to try to flush the pipe, then I perform the
> splice(read) then try to flush the pipe again using a splice(write).
> Then polling is enabled if we block on EAGAIN.

Just for the sake of documentation, I have rebuilt an up-to-date strace
so that we get the details of the epoll_ctl(). I've restarted with only
the classical epoll (without the events cache) so that it's easier to
trace event status with strace.

Here we clearly see that the output FD (6) was not reported as being
writable for a long period, until haproxy's timeout struck :

15:07:50.130499 gettimeofday({1324652870, 130772}, NULL) = 0
15:07:50.130937 gettimeofday({1324652870, 131071}, NULL) = 0
15:07:50.131195 epoll_wait(3, {}, 6, 1000) = 0
15:07:51.132529 gettimeofday({1324652871, 132654}, NULL) = 0
15:07:51.132777 gettimeofday({1324652871, 132895}, NULL) = 0
15:07:51.133013 epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 6, 1000) = 1
15:07:51.561594 gettimeofday({1324652871, 561723}, NULL) = 0
15:07:51.561866 accept(4, {sa_family=AF_INET, sin_port=htons(56441), sin_addr=inet_addr("192.168.0.31")}, [16]) = 6
15:07:51.562249 fcntl64(6, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
15:07:51.562582 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:07:51.562844 accept(4, 0x7ea65a64, [128]) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.563142 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0
15:07:51.563421 gettimeofday({1324652871, 563540}, NULL) = 0
15:07:51.563657 epoll_wait(3, {{EPOLLIN, {u32=6, u64=6}}}, 7, 1000) = 1
15:07:51.563923 gettimeofday({1324652871, 564041}, NULL) = 0
15:07:51.564180 recv(6, "GET /?s=1g HTTP/1.0\r\nUser-Agent:"..., 7000, 0) = 126
15:07:51.564574 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
15:07:51.564817 fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
15:07:51.565040 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
15:07:51.565277 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
15:07:51.565524 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("192.168.0.31")}, 16) = -1 EINPROGRESS (Operation now in progress)
15:07:51.565955 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
15:07:51.566235 epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=6, u64=6}}) = 0
15:07:51.566494 gettimeofday({1324652871, 566613}, NULL) = 0
15:07:51.566730 epoll_wait(3, {{EPOLLOUT, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.566994 gettimeofday({1324652871, 567114}, NULL) = 0
15:07:51.567246 send(7, "GET /?s=1g HTTP/1.0\r\nUser-Agent:"..., 107, MSG_DONTWAIT|MSG_NOSIGNAL) = 107
15:07:51.567589 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
15:07:51.567907 gettimeofday({1324652871, 568031}, NULL) = 0
15:07:51.568148 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.568471 gettimeofday({1324652871, 568593}, NULL) = 0
15:07:51.568715 recv(7, "HTTP/1.1 200\r\nConnection: close\r"..., 7000, 0) = 4380
15:07:51.569057 recv(7, "89.12345678\n.123456789.123456789"..., 2620, 0) = 2620
15:07:51.569425 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0
15:07:51.569745 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLOUT, {u32=6, u64=6}}) = 0
15:07:51.570013 gettimeofday({1324652871, 570133}, NULL) = 0
15:07:51.570251 epoll_wait(3, {{EPOLLOUT, {u32=6, u64=6}}}, 8, 1000) = 1
15:07:51.570514 gettimeofday({1324652871, 570634}, NULL) = 0
15:07:51.570754 send(6, "HTTP/1.1 200\r\nConnection: close\r"..., 7000, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE) = 7000
15:07:51.571244 epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=6, u64=6}}) = 0
15:07:51.571518 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
15:07:51.571825 gettimeofday({1324652871, 571947}, NULL) = 0
15:07:51.572064 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.572355 gettimeofday({1324652871, 572476}, NULL) = 0
15:07:51.572617 pipe([8, 9])            = 0
15:07:51.572863 fcntl64(8, 0x407 /* F_??? */, 0x40000) = 262144
15:07:51.573107 splice(7, NULL, 9, NULL, 1073734988, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 9060
15:07:51.573408 splice(7, NULL, 9, NULL, 1073725928, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1460
15:07:51.573696 splice(7, NULL, 9, NULL, 1073724468, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.573942 splice(8, NULL, 6, NULL, 10520, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 10520
15:07:51.574328 gettimeofday({1324652871, 574452}, NULL) = 0
15:07:51.574572 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.574988 gettimeofday({1324652871, 575112}, NULL) = 0
15:07:51.575235 splice(7, NULL, 9, NULL, 1073724468, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060
15:07:51.575513 splice(8, NULL, 6, NULL, 16060, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060
15:07:51.575878 gettimeofday({1324652871, 576003}, NULL) = 0
15:07:51.576142 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.576510 gettimeofday({1324652871, 576631}, NULL) = 0
15:07:51.576753 splice(7, NULL, 9, NULL, 1073708408, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 11680
15:07:51.577027 splice(8, NULL, 6, NULL, 11680, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 11680
15:07:51.577358 gettimeofday({1324652871, 577480}, NULL) = 0
15:07:51.577614 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.577959 gettimeofday({1324652871, 578080}, NULL) = 0
15:07:51.578201 splice(7, NULL, 9, NULL, 1073696728, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 13140
15:07:51.578492 splice(8, NULL, 6, NULL, 13140, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 13140
15:07:51.578799 gettimeofday({1324652871, 578921}, NULL) = 0
15:07:51.579057 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.579413 gettimeofday({1324652871, 579534}, NULL) = 0
15:07:51.579654 splice(7, NULL, 9, NULL, 1073683588, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600
15:07:51.579927 splice(8, NULL, 6, NULL, 14600, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600
15:07:51.580260 gettimeofday({1324652871, 580389}, NULL) = 0
15:07:51.580526 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.580952 gettimeofday({1324652871, 581073}, NULL) = 0
15:07:51.581195 splice(7, NULL, 9, NULL, 1073668988, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060
15:07:51.581491 splice(8, NULL, 6, NULL, 16060, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 16060
15:07:51.581801 gettimeofday({1324652871, 581930}, NULL) = 0
15:07:51.582173 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.582475 gettimeofday({1324652871, 582597}, NULL) = 0
15:07:51.582719 splice(7, NULL, 9, NULL, 1073652928, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 17520
15:07:51.582998 splice(8, NULL, 6, NULL, 17520, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 17520
15:07:51.583345 gettimeofday({1324652871, 583476}, NULL) = 0
15:07:51.583725 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.584001 gettimeofday({1324652871, 584131}, NULL) = 0
15:07:51.584329 splice(7, NULL, 9, NULL, 1073635408, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 29200
15:07:51.584599 splice(8, NULL, 6, NULL, 29200, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 24820
15:07:51.584968 epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLOUT, {u32=6, u64=6}}) = 0

Here FD6 was enabled on output since some data remained in the pipe.

15:07:51.585244 gettimeofday({1324652871, 585374}, NULL) = 0
15:07:51.585583 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.585852 gettimeofday({1324652871, 585972}, NULL) = 0
15:07:51.586093 splice(7, NULL, 9, NULL, 1073606208, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 45260
15:07:51.586384 splice(8, NULL, 6, NULL, 49640, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.586668 gettimeofday({1324652871, 586789}, NULL) = 0
15:07:51.586908 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.587246 gettimeofday({1324652871, 587367}, NULL) = 0
15:07:51.587488 splice(7, NULL, 9, NULL, 1073560948, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 20440
15:07:51.587764 splice(8, NULL, 6, NULL, 70080, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.588045 gettimeofday({1324652871, 588166}, NULL) = 0
15:07:51.588284 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.588641 gettimeofday({1324652871, 588762}, NULL) = 0
15:07:51.588882 splice(7, NULL, 9, NULL, 1073540508, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 23360
15:07:51.589157 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.589438 gettimeofday({1324652871, 589558}, NULL) = 0
15:07:51.589677 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.590041 gettimeofday({1324652871, 590163}, NULL) = 0
15:07:51.590284 splice(7, NULL, 9, NULL, 1073517148, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.590533 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.590781 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0
15:07:51.591054 gettimeofday({1324652871, 591173}, NULL) = 0
15:07:51.591290 epoll_wait(3, {{EPOLLOUT, {u32=6, u64=6}}}, 8, 1000) = 1

write reported on FD6

15:07:51.623172 gettimeofday({1324652871, 623304}, NULL) = 0
15:07:51.623430 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 74460

74kB sent, 18980 bytes remaining, FD status not changed.

15:07:51.623739 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
15:07:51.624009 gettimeofday({1324652871, 624128}, NULL) = 0
15:07:51.624246 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.624510 gettimeofday({1324652871, 624629}, NULL) = 0
15:07:51.624749 splice(7, NULL, 9, NULL, 1073517148, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 59860
15:07:51.625067 splice(8, NULL, 6, NULL, 78840, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.625357 gettimeofday({1324652871, 625479}, NULL) = 0
15:07:51.625597 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.625970 gettimeofday({1324652871, 626092}, NULL) = 0
15:07:51.626213 splice(7, NULL, 9, NULL, 1073457288, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 14600
15:07:51.626490 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.626773 gettimeofday({1324652871, 626894}, NULL) = 0
15:07:51.627013 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 1000) = 1
15:07:51.627387 gettimeofday({1324652871, 627509}, NULL) = 0
15:07:51.627630 splice(7, NULL, 9, NULL, 1073442688, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.627883 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:51.628131 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0

And here we're blocked on both FDs. We disable polling for reads on the input
FD since the pipe is full. We have still not disabled events on FD6, which is
not reported anymore.

15:07:51.628389 gettimeofday({1324652871, 628508}, NULL) = 0
15:07:51.628625 epoll_wait(3, {}, 8, 1000) = 0
15:07:52.629938 gettimeofday({1324652872, 630103}, NULL) = 0
15:07:52.630232 gettimeofday({1324652872, 630364}, NULL) = 0
15:07:52.630481 epoll_wait(3, {}, 8, 1000) = 0
15:07:53.631760 gettimeofday({1324652873, 631895}, NULL) = 0
15:07:53.632018 gettimeofday({1324652873, 632146}, NULL) = 0
15:07:53.632289 epoll_wait(3, {}, 8, 1000) = 0
15:07:54.633641 gettimeofday({1324652874, 633770}, NULL) = 0
15:07:54.633896 gettimeofday({1324652874, 634017}, NULL) = 0
15:07:54.634135 epoll_wait(3, {}, 8, 1000) = 0
15:07:55.635457 gettimeofday({1324652875, 635582}, NULL) = 0
15:07:55.635705 gettimeofday({1324652875, 635824}, NULL) = 0
15:07:55.635943 epoll_wait(3, {}, 8, 930) = 0
15:07:56.567197 gettimeofday({1324652876, 567322}, NULL) = 0
15:07:56.567448 gettimeofday({1324652876, 567568}, NULL) = 0
15:07:56.567685 epoll_wait(3, {}, 8, 1000) = 0
15:07:57.568855 gettimeofday({1324652877, 569012}, NULL) = 0
15:07:57.569145 gettimeofday({1324652877, 569278}, NULL) = 0
15:07:57.569396 epoll_wait(3, {}, 8, 1000) = 0
15:07:58.570760 gettimeofday({1324652878, 570899}, NULL) = 0
15:07:58.571023 gettimeofday({1324652878, 571154}, NULL) = 0
15:07:58.571270 epoll_wait(3, {}, 8, 999) = 0
15:07:59.571418 gettimeofday({1324652879, 571546}, NULL) = 0

Timeout strikes, we're about to close. It happens that the FD is re-enabled
at this point. That doesn't really matter anyway.

15:07:59.571688 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
15:07:59.571964 gettimeofday({1324652879, 572084}, NULL) = 0
15:07:59.572227 epoll_wait(3, {{EPOLLIN, {u32=7, u64=7}}}, 8, 57) = 1
15:07:59.572499 gettimeofday({1324652879, 572619}, NULL) = 0
15:07:59.572742 splice(7, NULL, 9, NULL, 1073442688, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
15:07:59.573010 splice(8, NULL, 6, NULL, 93440, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)

The FD 6 was still blocked.

15:07:59.573269 epoll_ctl(3, EPOLL_CTL_DEL, 7, {0, {u32=7, u64=7}}) = 0
15:07:59.573529 gettimeofday({1324652879, 573649}, NULL) = 0
15:07:59.573766 epoll_wait(3, {}, 8, 56) = 0
15:07:59.630081 gettimeofday({1324652879, 630202}, NULL) = 0
15:07:59.630327 setsockopt(6, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
15:07:59.630594 close(6)                = 0
15:07:59.630909 close(7)                = 0

FDs are closed.

15:07:59.631424 close(9)                = 0
15:07:59.631746 close(8)                = 0

And pipe is killed because it contains remaining data.

Regards,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:40     ` Eric Dumazet
@ 2012-05-17 21:50       ` Eric Dumazet
  2012-05-17 21:57         ` Willy Tarreau
  2012-05-17 22:01         ` Eric Dumazet
  2012-05-17 21:54       ` Willy Tarreau
  1 sibling, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 21:50 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:

> I dont understand why we should tcp_push() if we sent 0 bytes in this
> splice() call.
> 
> The push() should have be done already by prior splice() call, dont you
> think ?
> 
> out:
> 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> 		tcp_push(sk, flags, mss_now, tp->nonagle);
> 

I think I now understand

One splice() syscall actually calls do_tcp_sendpages() several times
(N).

Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set

And last one with MSG_SENDPAGE_NOTLAST unset

So maybe we should replace this test by :

	if (!(flags & MSG_SENDPAGE_NOTLAST))
		tcp_push(...);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:40     ` Eric Dumazet
  2012-05-17 21:50       ` Eric Dumazet
@ 2012-05-17 21:54       ` Willy Tarreau
  1 sibling, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 21:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 11:40:22PM +0200, Eric Dumazet wrote:
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 63ddaee..231fe53 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -917,8 +917,7 @@ new_segment:
> >  wait_for_sndbuf:
> >  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >  wait_for_memory:
> > -		if (copied)
> > -			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> > +		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> >  
> >  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> >  			goto do_error;
> 
> 
> I dont understand why we should tcp_push() if we sent 0 bytes in this
> splice() call.
> 
> The push() should have be done already by prior splice() call, dont you
> think ?
> 
> out:
> 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> 		tcp_push(sk, flags, mss_now, tp->nonagle);
> 

I do think so, but my guess is that something fails below us at some point.
I tracked the tcp_push() chain down to tcp_write_xmit() which I *think* might
fail. I'm not certain about the conditions which can make this happen though.
However what I'm almost sure is that if we fail one tcp_write_xmit() from a
call to tcp_push(), next time we come here in do_tcp_sendpages(), the buffers
won't have moved and this second splice() will immediately go to out with
copied == 0.

Maybe I just worked around the consequence of a deeper issue and something
else needs to enable tcp_push() when we fail, but it's not very clear to me.
I felt like the call to tcp_check_probe_timer() in __tcp_push_pending_frames()
is there for this purpose, but I'm not sure.

Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:50       ` Eric Dumazet
@ 2012-05-17 21:57         ` Willy Tarreau
  2012-05-17 22:01         ` Eric Dumazet
  1 sibling, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 21:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 17, 2012 at 11:50:10PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> 
> > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > splice() call.
> > 
> > The push() should have be done already by prior splice() call, dont you
> > think ?
> > 
> > out:
> > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > 
> 
> I think I now understand
> 
> One splice() syscall actually calls do_tcp_sendpages() several times
> (N).
> 
> Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> 
> And last one with MSG_SENDPAGE_NOTLAST unset
> 
> So maybe we should replace this test by :
> 
> 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> 		tcp_push(...);

Just tested, and it unfortunately does not fix the issue :-(

I'll see if I can log errors from lower layers.
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:50       ` Eric Dumazet
  2012-05-17 21:57         ` Willy Tarreau
@ 2012-05-17 22:01         ` Eric Dumazet
  2012-05-17 22:10           ` Eric Dumazet
  2012-05-17 22:16           ` Willy Tarreau
  1 sibling, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 22:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> 
> > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > splice() call.
> > 
> > The push() should have be done already by prior splice() call, dont you
> > think ?
> > 
> > out:
> > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > 
> 
> I think I now understand
> 
> One splice() syscall actually calls do_tcp_sendpages() several times
> (N).
> 
> Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> 
> And last one with MSG_SENDPAGE_NOTLAST unset
> 
> So maybe we should replace this test by :
> 
> 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> 		tcp_push(...);
> 
> 

Its more tricky than that.

If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call
us again, but we need to flush().  (This bug is indeed very old)

So maybe use :

if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) ||
    !copied)
	tcp_push(...);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:01         ` Eric Dumazet
@ 2012-05-17 22:10           ` Eric Dumazet
  2012-05-17 22:16           ` Willy Tarreau
  1 sibling, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 22:10 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Fri, 2012-05-18 at 00:02 +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> > 
> > > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > > splice() call.
> > > 
> > > The push() should have be done already by prior splice() call, dont you
> > > think ?
> > > 
> > > out:
> > > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > > 
> > 
> > I think I now understand
> > 
> > One splice() syscall actually calls do_tcp_sendpages() several times
> > (N).
> > 
> > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> > 
> > And last one with MSG_SENDPAGE_NOTLAST unset
> > 
> > So maybe we should replace this test by :
> > 
> > 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> > 		tcp_push(...);
> > 
> > 
> 
> Its more tricky than that.
> 
> If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call
> us again, but we need to flush().  (This bug is indeed very old)
> 
> So maybe use :
> 
> if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) ||
>     !copied)
> 	tcp_push(...);


But then your patch is equivalentm so I'm going to Ack it ;)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 21:14   ` Willy Tarreau
  2012-05-17 21:40     ` Eric Dumazet
  2012-05-17 21:47     ` Willy Tarreau
@ 2012-05-17 22:14     ` Eric Dumazet
  2012-05-17 22:29       ` Willy Tarreau
  2 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 22:14 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Thu, 2012-05-17 at 23:14 +0200, Willy Tarreau wrote:

> ---
> 
> From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 22:43:20 +0200
> Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions
> 
> Since recent changes on TCP splicing (starting with commits 2f533844
> and 35f9c09f), I started seeing massive stalls when forwarding traffic
> between two sockets using splice() when pipe buffers were larger than
> socket buffers.
> 
> Latest changes (net: netdev_alloc_skb() use build_skb()) made the
> problem even more apparent.
> 
> The reason seems to be that if do_tcp_sendpages() fails on out of memory
> condition without being able to send at least one byte, tcp_push() is not
> called and the buffers cannot be flushed.
> 
> After applying the attached patch, I cannot reproduce the stalls at all
> and the data rate it perfectly stable and steady under any condition
> which previously caused the problem to be permanent.
> 
> The issue seems to have been there since before the kernel migrated to
> git, which makes me think that the stalls I occasionally experienced
> with tux during stress-tests years ago were probably related to the
> same issue.
> 
> This issue was first encountered on 3.0.31 and 3.2.17, so please backport
> to -stable.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Cc: <stable@vger.kernel.org>
> ---
>  net/ipv4/tcp.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 63ddaee..231fe53 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -917,8 +917,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> -			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> +		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
>  			goto do_error;



Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:01         ` Eric Dumazet
  2012-05-17 22:10           ` Eric Dumazet
@ 2012-05-17 22:16           ` Willy Tarreau
  2012-05-17 22:22             ` Eric Dumazet
  1 sibling, 1 reply; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 22:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> > 
> > > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > > splice() call.
> > > 
> > > The push() should have be done already by prior splice() call, dont you
> > > think ?
> > > 
> > > out:
> > > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > > 
> > 
> > I think I now understand
> > 
> > One splice() syscall actually calls do_tcp_sendpages() several times
> > (N).
> > 
> > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> > 
> > And last one with MSG_SENDPAGE_NOTLAST unset
> > 
> > So maybe we should replace this test by :
> > 
> > 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> > 		tcp_push(...);
> > 
> > 
> 
> Its more tricky than that.
> 
> If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call
> us again, but we need to flush().  (This bug is indeed very old)
> 
> So maybe use :
> 
> if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) ||
>     !copied)
> 	tcp_push(...);

That's what I wanted to do at first but was still worried about the
situations where we fail upon first call due to lack of memory. And
indeed that did not fix the issue either :-(

At least now I've checked that we fail here in do_tcp_sendpages() :

	if (!sk_stream_memory_free(sk)) {
		printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied);
		goto wait_for_sndbuf;
	}

Well, finally I just put the same test as yours above for the call to
tcp_push() upon out of memory and it fixed it too. I have simplified
the expression, since ((A && !B) || !A) == !(A & B).

With the updated patch here it works for me. I don't know if it's
better.

Willy

-----

>From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:43:20 +0200
Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions

Since recent changes on TCP splicing (starting with commits 2f533844
and 35f9c09f), I started seeing massive stalls when forwarding traffic
between two sockets using splice() when pipe buffers were larger than
socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: <stable@vger.kernel.org>
---
 net/ipv4/tcp.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 63ddaee..cfe47c1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -917,7 +917,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
+		if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
 			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
@@ -927,7 +927,7 @@ wait_for_memory:
 	}
 
 out:
-	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
+	if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;
 
-- 
1.7.2.1.45.g54fbc

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:16           ` Willy Tarreau
@ 2012-05-17 22:22             ` Eric Dumazet
  2012-05-17 22:24               ` Willy Tarreau
  2012-05-17 22:27               ` Joe Perches
  0 siblings, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-17 22:22 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote:
> On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote:
> > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> > > 
> > > > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > > > splice() call.
> > > > 
> > > > The push() should have be done already by prior splice() call, dont you
> > > > think ?
> > > > 
> > > > out:
> > > > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > > > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > > > 
> > > 
> > > I think I now understand
> > > 
> > > One splice() syscall actually calls do_tcp_sendpages() several times
> > > (N).
> > > 
> > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> > > 
> > > And last one with MSG_SENDPAGE_NOTLAST unset
> > > 
> > > So maybe we should replace this test by :
> > > 
> > > 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> > > 		tcp_push(...);
> > > 
> > > 
> > 
> > Its more tricky than that.
> > 
> > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call
> > us again, but we need to flush().  (This bug is indeed very old)
> > 
> > So maybe use :
> > 
> > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) ||
> >     !copied)
> > 	tcp_push(...);
> 
> That's what I wanted to do at first but was still worried about the
> situations where we fail upon first call due to lack of memory. And
> indeed that did not fix the issue either :-(
> 
> At least now I've checked that we fail here in do_tcp_sendpages() :
> 
> 	if (!sk_stream_memory_free(sk)) {
> 		printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied);
> 		goto wait_for_sndbuf;
> 	}
> 
> Well, finally I just put the same test as yours above for the call to
> tcp_push() upon out of memory and it fixed it too. I have simplified
> the expression, since ((A && !B) || !A) == !(A & B).
> 
> With the updated patch here it works for me. I don't know if it's
> better.
> 
> Willy
> 
> -----
> 
> From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 22:43:20 +0200
> Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions
> 
> Since recent changes on TCP splicing (starting with commits 2f533844
> and 35f9c09f), I started seeing massive stalls when forwarding traffic
> between two sockets using splice() when pipe buffers were larger than
> socket buffers.
> 
> Latest changes (net: netdev_alloc_skb() use build_skb()) made the
> problem even more apparent.
> 
> The reason seems to be that if do_tcp_sendpages() fails on out of memory
> condition without being able to send at least one byte, tcp_push() is not
> called and the buffers cannot be flushed.
> 
> After applying the attached patch, I cannot reproduce the stalls at all
> and the data rate it perfectly stable and steady under any condition
> which previously caused the problem to be permanent.
> 
> The issue seems to have been there since before the kernel migrated to
> git, which makes me think that the stalls I occasionally experienced
> with tux during stress-tests years ago were probably related to the
> same issue.
> 
> This issue was first encountered on 3.0.31 and 3.2.17, so please backport
> to -stable.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Cc: <stable@vger.kernel.org>
> ---
>  net/ipv4/tcp.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 63ddaee..cfe47c1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -917,7 +917,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> +		if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
>  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> @@ -927,7 +927,7 @@ wait_for_memory:
>  	}
>  
>  out:
> -	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> +	if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
>  		tcp_push(sk, flags, mss_now, tp->nonagle);
>  	return copied;
>  


Hmm... I believe I prefer your prior patch ( the one I Acked)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:22             ` Eric Dumazet
@ 2012-05-17 22:24               ` Willy Tarreau
  2012-05-17 22:25                 ` David Miller
  2012-05-17 22:27               ` Joe Perches
  1 sibling, 1 reply; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 22:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote:
> On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote:
> > On Fri, May 18, 2012 at 12:01:59AM +0200, Eric Dumazet wrote:
> > > On Thu, 2012-05-17 at 23:50 +0200, Eric Dumazet wrote:
> > > > On Thu, 2012-05-17 at 23:40 +0200, Eric Dumazet wrote:
> > > > 
> > > > > I dont understand why we should tcp_push() if we sent 0 bytes in this
> > > > > splice() call.
> > > > > 
> > > > > The push() should have be done already by prior splice() call, dont you
> > > > > think ?
> > > > > 
> > > > > out:
> > > > > 	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > > > > 		tcp_push(sk, flags, mss_now, tp->nonagle);
> > > > > 
> > > > 
> > > > I think I now understand
> > > > 
> > > > One splice() syscall actually calls do_tcp_sendpages() several times
> > > > (N).
> > > > 
> > > > Problem is N-1 calls are done with MSG_SENDPAGE_NOTLAST set
> > > > 
> > > > And last one with MSG_SENDPAGE_NOTLAST unset
> > > > 
> > > > So maybe we should replace this test by :
> > > > 
> > > > 	if (!(flags & MSG_SENDPAGE_NOTLAST))
> > > > 		tcp_push(...);
> > > > 
> > > > 
> > > 
> > > Its more tricky than that.
> > > 
> > > If we return 0 from do_tcp_sendpages(), __splice_from_pipe() wont call
> > > us again, but we need to flush().  (This bug is indeed very old)
> > > 
> > > So maybe use :
> > > 
> > > if ((copied && !(flags & MSG_SENDPAGE_NOTLAST)) ||
> > >     !copied)
> > > 	tcp_push(...);
> > 
> > That's what I wanted to do at first but was still worried about the
> > situations where we fail upon first call due to lack of memory. And
> > indeed that did not fix the issue either :-(
> > 
> > At least now I've checked that we fail here in do_tcp_sendpages() :
> > 
> > 	if (!sk_stream_memory_free(sk)) {
> > 		printk(KERN_DEBUG "%s:%d copied=%d\n", __FUNCTION__, __LINE__, copied);
> > 		goto wait_for_sndbuf;
> > 	}
> > 
> > Well, finally I just put the same test as yours above for the call to
> > tcp_push() upon out of memory and it fixed it too. I have simplified
> > the expression, since ((A && !B) || !A) == !(A & B).
> > 
> > With the updated patch here it works for me. I don't know if it's
> > better.
> > 
> > Willy
> > 
> > -----
> > 
> > From 1f26263bd4f8827bbca4cef77a99fac128cf8ccc Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 17 May 2012 22:43:20 +0200
> > Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions
> > 
> > Since recent changes on TCP splicing (starting with commits 2f533844
> > and 35f9c09f), I started seeing massive stalls when forwarding traffic
> > between two sockets using splice() when pipe buffers were larger than
> > socket buffers.
> > 
> > Latest changes (net: netdev_alloc_skb() use build_skb()) made the
> > problem even more apparent.
> > 
> > The reason seems to be that if do_tcp_sendpages() fails on out of memory
> > condition without being able to send at least one byte, tcp_push() is not
> > called and the buffers cannot be flushed.
> > 
> > After applying the attached patch, I cannot reproduce the stalls at all
> > and the data rate it perfectly stable and steady under any condition
> > which previously caused the problem to be permanent.
> > 
> > The issue seems to have been there since before the kernel migrated to
> > git, which makes me think that the stalls I occasionally experienced
> > with tux during stress-tests years ago were probably related to the
> > same issue.
> > 
> > This issue was first encountered on 3.0.31 and 3.2.17, so please backport
> > to -stable.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  net/ipv4/tcp.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 63ddaee..cfe47c1 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -917,7 +917,7 @@ new_segment:
> >  wait_for_sndbuf:
> >  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >  wait_for_memory:
> > -		if (copied)
> > +		if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
> >  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> >  
> >  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> > @@ -927,7 +927,7 @@ wait_for_memory:
> >  	}
> >  
> >  out:
> > -	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> > +	if (!(copied && (flags & MSG_SENDPAGE_NOTLAST)))
> >  		tcp_push(sk, flags, mss_now, tp->nonagle);
> >  	return copied;
> >  
> 
> 
> Hmm... I believe I prefer your prior patch ( the one I Acked)

I too, less changes are better :-)

Thanks,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:24               ` Willy Tarreau
@ 2012-05-17 22:25                 ` David Miller
  2012-05-17 22:30                   ` Willy Tarreau
  0 siblings, 1 reply; 61+ messages in thread
From: David Miller @ 2012-05-17 22:25 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 18 May 2012 00:24:31 +0200

> On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote:
>> Hmm... I believe I prefer your prior patch ( the one I Acked)
> 
> I too, less changes are better :-)

And I think that's the one I'm going to apply after I audit this
code a little bit myself.

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:22             ` Eric Dumazet
  2012-05-17 22:24               ` Willy Tarreau
@ 2012-05-17 22:27               ` Joe Perches
  1 sibling, 0 replies; 61+ messages in thread
From: Joe Perches @ 2012-05-17 22:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, netdev

On Fri, 2012-05-18 at 00:22 +0200, Eric Dumazet wrote:
> On Fri, 2012-05-18 at 00:16 +0200, Willy Tarreau wrote:
> > I have simplified the expression,
> > since ((A && !B) || !A) == !(A & B).

Perhaps it's clearer for a reader the first way and
any decent compiler (even gcc) could probably do that
optimization.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:14     ` Eric Dumazet
@ 2012-05-17 22:29       ` Willy Tarreau
  0 siblings, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 22:29 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev

David,

After several tests, we finally agreed on this one. I can't make the
transfers fail anymore. And they're damn fast thanks to Eric's recent
work!

Best regards,
Willy

---

>From 39c3f73176118a274ec9dea9c22c83d97a7fbfe0 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 22:43:20 +0200
Subject: [PATCH] tcp: do_tcp_sendpages() must try to push data out on oom conditions

Since recent changes on TCP splicing (starting with commits 2f533844
and 35f9c09f), I started seeing massive stalls when forwarding traffic
between two sockets using splice() when pipe buffers were larger than
socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: <stable@vger.kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 63ddaee..231fe53 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -917,8 +917,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
-			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:25                 ` David Miller
@ 2012-05-17 22:30                   ` Willy Tarreau
  2012-05-17 22:35                     ` David Miller
  0 siblings, 1 reply; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 22:30 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, May 17, 2012 at 06:25:45PM -0400, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 18 May 2012 00:24:31 +0200
> 
> > On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote:
> >> Hmm... I believe I prefer your prior patch ( the one I Acked)
> > 
> > I too, less changes are better :-)
> 
> And I think that's the one I'm going to apply after I audit this
> code a little bit myself.
> 
> Thanks!

Sorry for the double mail, I didn't know if you were following the thread
when I forwarded it to you.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:30                   ` Willy Tarreau
@ 2012-05-17 22:35                     ` David Miller
  2012-05-17 22:49                       ` Willy Tarreau
  0 siblings, 1 reply; 61+ messages in thread
From: David Miller @ 2012-05-17 22:35 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 18 May 2012 00:30:28 +0200

> On Thu, May 17, 2012 at 06:25:45PM -0400, David Miller wrote:
>> From: Willy Tarreau <w@1wt.eu>
>> Date: Fri, 18 May 2012 00:24:31 +0200
>> 
>> > On Fri, May 18, 2012 at 12:22:31AM +0200, Eric Dumazet wrote:
>> >> Hmm... I believe I prefer your prior patch ( the one I Acked)
>> > 
>> > I too, less changes are better :-)
>> 
>> And I think that's the one I'm going to apply after I audit this
>> code a little bit myself.
>> 
>> Thanks!
> 
> Sorry for the double mail, I didn't know if you were following the thread
> when I forwarded it to you.

No problem.

BTW, this issue goes back all the way to the original implementation
of do_tcp_sendpages().

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
  2012-05-17 22:35                     ` David Miller
@ 2012-05-17 22:49                       ` Willy Tarreau
  0 siblings, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-05-17 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Thu, May 17, 2012 at 06:35:20PM -0400, David Miller wrote:
> BTW, this issue goes back all the way to the original implementation
> of do_tcp_sendpages().

I'm not surprized, I remember about tux on 2.4 sometimes leaving frozen
connections behind it when pushed to extreme speeds back in the days where
it was audacious to pull 2 Gbps out of a Pentium3.

I like it when we find very old issues when improving the code, it generally
means we're stressing it a bit more and making more efficient use of it.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 19:53               ` David Miller
@ 2012-05-18  4:41                 ` Eric Dumazet
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-05-18  4:41 UTC (permalink / raw)
  To: David Miller; +Cc: w, netdev

On Thu, 2012-05-17 at 15:53 -0400, David Miller wrote:

> Applied, we can sort out any fallout very easily before 3.5 is released.
> 
> Awesome work Eric.

Thanks David

I'll send a followup patch to fix issues :

- Must be IRQ safe (non NAPI drivers can use it)
- Must not leak the frag is build_skb() fails to allocate sk_buff

- Factorize code so that dev_alloc_skb()/__dev_alloc_skb() are a wrapper
around __netdev_alloc_skb()

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
  2012-05-17 17:45               ` Willy Tarreau
  2012-05-17 19:53               ` David Miller
@ 2012-06-04 12:37               ` Michael S. Tsirkin
  2012-06-04 13:06                 ` Eric Dumazet
  2 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 12:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Please note I havent tested yet this patch, lacking hardware for this.
> 
> (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> ixgbe uses fragments...)

virtio-net uses netdev_alloc_skb but maybe it should call
build_skb instead?

Also, it's not uncommon for drivers to copy short packets out to be able
to reuse pages.  virtio does this but I am guessing the logic is not
really virtio specific.

We could do
	if (len < GOOD_COPY_LEN)
		netdev_alloc_skb
		memmov
	else
		build_skb

but maybe it makes sense to put this logic in build_skb?


-- 
MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-05-17 17:45               ` Willy Tarreau
@ 2012-06-04 12:39                 ` Michael S. Tsirkin
  2012-06-04 12:44                   ` Willy Tarreau
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 12:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, David Miller, netdev

On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote:
> Impressed !
> 
> For the first time I could proxy HTTP traffic at gigabit speed on this
> little box powered by USB ! I've long believed that proper splicing
> would make this possible and now I'm seeing it is. Congrats Eric !

which userspace do you use?
anything I can try?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 12:39                 ` Michael S. Tsirkin
@ 2012-06-04 12:44                   ` Willy Tarreau
  0 siblings, 0 replies; 61+ messages in thread
From: Willy Tarreau @ 2012-06-04 12:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Eric Dumazet, David Miller, netdev

On Mon, Jun 04, 2012 at 03:39:12PM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote:
> > Impressed !
> > 
> > For the first time I could proxy HTTP traffic at gigabit speed on this
> > little box powered by USB ! I've long believed that proper splicing
> > would make this possible and now I'm seeing it is. Congrats Eric !
> 
> which userspace do you use?

It's haproxy-1.5-dev with splicing enabled.

> anything I can try?

Yes, feel free to download -dev11, build it for kernels >= 2.6.28 and
make a small config to relay TCP/HTTP to another host. of course you
need gigabit-capable client and server.

Willy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 12:37               ` Michael S. Tsirkin
@ 2012-06-04 13:06                 ` Eric Dumazet
  2012-06-04 13:41                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 13:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Please note I havent tested yet this patch, lacking hardware for this.
> > 
> > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> > ixgbe uses fragments...)
> 
> virtio-net uses netdev_alloc_skb but maybe it should call
> build_skb instead?
> 
> Also, it's not uncommon for drivers to copy short packets out to be able
> to reuse pages.  virtio does this but I am guessing the logic is not
> really virtio specific.
> 
> We could do
> 	if (len < GOOD_COPY_LEN)
> 		netdev_alloc_skb
> 		memmov
> 	else
> 		build_skb
> 
> but maybe it makes sense to put this logic in build_skb?
> 
> 

I am not sure to understand the question.

If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to
change.

build_skb() is for drivers that allocate the memory to hold frame, and
wait for NIC completion before allocating/populating the skb itself.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 13:06                 ` Eric Dumazet
@ 2012-06-04 13:41                   ` Michael S. Tsirkin
  2012-06-04 14:01                     ` Eric Dumazet
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 13:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 03:06:53PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote:
> > On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > Please note I havent tested yet this patch, lacking hardware for this.
> > > 
> > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> > > ixgbe uses fragments...)
> > 
> > virtio-net uses netdev_alloc_skb but maybe it should call
> > build_skb instead?
> > 
> > Also, it's not uncommon for drivers to copy short packets out to be able
> > to reuse pages.  virtio does this but I am guessing the logic is not
> > really virtio specific.
> > 
> > We could do
> > 	if (len < GOOD_COPY_LEN)
> > 		netdev_alloc_skb
> > 		memmov
> > 	else
> > 		build_skb
> > 
> > but maybe it makes sense to put this logic in build_skb?
> > 
> > 
> 
> I am not sure to understand the question.
> 
> If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to
> change.
> 
> build_skb() is for drivers that allocate the memory to hold frame, and
> wait for NIC completion before allocating/populating the skb itself.
> 


This is generally what virtio does, take a look:
page_to_skb fills the first fragment and receive_mergeable fills the
rest (other modes are for legacy hardware).

The way hypervisor now works is this (we call it mergeable buffers):

- pages are passed to hardware
- hypervisor puts virtio specific stuff in first 12 bytes
  on first page
- following this, the rest of the first page and all following
  pages have data

The driver gets the 1st page, allocates the skb, copies out the 12 byte
header and copies the first 128 bytes of data into skb.
The rest if any is populated by the pages.

So I guess I'm asking for advice, would it make sense to switch to build_skb
and how best to handle the data copying above? Maybe it would help
if we changed the hypervisor to write the 12 bytes separately?
  


-- 
MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 13:41                   ` Michael S. Tsirkin
@ 2012-06-04 14:01                     ` Eric Dumazet
  2012-06-04 14:09                       ` Eric Dumazet
  2012-06-04 14:17                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 14:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote:

> This is generally what virtio does, take a look:
> page_to_skb fills the first fragment and receive_mergeable fills the
> rest (other modes are for legacy hardware).
> 
> The way hypervisor now works is this (we call it mergeable buffers):
> 
> - pages are passed to hardware
> - hypervisor puts virtio specific stuff in first 12 bytes
>   on first page
> - following this, the rest of the first page and all following
>   pages have data
> 
> The driver gets the 1st page, allocates the skb, copies out the 12 byte
> header and copies the first 128 bytes of data into skb.
> The rest if any is populated by the pages.
> 
> So I guess I'm asking for advice, would it make sense to switch to build_skb
> and how best to handle the data copying above? Maybe it would help
> if we changed the hypervisor to write the 12 bytes separately?
>   

Thanks for these details.

Not sure 12 bytes of headroom would be enough (instead of the
NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
done indeed is to use the first page as the skb->head, so using
build_skb() indeed, removing one fragment, one (small) copy and one
{put|get}_page() pair.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 14:01                     ` Eric Dumazet
@ 2012-06-04 14:09                       ` Eric Dumazet
  2012-06-04 14:17                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 16:01 +0200, Eric Dumazet wrote:

> Not sure 12 bytes of headroom would be enough (instead of the
> NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
> done indeed is to use the first page as the skb->head, so using
> build_skb() indeed, removing one fragment, one (small) copy and one
> {put|get}_page() pair.

It would also avoid 'pulling' tcp data payload in linear part.

page_to_skb() does :

copy = len;
if (copy > skb_tailroom(skb))
	copy = skb_tailroom(skb);
memcpy(skb_put(skb, copy), p, copy);

This means GRO or TCP coalescing (or splice()) has to handle two
segments to fetch data.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 14:01                     ` Eric Dumazet
  2012-06-04 14:09                       ` Eric Dumazet
@ 2012-06-04 14:17                       ` Michael S. Tsirkin
  2012-06-04 15:01                         ` Eric Dumazet
  1 sibling, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 14:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 04:01:41PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote:
> 
> > This is generally what virtio does, take a look:
> > page_to_skb fills the first fragment and receive_mergeable fills the
> > rest (other modes are for legacy hardware).
> > 
> > The way hypervisor now works is this (we call it mergeable buffers):
> > 
> > - pages are passed to hardware
> > - hypervisor puts virtio specific stuff in first 12 bytes
> >   on first page
> > - following this, the rest of the first page and all following
> >   pages have data
> > 
> > The driver gets the 1st page, allocates the skb, copies out the 12 byte
> > header and copies the first 128 bytes of data into skb.
> > The rest if any is populated by the pages.
> > 
> > So I guess I'm asking for advice, would it make sense to switch to build_skb
> > and how best to handle the data copying above? Maybe it would help
> > if we changed the hypervisor to write the 12 bytes separately?
> >   
> 
> Thanks for these details.
> 
> Not sure 12 bytes of headroom would be enough (instead of the
> NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
> done indeed is to use the first page as the skb->head, so using
> build_skb() indeed, removing one fragment, one (small) copy and one
> {put|get}_page() pair.
> 

bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
after build_skb. You are saying it's not a must?

Hmm so maybe we should teach the hypervisor to write data
out at an offset. Interesting.

Another question is about very small packets truesize.
build_skb sets truesize to frag_size but isn't
this too small? We keep the whole page around, no?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 14:17                       ` Michael S. Tsirkin
@ 2012-06-04 15:01                         ` Eric Dumazet
  2012-06-04 17:20                           ` Michael S. Tsirkin
  2012-06-04 18:16                           ` Michael S. Tsirkin
  0 siblings, 2 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:

> bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> after build_skb. You are saying it's not a must?
> 

32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
on most x86 current cpus) to avoid using half a cache line.

> Hmm so maybe we should teach the hypervisor to write data
> out at an offset. Interesting.
> 
> Another question is about very small packets truesize.
> build_skb sets truesize to frag_size but isn't
> this too small? We keep the whole page around, no?

We keep one page per cpu, at most.

For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
enough (this is very close from 2048 'truesize')

But yes, for some uses (wifi for example), we might use a full page per
skb, yet underestimate skb->truesize. Hopefully we can track these uses
and fix them.

ath9k for example could be changed, to be able to reassemble up to 3
frags instead of 2 frags, ie extending what did commit
0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx
buffers to get rid of order-1 skb allocations)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 15:01                         ` Eric Dumazet
@ 2012-06-04 17:20                           ` Michael S. Tsirkin
  2012-06-04 17:44                             ` Eric Dumazet
  2012-06-04 18:16                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 17:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:
> 
> > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> > after build_skb. You are saying it's not a must?
> > 
> 
> 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
> on most x86 current cpus) to avoid using half a cache line.

OK so if we want to use build_skb we need to pad by at least
32 - 12 bytes or more likely 64-12. We lose 52 bytes per page
but save a copy of 128 bytes for medium sized packets.

> > Hmm so maybe we should teach the hypervisor to write data
> > out at an offset. Interesting.
> > 
> > Another question is about very small packets truesize.
> > build_skb sets truesize to frag_size but isn't
> > this too small? We keep the whole page around, no?
> 
> We keep one page per cpu, at most.
> 
> For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
> two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
> enough (this is very close from 2048 'truesize')

I see. virtio allocates full 4K pages which works well for GSO
but it means smaller packets would get large truesize
wasting lots of memory.
So maybe we should keep copying packets < 128 bytes.

> But yes, for some uses (wifi for example), we might use a full page per
> skb, yet underestimate skb->truesize. Hopefully we can track these uses
> and fix them.
> 
> ath9k for example could be changed, to be able to reassemble up to 3
> frags instead of 2 frags, ie extending what did commit
> 0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx
> buffers to get rid of order-1 skb allocations)
> 
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 17:20                           ` Michael S. Tsirkin
@ 2012-06-04 17:44                             ` Eric Dumazet
  2012-06-04 18:16                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 17:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote:

> I see. virtio allocates full 4K pages which works well for GSO
> but it means smaller packets would get large truesize
> wasting lots of memory.
> So maybe we should keep copying packets < 128 bytes.

Yet, but restrict the copy to bare minimum if > 128 bytes

See following commit for details (actual changelog bigger than patch
itself)

commit 56138f50d1900b0c3d8647376e37b488b23ba53d
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 18 04:48:33 2012 +0000

    iwlwifi: dont pull too much payload in skb head
    
    As iwlwifi use fat skbs, it should not pull too much data in skb->head,
    and particularly no tcp data payload, or splice() is slower, and TCP
    coalescing is disabled. Copying payload to userland also involves at
    least two copies (part from header, part from fragment)
    
    Each layer will pull its header from the fragment as needed.
    
    (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)
    
    With this patch applied, I have a major reduction of collapsed/pruned
    TCP packets, a nice increase of TCPRcvCoalesce counter, and overall
    better Internet User experience.
    
    Small packets are still using a fragless skb, so that page can be reused
    by the driver.
    

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 15:01                         ` Eric Dumazet
  2012-06-04 17:20                           ` Michael S. Tsirkin
@ 2012-06-04 18:16                           ` Michael S. Tsirkin
  2012-06-04 19:29                             ` Eric Dumazet
  1 sibling, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 05:01:04PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:
> 
> > bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> > after build_skb. You are saying it's not a must?
> > 
> 
> 32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
> on most x86 current cpus) to avoid using half a cache line.
> 
> > Hmm so maybe we should teach the hypervisor to write data
> > out at an offset. Interesting.
> > 
> > Another question is about very small packets truesize.
> > build_skb sets truesize to frag_size but isn't
> > this too small? We keep the whole page around, no?
> 
> We keep one page per cpu, at most.
> 
> For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
> two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
> enough (this is very close from 2048 'truesize')

Yes but if a tcp socket then hangs on, on one of the fragments,
while the other has been freed, the whole page is still
never reused, right?

Doesn't this mean truesize should be 4K?

-- 
MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 17:44                             ` Eric Dumazet
@ 2012-06-04 18:16                               ` Michael S. Tsirkin
  2012-06-04 19:24                                 ` Eric Dumazet
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 07:44:50PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 20:20 +0300, Michael S. Tsirkin wrote:
> 
> > I see. virtio allocates full 4K pages which works well for GSO
> > but it means smaller packets would get large truesize
> > wasting lots of memory.
> > So maybe we should keep copying packets < 128 bytes.
> 
> Yet, but restrict the copy to bare minimum if > 128 bytes
> 
> See following commit for details (actual changelog bigger than patch
> itself)
> 
> commit 56138f50d1900b0c3d8647376e37b488b23ba53d
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Fri May 18 04:48:33 2012 +0000
> 
>     iwlwifi: dont pull too much payload in skb head
>     
>     As iwlwifi use fat skbs, it should not pull too much data in skb->head,
>     and particularly no tcp data payload, or splice() is slower, and TCP
>     coalescing is disabled. Copying payload to userland also involves at
>     least two copies (part from header, part from fragment)
>     
>     Each layer will pull its header from the fragment as needed.
>     
>     (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)
>     
>     With this patch applied, I have a major reduction of collapsed/pruned
>     TCP packets, a nice increase of TCPRcvCoalesce counter, and overall
>     better Internet User experience.
>     
>     Small packets are still using a fragless skb, so that page can be reused
>     by the driver.
>     

Will take a look, thanks.
By the way, this comment at build_skb:
* @frag_size: size of fragment, or 0 if head was kmalloced

is not very clear to me. Could you clarify what exactly size
of fragment means in this context?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 18:16                               ` Michael S. Tsirkin
@ 2012-06-04 19:24                                 ` Eric Dumazet
  2012-06-04 19:48                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 19:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:

> Will take a look, thanks.
> By the way, this comment at build_skb:
> * @frag_size: size of fragment, or 0 if head was kmalloced
> 
> is not very clear to me. Could you clarify what exactly size
> of fragment means in this context?
> 


If your driver did :

data = kmalloc(100) then you use @frag_size=0, so that build_skb() does
the ksize(data) to fetch real size (It can depend on slab/slub/sob
allocator)


If you used netdev_alloc_frag(128), then you use 128 because there is no
way build_skb() can guess the size of the fragment. Its also how we
signal to build_skb() that skb->head_frag is set to 1.

__netdev_alloc_skb() for example does :

void *data = netdev_alloc_frag(fragsz);
skb = build_skb(data, fragsz);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 18:16                           ` Michael S. Tsirkin
@ 2012-06-04 19:29                             ` Eric Dumazet
  2012-06-04 19:43                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 19:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:

> Yes but if a tcp socket then hangs on, on one of the fragments,
> while the other has been freed, the whole page is still
> never reused, right?
> 
> Doesn't this mean truesize should be 4K?
> 

Yes, or more exactly PAGE_SIZE, but then performance would really go
down on machines with 64KB pages. Maybe we should make the whole frag
head idea enabled only for PAGE_SIZE=4096.

Not sure we want to track precise truesize, as the minimum truesize is
SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct
skb_shared_info)) (64 + 64 + 320) = 448

Its not like buggy drivers that used truesize = length

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:29                             ` Eric Dumazet
@ 2012-06-04 19:43                               ` Michael S. Tsirkin
  2012-06-04 19:52                                 ` Eric Dumazet
  2012-06-04 19:56                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 19:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:
> 
> > Yes but if a tcp socket then hangs on, on one of the fragments,
> > while the other has been freed, the whole page is still
> > never reused, right?
> > 
> > Doesn't this mean truesize should be 4K?
> > 
> 
> Yes, or more exactly PAGE_SIZE, but then performance would really go
> down on machines with 64KB pages.
> Maybe we should make the whole frag
> head idea enabled only for PAGE_SIZE=4096.
> 
> Not sure we want to track precise truesize, as the minimum truesize is
> SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)) (64 + 64 + 320) = 448
> 
> Its not like buggy drivers that used truesize = length
> 
> 

Interesting. But where's the threshold?

-- 
MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:24                                 ` Eric Dumazet
@ 2012-06-04 19:48                                   ` Michael S. Tsirkin
  2012-06-04 19:56                                     ` Eric Dumazet
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 19:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 09:24:02PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:
> 
> > Will take a look, thanks.
> > By the way, this comment at build_skb:
> > * @frag_size: size of fragment, or 0 if head was kmalloced
> > 
> > is not very clear to me. Could you clarify what exactly size
> > of fragment means in this context?
> > 
> 
> 
> If your driver did :
> 
> data = kmalloc(100) then you use @frag_size=0, so that build_skb() does
> the ksize(data) to fetch real size (It can depend on slab/slub/sob
> allocator)
> 
> 
> If you used netdev_alloc_frag(128), then you use 128 because there is no
> way build_skb() can guess the size of the fragment. Its also how we
> signal to build_skb() that skb->head_frag is set to 1.
> 
> __netdev_alloc_skb() for example does :
> 
> void *data = netdev_alloc_frag(fragsz);
> skb = build_skb(data, fragsz);
> 

If I do this what will truesize be? 128, no?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:43                               ` Michael S. Tsirkin
@ 2012-06-04 19:52                                 ` Eric Dumazet
  2012-06-04 21:54                                   ` Michael S. Tsirkin
  2012-06-04 19:56                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 19:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 22:43 +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:
> > 
> > > Yes but if a tcp socket then hangs on, on one of the fragments,
> > > while the other has been freed, the whole page is still
> > > never reused, right?
> > > 
> > > Doesn't this mean truesize should be 4K?
> > > 
> > 
> > Yes, or more exactly PAGE_SIZE, but then performance would really go
> > down on machines with 64KB pages.
> > Maybe we should make the whole frag
> > head idea enabled only for PAGE_SIZE=4096.
> > 
> > Not sure we want to track precise truesize, as the minimum truesize is
> > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info)) (64 + 64 + 320) = 448
> > 
> > Its not like buggy drivers that used truesize = length
> > 
> > 
> 
> Interesting. But where's the threshold?
> 

It all depends on the global limit you have on your machine.

If you allow tcp memory to use 10% of ram, then a systematic x4 error
would allow it to use 40% of ram. Mabe not enough to crash.

Now you have to find a real workload able to hit this limit for real...

But, if you "allow" a driver to claim a truesize of 1 (instead of 4096),
you can reach the limit and OOM faster

You know, even the current page stored for each socket (sk_sndmsg_page)
can be a problem if you setup 1.000.000 tcp sockets. That can consume
4GB of ram (added to inode/sockets themselves)
This is not really taken into account right now...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:43                               ` Michael S. Tsirkin
  2012-06-04 19:52                                 ` Eric Dumazet
@ 2012-06-04 19:56                                 ` Michael S. Tsirkin
  2012-06-04 20:05                                   ` Eric Dumazet
  1 sibling, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 19:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 10:43:30PM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:
> > 
> > > Yes but if a tcp socket then hangs on, on one of the fragments,
> > > while the other has been freed, the whole page is still
> > > never reused, right?
> > > 
> > > Doesn't this mean truesize should be 4K?
> > > 
> > 
> > Yes, or more exactly PAGE_SIZE, but then performance would really go
> > down on machines with 64KB pages.
> > Maybe we should make the whole frag
> > head idea enabled only for PAGE_SIZE=4096.
> > 
> > Not sure we want to track precise truesize, as the minimum truesize is
> > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info)) (64 + 64 + 320) = 448
> > 
> > Its not like buggy drivers that used truesize = length
> > 
> > 
> 
> Interesting. But where's the threshold?

And just to explain why I'm asking, if it's OK to
declare 2K when you use 4K, we can do (effectively):

+       skb->truesize += PAGE_SIZE * nfrags;
-       skb->truesize += min(1500, skb->data_len);

which means with 1500 byte packets we can use
as much memory as we did before
4b727361f0bc7ee7378298941066d8aa15023ffb

> -- 
> MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:48                                   ` Michael S. Tsirkin
@ 2012-06-04 19:56                                     ` Eric Dumazet
  2012-06-04 21:20                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote:

> If I do this what will truesize be? 128, no?

My example was not correct, since you must have enough room for the
SKB_DATA_ALIGN(sizeof(struct skb_shared_info))  ( 320 bytes )

So it would be 128 + 320 = 448

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:56                                 ` Michael S. Tsirkin
@ 2012-06-04 20:05                                   ` Eric Dumazet
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-04 20:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Mon, 2012-06-04 at 22:56 +0300, Michael S. Tsirkin wrote:

> And just to explain why I'm asking, if it's OK to
> declare 2K when you use 4K, we can do (effectively):
> 
> +       skb->truesize += PAGE_SIZE * nfrags;
> -       skb->truesize += min(1500, skb->data_len);
> 
> which means with 1500 byte packets we can use
> as much memory as we did before
> 4b727361f0bc7ee7378298941066d8aa15023ffb
> 

How are you sure data_len is >= 1500 ?

If you know that you use a full page for a fragment, then PAGE_SIZE is
better than 1500


If you share a page with 2/4 frames (But not one hundred), you can
assume risk of underestimation is really low.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:56                                     ` Eric Dumazet
@ 2012-06-04 21:20                                       ` Michael S. Tsirkin
  2012-06-05  2:50                                         ` Eric Dumazet
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 21:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 09:56:52PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote:
> 
> > If I do this what will truesize be? 128, no?
> 
> My example was not correct, since you must have enough room for the
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info))  ( 320 bytes )
> 
> So it would be 128 + 320 = 448
> 


Ugh. I forgot about that. shinfo goes into the same page,
so we'll have to also make all frags shorter by 320
to leave space for shinfo at tail.
overall looks like we need hyprevisor extensions if
we want to use build_skb ...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 19:52                                 ` Eric Dumazet
@ 2012-06-04 21:54                                   ` Michael S. Tsirkin
  2012-06-05  2:46                                     ` Eric Dumazet
  0 siblings, 1 reply; 61+ messages in thread
From: Michael S. Tsirkin @ 2012-06-04 21:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev

On Mon, Jun 04, 2012 at 09:52:59PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 22:43 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jun 04, 2012 at 09:29:45PM +0200, Eric Dumazet wrote:
> > > On Mon, 2012-06-04 at 21:16 +0300, Michael S. Tsirkin wrote:
> > > 
> > > > Yes but if a tcp socket then hangs on, on one of the fragments,
> > > > while the other has been freed, the whole page is still
> > > > never reused, right?
> > > > 
> > > > Doesn't this mean truesize should be 4K?
> > > > 
> > > 
> > > Yes, or more exactly PAGE_SIZE, but then performance would really go
> > > down on machines with 64KB pages.
> > > Maybe we should make the whole frag
> > > head idea enabled only for PAGE_SIZE=4096.
> > > 
> > > Not sure we want to track precise truesize, as the minimum truesize is
> > > SKB_DATA_ALIGN(length + NET_SKB_PAD) + SKB_DATA_ALIGN(sizeof(struct
> > > skb_shared_info)) (64 + 64 + 320) = 448
> > > 
> > > Its not like buggy drivers that used truesize = length
> > > 
> > > 
> > 
> > Interesting. But where's the threshold?
> > 
> 
> It all depends on the global limit you have on your machine.
> 
> If you allow tcp memory to use 10% of ram, then a systematic x4 error
> would allow it to use 40% of ram. Mabe not enough to crash.
> 
> Now you have to find a real workload able to hit this limit for real...
> 
> But, if you "allow" a driver to claim a truesize of 1 (instead of 4096),
> you can reach the limit and OOM faster
> 
> You know, even the current page stored for each socket (sk_sndmsg_page)
> can be a problem if you setup 1.000.000 tcp sockets. That can consume
> 4GB of ram (added to inode/sockets themselves)
> This is not really taken into account right now...
> 
> 

Yes but what bugs me if the box is not under memory pressure
this overestimation limits buffers for no real gain.
How about we teach tcp to use data_len for buffer
limits normally and switch to truesize when low on memory?

-- 
MST

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 21:54                                   ` Michael S. Tsirkin
@ 2012-06-05  2:46                                     ` Eric Dumazet
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-05  2:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Tue, 2012-06-05 at 00:54 +0300, Michael S. Tsirkin wrote:

> Yes but what bugs me if the box is not under memory pressure
> this overestimation limits buffers for no real gain.
> How about we teach tcp to use data_len for buffer
> limits normally and switch to truesize when low on memory?
> 

You should first have evidence of the effect of this limitation.

I see more evidence of poor choices at driver level than core level.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
  2012-06-04 21:20                                       ` Michael S. Tsirkin
@ 2012-06-05  2:50                                         ` Eric Dumazet
  0 siblings, 0 replies; 61+ messages in thread
From: Eric Dumazet @ 2012-06-05  2:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev

On Tue, 2012-06-05 at 00:20 +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2012 at 09:56:52PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-06-04 at 22:48 +0300, Michael S. Tsirkin wrote:
> > 
> > > If I do this what will truesize be? 128, no?
> > 
> > My example was not correct, since you must have enough room for the
> > SKB_DATA_ALIGN(sizeof(struct skb_shared_info))  ( 320 bytes )
> > 
> > So it would be 128 + 320 = 448
> > 
> 
> 
> Ugh. I forgot about that. shinfo goes into the same page,
> so we'll have to also make all frags shorter by 320
> to leave space for shinfo at tail.
> overall looks like we need hyprevisor extensions if
> we want to use build_skb ...

Maybe not.

If you provided a 2048 bytes block and hypervisor filled one (small)
frame, there might be available room at the end of the block for the
shinfo already.

If yes : You can use build_skb()
If not : netdev_alloc_skb_ip_align()

I mentioned this trick for ixgbe driver.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2012-06-05  2:50 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-17 12:18 Stable regression with 'tcp: allow splice() to build full TSO packets' Willy Tarreau
2012-05-17 15:01 ` Willy Tarreau
2012-05-17 15:43   ` Eric Dumazet
2012-05-17 15:56     ` Willy Tarreau
2012-05-17 16:33       ` Eric Dumazet
2012-05-17 16:40         ` Willy Tarreau
2012-05-17 16:47           ` Eric Dumazet
2012-05-17 16:49           ` Eric Dumazet
2012-05-17 17:22             ` Willy Tarreau
2012-05-17 17:34             ` [PATCH net-next] net: netdev_alloc_skb() use build_skb() Eric Dumazet
2012-05-17 17:45               ` Willy Tarreau
2012-06-04 12:39                 ` Michael S. Tsirkin
2012-06-04 12:44                   ` Willy Tarreau
2012-05-17 19:53               ` David Miller
2012-05-18  4:41                 ` Eric Dumazet
2012-06-04 12:37               ` Michael S. Tsirkin
2012-06-04 13:06                 ` Eric Dumazet
2012-06-04 13:41                   ` Michael S. Tsirkin
2012-06-04 14:01                     ` Eric Dumazet
2012-06-04 14:09                       ` Eric Dumazet
2012-06-04 14:17                       ` Michael S. Tsirkin
2012-06-04 15:01                         ` Eric Dumazet
2012-06-04 17:20                           ` Michael S. Tsirkin
2012-06-04 17:44                             ` Eric Dumazet
2012-06-04 18:16                               ` Michael S. Tsirkin
2012-06-04 19:24                                 ` Eric Dumazet
2012-06-04 19:48                                   ` Michael S. Tsirkin
2012-06-04 19:56                                     ` Eric Dumazet
2012-06-04 21:20                                       ` Michael S. Tsirkin
2012-06-05  2:50                                         ` Eric Dumazet
2012-06-04 18:16                           ` Michael S. Tsirkin
2012-06-04 19:29                             ` Eric Dumazet
2012-06-04 19:43                               ` Michael S. Tsirkin
2012-06-04 19:52                                 ` Eric Dumazet
2012-06-04 21:54                                   ` Michael S. Tsirkin
2012-06-05  2:46                                     ` Eric Dumazet
2012-06-04 19:56                                 ` Michael S. Tsirkin
2012-06-04 20:05                                   ` Eric Dumazet
2012-05-17 18:38       ` Stable regression with 'tcp: allow splice() to build full TSO packets' Ben Hutchings
2012-05-17 19:55   ` David Miller
2012-05-17 20:04     ` Willy Tarreau
2012-05-17 20:07       ` David Miller
2012-05-17 20:41 ` Eric Dumazet
2012-05-17 21:14   ` Willy Tarreau
2012-05-17 21:40     ` Eric Dumazet
2012-05-17 21:50       ` Eric Dumazet
2012-05-17 21:57         ` Willy Tarreau
2012-05-17 22:01         ` Eric Dumazet
2012-05-17 22:10           ` Eric Dumazet
2012-05-17 22:16           ` Willy Tarreau
2012-05-17 22:22             ` Eric Dumazet
2012-05-17 22:24               ` Willy Tarreau
2012-05-17 22:25                 ` David Miller
2012-05-17 22:30                   ` Willy Tarreau
2012-05-17 22:35                     ` David Miller
2012-05-17 22:49                       ` Willy Tarreau
2012-05-17 22:27               ` Joe Perches
2012-05-17 21:54       ` Willy Tarreau
2012-05-17 21:47     ` Willy Tarreau
2012-05-17 22:14     ` Eric Dumazet
2012-05-17 22:29       ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).