All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tcp: allow splice() to build full TSO packets
@ 2012-04-03 19:37 Eric Dumazet
  2012-04-03 21:21 ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2012-04-03 19:37 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Tom Herbert, Yuchung Cheng,
	H.K. Jerry Chu, Maciej Żenczykowski, Mahesh Bandewar,
	Ilpo Järvinen, Nandita Dukkipati

vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
---
 net/ipv4/tcp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfd7edd..2ff6f45 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -860,7 +860,7 @@ wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied && !(flags & MSG_MORE))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 19:37 [PATCH] tcp: allow splice() to build full TSO packets Eric Dumazet
@ 2012-04-03 21:21 ` David Miller
  2012-04-03 21:31   ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2012-04-03 21:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 21:37:01 +0200

> vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
> time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
> 
> The call to tcp_push() at the end of do_tcp_sendpages() forces an
> immediate xmit when pipe is not already filled, and tso_fragment() try
> to split these skb to MSS multiples.
> 
> 4096 bytes are usually split in a skb with 2 MSS, and a remaining
> sub-mss skb (assuming MTU=1500)

Interesting.

But why doesn't TCP_NAGLE_CORK save us?  That gets passed down into
the push pending frames logic when MSG_MORE is specified.

As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO
deferral logic should do the right thing here.

Obviously you see different behavior, but why?

Also, by eliding the tcp_push() call you are introducing other side
effects:

1) we won't do the tcp_mark_push logic

2) we don't set the URG seq

I think #2 can never happen in the vmsplice/splice path, but #1 might
matter.

That's why I want to concentrate on why the tcp_push() path doesn't
behave properly when MSG_MORE is set.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 21:21 ` David Miller
@ 2012-04-03 21:31   ` Eric Dumazet
  2012-04-03 21:36     ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2012-04-03 21:31 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

On Tue, 2012-04-03 at 17:21 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 21:37:01 +0200
> 
> > vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
> > time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
> > 
> > The call to tcp_push() at the end of do_tcp_sendpages() forces an
> > immediate xmit when pipe is not already filled, and tso_fragment() try
> > to split these skb to MSS multiples.
> > 
> > 4096 bytes are usually split in a skb with 2 MSS, and a remaining
> > sub-mss skb (assuming MTU=1500)
> 
> Interesting.
> 
> But why doesn't TCP_NAGLE_CORK save us?  That gets passed down into
> the push pending frames logic when MSG_MORE is specified.
> 
> As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO
> deferral logic should do the right thing here.
> 
> Obviously you see different behavior, but why?
> 
> Also, by eliding the tcp_push() call you are introducing other side
> effects:
> 
> 1) we won't do the tcp_mark_push logic
> 
> 2) we don't set the URG seq
> 
> I think #2 can never happen in the vmsplice/splice path, but #1 might
> matter.
> 
> That's why I want to concentrate on why the tcp_push() path doesn't
> behave properly when MSG_MORE is set.

It behaves properly I think, but in the tcp_sendmsg() perspective only.

The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
probably copy/pasted) but the thing is tcp_sendmsg() is called once per
sendmsg() call (and the push logic is OK at the end of it), while a
single splice() system call can call do_tcp_sendpages() 16 times (or
even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))

Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I
tried this today and found needed surgery was complex). Also this would
lock socket for a long period and could add latencies because of backlog
processing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 21:31   ` Eric Dumazet
@ 2012-04-03 21:36     ` David Miller
  2012-04-03 22:50       ` David Miller
  2012-04-05 13:05       ` Eric Dumazet
  0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2012-04-03 21:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 23:31:29 +0200

> The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
> probably copy/pasted) but the thing is tcp_sendmsg() is called once per
> sendmsg() call (and the push logic is OK at the end of it), while a
> single splice() system call can call do_tcp_sendpages() 16 times (or
> even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))

Ok, so this means that in essence the tcp_mark_push should also only
be done in the final sendpage call.

And since I'm wholly convinced that the URG stuff is a complete
"don't care" for this path, I'm convinced your patch is the right
thing to do.

Applied to 'net' and queued up for -stable, thanks Eric.

> Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I
> tried this today and found needed surgery was complex). Also this would
> lock socket for a long period and could add latencies because of backlog
> processing.

I don't think this is a good idea.  Maybe we can do some level of
batching at some point, but it would need to have a limit.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 21:36     ` David Miller
@ 2012-04-03 22:50       ` David Miller
  2012-04-04  6:15         ` Eric Dumazet
  2012-04-05 13:05       ` Eric Dumazet
  1 sibling, 1 reply; 10+ messages in thread
From: David Miller @ 2012-04-03 22:50 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

From: David Miller <davem@davemloft.net>
Date: Tue, 03 Apr 2012 17:36:14 -0400 (EDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 23:31:29 +0200
> 
>> The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
>> probably copy/pasted) but the thing is tcp_sendmsg() is called once per
>> sendmsg() call (and the push logic is OK at the end of it), while a
>> single splice() system call can call do_tcp_sendpages() 16 times (or
>> even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))
> 
> Ok, so this means that in essence the tcp_mark_push should also only
> be done in the final sendpage call.
> 
> And since I'm wholly convinced that the URG stuff is a complete
> "don't care" for this path, I'm convinced your patch is the right
> thing to do.
> 
> Applied to 'net' and queued up for -stable, thanks Eric.

Eric, sorry to be a pain, but to clear my conscience can you tell me
if the sendfile() path do the right thing with your change too?

I'm concerned about returning back into userspace without at least
one tcp_push() at the end.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 22:50       ` David Miller
@ 2012-04-04  6:15         ` Eric Dumazet
  2012-04-04  6:20           ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2012-04-04  6:15 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote:

> Eric, sorry to be a pain, but to clear my conscience can you tell me
> if the sendfile() path do the right thing with your change too?
> 
> I'm concerned about returning back into userspace without at least
> one tcp_push() at the end.

Absolutely, this patch also handles the sendfile() path.

do_sendfile() ->
 do_splice_direct() ->
  splice_direct_to_actor() -> 
   ...
    splice_from_pipe() ->
     splice_from_pipe_feed() ->
      pipe_to_sendpage()  more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len
       sock_sendpage() if (more) flags |= MSG_MORE;
        kernel_sendpage()
         inet_sendpage()
          tcp_sendpage()

So the push is perfomed at the end (last page doesnt have MSG_MORE flag)

Here is what we got _before_ the patch using "netperf -t TCP_SENDFILE --
-m 16K"


07:50:14.097137 IP A > B: Flags [S], seq 2635999430, win 14600, options [mss 1460,sackOK,TS val 56315934 ecr 0,nop,wscale 7], length 0
07:50:14.107211 IP B > A: Flags [S.], seq 2438979637, ack 2635999431, win 14480, options [mss 1460,sackOK,TS val 56315935 ecr 56315934,nop,wscale 7], length 0
07:50:14.117318 IP A > B: Flags [.], ack 1, win 115, options [nop,nop,TS val 56315936 ecr 56315935], length 0
07:50:14.117534 IP A > B: Flags [.], seq 1:2897, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 2896
07:50:14.117543 IP A > B: Flags [.], seq 2897:4345, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 1448
07:50:14.127612 IP B > A: Flags [.], ack 2897, win 136, options [nop,nop,TS val 56315938 ecr 56315937], length 0
07:50:14.127618 IP B > A: Flags [.], ack 4345, win 159, options [nop,nop,TS val 56315938 ecr 56315937], length 0
07:50:14.137698 IP A > B: Flags [.], seq 4345:8689, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 4344
07:50:14.137731 IP A > B: Flags [.], seq 8689:11585, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 2896
07:50:14.147906 IP B > A: Flags [.], ack 8689, win 181, options [nop,nop,TS val 56315940 ecr 56315939], length 0
07:50:14.147910 IP B > A: Flags [.], ack 11585, win 204, options [nop,nop,TS val 56315940 ecr 56315939], length 0
07:50:14.158012 IP A > B: Flags [.], seq 11585:17377, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 5792
07:50:14.158022 IP A > B: Flags [P.], seq 17377:18825, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 1448
07:50:14.158171 IP A > B: Flags [.], seq 18825:21721, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 2896
07:50:14.168361 IP B > A: Flags [.], ack 17377, win 227, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.168366 IP B > A: Flags [.], ack 18825, win 249, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.168424 IP B > A: Flags [.], ack 21721, win 272, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.178460 IP A > B: Flags [.], seq 21721:28961, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 7240
07:50:14.178466 IP A > B: Flags [.], seq 28961:34753, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 5792
07:50:14.188668 IP B > A: Flags [.], ack 28961, win 295, options [nop,nop,TS val 56315944 ecr 56315943], length 0
07:50:14.188675 IP B > A: Flags [.], ack 34753, win 317, options [nop,nop,TS val 56315944 ecr 56315943], length 0
07:50:14.198772 IP A > B: Flags [.], seq 34753:44889, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 10136
07:50:14.198778 IP A > B: Flags [P.], seq 44889:47785, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 2896
07:50:14.208990 IP B > A: Flags [.], ack 44889, win 340, options [nop,nop,TS val 56315946 ecr 56315945], length 0
07:50:14.208995 IP B > A: Flags [.], ack 47785, win 362, options [nop,nop,TS val 56315946 ecr 56315945], length 0
07:50:14.219081 IP A > B: Flags [.], seq 47785:60817, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 13032
07:50:14.219090 IP A > B: Flags [.], seq 60817:68057, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 7240
07:50:14.229357 IP B > A: Flags [.], ack 60817, win 385, options [nop,nop,TS val 56315948 ecr 56315947], length 0
07:50:14.229361 IP B > A: Flags [.], ack 68057, win 408, options [nop,nop,TS val 56315948 ecr 56315947], length 0
07:50:14.239450 IP A > B: Flags [.], seq 68057:73849, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 5792
07:50:14.239458 IP A > B: Flags [.], seq 73849:82537, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 8688
07:50:14.239492 IP A > B: Flags [.], seq 82537:86881, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 4344
07:50:14.249694 IP B > A: Flags [.], ack 73849, win 430, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.249699 IP B > A: Flags [.], ack 82537, win 453, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.249753 IP B > A: Flags [.], ack 86881, win 476, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.259790 IP A > B: Flags [.], seq 86881:98465, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 11584
07:50:14.259796 IP A > B: Flags [.], seq 98465:99913, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 1448
07:50:14.259826 IP A > B: Flags [.], seq 99913:108601, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 8688
07:50:14.259830 IP A > B: Flags [P.], seq 108601:112945, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 4344

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-04  6:15         ` Eric Dumazet
@ 2012-04-04  6:20           ` David Miller
  0 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2012-04-04  6:20 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 04 Apr 2012 08:15:28 +0200

> On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote:
> 
>> Eric, sorry to be a pain, but to clear my conscience can you tell me
>> if the sendfile() path do the right thing with your change too?
>> 
>> I'm concerned about returning back into userspace without at least
>> one tcp_push() at the end.
> 
> Absolutely, this patch also handles the sendfile() path.
> 
> do_sendfile() ->
>  do_splice_direct() ->
>   splice_direct_to_actor() -> 
>    ...
>     splice_from_pipe() ->
>      splice_from_pipe_feed() ->
>       pipe_to_sendpage()  more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len
>        sock_sendpage() if (more) flags |= MSG_MORE;
>         kernel_sendpage()
>          inet_sendpage()
>           tcp_sendpage()
> 
> So the push is perfomed at the end (last page doesnt have MSG_MORE flag)

Perfect, thanks for double checking.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-03 21:36     ` David Miller
  2012-04-03 22:50       ` David Miller
@ 2012-04-05 13:05       ` Eric Dumazet
  2012-04-05 23:05         ` David Miller
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2012-04-05 13:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

On Tue, 2012-04-03 at 17:36 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 23:31:29 +0200
> 
> > The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
> > probably copy/pasted) but the thing is tcp_sendmsg() is called once per
> > sendmsg() call (and the push logic is OK at the end of it), while a
> > single splice() system call can call do_tcp_sendpages() 16 times (or
> > even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))
> 
> Ok, so this means that in essence the tcp_mark_push should also only
> be done in the final sendpage call.
> 
> And since I'm wholly convinced that the URG stuff is a complete
> "don't care" for this path, I'm convinced your patch is the right
> thing to do.
> 
> Applied to 'net' and queued up for -stable, thanks Eric.

Hmm, thinking again about this, I did more tests and it appears we need
to differentiate the SPLICE_F_MORE flag (user request) and the internal
marker provided by splice logic (handling a batch of pages)

A program doing splice(... SPLICE_F_MORE) should really call tcp_push()
at the end of its work.

Thanks

[PATCH] tcp: tcp_sendpages() should call tcp_push() once

commit 2f533844242 (tcp: allow splice() to build full TSO packets) added
a regression for splice() calls using SPLICE_F_MORE.

We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall.

Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
with different semantic.

For all sendpage() providers, its a transparent change. Only
sock_sendpage() and tcp_sendpages() can differentiate the two different
flags provided by pipe_to_sendpage()

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
---
 fs/splice.c            |    5 ++++-
 include/linux/socket.h |    2 +-
 net/ipv4/tcp.c         |    2 +-
 net/socket.c           |    6 +++---
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 5f883de..f847684 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -30,6 +30,7 @@
 #include <linux/uio.h>
 #include <linux/security.h>
 #include <linux/gfp.h>
+#include <linux/socket.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -690,7 +691,9 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!likely(file->f_op && file->f_op->sendpage))
 		return -EINVAL;
 
-	more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
+	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
+	if (sd->len < sd->total_len)
+		more |= MSG_SENDPAGE_NOTLAST;
 	return file->f_op->sendpage(file, buf->page, buf->offset,
 				    sd->len, &pos, more);
 }
diff --git a/include/linux/socket.h b/include/linux/socket.h
index da2d3e2..b84bbd4 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -265,7 +265,7 @@ struct ucred {
 #define MSG_NOSIGNAL	0x4000	/* Do not generate SIGPIPE */
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
-
+#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_EOF         MSG_FIN
 
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exit for file
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2ff6f45..5d54ed3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -860,7 +860,7 @@ wait_for_memory:
 	}
 
 out:
-	if (copied && !(flags & MSG_MORE))
+	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;
 
diff --git a/net/socket.c b/net/socket.c
index 484cc69..851edcd 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -811,9 +811,9 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 
 	sock = file->private_data;
 
-	flags = !(file->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
-	if (more)
-		flags |= MSG_MORE;
+	flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+	/* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
+	flags |= more;
 
 	return kernel_sendpage(sock, page, offset, size, flags);
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-05 13:05       ` Eric Dumazet
@ 2012-04-05 23:05         ` David Miller
  2012-04-06  1:59           ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2012-04-05 23:05 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 05 Apr 2012 15:05:35 +0200

> Hmm, thinking again about this, I did more tests and it appears we need
> to differentiate the SPLICE_F_MORE flag (user request) and the internal
> marker provided by splice logic (handling a batch of pages)
> 
> A program doing splice(... SPLICE_F_MORE) should really call tcp_push()
> at the end of its work.

This is the kind of problem I was hoping we weren't introducing
when I asked about sendfile() et al. the other day :-)

> [PATCH] tcp: tcp_sendpages() should call tcp_push() once
> 
> commit 2f533844242 (tcp: allow splice() to build full TSO packets) added
> a regression for splice() calls using SPLICE_F_MORE.
> 
> We need to call tcp_flush() at the end of the last page processed in
> tcp_sendpages(), or else transmits can be deferred and future sends
> stall.
> 
> Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
> with different semantic.
> 
> For all sendpage() providers, its a transparent change. Only
> sock_sendpage() and tcp_sendpages() can differentiate the two different
> flags provided by pipe_to_sendpage()
> 
> Reported-by: Tom Herbert <therbert@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tcp: allow splice() to build full TSO packets
  2012-04-05 23:05         ` David Miller
@ 2012-04-06  1:59           ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2012-04-06  1:59 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ncardwell, therbert, ycheng, hkchu, maze, maheshb,
	ilpo.jarvinen, nanditad

On Thu, 2012-04-05 at 19:05 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 05 Apr 2012 15:05:35 +0200
> 
> > Hmm, thinking again about this, I did more tests and it appears we need
> > to differentiate the SPLICE_F_MORE flag (user request) and the internal
> > marker provided by splice logic (handling a batch of pages)
> > 
> > A program doing splice(... SPLICE_F_MORE) should really call tcp_push()
> > at the end of its work.
> 
> This is the kind of problem I was hoping we weren't introducing
> when I asked about sendfile() et al. the other day :-)

Yes, sorry for this.

Yet sendfile() did not have this problem (or so I believe), only the
splice(SPLICE_F_MORE) did.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-06  1:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-03 19:37 [PATCH] tcp: allow splice() to build full TSO packets Eric Dumazet
2012-04-03 21:21 ` David Miller
2012-04-03 21:31   ` Eric Dumazet
2012-04-03 21:36     ` David Miller
2012-04-03 22:50       ` David Miller
2012-04-04  6:15         ` Eric Dumazet
2012-04-04  6:20           ` David Miller
2012-04-05 13:05       ` Eric Dumazet
2012-04-05 23:05         ` David Miller
2012-04-06  1:59           ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.