linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [TCP 0/3] Receive from socket into bio without copying
@ 2012-06-29 14:53 Andreas Gruenbacher
  2012-06-29 15:08 ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Gruenbacher @ 2012-06-29 14:53 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: Herbert Xu, David S. Miller

Hello,

I'm (still) trying to pass data from the network to the block layer without
copying. The block layer needs blocks to be contiguous in memory, and may have
some alignment restrictions as well.  A lot of modern network hardware will
receive large packets into separate buffers, so individual large packets will
end up in contiguous, aligned buffers.  I would like to make use of that, but
tcp currently doesn't allow me to control what ends up in which packets.

This patch series introduces a new flag for indicating to tcp when it should
start a new segment. Using that on the sender side, I can get data over the
network with no cpu copying at all.

[My last posting on this topic from May 8 is archived here:
 http://www.spinics.net/lists/netdev/msg197788.html ]

Thanks,
Andreas

Andreas Gruenbacher (3):
  tcp: Add MSG_NEW_PACKET flag to indicate preferable packet boundaries
  tcp: Zero-copy receive from a socket into a bio
  fs: Export bio_release_pages()

 fs/bio.c               |    3 +-
 include/linux/bio.h    |    1 +
 include/linux/socket.h |    1 +
 include/net/tcp.h      |    3 +
 net/ipv4/Makefile      |    3 +-
 net/ipv4/tcp.c         |    5 +-
 net/ipv4/tcp_recvbio.c |  168 ++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 180 insertions(+), 4 deletions(-)
 create mode 100644 net/ipv4/tcp_recvbio.c

-- 
1.7.10.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-06-29 14:53 [RFC] [TCP 0/3] Receive from socket into bio without copying Andreas Gruenbacher
@ 2012-06-29 15:08 ` Eric Dumazet
  2012-07-02 11:45   ` Andreas Gruenbacher
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2012-06-29 15:08 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Fri, 2012-06-29 at 16:53 +0200, Andreas Gruenbacher wrote:
> Hello,
> 
> I'm (still) trying to pass data from the network to the block layer without
> copying. The block layer needs blocks to be contiguous in memory, and may have
> some alignment restrictions as well.  A lot of modern network hardware will
> receive large packets into separate buffers, so individual large packets will
> end up in contiguous, aligned buffers.  I would like to make use of that, but
> tcp currently doesn't allow me to control what ends up in which packets.
> 
> This patch series introduces a new flag for indicating to tcp when it should
> start a new segment. Using that on the sender side, I can get data over the
> network with no cpu copying at all.
> 
> [My last posting on this topic from May 8 is archived here:
>  http://www.spinics.net/lists/netdev/msg197788.html ]
> 
> Thanks,
> Andreas
> 
> Andreas Gruenbacher (3):
>   tcp: Add MSG_NEW_PACKET flag to indicate preferable packet boundaries
>   tcp: Zero-copy receive from a socket into a bio
>   fs: Export bio_release_pages()

This looks like yet another zero copy, needing another couple of hundred
of lines.

Why splice infrastructure doesnt fit your needs ?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-06-29 15:08 ` Eric Dumazet
@ 2012-07-02 11:45   ` Andreas Gruenbacher
  2012-07-02 12:36     ` Eric Dumazet
  2012-07-02 13:39     ` saeed bishara
  0 siblings, 2 replies; 11+ messages in thread
From: Andreas Gruenbacher @ 2012-07-02 11:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Fri, 2012-06-29 at 17:08 +0200, Eric Dumazet wrote:
> This looks like yet another zero copy, needing another couple of hundred
> of lines.

Kind of, yes.  We really want to make no copies at all though; the cpu
just passes buffers from one device to the other.

> Why splice infrastructure doesnt fit your needs ?

The pipe api that splice is based on saves a copy between the kernel and
user space, but it currently writes to files, going through the page
cache.  For that, the alignment of data in the network receive buffers
doesn't matter.

We want to go directly to the block layer instead.  This requires that
the network hardware receives the data into sector aligned buffers.
Hence the proposed MSG_NEW_PACKET flag.

With that, it might be possible to implement a pipe "sink" that goes to
a bio instead of writing to a file.  Going through the pipe
infrastructure doesn't actually help in this case though, it's just
overhead.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 11:45   ` Andreas Gruenbacher
@ 2012-07-02 12:36     ` Eric Dumazet
  2012-07-02 13:02       ` Andreas Gruenbacher
  2012-07-02 13:39     ` saeed bishara
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2012-07-02 12:36 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, 2012-07-02 at 13:45 +0200, Andreas Gruenbacher wrote:
> On Fri, 2012-06-29 at 17:08 +0200, Eric Dumazet wrote:
> > This looks like yet another zero copy, needing another couple of hundred
> > of lines.
> 
> Kind of, yes.  We really want to make no copies at all though; the cpu
> just passes buffers from one device to the other.
> 
> > Why splice infrastructure doesnt fit your needs ?
> 
> The pipe api that splice is based on saves a copy between the kernel and
> user space, but it currently writes to files, going through the page
> cache.  For that, the alignment of data in the network receive buffers
> doesn't matter.
> 

No files or page cache are needed for splice() usage, for example from
socket to another socket.

It just works (check haproxy for an example), with 10Gb performance out
of the box.

The pipe is only a container for buffers, in case the data fetched from
producer cannot be fully sent to consumer. You don't want to lose this
data.


> We want to go directly to the block layer instead.  This requires that
> the network hardware receives the data into sector aligned buffers.
> Hence the proposed MSG_NEW_PACKET flag.
> 

This only is a hint something is wrong with the approach.

> With that, it might be possible to implement a pipe "sink" that goes to
> a bio instead of writing to a file.  Going through the pipe
> infrastructure doesn't actually help in this case though, it's just
> overhead.

There is no expensive overhead in splice() infrastructure, only some
small details that should be eventually solved instead of designing a
new zero copy mode.

You didnt actually tried splice() if you believe a regular file is
needed.

You only need proper splice() support (from pipe to bio), if not already
there.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 12:36     ` Eric Dumazet
@ 2012-07-02 13:02       ` Andreas Gruenbacher
  2012-07-02 13:54         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Gruenbacher @ 2012-07-02 13:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, 2012-07-02 at 14:36 +0200, Eric Dumazet wrote:
> No files or page cache are needed for splice() usage, for example from
> socket to another socket.
> 
> It just works (check haproxy for an example), with 10Gb performance out
> of the box.

bio_vec's have some alignment requirements that must be met, and
anything that doesn't meet those requirements can't be passed to the
block layer (without copying it first).  Additional layers between the
network and block layers, like a pipe, won't make that problem go away.

> The pipe is only a container for buffers, in case the data fetched from
> producer cannot be fully sent to consumer. You don't want to lose this
> data.

Stuff that isn't pulled out of a socket receive buffer will stay there,
it won't magically be lost.

> > We want to go directly to the block layer instead.  This requires that
> > the network hardware receives the data into sector aligned buffers.
> > Hence the proposed MSG_NEW_PACKET flag.
> 
> This only is a hint something is wrong with the approach.

It just means that I'm trying to do something that isn't currently
supported.

> You only need proper splice() support (from pipe to bio), if not already
> there.

It's not already there, it requires the alignment issue to be addresses
first.

Andreas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 11:45   ` Andreas Gruenbacher
  2012-07-02 12:36     ` Eric Dumazet
@ 2012-07-02 13:39     ` saeed bishara
  1 sibling, 0 replies; 11+ messages in thread
From: saeed bishara @ 2012-07-02 13:39 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Eric Dumazet, netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, Jul 2, 2012 at 2:45 PM, Andreas Gruenbacher <agruen@linbit.com> wrote:

>
> We want to go directly to the block layer instead.  This requires that
> the network hardware receives the data into sector aligned buffers.
> Hence the proposed MSG_NEW_PACKET flag.
Andreas,
 I didn't read your patches, but what are you looking for can't be
achieved using "normal" NICs, for that you need to use RDMA protocol
with a hardware that supports RDMA.

saeed

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 13:02       ` Andreas Gruenbacher
@ 2012-07-02 13:54         ` Eric Dumazet
  2012-07-02 16:06           ` Andreas Gruenbacher
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2012-07-02 13:54 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, 2012-07-02 at 15:02 +0200, Andreas Gruenbacher wrote:

> bio_vec's have some alignment requirements that must be met, and
> anything that doesn't meet those requirements can't be passed to the
> block layer (without copying it first).  Additional layers between the
> network and block layers, like a pipe, won't make that problem go away.
> 

What are the "some alignment requirements" exactly, and how do you use
TCP exactly to meet them ? (MSS= multiple of 512 ?)

I believe you try to escape from the real problem.

If the NIC driver provides non aligned data, neither splice() or your
new stuff will magically align it. You _need_ a copy in either cases.

If NIC driver provides aligned data, splice(socket -> pipe) will keep
this alignment for you at 0 cost.

> It's not already there, it requires the alignment issue to be addresses
> first.


There is no guarantee TCP payload is aligned to a bio, ever, in linux
ethernet/ip/tcp stack.

Really, your patches work for you, by pure luck, because you use one
particular NIC driver that happens to prepare things for you (presumably
doing header split). Nothing guarantee this wont change even for the
same hardware in linux-3.8

So I will just say no to your patches, unless you demonstrate the
splice() problems, and how you can fix the alignment problem in a new
layer instead of in the existing zero copy standard one.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 13:54         ` Eric Dumazet
@ 2012-07-02 16:06           ` Andreas Gruenbacher
  2012-07-02 19:41             ` chetan loke
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Gruenbacher @ 2012-07-02 16:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote:
> On Mon, 2012-07-02 at 15:02 +0200, Andreas Gruenbacher wrote:
> > bio_vec's have some alignment requirements that must be met, and
> > anything that doesn't meet those requirements can't be passed to the
> > block layer (without copying it first). Additional layers between
> > the
> > network and block layers, like a pipe, won't make that problem go
> > away.
> >
> 
> What are the "some alignment requirements" exactly, and how do you use
> TCP exactly to meet them ? (MSS= multiple of 512 ?)

Sectors of 512 bytes must be contiguous; some devices have additional
requirements (like 4k sectors).  I'm not sure if sectors always need to be
aligned, but if buffers are allocated page wise and handed out as half /
full pages, you get that automatically.

> I believe you try to escape from the real problem.
> 
> If the NIC driver provides non aligned data, neither splice() or your
> new stuff will magically align it. You _need_ a copy in either cases.

Yes, the NIC must provide aligned data.  A prerequisite for that is that the
NIC knows how to align things.  With no knowledge of the application protocol,
the NIC can only use the packet boundaries as hints.  I'm trying to get tcp
to start new packets at specific points in the protocol so that the packet
boundaries will coincide with alignment boundaries.  With that, NICs that do
header splitting can receive packets into appropriately aligned buffers, and
the problem is solved.

> If NIC driver provides aligned data, splice(socket -> pipe) will keep
> this alignment for you at 0 cost.

Yes of course.  That is not the real issue here though.

> > It's not already there, it requires the alignment issue to be
> > addresses first.
> 
> There is no guarantee TCP payload is aligned to a bio, ever, in linux
> ethernet/ip/tcp stack.
> 
> Really, your patches work for you, by pure luck, because you use one
> particular NIC driver that happens to prepare things for you
> (presumably doing header split). Nothing guarantee this wont change even
> for the same hardware in linux-3.8

NICs with header splitting are common enough that you don't have to resort
to pure luck to get one.

> So I will just say no to your patches, unless you demonstrate the
> splice() problems, and how you can fix the alignment problem in a new
> layer instead of in the existing zero copy standard one.

Again, splice or not is not the issue here. It does not, by itself, allow zero
copy from the network directly to disk but it could likely be made to support
that if we can get the alignment right first.  The proposed MSG_NEW_PACKET flag
helps with that, but maybe someone has a better idea.

This doesn't have to work with arbitrary NICs and it most likely will hurt with
small MTUs, but then you can still choose not to use it.  It just has to almost
always work with some particular NICs and with large MTUs.

Andreas


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 16:06           ` Andreas Gruenbacher
@ 2012-07-02 19:41             ` chetan loke
  2012-07-02 21:37               ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: chetan loke @ 2012-07-02 19:41 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Eric Dumazet, netdev, linux-kernel, Herbert Xu, David S. Miller

On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote:
> On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote:
>> So I will just say no to your patches, unless you demonstrate the
>> splice() problems, and how you can fix the alignment problem in a new
>> layer instead of in the existing zero copy standard one.
>
> Again, splice or not is not the issue here. It does not, by itself, allow zero
> copy from the network directly to disk but it could likely be made to support
> that if we can get the alignment right first.  The proposed MSG_NEW_PACKET flag
> helps with that, but maybe someone has a better idea.
>

Eric - by using splice do you mean something like:

int filedes[2];
PIPE_SIZE (64*1024)
pipe(filedes);
ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
                     SPLICE_F_MORE | SPLICE_F_MOVE);


ret = splice (filedes [0], NULL, file_fd_to,
                         &to_offset, ret,
                         SPLICE_F_MORE | SPLICE_F_MOVE);


i.e. splice-in from socket to pipe, and splice-out from pipe to destination?

Andreas - if the above assumption is true then can you apply the
'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code
achieves something similar to what you expect on the receive side(you
can also play w/ F_SETPIPE_SZ -  although I found very little
reduction in CPU usage)? Note: My personal experience - using splice
from an input-file-A to output-file-B bought very minimal cpu
reduction(yes, both the files used O_DIRECT). Instead, a simple
read/write w/ O_DIRECT from file-A to file-B was much much faster.


Chetan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 19:41             ` chetan loke
@ 2012-07-02 21:37               ` Eric Dumazet
  2012-07-03  0:02                 ` Willy Tarreau
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2012-07-02 21:37 UTC (permalink / raw)
  To: chetan loke
  Cc: Andreas Gruenbacher, netdev, linux-kernel, Herbert Xu,
	David S. Miller, Willy Tarreau

On Mon, 2012-07-02 at 15:41 -0400, chetan loke wrote:
> On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote:
> > On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote:
> >> So I will just say no to your patches, unless you demonstrate the
> >> splice() problems, and how you can fix the alignment problem in a new
> >> layer instead of in the existing zero copy standard one.
> >
> > Again, splice or not is not the issue here. It does not, by itself, allow zero
> > copy from the network directly to disk but it could likely be made to support
> > that if we can get the alignment right first.  The proposed MSG_NEW_PACKET flag
> > helps with that, but maybe someone has a better idea.
> >
> 
> Eric - by using splice do you mean something like:
> 
> int filedes[2];
> PIPE_SIZE (64*1024)
> pipe(filedes);
> ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
>                      SPLICE_F_MORE | SPLICE_F_MOVE);
> 
> 
> ret = splice (filedes [0], NULL, file_fd_to,
>                          &to_offset, ret,
>                          SPLICE_F_MORE | SPLICE_F_MOVE);
> 

Yes, thats more or less the plan. You also can play with bigger
PIPE_SIZE if needed.

> 
> i.e. splice-in from socket to pipe, and splice-out from pipe to destination?
> 
> Andreas - if the above assumption is true then can you apply the
> 'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code
> achieves something similar to what you expect on the receive side(you
> can also play w/ F_SETPIPE_SZ -  although I found very little
> reduction in CPU usage)? Note: My personal experience - using splice
> from an input-file-A to output-file-B bought very minimal cpu
> reduction(yes, both the files used O_DIRECT). Instead, a simple
> read/write w/ O_DIRECT from file-A to file-B was much much faster.

splice() performance from socket to pipe have improved a lot in
linux-3.5

It was not true zero copy, until very recent patches.
(It was zero copy only on certain class of NIC, not on the ones found
on appliances or cheap platforms)

Willy Tarreau mentioned a nice boost of performance with haproxy.

Willy wanted to work on a direct splice from socket to socket, but
I am not sure it'll bring major speed improvement.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying
  2012-07-02 21:37               ` Eric Dumazet
@ 2012-07-03  0:02                 ` Willy Tarreau
  0 siblings, 0 replies; 11+ messages in thread
From: Willy Tarreau @ 2012-07-03  0:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: chetan loke, Andreas Gruenbacher, netdev, linux-kernel,
	Herbert Xu, David S. Miller

Hi Eric,

On Mon, Jul 02, 2012 at 11:37:04PM +0200, Eric Dumazet wrote:
> On Mon, 2012-07-02 at 15:41 -0400, chetan loke wrote:
> > On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote:
> > > On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote:
> > >> So I will just say no to your patches, unless you demonstrate the
> > >> splice() problems, and how you can fix the alignment problem in a new
> > >> layer instead of in the existing zero copy standard one.
> > >
> > > Again, splice or not is not the issue here. It does not, by itself, allow zero
> > > copy from the network directly to disk but it could likely be made to support
> > > that if we can get the alignment right first.  The proposed MSG_NEW_PACKET flag
> > > helps with that, but maybe someone has a better idea.
> > >
> > 
> > Eric - by using splice do you mean something like:
> > 
> > int filedes[2];
> > PIPE_SIZE (64*1024)
> > pipe(filedes);
> > ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE,
> >                      SPLICE_F_MORE | SPLICE_F_MOVE);
> > 
> > 
> > ret = splice (filedes [0], NULL, file_fd_to,
> >                          &to_offset, ret,
> >                          SPLICE_F_MORE | SPLICE_F_MOVE);
> > 
> 
> Yes, thats more or less the plan. You also can play with bigger
> PIPE_SIZE if needed.

I confirm, this is recommended at high bit rates if you're working with
large windows.

> > i.e. splice-in from socket to pipe, and splice-out from pipe to destination?
> > 
> > Andreas - if the above assumption is true then can you apply the
> > 'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code
> > achieves something similar to what you expect on the receive side(you
> > can also play w/ F_SETPIPE_SZ -  although I found very little
> > reduction in CPU usage)? Note: My personal experience - using splice
> > from an input-file-A to output-file-B bought very minimal cpu
> > reduction(yes, both the files used O_DIRECT). Instead, a simple
> > read/write w/ O_DIRECT from file-A to file-B was much much faster.
> 
> splice() performance from socket to pipe have improved a lot in
> linux-3.5
> 
> It was not true zero copy, until very recent patches.

In fact it has been true zero copy in 2.6.25 until we faced a large
amount of data corruption and the zero copy was disabled in 2.6.25.X.
Since then it remained that way until you brought your patches to
re-instantiate it.

> (It was zero copy only on certain class of NIC, not on the ones found
> on appliances or cheap platforms)
> 
> Willy Tarreau mentioned a nice boost of performance with haproxy.

Yes definitely. The savings are more noticeable on small systems where
memory bandwidth is limited. On a small ARM system bound by RAM bandwidth,
the performance was basically doubled. But I also observed nice savings
on a core2duo equipped with 2 myricom 10Gig NICs forwarding at line rate.

> Willy wanted to work on a direct splice from socket to socket, but
> I am not sure it'll bring major speed improvement.

I'm not sure at all either, I'm betting a few percent saved from the
reduction of syscalls, not much more. This is why I'll probably check
this when I have enough time to kill.

Regards,
Willy


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-07-03  0:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-29 14:53 [RFC] [TCP 0/3] Receive from socket into bio without copying Andreas Gruenbacher
2012-06-29 15:08 ` Eric Dumazet
2012-07-02 11:45   ` Andreas Gruenbacher
2012-07-02 12:36     ` Eric Dumazet
2012-07-02 13:02       ` Andreas Gruenbacher
2012-07-02 13:54         ` Eric Dumazet
2012-07-02 16:06           ` Andreas Gruenbacher
2012-07-02 19:41             ` chetan loke
2012-07-02 21:37               ` Eric Dumazet
2012-07-03  0:02                 ` Willy Tarreau
2012-07-02 13:39     ` saeed bishara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).