* [RFC] [TCP 0/3] Receive from socket into bio without copying @ 2012-06-29 14:53 Andreas Gruenbacher 2012-06-29 15:08 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Andreas Gruenbacher @ 2012-06-29 14:53 UTC (permalink / raw) To: netdev, linux-kernel; +Cc: Herbert Xu, David S. Miller Hello, I'm (still) trying to pass data from the network to the block layer without copying. The block layer needs blocks to be contiguous in memory, and may have some alignment restrictions as well. A lot of modern network hardware will receive large packets into separate buffers, so individual large packets will end up in contiguous, aligned buffers. I would like to make use of that, but tcp currently doesn't allow me to control what ends up in which packets. This patch series introduces a new flag for indicating to tcp when it should start a new segment. Using that on the sender side, I can get data over the network with no cpu copying at all. [My last posting on this topic from May 8 is archived here: http://www.spinics.net/lists/netdev/msg197788.html ] Thanks, Andreas Andreas Gruenbacher (3): tcp: Add MSG_NEW_PACKET flag to indicate preferable packet boundaries tcp: Zero-copy receive from a socket into a bio fs: Export bio_release_pages() fs/bio.c | 3 +- include/linux/bio.h | 1 + include/linux/socket.h | 1 + include/net/tcp.h | 3 + net/ipv4/Makefile | 3 +- net/ipv4/tcp.c | 5 +- net/ipv4/tcp_recvbio.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 180 insertions(+), 4 deletions(-) create mode 100644 net/ipv4/tcp_recvbio.c -- 1.7.10.2 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-06-29 14:53 [RFC] [TCP 0/3] Receive from socket into bio without copying Andreas Gruenbacher @ 2012-06-29 15:08 ` Eric Dumazet 2012-07-02 11:45 ` Andreas Gruenbacher 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2012-06-29 15:08 UTC (permalink / raw) To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Fri, 2012-06-29 at 16:53 +0200, Andreas Gruenbacher wrote: > Hello, > > I'm (still) trying to pass data from the network to the block layer without > copying. The block layer needs blocks to be contiguous in memory, and may have > some alignment restrictions as well. A lot of modern network hardware will > receive large packets into separate buffers, so individual large packets will > end up in contiguous, aligned buffers. I would like to make use of that, but > tcp currently doesn't allow me to control what ends up in which packets. > > This patch series introduces a new flag for indicating to tcp when it should > start a new segment. Using that on the sender side, I can get data over the > network with no cpu copying at all. > > [My last posting on this topic from May 8 is archived here: > http://www.spinics.net/lists/netdev/msg197788.html ] > > Thanks, > Andreas > > Andreas Gruenbacher (3): > tcp: Add MSG_NEW_PACKET flag to indicate preferable packet boundaries > tcp: Zero-copy receive from a socket into a bio > fs: Export bio_release_pages() This looks like yet another zero copy, needing another couple of hundred of lines. Why splice infrastructure doesnt fit your needs ? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-06-29 15:08 ` Eric Dumazet @ 2012-07-02 11:45 ` Andreas Gruenbacher 2012-07-02 12:36 ` Eric Dumazet 2012-07-02 13:39 ` saeed bishara 0 siblings, 2 replies; 11+ messages in thread From: Andreas Gruenbacher @ 2012-07-02 11:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Fri, 2012-06-29 at 17:08 +0200, Eric Dumazet wrote: > This looks like yet another zero copy, needing another couple of hundred > of lines. Kind of, yes. We really want to make no copies at all though; the cpu just passes buffers from one device to the other. > Why splice infrastructure doesnt fit your needs ? The pipe api that splice is based on saves a copy between the kernel and user space, but it currently writes to files, going through the page cache. For that, the alignment of data in the network receive buffers doesn't matter. We want to go directly to the block layer instead. This requires that the network hardware receives the data into sector aligned buffers. Hence the proposed MSG_NEW_PACKET flag. With that, it might be possible to implement a pipe "sink" that goes to a bio instead of writing to a file. Going through the pipe infrastructure doesn't actually help in this case though, it's just overhead. Thanks, Andreas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 11:45 ` Andreas Gruenbacher @ 2012-07-02 12:36 ` Eric Dumazet 2012-07-02 13:02 ` Andreas Gruenbacher 2012-07-02 13:39 ` saeed bishara 1 sibling, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2012-07-02 12:36 UTC (permalink / raw) To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, 2012-07-02 at 13:45 +0200, Andreas Gruenbacher wrote: > On Fri, 2012-06-29 at 17:08 +0200, Eric Dumazet wrote: > > This looks like yet another zero copy, needing another couple of hundred > > of lines. > > Kind of, yes. We really want to make no copies at all though; the cpu > just passes buffers from one device to the other. > > > Why splice infrastructure doesnt fit your needs ? > > The pipe api that splice is based on saves a copy between the kernel and > user space, but it currently writes to files, going through the page > cache. For that, the alignment of data in the network receive buffers > doesn't matter. > No files or page cache are needed for splice() usage, for example from socket to another socket. It just works (check haproxy for an example), with 10Gb performance out of the box. The pipe is only a container for buffers, in case the data fetched from producer cannot be fully sent to consumer. You don't want to lose this data. > We want to go directly to the block layer instead. This requires that > the network hardware receives the data into sector aligned buffers. > Hence the proposed MSG_NEW_PACKET flag. > This only is a hint something is wrong with the approach. > With that, it might be possible to implement a pipe "sink" that goes to > a bio instead of writing to a file. Going through the pipe > infrastructure doesn't actually help in this case though, it's just > overhead. There is no expensive overhead in splice() infrastructure, only some small details that should be eventually solved instead of designing a new zero copy mode. You didnt actually tried splice() if you believe a regular file is needed. You only need proper splice() support (from pipe to bio), if not already there. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 12:36 ` Eric Dumazet @ 2012-07-02 13:02 ` Andreas Gruenbacher 2012-07-02 13:54 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Andreas Gruenbacher @ 2012-07-02 13:02 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, 2012-07-02 at 14:36 +0200, Eric Dumazet wrote: > No files or page cache are needed for splice() usage, for example from > socket to another socket. > > It just works (check haproxy for an example), with 10Gb performance out > of the box. bio_vec's have some alignment requirements that must be met, and anything that doesn't meet those requirements can't be passed to the block layer (without copying it first). Additional layers between the network and block layers, like a pipe, won't make that problem go away. > The pipe is only a container for buffers, in case the data fetched from > producer cannot be fully sent to consumer. You don't want to lose this > data. Stuff that isn't pulled out of a socket receive buffer will stay there, it won't magically be lost. > > We want to go directly to the block layer instead. This requires that > > the network hardware receives the data into sector aligned buffers. > > Hence the proposed MSG_NEW_PACKET flag. > > This only is a hint something is wrong with the approach. It just means that I'm trying to do something that isn't currently supported. > You only need proper splice() support (from pipe to bio), if not already > there. It's not already there, it requires the alignment issue to be addresses first. Andreas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 13:02 ` Andreas Gruenbacher @ 2012-07-02 13:54 ` Eric Dumazet 2012-07-02 16:06 ` Andreas Gruenbacher 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2012-07-02 13:54 UTC (permalink / raw) To: Andreas Gruenbacher; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, 2012-07-02 at 15:02 +0200, Andreas Gruenbacher wrote: > bio_vec's have some alignment requirements that must be met, and > anything that doesn't meet those requirements can't be passed to the > block layer (without copying it first). Additional layers between the > network and block layers, like a pipe, won't make that problem go away. > What are the "some alignment requirements" exactly, and how do you use TCP exactly to meet them ? (MSS= multiple of 512 ?) I believe you try to escape from the real problem. If the NIC driver provides non aligned data, neither splice() or your new stuff will magically align it. You _need_ a copy in either cases. If NIC driver provides aligned data, splice(socket -> pipe) will keep this alignment for you at 0 cost. > It's not already there, it requires the alignment issue to be addresses > first. There is no guarantee TCP payload is aligned to a bio, ever, in linux ethernet/ip/tcp stack. Really, your patches work for you, by pure luck, because you use one particular NIC driver that happens to prepare things for you (presumably doing header split). Nothing guarantee this wont change even for the same hardware in linux-3.8 So I will just say no to your patches, unless you demonstrate the splice() problems, and how you can fix the alignment problem in a new layer instead of in the existing zero copy standard one. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 13:54 ` Eric Dumazet @ 2012-07-02 16:06 ` Andreas Gruenbacher 2012-07-02 19:41 ` chetan loke 0 siblings, 1 reply; 11+ messages in thread From: Andreas Gruenbacher @ 2012-07-02 16:06 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote: > On Mon, 2012-07-02 at 15:02 +0200, Andreas Gruenbacher wrote: > > bio_vec's have some alignment requirements that must be met, and > > anything that doesn't meet those requirements can't be passed to the > > block layer (without copying it first). Additional layers between > > the > > network and block layers, like a pipe, won't make that problem go > > away. > > > > What are the "some alignment requirements" exactly, and how do you use > TCP exactly to meet them ? (MSS= multiple of 512 ?) Sectors of 512 bytes must be contiguous; some devices have additional requirements (like 4k sectors). I'm not sure if sectors always need to be aligned, but if buffers are allocated page wise and handed out as half / full pages, you get that automatically. > I believe you try to escape from the real problem. > > If the NIC driver provides non aligned data, neither splice() or your > new stuff will magically align it. You _need_ a copy in either cases. Yes, the NIC must provide aligned data. A prerequisite for that is that the NIC knows how to align things. With no knowledge of the application protocol, the NIC can only use the packet boundaries as hints. I'm trying to get tcp to start new packets at specific points in the protocol so that the packet boundaries will coincide with alignment boundaries. With that, NICs that do header splitting can receive packets into appropriately aligned buffers, and the problem is solved. > If NIC driver provides aligned data, splice(socket -> pipe) will keep > this alignment for you at 0 cost. Yes of course. That is not the real issue here though. > > It's not already there, it requires the alignment issue to be > > addresses first. > > There is no guarantee TCP payload is aligned to a bio, ever, in linux > ethernet/ip/tcp stack. > > Really, your patches work for you, by pure luck, because you use one > particular NIC driver that happens to prepare things for you > (presumably doing header split). Nothing guarantee this wont change even > for the same hardware in linux-3.8 NICs with header splitting are common enough that you don't have to resort to pure luck to get one. > So I will just say no to your patches, unless you demonstrate the > splice() problems, and how you can fix the alignment problem in a new > layer instead of in the existing zero copy standard one. Again, splice or not is not the issue here. It does not, by itself, allow zero copy from the network directly to disk but it could likely be made to support that if we can get the alignment right first. The proposed MSG_NEW_PACKET flag helps with that, but maybe someone has a better idea. This doesn't have to work with arbitrary NICs and it most likely will hurt with small MTUs, but then you can still choose not to use it. It just has to almost always work with some particular NICs and with large MTUs. Andreas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 16:06 ` Andreas Gruenbacher @ 2012-07-02 19:41 ` chetan loke 2012-07-02 21:37 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: chetan loke @ 2012-07-02 19:41 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Eric Dumazet, netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote: > On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote: >> So I will just say no to your patches, unless you demonstrate the >> splice() problems, and how you can fix the alignment problem in a new >> layer instead of in the existing zero copy standard one. > > Again, splice or not is not the issue here. It does not, by itself, allow zero > copy from the network directly to disk but it could likely be made to support > that if we can get the alignment right first. The proposed MSG_NEW_PACKET flag > helps with that, but maybe someone has a better idea. > Eric - by using splice do you mean something like: int filedes[2]; PIPE_SIZE (64*1024) pipe(filedes); ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE, SPLICE_F_MORE | SPLICE_F_MOVE); ret = splice (filedes [0], NULL, file_fd_to, &to_offset, ret, SPLICE_F_MORE | SPLICE_F_MOVE); i.e. splice-in from socket to pipe, and splice-out from pipe to destination? Andreas - if the above assumption is true then can you apply the 'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code achieves something similar to what you expect on the receive side(you can also play w/ F_SETPIPE_SZ - although I found very little reduction in CPU usage)? Note: My personal experience - using splice from an input-file-A to output-file-B bought very minimal cpu reduction(yes, both the files used O_DIRECT). Instead, a simple read/write w/ O_DIRECT from file-A to file-B was much much faster. Chetan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 19:41 ` chetan loke @ 2012-07-02 21:37 ` Eric Dumazet 2012-07-03 0:02 ` Willy Tarreau 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2012-07-02 21:37 UTC (permalink / raw) To: chetan loke Cc: Andreas Gruenbacher, netdev, linux-kernel, Herbert Xu, David S. Miller, Willy Tarreau On Mon, 2012-07-02 at 15:41 -0400, chetan loke wrote: > On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote: > > On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote: > >> So I will just say no to your patches, unless you demonstrate the > >> splice() problems, and how you can fix the alignment problem in a new > >> layer instead of in the existing zero copy standard one. > > > > Again, splice or not is not the issue here. It does not, by itself, allow zero > > copy from the network directly to disk but it could likely be made to support > > that if we can get the alignment right first. The proposed MSG_NEW_PACKET flag > > helps with that, but maybe someone has a better idea. > > > > Eric - by using splice do you mean something like: > > int filedes[2]; > PIPE_SIZE (64*1024) > pipe(filedes); > ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE, > SPLICE_F_MORE | SPLICE_F_MOVE); > > > ret = splice (filedes [0], NULL, file_fd_to, > &to_offset, ret, > SPLICE_F_MORE | SPLICE_F_MOVE); > Yes, thats more or less the plan. You also can play with bigger PIPE_SIZE if needed. > > i.e. splice-in from socket to pipe, and splice-out from pipe to destination? > > Andreas - if the above assumption is true then can you apply the > 'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code > achieves something similar to what you expect on the receive side(you > can also play w/ F_SETPIPE_SZ - although I found very little > reduction in CPU usage)? Note: My personal experience - using splice > from an input-file-A to output-file-B bought very minimal cpu > reduction(yes, both the files used O_DIRECT). Instead, a simple > read/write w/ O_DIRECT from file-A to file-B was much much faster. splice() performance from socket to pipe have improved a lot in linux-3.5 It was not true zero copy, until very recent patches. (It was zero copy only on certain class of NIC, not on the ones found on appliances or cheap platforms) Willy Tarreau mentioned a nice boost of performance with haproxy. Willy wanted to work on a direct splice from socket to socket, but I am not sure it'll bring major speed improvement. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 21:37 ` Eric Dumazet @ 2012-07-03 0:02 ` Willy Tarreau 0 siblings, 0 replies; 11+ messages in thread From: Willy Tarreau @ 2012-07-03 0:02 UTC (permalink / raw) To: Eric Dumazet Cc: chetan loke, Andreas Gruenbacher, netdev, linux-kernel, Herbert Xu, David S. Miller Hi Eric, On Mon, Jul 02, 2012 at 11:37:04PM +0200, Eric Dumazet wrote: > On Mon, 2012-07-02 at 15:41 -0400, chetan loke wrote: > > On Mon, Jul 2, 2012 at 12:06 PM, Andreas Gruenbacher <agruen@linbit.com> wrote: > > > On Mon, 2012-07-02 at 15:54 +0200, Eric Dumazet wrote: > > >> So I will just say no to your patches, unless you demonstrate the > > >> splice() problems, and how you can fix the alignment problem in a new > > >> layer instead of in the existing zero copy standard one. > > > > > > Again, splice or not is not the issue here. It does not, by itself, allow zero > > > copy from the network directly to disk but it could likely be made to support > > > that if we can get the alignment right first. The proposed MSG_NEW_PACKET flag > > > helps with that, but maybe someone has a better idea. > > > > > > > Eric - by using splice do you mean something like: > > > > int filedes[2]; > > PIPE_SIZE (64*1024) > > pipe(filedes); > > ret = splice (sock_fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE, > > SPLICE_F_MORE | SPLICE_F_MOVE); > > > > > > ret = splice (filedes [0], NULL, file_fd_to, > > &to_offset, ret, > > SPLICE_F_MORE | SPLICE_F_MOVE); > > > > Yes, thats more or less the plan. You also can play with bigger > PIPE_SIZE if needed. I confirm, this is recommended at high bit rates if you're working with large windows. > > i.e. splice-in from socket to pipe, and splice-out from pipe to destination? > > > > Andreas - if the above assumption is true then can you apply the > > 'MSG_NEW_PACKET' on the sender and see if the above pseudo-splice code > > achieves something similar to what you expect on the receive side(you > > can also play w/ F_SETPIPE_SZ - although I found very little > > reduction in CPU usage)? Note: My personal experience - using splice > > from an input-file-A to output-file-B bought very minimal cpu > > reduction(yes, both the files used O_DIRECT). Instead, a simple > > read/write w/ O_DIRECT from file-A to file-B was much much faster. > > splice() performance from socket to pipe have improved a lot in > linux-3.5 > > It was not true zero copy, until very recent patches. In fact it has been true zero copy in 2.6.25 until we faced a large amount of data corruption and the zero copy was disabled in 2.6.25.X. Since then it remained that way until you brought your patches to re-instantiate it. > (It was zero copy only on certain class of NIC, not on the ones found > on appliances or cheap platforms) > > Willy Tarreau mentioned a nice boost of performance with haproxy. Yes definitely. The savings are more noticeable on small systems where memory bandwidth is limited. On a small ARM system bound by RAM bandwidth, the performance was basically doubled. But I also observed nice savings on a core2duo equipped with 2 myricom 10Gig NICs forwarding at line rate. > Willy wanted to work on a direct splice from socket to socket, but > I am not sure it'll bring major speed improvement. I'm not sure at all either, I'm betting a few percent saved from the reduction of syscalls, not much more. This is why I'll probably check this when I have enough time to kill. Regards, Willy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] [TCP 0/3] Receive from socket into bio without copying 2012-07-02 11:45 ` Andreas Gruenbacher 2012-07-02 12:36 ` Eric Dumazet @ 2012-07-02 13:39 ` saeed bishara 1 sibling, 0 replies; 11+ messages in thread From: saeed bishara @ 2012-07-02 13:39 UTC (permalink / raw) To: Andreas Gruenbacher Cc: Eric Dumazet, netdev, linux-kernel, Herbert Xu, David S. Miller On Mon, Jul 2, 2012 at 2:45 PM, Andreas Gruenbacher <agruen@linbit.com> wrote: > > We want to go directly to the block layer instead. This requires that > the network hardware receives the data into sector aligned buffers. > Hence the proposed MSG_NEW_PACKET flag. Andreas, I didn't read your patches, but what are you looking for can't be achieved using "normal" NICs, for that you need to use RDMA protocol with a hardware that supports RDMA. saeed ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-07-03 0:02 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-06-29 14:53 [RFC] [TCP 0/3] Receive from socket into bio without copying Andreas Gruenbacher 2012-06-29 15:08 ` Eric Dumazet 2012-07-02 11:45 ` Andreas Gruenbacher 2012-07-02 12:36 ` Eric Dumazet 2012-07-02 13:02 ` Andreas Gruenbacher 2012-07-02 13:54 ` Eric Dumazet 2012-07-02 16:06 ` Andreas Gruenbacher 2012-07-02 19:41 ` chetan loke 2012-07-02 21:37 ` Eric Dumazet 2012-07-03 0:02 ` Willy Tarreau 2012-07-02 13:39 ` saeed bishara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).