Fwd: Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fwd: Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
@ 2012-10-18 10:19 Weiping Pan
  2012-10-18 12:23 ` Bruce Curtis
  0 siblings, 1 reply; 15+ messages in thread
From: Weiping Pan @ 2012-10-18 10:19 UTC (permalink / raw)
  To: open list:NETWORKING [GENERAL]

Sorry, forget to cc the list.

2012/9/18 Bruce "Brutus" Curtis<brutus@google.com>:
>  From: "Bruce \"Brutus\" Curtis"<brutus@google.com>
>
>  TCP/IP loopback socket pair stack bypass, based on an idea by, and
>  rough upstream patch from, David Miller<davem@davemloft.net>  called
>  "friends", the data structure modifcations and connection scheme are
>  reused with extensive data-path changes.

Hi, Bruce,

I found that there is a bug in the tcp friends patch,
when I kill netperf randomly, panic occurs in tcp_close().

BUG: unable to handle kernel NULL pointer dereference at 0000000d
IP: [<c0835a9b>] tcp_close+0x7b/0x3a0
*pde = 00000000
Oops: 0000 [#1] SMP
Modules linked in: fuse 8021q garp stp llc ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4
ip6table_filter xt_state nf_conntrack ip6_tables ppdev parport_pc
pcspkr i2c_piix4 i2c_core parport microcode e1000 uinput
Pid: 16627, comm: netperf Not tainted 3.6.0+ #25 innotek GmbH VirtualBox
EIP: 0060:[<c0835a9b>] EFLAGS: 00010202 CPU: 1
EIP is at tcp_close+0x7b/0x3a0
EAX: f6f41240 EBX: c2a46f40 ECX: 00000000 EDX: 00000001
ESI: 00000000 EDI: c2a46f88 EBP: c2a1fd8c ESP: c2a1fd78
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 0000000d CR3: 00c07000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process netperf (pid: 16627, ti=c2a1e000 task=f44cd780 task.ti=c2a1e000)
Stack:
  c0b4d880 00000000 c2a46f40 ed821080 f4de7f00 c2a1fd9c c0857cff ed821080
  00000000 c2a1fdb0 c07e3db0 00000000 f6f07540 00000008 c2a1fdbc c07e4127
  f4de7f00 c2a1fdec c05378e8 00000001 00000000 00000000 f54bf010 ed82109c
Call Trace:
  [<c0857cff>] inet_release+0x5f/0x70
  [<c07e3db0>] sock_release+0x20/0x80
  [<c07e4127>] sock_close+0x17/0x30
  [<c05378e8>] __fput+0x98/0x1f0
  [<c0537a4d>] ____fput+0xd/0x10
  [<c04580f1>] task_work_run+0x91/0xb0
  [<c0441157>] do_exit+0x177/0x7f0
  [<c0422c97>] ? smp_reschedule_interrupt+0x27/0x30
  [<c0441a67>] do_group_exit+0x37/0xa0
  [<c044e989>] get_signal_to_deliver+0x1c9/0x5b0
  [<c0471393>] ? update_curr+0x213/0x380
  [<c0402bca>] do_signal+0x2a/0x980
  [<c04026c7>] ? __switch_to+0xc7/0x340
  [<c08ea8c9>] ? __schedule+0x379/0x780
  [<c08ec3f8>] ? apic_timer_interrupt+0x34/0x3c
  [<c04ad0ae>] ? __audit_syscall_exit+0x36e/0x3a0
  [<c04ad0ae>] ? __audit_syscall_exit+0x36e/0x3a0
  [<c04036e5>] do_notify_resume+0x75/0xa0
  [<c08ec1c1>] work_notifysig+0x30/0x37
Code: 85 c0 74 3e 83 6b 50 01 8b 08 8b 50 04 c7 00 00 00 00 00 c7 40
04 00 00 00 00 89 51 04 89 0a 8b 88 9c 00 00 00 8b 50 34 2b 50 30<0f>
b6 49 0d 83 e1 01 29 ca 01 d6 e8 c5 64 fb ff 8b 43 48 39 f8
EIP: [<c0835a9b>] tcp_close+0x7b/0x3a0 SS:ESP 0068:c2a1fd78
CR2: 000000000000000d
---[ end trace 9f6d5c8fc973265c ]---


How to reproduce it ?
1 run netserver in a loop
2 run netperf with different modes in a loop
3 kill netperf randomly

And I found it is easy to see the panic on VirtualBox, but I did not
see it on the real machine.

Any hints ?

thanks
Weiping Pan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections
  2012-10-18 10:19 Fwd: Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections Weiping Pan
@ 2012-10-18 12:23 ` Bruce Curtis
  2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
  0 siblings, 1 reply; 15+ messages in thread
From: Bruce Curtis @ 2012-10-18 12:23 UTC (permalink / raw)
  To: Weiping Pan; +Cc: open list:NETWORKING [GENERAL]

Hi Weiping,

Also wasn't able to repro on a real machine, will try a VM, but has to
do with the "making friends" code-paths which have been refactored
since last patch (due to Eric's comments above about an unreferenced
skb->friend sock pointer), I'll be sending out a new patch soon.

Thanks,
Bruce

On Thu, Oct 18, 2012 at 2:19 PM, Weiping Pan <panweiping3@gmail.com> wrote:
> Sorry, forget to cc the list.
>
>
> 2012/9/18 Bruce "Brutus" Curtis<brutus@google.com>:
>>
>>  From: "Bruce \"Brutus\" Curtis"<brutus@google.com>
>>
>>  TCP/IP loopback socket pair stack bypass, based on an idea by, and
>>  rough upstream patch from, David Miller<davem@davemloft.net>  called
>>  "friends", the data structure modifcations and connection scheme are
>>  reused with extensive data-path changes.
>
>
> Hi, Bruce,
>
> I found that there is a bug in the tcp friends patch,
> when I kill netperf randomly, panic occurs in tcp_close().
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000d
> IP: [<c0835a9b>] tcp_close+0x7b/0x3a0
> *pde = 00000000
> Oops: 0000 [#1] SMP
> Modules linked in: fuse 8021q garp stp llc ip6t_REJECT
> nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4
> ip6table_filter xt_state nf_conntrack ip6_tables ppdev parport_pc
> pcspkr i2c_piix4 i2c_core parport microcode e1000 uinput
> Pid: 16627, comm: netperf Not tainted 3.6.0+ #25 innotek GmbH VirtualBox
> EIP: 0060:[<c0835a9b>] EFLAGS: 00010202 CPU: 1
> EIP is at tcp_close+0x7b/0x3a0
> EAX: f6f41240 EBX: c2a46f40 ECX: 00000000 EDX: 00000001
> ESI: 00000000 EDI: c2a46f88 EBP: c2a1fd8c ESP: c2a1fd78
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> CR0: 8005003b CR2: 0000000d CR3: 00c07000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> Process netperf (pid: 16627, ti=c2a1e000 task=f44cd780 task.ti=c2a1e000)
> Stack:
>  c0b4d880 00000000 c2a46f40 ed821080 f4de7f00 c2a1fd9c c0857cff ed821080
>  00000000 c2a1fdb0 c07e3db0 00000000 f6f07540 00000008 c2a1fdbc c07e4127
>  f4de7f00 c2a1fdec c05378e8 00000001 00000000 00000000 f54bf010 ed82109c
> Call Trace:
>  [<c0857cff>] inet_release+0x5f/0x70
>  [<c07e3db0>] sock_release+0x20/0x80
>  [<c07e4127>] sock_close+0x17/0x30
>  [<c05378e8>] __fput+0x98/0x1f0
>  [<c0537a4d>] ____fput+0xd/0x10
>  [<c04580f1>] task_work_run+0x91/0xb0
>  [<c0441157>] do_exit+0x177/0x7f0
>  [<c0422c97>] ? smp_reschedule_interrupt+0x27/0x30
>  [<c0441a67>] do_group_exit+0x37/0xa0
>  [<c044e989>] get_signal_to_deliver+0x1c9/0x5b0
>  [<c0471393>] ? update_curr+0x213/0x380
>  [<c0402bca>] do_signal+0x2a/0x980
>  [<c04026c7>] ? __switch_to+0xc7/0x340
>  [<c08ea8c9>] ? __schedule+0x379/0x780
>  [<c08ec3f8>] ? apic_timer_interrupt+0x34/0x3c
>  [<c04ad0ae>] ? __audit_syscall_exit+0x36e/0x3a0
>  [<c04ad0ae>] ? __audit_syscall_exit+0x36e/0x3a0
>  [<c04036e5>] do_notify_resume+0x75/0xa0
>  [<c08ec1c1>] work_notifysig+0x30/0x37
> Code: 85 c0 74 3e 83 6b 50 01 8b 08 8b 50 04 c7 00 00 00 00 00 c7 40
> 04 00 00 00 00 89 51 04 89 0a 8b 88 9c 00 00 00 8b 50 34 2b 50 30<0f>
> b6 49 0d 83 e1 01 29 ca 01 d6 e8 c5 64 fb ff 8b 43 48 39 f8
> EIP: [<c0835a9b>] tcp_close+0x7b/0x3a0 SS:ESP 0068:c2a1fd78
> CR2: 000000000000000d
> ---[ end trace 9f6d5c8fc973265c ]---
>
>
> How to reproduce it ?
> 1 run netserver in a loop
> 2 run netperf with different modes in a loop
> 3 kill netperf randomly
>
> And I found it is easy to see the panic on VirtualBox, but I did not
> see it on the real machine.
>
> Any hints ?
>
> thanks
> Weiping Pan
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
  2012-10-18 12:23 ` Bruce Curtis
@ 2012-12-05  2:54   ` Weiping Pan
  2012-12-05  2:54     ` [PATCH 1/3] Bruce's orignal tcp friend V3 Weiping Pan
                       ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan

1 patch overview
[PATCH 1/3] is the original V3 patch from  Bruce(brutus@google.com),
I just rebase it on top of net-next
commit 03f52a0a5542(ip6mr: Add sizeof verification to MRT6_ASSERT and
MT6_PIM).
http://patchwork.ozlabs.org/patch/184523/

[PATCH 2/3] is to fix the bug in tcp_close() that triggered by [PATCH 1/3],
since for tcp friends data skb, it has no tcp header, and its transport_header
is NULL,
so it will panic if we deference tcp_hdr(skb) in tcp_close().

[PATCH 3/3] is to fix the problem raised by Eric(eric.dumazet@gmail.com)
http://www.spinics.net/lists/netdev/msg210750.html

The sock pointed by request_sock->friend may be freed since it does not have a
lock to protect it.
I just delete request_sock->friend since I think it is useless.

For sk_buff->friend, it has the same problem, and I use
"atomic_add(skb->truesize, &sk->sk_wmem_alloc)" to guarantee that the sock can
not be freed before the skb is freed.

Then for 3-way handshake with tcp friends enabled,
SYN->friend is NULL, SYN/ACK->friend is set in tcp_make_synack(),
and ACK->friend is set in tcp_send_ack().

For normal data and FIN skbs, their friend pointer is NULL.

2 performance analysis
In short, TCP_RR increases by 5 or 6 times, TCP_CRR keeps the same,
TCP_SENDFILE and TCP_MAERTS are not stable, sometimes they increase while
sometimes decrease, so we can regard them as no increase.
For TCP_STREAM, it depends on the message size, if it is bigger than 8192, it
increases else decreases.

Intel(R) Xeon(R) E5506, 2 sockets, 8 cores, 2.13GHz
Memory 4GB
--------------------------------------------------------------------------
TCP friends performance results start


BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.



      BASE    AF_UNIX    FRIENDS               TCP_STREAM
  21741.94   30653.90   17115.66   78%   55%



      BASE    AF_UNIX    FRIENDS               TCP_MAERTS
  17464.98          -   17134.63   98%    -%



      BASE    AF_UNIX    FRIENDS             TCP_SENDFILE
     25707          -      30828  119%    -%


TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.



        MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
         1      15.64       5.90       5.12   32%   86%
         2      30.93       9.81      10.48   33%  106%
         4      58.22      19.70      21.29   36%  108%
         8     117.00      39.00      42.74   36%  109%
        16     231.08      84.59      83.90   36%   99%
        32     439.39     159.93     163.03   37%  101%
        64     879.13     323.31     322.78   36%   99%
       128    1617.55     632.50     646.34   39%  102%
       256    3091.72    1316.36    1206.93   39%   91%
       512    5077.18    2359.51    2342.00   46%   99%
      1024    7403.20    6302.20    3335.23   45%   52%
      2048   10194.40   13922.19    5751.23   56%   41%
      4096   13338.08   22566.45    9447.29   70%   41%
      8192   14467.93   28122.20   13758.43   95%   48%
     16384   22463.15   37522.42   26804.36  119%   71%
     32768   14743.58   30591.61   17040.15  115%   55%
     65536   24743.77   33855.93   40418.15  163%  119%
    131072   13925.14   31762.52   48292.60  346%  152%
    262144   16126.15   32912.89   25610.47  158%   77%
    524288   12080.51   35059.27   30608.31  253%   87%
   1048576   10539.06   28200.14   16953.69  160%   60%
MS means Message Size in bytes, that is -m -M for netperf



        RR       BASE    AF_UNIX    FRIENDS                TCP_RR_RR
         1   13064.17   95593.46   72982.11  558%   76%
         2   12000.95   95477.38   65203.37  543%   68%
         4   12560.45   90758.17   69983.71  557%   77%
         8   17991.62   96794.53   77293.14  429%   79%
        16   13015.98   89384.69   83125.91  638%   92%
        32   13863.00   89870.17   88986.21  641%   99%
        64   10632.42   88906.59   83055.69  781%   93%
       128   13673.29   85629.27   92984.32  680%  108%
       256   12965.59   88117.74   86155.43  664%   97%
       512   17158.55   90866.08   85498.26  498%   94%
      1024   16951.15   82982.26   82286.84  485%   99%
      2048   11814.75   76684.40   83154.99  703%  108%
      4096   10393.91   63204.65   68558.71  659%  108%
      8192    7757.81   50318.63   50270.39  647%   99%
     16384    8147.26   37392.42   38619.89  474%  103%
     32768    8846.85   24847.64   28412.23  321%  114%
     65536    4974.59   16717.47   17327.65  348%  103%
    131072    4148.19    9053.56    9402.89  226%  103%
    262144    3029.66    5575.51    6119.65  201%  109%
    524288     923.40    3271.52    3649.37  395%  111%
   1048576     385.47    1173.18    1017.43  263%   86%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf



        RR       BASE    AF_UNIX    FRIENDS               TCP_CRR_RR
         1    3424.40          -    3608.92  105%    -%
         2    3355.94          -    3523.77  105%    -%
         4    3437.05          -    3538.48  102%    -%
         8    3465.41          -    3630.49  104%    -%
        16    3495.40          -    3516.93  100%    -%
        32    3425.78          -    3524.90  102%    -%
        64    3432.01          -    3628.25  105%    -%
       128    3434.69          -    3573.88  104%    -%
       256    3413.94          -    3616.94  105%    -%
       512    3457.32          -    3675.38  106%    -%
      1024    3476.01          -    3634.25  104%    -%
      2048    3484.38          -    3539.96  101%    -%
      4096    3304.86          -    3564.57  107%    -%
      8192    3420.40          -    3599.02  105%    -%
     16384    3358.47          -    3571.60  106%    -%
     32768    3299.75          -    3469.19  105%    -%
     65536    2635.22          -    3292.74  124%    -%
    131072     119.97          -    3008.15 2507%    -%
    262144     933.66          -    2189.83  234%    -%
    524288     175.82          -     607.32  345%    -%
   1048576      41.70          -     296.22  710%    -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1



TCP friends performance results end
--------------------------------------------------------------------------

In short, I think the performance of tcp friends is not overwhelming than
loopback.

Friends VS AF__UNIX
Their call path are almost the same, but AF_UNIX uses its own send/recv codes
with proper locks,
so AF_UNIX's performance is much better than Friends.

Friends VS normal tcp
Friends directly adds skb into peer's sk_receive_queue if it gets the lock.
So the sender and receiver have serious lock contention.

Normal tcp sends skb into sk_write_queue, then sends it in net_tx_action() and
receives it in net_rx_action(), then adds it into peer's sk_receive_queue.
So the sender just needs to lock the write queue while the receiver just needs
to lock the receive queue, so they have little lock contention.

3 TODO
1 try to confirm that the root cause of regression in some cases is the lock
contention.

2 find a better way to fix the regression.

Any hints ?

thanks

Weiping Pan (3):
  Bruce's orignal tcp friend V3
  fix panic in tcp_close()
  delete request_sock->friend

 Documentation/networking/ip-sysctl.txt |    8 +
 include/linux/skbuff.h                 |    2 +
 include/net/inet_connection_sock.h     |    4 +
 include/net/sock.h                     |   32 ++-
 include/net/tcp.h                      |   13 +-
 net/core/skbuff.c                      |    1 +
 net/core/sock.c                        |    1 +
 net/core/stream.c                      |   36 ++
 net/ipv4/inet_connection_sock.c        |   38 ++
 net/ipv4/sysctl_net_ipv4.c             |    7 +
 net/ipv4/tcp.c                         |  610 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c                   |   12 +-
 net/ipv4/tcp_ipv4.c                    |    5 +
 net/ipv4/tcp_minisocks.c               |   11 +-
 net/ipv4/tcp_output.c                  |   19 +-
 15 files changed, 707 insertions(+), 92 deletions(-)

-- 
1.7.4.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] Bruce's orignal tcp friend V3
  2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
@ 2012-12-05  2:54     ` Weiping Pan
  2012-12-05  2:54     ` [PATCH 2/3] fix panic in tcp_close() Weiping Pan
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan

http://patchwork.ozlabs.org/patch/184523/

Rebase on top of commit 03f52a0a5542(ip6mr: Add sizeof verification to
MRT6_ASSERT and MT6_PIM).

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 Documentation/networking/ip-sysctl.txt |    8 +
 include/linux/skbuff.h                 |    2 +
 include/net/request_sock.h             |    1 +
 include/net/sock.h                     |   32 ++-
 include/net/tcp.h                      |   13 +-
 net/core/skbuff.c                      |    1 +
 net/core/sock.c                        |    1 +
 net/core/stream.c                      |   36 ++
 net/ipv4/inet_connection_sock.c        |   20 +
 net/ipv4/sysctl_net_ipv4.c             |    7 +
 net/ipv4/tcp.c                         |  604 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c                   |   22 +-
 net/ipv4/tcp_ipv4.c                    |    2 +
 net/ipv4/tcp_minisocks.c               |    4 +
 net/ipv4/tcp_output.c                  |   16 +-
 net/ipv6/tcp_ipv6.c                    |    1 +
 16 files changed, 679 insertions(+), 91 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 98ac0d7..152f488 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -214,6 +214,14 @@ tcp_fack - BOOLEAN
 	Enable FACK congestion avoidance and fast retransmission.
 	The value is not used, if tcp_sack is not enabled.
 
+tcp_friends - BOOLEAN
+	If set, TCP loopback socket pair stack bypass is enabled such
+	that all data sent will be directly queued to the receiver's
+	socket for receive. Note, normal connection establishment and
+	finish is used to make friends so any loopback interpose, e.g.
+	tcpdump, will see these TCP segements but no data segments.
+	Default: 1
+
 tcp_fin_timeout - INTEGER
 	Time to hold socket in state FIN-WAIT-2, if it was closed
 	by our side. Peer can be broken and never close its side,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..c890f65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -334,6 +334,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
  *	@_skb_refdst: destination entry (with norefcount bit)
  *	@sp: the security path, used for xfrm
+ *	@friend: loopback friend socket
  *	@len: Length of actual data
  *	@data_len: Data length
  *	@mac_len: Length of link layer header
@@ -409,6 +410,7 @@ struct sk_buff {
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
+	struct sock		*friend;
 	unsigned int		len,
 				data_len;
 	__u16			mac_len,
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index a51dbd1..c6dfa26 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -66,6 +66,7 @@ struct request_sock {
 	unsigned long			expires;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
+	struct sock			*friend;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/include/net/sock.h b/include/net/sock.h
index c945fba..778d8dd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -197,6 +197,7 @@ struct cg_proto;
   *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
   *	@sk_lock:	synchronizer
   *	@sk_rcvbuf: size of receive buffer in bytes
+  *	@sk_friend: loopback friend socket
   *	@sk_wq: sock wait queue and async head
   *	@sk_rx_dst: receive input route used by early tcp demux
   *	@sk_dst_cache: destination cache
@@ -286,6 +287,14 @@ struct sock {
 	socket_lock_t		sk_lock;
 	struct sk_buff_head	sk_receive_queue;
 	/*
+	 * If socket has a friend (sk_friend != NULL) then a send skb is
+	 * enqueued directly to the friend's sk_receive_queue such that:
+	 *
+	 *        sk_sndbuf -> sk_sndbuf + sk_friend->sk_rcvbuf
+	 *   sk_wmem_queued -> sk_friend->sk_rmem_alloc
+	 */
+	struct sock		*sk_friend;
+	/*
 	 * The backlog queue is special, it is always used with
 	 * the per-socket spinlock held and requires low latency
 	 * access. Therefore we special case it's implementation.
@@ -703,24 +712,40 @@ static inline bool sk_acceptq_is_full(const struct sock *sk)
 	return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
 }
 
+static inline int sk_wmem_queued_get(const struct sock *sk)
+{
+	if (sk->sk_friend)
+		return atomic_read(&sk->sk_friend->sk_rmem_alloc);
+	else
+		return sk->sk_wmem_queued;
+}
+
+static inline int sk_sndbuf_get(const struct sock *sk)
+{
+	if (sk->sk_friend)
+		return sk->sk_sndbuf + sk->sk_friend->sk_rcvbuf;
+	else
+		return sk->sk_sndbuf;
+}
+
 /*
  * Compute minimal free write space needed to queue new packets.
  */
 static inline int sk_stream_min_wspace(const struct sock *sk)
 {
-	return sk->sk_wmem_queued >> 1;
+	return sk_wmem_queued_get(sk) >> 1;
 }
 
 static inline int sk_stream_wspace(const struct sock *sk)
 {
-	return sk->sk_sndbuf - sk->sk_wmem_queued;
+	return sk_sndbuf_get(sk) - sk_wmem_queued_get(sk);
 }
 
 extern void sk_stream_write_space(struct sock *sk);
 
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
-	return sk->sk_wmem_queued < sk->sk_sndbuf;
+	return sk_wmem_queued_get(sk) < sk_sndbuf_get(sk);
 }
 
 /* OOB backlog add */
@@ -829,6 +854,7 @@ static inline void sock_rps_reset_rxhash(struct sock *sk)
 	})
 
 extern int sk_stream_wait_connect(struct sock *sk, long *timeo_p);
+extern int sk_stream_wait_friend(struct sock *sk, long *timeo_p);
 extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
 extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
 extern int sk_stream_error(struct sock *sk, int flags, int err);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3202bde..5f82770 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_friends;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -687,6 +688,15 @@ void tcp_send_window_probe(struct sock *sk);
 #define TCPHDR_ECE 0x40
 #define TCPHDR_CWR 0x80
 
+/* If skb_get_friend() != NULL, TCP friends per packet state.
+ */
+struct friend_skb_parm {
+	bool	tail_inuse;		/* In use by skb_get_friend() send while */
+					/* on sk_receive_queue for tail put */
+};
+
+#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
+
 /* This is what the send packet queuing engine uses to pass
  * TCP per-packet control information to the transmission code.
  * We also store the host-order sequence numbers in here too.
@@ -699,6 +709,7 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
+		struct friend_skb_parm	hf;
 	} header;	/* For incoming frames		*/
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
@@ -1041,7 +1052,7 @@ static inline bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (sysctl_tcp_low_latency || !tp->ucopy.task)
+	if (sysctl_tcp_low_latency || !tp->ucopy.task || sk->sk_friend)
 		return false;
 
 	__skb_queue_tail(&tp->ucopy.prequeue, skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 880722e2..665826a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -690,6 +690,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
+	new->friend		= old->friend;
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	new->csum		= old->csum;
 	new->local_df		= old->local_df;
diff --git a/net/core/sock.c b/net/core/sock.c
index a692ef4..a8f59a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2225,6 +2225,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 #ifdef CONFIG_NET_DMA
 	skb_queue_head_init(&sk->sk_async_wait_queue);
 #endif
+	sk->sk_friend		=	NULL;
 
 	sk->sk_send_head	=	NULL;
 
diff --git a/net/core/stream.c b/net/core/stream.c
index f5df85d..85e5b03 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -83,6 +83,42 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
 EXPORT_SYMBOL(sk_stream_wait_connect);
 
 /**
+ * sk_stream_wait_friend - Wait for a socket to make friends
+ * @sk: sock to wait on
+ * @timeo_p: for how long to wait
+ *
+ * Must be called with the socket locked.
+ */
+int sk_stream_wait_friend(struct sock *sk, long *timeo_p)
+{
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	int done;
+
+	do {
+		int err = sock_error(sk);
+		if (err)
+			return err;
+		if (!sk->sk_friend)
+			return -EBADFD;
+		if (!*timeo_p)
+			return -EAGAIN;
+		if (signal_pending(tsk))
+			return sock_intr_errno(*timeo_p);
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		sk->sk_write_pending++;
+		done = sk_wait_event(sk, timeo_p,
+				     !sk->sk_err &&
+				     sk->sk_friend->sk_friend);
+		finish_wait(sk_sleep(sk), &wait);
+		sk->sk_write_pending--;
+	} while (!done);
+	return 0;
+}
+EXPORT_SYMBOL(sk_stream_wait_friend);
+
+/**
  * sk_stream_closing - Return 1 if we still have things to send in our buffers.
  * @sk: socket to verify
  */
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 2026542..ce4b79b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -659,6 +659,26 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 	if (newsk != NULL) {
 		struct inet_connection_sock *newicsk = inet_csk(newsk);
 
+		if (req->friend) {
+			/*
+			 * Make friends with the requestor but the ACK of
+			 * the request is already in-flight so the race is
+			 * on to make friends before the ACK is processed.
+			 * If the requestor's sk_friend value is != NULL
+			 * then the requestor has already processed the
+			 * ACK so indicate state change to wake'm up.
+			 */
+			struct sock *was;
+
+			sock_hold(req->friend);
+			newsk->sk_friend = req->friend;
+			sock_hold(newsk);
+			was = xchg(&req->friend->sk_friend, newsk);
+			/* If requester already connect()ed, maybe sleeping */
+			if (was && !sock_flag(req->friend, SOCK_DEAD))
+				sk->sk_state_change(req->friend);
+		}
+
 		newsk->sk_state = TCP_SYN_RECV;
 		newicsk->icsk_bind_hash = NULL;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d84400b..4ca53db 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -796,6 +796,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero
 	},
+	{
+		.procname	= "tcp_friends",
+		.data		= &sysctl_tcp_friends,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e6eace1..4327deb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -310,6 +310,56 @@ struct tcp_splice_state {
 };
 
 /*
+ * Validate friendp, if not a friend return 0, else if friend is also a
+ * friend return 1, else friendp points to a listen()er so wait for our
+ * friend to be ready then update friendp with pointer to the real friend
+ * and return 1, else an error has occurred so return a -errno.
+ */
+static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
+			      long *timeo)
+{
+	struct sock *friend = *friendp;
+
+	if (!friend)
+		return 0;
+	if (unlikely(!friend->sk_friend)) {
+		/* Friendship not complete, wait? */
+		int err;
+
+		if (!timeo)
+			return -EAGAIN;
+		err = sk_stream_wait_friend(sk, timeo);
+		if (err < 0)
+			return err;
+		*friendp = sk->sk_friend;
+	}
+	return 1;
+}
+
+static inline int tcp_friend_send_lock(struct sock *friend)
+{
+	int err = 0;
+
+	spin_lock_bh(&friend->sk_lock.slock);
+	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
+		spin_unlock_bh(&friend->sk_lock.slock);
+		err = -ECONNRESET;
+	}
+
+	return err;
+}
+
+static inline void tcp_friend_recv_lock(struct sock *friend)
+{
+	spin_lock_bh(&friend->sk_lock.slock);
+}
+
+static void tcp_friend_unlock(struct sock *friend)
+{
+	spin_unlock_bh(&friend->sk_lock.slock);
+}
+
+/*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
@@ -589,6 +639,76 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(tcp_ioctl);
 
+/*
+ * Friend receive_queue tail skb space? If true, set tail_inuse.
+ * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
+ */
+static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
+{
+	struct sk_buff	*skb = NULL;
+	int		sz = 0;
+
+	if (skb_peek_tail(&friend->sk_receive_queue)) {
+		sz = tcp_friend_send_lock(friend);
+		if (!sz) {
+			skb = skb_peek_tail(&friend->sk_receive_queue);
+			if (skb && skb->friend) {
+				if (!*copy)
+					sz = skb_tailroom(skb);
+				else {
+					sz = *copy - skb->len;
+					if (sz < 0)
+						sz = 0;
+				}
+				if (sz > 0)
+					TCP_FRIEND_CB(TCP_SKB_CB(skb))->
+							tail_inuse = true;
+			}
+			tcp_friend_unlock(friend);
+		}
+	}
+
+	*copy = sz;
+	return skb;
+}
+
+static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
+{
+	struct sock	*friend = sk->sk_friend;
+	struct tcp_sock *tp = tcp_sk(friend);
+
+	if (charge) {
+		sk_mem_charge(friend, charge);
+		atomic_add(charge, &friend->sk_rmem_alloc);
+	}
+	tp->rcv_nxt += copy;
+	tp->rcv_wup += copy;
+	tcp_friend_unlock(friend);
+
+	tp = tcp_sk(sk);
+	tp->snd_nxt += copy;
+	tp->pushed_seq += copy;
+	tp->snd_una += copy;
+	tp->snd_up += copy;
+}
+
+static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
+{
+	struct sock	*friend = sk->sk_friend;
+	int		wait = false;
+
+	skb_set_owner_r(skb, friend);
+	__skb_queue_tail(&friend->sk_receive_queue, skb);
+	if (!sk_rmem_schedule(friend, skb, skb->truesize))
+		wait = true;
+
+	tcp_friend_seq(sk, skb->len, 0);
+	if (skb == skb_peek(&friend->sk_receive_queue))
+		friend->sk_data_ready(friend, 0);
+
+	return wait;
+}
+
 static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb)
 {
 	TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
@@ -605,8 +725,13 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
-	skb->csum    = 0;
 	tcb->seq     = tcb->end_seq = tp->write_seq;
+	if (sk->sk_friend) {
+		skb->friend = sk;
+		TCP_FRIEND_CB(tcb)->tail_inuse = false;
+		return;
+	}
+	skb->csum    = 0;
 	tcb->tcp_flags = TCPHDR_ACK;
 	tcb->sacked  = 0;
 	skb_header_release(skb);
@@ -626,7 +751,10 @@ static inline void tcp_mark_urg(struct tcp_sock *tp, int flags)
 static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 			    int nonagle)
 {
-	if (tcp_send_head(sk)) {
+	if (sk->sk_friend) {
+		if (skb_peek(&sk->sk_friend->sk_receive_queue))
+			sk->sk_friend->sk_data_ready(sk->sk_friend, 0);
+	} else if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);
 
 		if (!(flags & MSG_MORE) || forced_push(tp))
@@ -758,6 +886,21 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 }
 EXPORT_SYMBOL(tcp_splice_read);
 
+static inline struct sk_buff *tcp_friend_alloc_skb(struct sock *sk, int size)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(size, sk->sk_allocation);
+	if (skb)
+		skb->avail_size = skb_tailroom(skb);
+	else {
+		sk->sk_prot->enter_memory_pressure(sk);
+		sk_stream_moderate_sndbuf(sk);
+	}
+
+	return skb;
+}
+
 struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
 {
 	struct sk_buff *skb;
@@ -821,12 +964,53 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	return max(xmit_size_goal, mss_now);
 }
 
+static unsigned int tcp_friend_xmit_size_goal(struct sock *sk, int size_goal)
+{
+	u32 size = SKB_DATA_ALIGN(size_goal);
+	u32 overhead = sizeof(struct skb_shared_info) + sizeof(struct sk_buff);
+
+	/*
+	 * If alloc >= largest skb use largest order, else check
+	 * for optimal tail fill size, else use largest order.
+	 */
+	if (size >= SKB_MAX_ORDER(0, 4))
+		size = SKB_MAX_ORDER(0, 4);
+	else if (size <= (SKB_MAX_ORDER(0, 0) >> 3))
+		size = SKB_MAX_ORDER(0, 0);
+	else if (size <= (SKB_MAX_ORDER(0, 1) >> 3))
+		size = SKB_MAX_ORDER(0, 1);
+	else if (size <= (SKB_MAX_ORDER(0, 0) >> 1))
+		size = SKB_MAX_ORDER(0, 0);
+	else if (size <= (SKB_MAX_ORDER(0, 1) >> 1))
+		size = SKB_MAX_ORDER(0, 1);
+	else if (size <= (SKB_MAX_ORDER(0, 2) >> 1))
+		size = SKB_MAX_ORDER(0, 2);
+	else if (size <= (SKB_MAX_ORDER(0, 3) >> 1))
+		size = SKB_MAX_ORDER(0, 3);
+	else
+		size = SKB_MAX_ORDER(0, 4);
+
+	/* At least 2 true sized in sk_buf */
+	if (size + overhead > (sk_sndbuf_get(sk) >> 1))
+		size = (sk_sndbuf_get(sk) >> 1) - overhead;
+
+	return size;
+}
+
 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 {
 	int mss_now;
+	int tmp;
+
+	if (sk->sk_friend) {
+		mss_now = tcp_friend_xmit_size_goal(sk, *size_goal);
+		tmp = mss_now;
+	} else {
+		mss_now = tcp_current_mss(sk);
+		tmp = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
+	}
 
-	mss_now = tcp_current_mss(sk);
-	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
+	*size_goal = tmp;
 
 	return mss_now;
 }
@@ -834,8 +1018,9 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
 			 size_t psize, int flags)
 {
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
-	int mss_now, size_goal;
+	int mss_now, size_goal = psize;
 	int err;
 	ssize_t copied;
 	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
@@ -850,6 +1035,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 			goto out_err;
 	}
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out_err;
+
 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
 	mss_now = tcp_send_mss(sk, &size_goal, flags);
@@ -860,25 +1049,47 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 		goto out_err;
 
 	while (psize > 0) {
-		struct sk_buff *skb = tcp_write_queue_tail(sk);
+		struct sk_buff *skb;
+		struct tcp_skb_cb *tcb;
 		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
 		bool can_coalesce;
 
-		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+		if (friend) {
+			copy = size_goal;
+			skb = tcp_friend_tail(friend, &copy);
+			if (copy < 0) {
+				sk->sk_err = -copy;
+				err = -EPIPE;
+				goto out_err;
+			}
+		} else if (!tcp_send_head(sk)) {
+			skb = NULL;
+			copy = 0;
+		} else {
+			skb = tcp_write_queue_tail(sk);
+			copy = size_goal - skb->len;
+		}
+
+		if (copy <= 0) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
 
-			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation);
+			if (friend)
+				skb = tcp_friend_alloc_skb(sk, 0);
+			else
+				skb = sk_stream_alloc_skb(sk, 0,
+							  sk->sk_allocation);
 			if (!skb)
 				goto wait_for_memory;
 
 			skb_entail(sk, skb);
 			copy = size_goal;
 		}
+		tcb = TCP_SKB_CB(skb);
 
 		if (copy > size)
 			copy = size;
@@ -886,10 +1097,14 @@ new_segment:
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
-			tcp_mark_push(tp, skb);
+			if (friend) {
+				if (TCP_FRIEND_CB(tcb)->tail_inuse)
+					TCP_FRIEND_CB(tcb)->tail_inuse = false;
+			} else
+				tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
-		if (!sk_wmem_schedule(sk, copy))
+		if (!friend && !sk_wmem_schedule(sk, copy))
 			goto wait_for_memory;
 
 		if (can_coalesce) {
@@ -902,19 +1117,41 @@ new_segment:
 		skb->len += copy;
 		skb->data_len += copy;
 		skb->truesize += copy;
-		sk->sk_wmem_queued += copy;
-		sk_mem_charge(sk, copy);
-		skb->ip_summed = CHECKSUM_PARTIAL;
 		tp->write_seq += copy;
-		TCP_SKB_CB(skb)->end_seq += copy;
-		skb_shinfo(skb)->gso_segs = 0;
-
-		if (!copied)
-			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
 
 		copied += copy;
 		poffset += copy;
-		if (!(psize -= copy))
+		psize -= copy;
+
+		if (friend) {
+			err = tcp_friend_send_lock(friend);
+			if (err) {
+				sk->sk_err = -err;
+				err = -EPIPE;
+				goto out_err;
+			}
+			tcb->end_seq += copy;
+			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+				TCP_FRIEND_CB(tcb)->tail_inuse = false;
+				tcp_friend_seq(sk, copy, copy);
+			} else {
+				if (tcp_friend_push(sk, skb))
+					goto wait_for_sndbuf;
+			}
+			if (!psize)
+				goto out;
+			continue;
+		}
+
+		tcb->end_seq += copy;
+		skb_shinfo(skb)->gso_segs = 0;
+		sk->sk_wmem_queued += copy;
+		sk_mem_charge(sk, copy);
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		if (copied == copy)
+			tcb->tcp_flags &= ~TCPHDR_PSH;
+
+		if (!psize)
 			goto out;
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
@@ -935,7 +1172,8 @@ wait_for_memory:
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
 
-		mss_now = tcp_send_mss(sk, &size_goal, flags);
+		if (!friend)
+			mss_now = tcp_send_mss(sk, &size_goal, flags);
 	}
 
 out:
@@ -1026,10 +1264,12 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t size)
 {
 	struct iovec *iov;
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct tcp_skb_cb *tcb;
 	int iovlen, flags, err, copied = 0;
-	int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
+	int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
 	bool sg;
 	long timeo;
 
@@ -1057,6 +1297,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			goto do_error;
 	}
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out;
+
 	if (unlikely(tp->repair)) {
 		if (tp->repair_queue == TCP_RECV_QUEUE) {
 			copied = tcp_send_rcvq(sk, msg, size);
@@ -1105,24 +1349,38 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			int copy = 0;
 			int max = size_goal;
 
-			skb = tcp_write_queue_tail(sk);
-			if (tcp_send_head(sk)) {
-				if (skb->ip_summed == CHECKSUM_NONE)
-					max = mss_now;
-				copy = max - skb->len;
+			if (friend) {
+				skb = tcp_friend_tail(friend, &copy);
+				if (copy < 0) {
+					sk->sk_err = -copy;
+					err = -EPIPE;
+					goto out_err;
+				}
+			} else {
+				skb = tcp_write_queue_tail(sk);
+				if (tcp_send_head(sk)) {
+					if (skb->ip_summed == CHECKSUM_NONE)
+						max = mss_now;
+					copy = max - skb->len;
+				}
 			}
 
 			if (copy <= 0) {
 new_segment:
-				/* Allocate new segment. If the interface is SG,
-				 * allocate skb fitting to single page.
-				 */
 				if (!sk_stream_memory_free(sk))
 					goto wait_for_sndbuf;
 
-				skb = sk_stream_alloc_skb(sk,
-							  select_size(sk, sg),
-							  sk->sk_allocation);
+				if (friend)
+					skb = tcp_friend_alloc_skb(sk, max);
+				else {
+					/* Allocate new segment. If the
+					 * interface is SG, allocate skb
+					 * fitting to single page.
+					 */
+					skb = sk_stream_alloc_skb(sk,
+							select_size(sk, sg),
+							sk->sk_allocation);
+				}
 				if (!skb)
 					goto wait_for_memory;
 
@@ -1136,6 +1394,7 @@ new_segment:
 				copy = size_goal;
 				max = size_goal;
 			}
+			tcb = TCP_SKB_CB(skb);
 
 			/* Try to append data to the end of skb. */
 			if (copy > seglen)
@@ -1153,6 +1412,8 @@ new_segment:
 				int i = skb_shinfo(skb)->nr_frags;
 				struct page_frag *pfrag = sk_page_frag(sk);
 
+				BUG_ON(friend);
+
 				if (!sk_page_frag_refill(sk, pfrag))
 					goto wait_for_memory;
 
@@ -1188,16 +1449,37 @@ new_segment:
 				pfrag->offset += copy;
 			}
 
-			if (!copied)
-				TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
 			tp->write_seq += copy;
-			TCP_SKB_CB(skb)->end_seq += copy;
-			skb_shinfo(skb)->gso_segs = 0;
 
 			from += copy;
 			copied += copy;
-			if ((seglen -= copy) == 0 && iovlen == 0)
+			seglen -= copy;
+
+			if (friend) {
+				err = tcp_friend_send_lock(friend);
+				if (err) {
+					sk->sk_err = -err;
+					err = -EPIPE;
+					goto out_err;
+				}
+				tcb->end_seq += copy;
+				if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+					TCP_FRIEND_CB(tcb)->tail_inuse = false;
+					tcp_friend_seq(sk, copy, 0);
+				} else {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
+				continue;
+			}
+
+			tcb->end_seq += copy;
+			skb_shinfo(skb)->gso_segs = 0;
+
+			if (copied == copy)
+				tcb->tcp_flags &= ~TCPHDR_PSH;
+
+			if (seglen == 0 && iovlen == 0)
 				goto out;
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
@@ -1219,7 +1501,8 @@ wait_for_memory:
 			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 				goto do_error;
 
-			mss_now = tcp_send_mss(sk, &size_goal, flags);
+			if (!friend)
+				mss_now = tcp_send_mss(sk, &size_goal, flags);
 		}
 	}
 
@@ -1230,7 +1513,12 @@ out:
 	return copied + copied_syn;
 
 do_fault:
-	if (!skb->len) {
+	if (skb->friend) {
+		if (TCP_FRIEND_CB(tcb)->tail_inuse)
+			TCP_FRIEND_CB(tcb)->tail_inuse = false;
+		else
+			__kfree_skb(skb);
+	} else if (!skb->len) {
 		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
 		 * reset, where we can be unlinking the send_head.
@@ -1249,6 +1537,13 @@ out_err:
 }
 EXPORT_SYMBOL(tcp_sendmsg);
 
+static inline void tcp_friend_write_space(struct sock *sk)
+{
+	/* Queued data below 1/4th of sndbuf? */
+	if ((sk_sndbuf_get(sk) >> 2) > sk_wmem_queued_get(sk))
+		sk->sk_friend->sk_write_space(sk->sk_friend);
+}
+
 /*
  *	Handle reading urgent data. BSD has very simple semantics for
  *	this, no blocking and very strange errors 8)
@@ -1327,7 +1622,12 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
 
-	struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
+	struct sk_buff *skb;
+
+	if (sk->sk_friend)
+		return;
+
+	skb = skb_peek(&sk->sk_receive_queue);
 
 	WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
 	     "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
@@ -1431,17 +1731,27 @@ static void tcp_service_net_dma(struct sock *sk, bool wait)
 }
 #endif
 
-static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
+static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off,
+					   size_t *len)
 {
 	struct sk_buff *skb;
 	u32 offset;
+	size_t avail;
 
 	skb_queue_walk(&sk->sk_receive_queue, skb) {
-		offset = seq - TCP_SKB_CB(skb)->seq;
-		if (tcp_hdr(skb)->syn)
-			offset--;
-		if (offset < skb->len || tcp_hdr(skb)->fin) {
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		offset = seq - tcb->seq;
+		if (skb->friend)
+			avail = (u32)(tcb->end_seq - seq);
+		else {
+			if (tcp_hdr(skb)->syn)
+				offset--;
+			avail = skb->len - offset;
+		}
+		if (avail > 0 || (!skb->friend && tcp_hdr(skb)->fin)) {
 			*off = offset;
+			*len = avail;
 			return skb;
 		}
 	}
@@ -1467,15 +1777,23 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	u32 seq = tp->copied_seq;
 	u32 offset;
 	int copied = 0;
+	size_t len;
+	int err;
+	struct sock *friend = sk->sk_friend;
+	long timeo = sock_rcvtimeo(sk, false);
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
-		if (offset < skb->len) {
-			int used;
-			size_t len;
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		return err;
+	if (friend)
+		tcp_friend_recv_lock(sk);
 
-			len = skb->len - offset;
+	while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
+		if (len > 0) {
+			int used;
+	again:
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
 				u32 urg_offset = tp->urg_seq - seq;
@@ -1484,6 +1802,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (!len)
 					break;
 			}
+
+			if (friend)
+				tcp_friend_unlock(sk);
+
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
@@ -1494,33 +1816,65 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				copied += used;
 				offset += used;
 			}
-			/*
-			 * If recv_actor drops the lock (e.g. TCP splice
-			 * receive) the skb pointer might be invalid when
-			 * getting here: tcp_collapse might have deleted it
-			 * while aggregating skbs from the socket queue.
-			 */
-			skb = tcp_recv_skb(sk, seq-1, &offset);
-			if (!skb || (offset+1 != skb->len))
-				break;
+
+			if (friend)
+				tcp_friend_recv_lock(sk);
+			if (skb->friend) {
+				len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+				if (len > 0) {
+					/*
+					 * Friend did an skb_put() while we
+					 * were away so process the same skb.
+					 */
+					if (!desc->count)
+						break;
+					tp->copied_seq = seq;
+					goto again;
+				}
+			} else {
+				/*
+				 * If recv_actor drops the lock (e.g. TCP
+				 * splice receive) the skb pointer might be
+				 * invalid when getting here: tcp_collapse
+				 * might have deleted it while aggregating
+				 * skbs from the socket queue.
+				 */
+				skb = tcp_recv_skb(sk, seq-1, &offset, &len);
+				if (!skb || (offset+1 != skb->len))
+					break;
+			}
 		}
-		if (tcp_hdr(skb)->fin) {
+		if (!skb->friend && tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, false);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb, false);
+		if (skb->friend) {
+			if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
+				__skb_unlink(skb, &sk->sk_receive_queue);
+				__kfree_skb(skb);
+				tcp_friend_write_space(sk);
+			}
+			tcp_friend_unlock(sk);
+			tcp_friend_recv_lock(sk);
+		} else
+			sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
 			break;
 		tp->copied_seq = seq;
 	}
 	tp->copied_seq = seq;
 
-	tcp_rcv_space_adjust(sk);
+	if (friend) {
+		tcp_friend_unlock(sk);
+		tcp_friend_write_space(sk);
+	} else {
+		tcp_rcv_space_adjust(sk);
 
-	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
-		tcp_cleanup_rbuf(sk, copied);
+		/* Clean up data we have read: This will do ACK frames. */
+		if (copied > 0)
+			tcp_cleanup_rbuf(sk, copied);
+	}
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_sock);
@@ -1536,6 +1890,7 @@ EXPORT_SYMBOL(tcp_read_sock);
 int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t len, int nonblock, int flags, int *addr_len)
 {
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
 	int copied = 0;
 	u32 peek_seq;
@@ -1548,6 +1903,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	bool copied_early = false;
 	struct sk_buff *skb;
 	u32 urg_hole = 0;
+	bool locked = false;
 
 	lock_sock(sk);
 
@@ -1557,6 +1913,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	timeo = sock_rcvtimeo(sk, nonblock);
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out;
+
 	/* Urgent data needs to be handled specially. */
 	if (flags & MSG_OOB)
 		goto recv_urg;
@@ -1595,7 +1955,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
 		if ((available < target) &&
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
-		    !sysctl_tcp_low_latency &&
+		    !sysctl_tcp_low_latency && !friend &&
 		    net_dma_find_channel()) {
 			preempt_enable_no_resched();
 			tp->ucopy.pinned_list =
@@ -1606,7 +1966,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	}
 #endif
 
+	err = 0;
+
 	do {
+		struct tcp_skb_cb *tcb;
 		u32 offset;
 
 		/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
@@ -1614,37 +1977,77 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			if (copied)
 				break;
 			if (signal_pending(current)) {
-				copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
+				err = timeo ? sock_intr_errno(timeo) : -EAGAIN;
 				break;
 			}
 		}
 
-		/* Next get a buffer. */
+		/*
+		 * Next get a buffer. Note, for friends sendmsg() queues
+		 * data directly to our sk_receive_queue by holding our
+		 * slock and either tail queuing a new skb or adding new
+		 * data to the tail skb. In the later case tail_inuse is
+		 * set, slock dropped, copyin, skb->len updated, re-hold
+		 * slock, end_seq updated, so we can only use the bytes
+		 * from *seq to end_seq!
+		 */
+		if (friend && !locked) {
+			tcp_friend_recv_lock(sk);
+			locked = true;
+		}
 
 		skb_queue_walk(&sk->sk_receive_queue, skb) {
+			tcb = TCP_SKB_CB(skb);
+			offset = *seq - tcb->seq;
+			if (friend) {
+				if (skb->friend) {
+					used = (u32)(tcb->end_seq - *seq);
+					if (used > 0) {
+						tcp_friend_unlock(sk);
+						locked = false;
+						/* Can use it all */
+						goto found_ok_skb;
+					}
+					/* No data to copyout */
+					if (flags & MSG_PEEK)
+						continue;
+					if (!TCP_FRIEND_CB(tcb)->tail_inuse)
+						goto unlink;
+					break;
+				}
+				tcp_friend_unlock(sk);
+				locked = false;
+			}
+
 			/* Now that we have two receive queues this
 			 * shouldn't happen.
 			 */
-			if (WARN(before(*seq, TCP_SKB_CB(skb)->seq),
+			if (WARN(before(*seq, tcb->seq),
 				 "recvmsg bug: copied %X seq %X rcvnxt %X fl %X\n",
-				 *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt,
-				 flags))
+				 *seq, tcb->seq, tp->rcv_nxt, flags))
 				break;
 
-			offset = *seq - TCP_SKB_CB(skb)->seq;
 			if (tcp_hdr(skb)->syn)
 				offset--;
-			if (offset < skb->len)
+			if (offset < skb->len) {
+				/* Ok so how much can we use? */
+				used = skb->len - offset;
 				goto found_ok_skb;
+			}
 			if (tcp_hdr(skb)->fin)
 				goto found_fin_ok;
 			WARN(!(flags & MSG_PEEK),
 			     "recvmsg bug 2: copied %X seq %X rcvnxt %X fl %X\n",
-			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
+			     *seq, tcb->seq, tp->rcv_nxt, flags);
 		}
 
 		/* Well, if we have backlog, try to process it now yet. */
 
+		if (friend && locked) {
+			tcp_friend_unlock(sk);
+			locked = false;
+		}
+
 		if (copied >= target && !sk->sk_backlog.tail)
 			break;
 
@@ -1691,7 +2094,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 		tcp_cleanup_rbuf(sk, copied);
 
-		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+		if (!sysctl_tcp_low_latency && !friend &&
+		    tp->ucopy.task == user_recv) {
 			/* Install new reader */
 			if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
 				user_recv = current;
@@ -1791,8 +2195,6 @@ do_prequeue:
 		continue;
 
 	found_ok_skb:
-		/* Ok so how much can we use? */
-		used = skb->len - offset;
 		if (len < used)
 			used = len;
 
@@ -1849,7 +2251,7 @@ do_prequeue:
 				if (err) {
 					/* Exception. Bailout! */
 					if (!copied)
-						copied = -EFAULT;
+						copied = err;
 					break;
 				}
 			}
@@ -1858,6 +2260,7 @@ do_prequeue:
 		*seq += used;
 		copied += used;
 		len -= used;
+		offset += used;
 
 		tcp_rcv_space_adjust(sk);
 
@@ -1866,10 +2269,43 @@ skip_copy:
 			tp->urg_data = 0;
 			tcp_fast_path_check(sk);
 		}
-		if (used + offset < skb->len)
+
+		if (skb->friend) {
+			tcp_friend_recv_lock(sk);
+			locked = true;
+			used = (u32)(tcb->end_seq - *seq);
+			if (used) {
+				/*
+				 * Friend did an skb_put() while we were away
+				 * so if more to do process the same skb.
+				 */
+				if (len > 0) {
+					tcp_friend_unlock(sk);
+					locked = false;
+					goto found_ok_skb;
+				}
+				continue;
+			}
+			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+				/* Give sendmsg a chance */
+				tcp_friend_unlock(sk);
+				locked = false;
+				continue;
+			}
+			if (!(flags & MSG_PEEK)) {
+		unlink:
+				__skb_unlink(skb, &sk->sk_receive_queue);
+				__kfree_skb(skb);
+				tcp_friend_unlock(sk);
+				locked = false;
+				tcp_friend_write_space(sk);
+			}
 			continue;
+		}
 
-		if (tcp_hdr(skb)->fin)
+		if (offset < skb->len)
+			continue;
+		else if (tcp_hdr(skb)->fin)
 			goto found_fin_ok;
 		if (!(flags & MSG_PEEK)) {
 			sk_eat_skb(sk, skb, copied_early);
@@ -1887,6 +2323,9 @@ skip_copy:
 		break;
 	} while (len > 0);
 
+	if (friend && locked)
+		tcp_friend_unlock(sk);
+
 	if (user_recv) {
 		if (!skb_queue_empty(&tp->ucopy.prequeue)) {
 			int chunk;
@@ -2065,6 +2504,9 @@ void tcp_close(struct sock *sk, long timeout)
 		goto adjudge_to_death;
 	}
 
+	if (sk->sk_friend)
+		sock_put(sk->sk_friend);
+
 	/*  We need to flush the recv. buffs.  We do this only on the
 	 *  descriptor close, not protocol-sourced closes, because the
 	 *  reader process may not have drained the data yet!
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fc67831..9640a81 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -530,6 +530,9 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	int time;
 	int space;
 
+	if (sk->sk_friend)
+		return;
+
 	if (tp->rcvq_space.time == 0)
 		goto new_measure;
 
@@ -4350,8 +4353,9 @@ static int tcp_prune_queue(struct sock *sk);
 static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 				 unsigned int size)
 {
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    !sk_rmem_schedule(sk, skb, size)) {
+	if (!sk->sk_friend &&
+	    (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+	    !sk_rmem_schedule(sk, skb, size))) {
 
 		if (tcp_prune_queue(sk) < 0)
 			return -1;
@@ -5722,6 +5726,16 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
+		if (skb->friend) {
+			/*
+			 * If friends haven't been made yet, our sk_friend
+			 * still == NULL, then update with the ACK's friend
+			 * value (the listen()er's sock addr) which is used
+			 * as a place holder.
+			 */
+			cmpxchg(&sk->sk_friend, NULL, skb->friend);
+		}
+
 		TCP_ECN_rcv_synack(tp, th);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
@@ -5797,9 +5811,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
 			return -1;
 
-		if (sk->sk_write_pending ||
+		if (!skb->friend && (sk->sk_write_pending ||
 		    icsk->icsk_accept_queue.rskq_defer_accept ||
-		    icsk->icsk_ack.pingpong) {
+		    icsk->icsk_ack.pingpong)) {
 			/* Save one ACK. Data will be ready after
 			 * several ticks, if write_pending is set.
 			 *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1ed2307..f494914 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1512,6 +1512,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	req->friend = skb->friend;
+
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f35f2df..36d832a 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -270,6 +270,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 	const struct tcp_sock *tp = tcp_sk(sk);
 	bool recycle_ok = false;
 
+	if (sk->sk_friend)
+		goto out;
+
 	if (tcp_death_row.sysctl_tw_recycle && tp->rx_opt.ts_recent_stamp)
 		recycle_ok = tcp_remember_stamp(sk);
 
@@ -349,6 +352,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 	}
 
 	tcp_update_metrics(sk);
+out:
 	tcp_done(sk);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8ac0855..509c5e3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+/* By default, TCP loopback bypass */
+int sysctl_tcp_friends __read_mostly = 1;
+
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
 
@@ -1025,9 +1028,13 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
-	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
+	if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
+		/* Only try to make friends if enabled */
+		if (sysctl_tcp_friends)
+			skb->friend = sk;
+
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
-	else
+	} else
 		tcp_options_size = tcp_established_options(sk, skb, &opts,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
@@ -2725,6 +2732,11 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	}
 
 	memset(&opts, 0, sizeof(opts));
+
+	/* Only try to make friends if enabled */
+	if (sysctl_tcp_friends)
+		skb->friend = sk;
+
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(req->cookie_ts))
 		TCP_SKB_CB(skb)->when = cookie_init_timestamp(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6565cf5..828d5f7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -969,6 +969,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
 #endif
 
+	req->friend = skb->friend;
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] fix panic in tcp_close()
  2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
  2012-12-05  2:54     ` [PATCH 1/3] Bruce's orignal tcp friend V3 Weiping Pan
@ 2012-12-05  2:54     ` Weiping Pan
  2012-12-05  2:54     ` [PATCH 3/3] delete request_sock->friend Weiping Pan
  2012-12-10 21:02     ` [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections David Miller
  3 siblings, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan

For tcp friends data skb, it has no tcp header,
and its transport_header is NULL, so it will panic if we deference tcp_hdr(skb)
in tcp_close().

So I add a check before we use tcp_hdr().

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 net/ipv4/tcp.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4327deb..e9d82e0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2512,8 +2512,12 @@ void tcp_close(struct sock *sk, long timeout)
 	 *  reader process may not have drained the data yet!
 	 */
 	while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
+		u32 len;
+		if (tcp_hdr(skb))
+			len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
 			  tcp_hdr(skb)->fin;
+		else
+			len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq;
 		data_was_unread += len;
 		__kfree_skb(skb);
 	}
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] delete request_sock->friend
  2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
  2012-12-05  2:54     ` [PATCH 1/3] Bruce's orignal tcp friend V3 Weiping Pan
  2012-12-05  2:54     ` [PATCH 2/3] fix panic in tcp_close() Weiping Pan
@ 2012-12-05  2:54     ` Weiping Pan
  2012-12-10 21:02     ` [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections David Miller
  3 siblings, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan

The sock pointed by request_sock->friend may be freed since it does not have a
lock to protect it.
I just delete request_sock->friend since I think it is useless.

For sk_buff->friend, it has the same problem, and we use
"atomic_add(skb->truesize, &sk->sk_wmem_alloc)" to guarantee that the sock can
not be freed before the skb is freed.

Then for 3-way handshake with tcp friends enabled,
SYN->friend is NULL, SYN/ACK->friend is set in tcp_make_synack(),
and ACK->friend is set in tcp_send_ack().

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 include/net/inet_connection_sock.h |    4 ++
 include/net/request_sock.h         |    1 -
 net/ipv4/inet_connection_sock.c    |   58 +++++++++++++++++++++++------------
 net/ipv4/tcp_input.c               |   10 ------
 net/ipv4/tcp_ipv4.c                |    7 +++-
 net/ipv4/tcp_minisocks.c           |    7 ++++-
 net/ipv4/tcp_output.c              |   21 ++++++++-----
 net/ipv6/tcp_ipv6.c                |    1 -
 8 files changed, 66 insertions(+), 43 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index ba1d361..883e029 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -147,6 +147,10 @@ static inline void *inet_csk_ca(const struct sock *sk)
 extern struct sock *inet_csk_clone_lock(const struct sock *sk,
 					const struct request_sock *req,
 					const gfp_t priority);
+extern struct sock *inet_csk_friend_clone_lock(const struct sock *sk,
+					const struct request_sock *req,
+					const struct sk_buff *skb,
+					const gfp_t priority);
 
 enum inet_csk_ack_state_t {
 	ICSK_ACK_SCHED	= 1,
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c6dfa26..a51dbd1 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -66,7 +66,6 @@ struct request_sock {
 	unsigned long			expires;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
-	struct sock			*friend;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ce4b79b..7af92ed 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -659,26 +659,6 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 	if (newsk != NULL) {
 		struct inet_connection_sock *newicsk = inet_csk(newsk);
 
-		if (req->friend) {
-			/*
-			 * Make friends with the requestor but the ACK of
-			 * the request is already in-flight so the race is
-			 * on to make friends before the ACK is processed.
-			 * If the requestor's sk_friend value is != NULL
-			 * then the requestor has already processed the
-			 * ACK so indicate state change to wake'm up.
-			 */
-			struct sock *was;
-
-			sock_hold(req->friend);
-			newsk->sk_friend = req->friend;
-			sock_hold(newsk);
-			was = xchg(&req->friend->sk_friend, newsk);
-			/* If requester already connect()ed, maybe sleeping */
-			if (was && !sock_flag(req->friend, SOCK_DEAD))
-				sk->sk_state_change(req->friend);
-		}
-
 		newsk->sk_state = TCP_SYN_RECV;
 		newicsk->icsk_bind_hash = NULL;
 
@@ -700,6 +680,44 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 }
 EXPORT_SYMBOL_GPL(inet_csk_clone_lock);
 
+/**
+ *	inet_csk_friend_clone_lock - clone an inet socket, and lock its clone
+ *	@sk: the socket to clone
+ *	@req: request_sock
+ *	@skb: who sends the request
+ *	@priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
+ *
+ *	Caller must unlock socket even in error path (bh_unlock_sock(newsk))
+ */
+struct sock *inet_csk_friend_clone_lock(const struct sock *sk,
+				 const struct request_sock *req,
+				 const struct sk_buff *skb,
+				 const gfp_t priority)
+{
+	struct sock *newsk = inet_csk_clone_lock(sk, req, priority);
+
+	if (newsk) {
+		struct sock *friend = skb->friend;
+		if (friend) {
+			/*
+			 * Make friends.
+			 */
+			struct sock *was;
+
+			sock_hold(friend);
+			newsk->sk_friend = friend;
+			sock_hold(newsk);
+			was = xchg(&friend->sk_friend, newsk);
+			/* If requester already connect()ed, maybe sleeping */
+			if (was && !sock_flag(friend, SOCK_DEAD))
+				sk->sk_state_change(friend);
+		}
+	}
+
+	return newsk;
+}
+EXPORT_SYMBOL_GPL(inet_csk_friend_clone_lock);
+
 /*
  * At this point, there should be no process reference to this
  * socket, and thus no user references at all.  Therefore we
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9640a81..39db09d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5726,16 +5726,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
-		if (skb->friend) {
-			/*
-			 * If friends haven't been made yet, our sk_friend
-			 * still == NULL, then update with the ACK's friend
-			 * value (the listen()er's sock addr) which is used
-			 * as a place holder.
-			 */
-			cmpxchg(&sk->sk_friend, NULL, skb->friend);
-		}
-
 		TCP_ECN_rcv_synack(tp, th);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f494914..8d61e4c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1512,8 +1512,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
-	req->friend = skb->friend;
-
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
@@ -1873,6 +1871,11 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
 		goto csum_err;
 
+	if (sysctl_tcp_friends && skb->friend) {
+		skb->sk = skb->friend;
+		skb->destructor = sock_wfree;
+	}
+
 	if (sk->sk_state == TCP_LISTEN) {
 		struct sock *nsk = tcp_v4_hnd_req(sk, skb);
 		if (!nsk)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 36d832a..753126e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -383,7 +383,12 @@ static inline void TCP_ECN_openreq_child(struct tcp_sock *tp,
  */
 struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
 {
-	struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
+	struct sock *newsk = NULL;
+
+	if (sysctl_tcp_friends && skb->friend)
+		newsk = inet_csk_friend_clone_lock(sk, req, skb, GFP_ATOMIC);
+	else
+		newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
 
 	if (newsk != NULL) {
 		const struct inet_request_sock *ireq = inet_rsk(req);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 509c5e3..4d71549 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1028,13 +1028,9 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
-	if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
-		/* Only try to make friends if enabled */
-		if (sysctl_tcp_friends)
-			skb->friend = sk;
-
+	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
-	} else
+	else
 		tcp_options_size = tcp_established_options(sk, skb, &opts,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
@@ -1050,7 +1046,11 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_orphan(skb);
 	skb->sk = sk;
-	skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
+
+	if (skb->friend)
+		skb->destructor = NULL;
+	else
+		skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
 			  tcp_wfree : sock_wfree;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
@@ -2734,8 +2734,10 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	memset(&opts, 0, sizeof(opts));
 
 	/* Only try to make friends if enabled */
-	if (sysctl_tcp_friends)
+	if (sysctl_tcp_friends) {
 		skb->friend = sk;
+		atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+	}
 
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(req->cookie_ts))
@@ -3120,6 +3122,9 @@ void tcp_send_ack(struct sock *sk)
 
 	/* Send it off, this clears delayed acks for us. */
 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+
+	if (sysctl_tcp_friends)
+		buff->friend = sk;
 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
 }
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 828d5f7..6565cf5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -969,7 +969,6 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
 #endif
 
-	req->friend = skb->friend;
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
  2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
                       ` (2 preceding siblings ...)
  2012-12-05  2:54     ` [PATCH 3/3] delete request_sock->friend Weiping Pan
@ 2012-12-10 21:02     ` David Miller
  2012-12-12 14:13       ` Weiping Pan
       [not found]       ` <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
  3 siblings, 2 replies; 15+ messages in thread
From: David Miller @ 2012-12-10 21:02 UTC (permalink / raw)
  To: wpan; +Cc: netdev, brutus

From: Weiping Pan <wpan@redhat.com>
Date: Wed,  5 Dec 2012 10:54:16 +0800

> Friends VS AF__UNIX
> Their call path are almost the same, but AF_UNIX uses its own send/recv codes
> with proper locks,
> so AF_UNIX's performance is much better than Friends.

While I understand the other portions of your analysis, this one
mystifies me.

In both cases, the sender has to queue the SKB onto the receiver's
queue.  And in both cases, the sender takes the lock on that queue.

So the locking contention really ought to be similar if not identical.

The only difference is that AF_UNIX takes the unix_sk()->lock of the
remote socket around these operations.

If that is enough of a synchronizer to "fix" the contention or reduce
it, then this would be very easy to test by adding a friend lock to
tcp_sk().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
  2012-12-10 21:02     ` [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections David Miller
@ 2012-12-12 14:13       ` Weiping Pan
       [not found]       ` <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-12 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: wpan, netdev, brutus

On 12/11/2012 05:02 AM, David Miller wrote:
> From: Weiping Pan<wpan@redhat.com>
> Date: Wed,  5 Dec 2012 10:54:16 +0800
>
>> Friends VS AF__UNIX
>> Their call path are almost the same, but AF_UNIX uses its own send/recv codes
>> with proper locks,
>> so AF_UNIX's performance is much better than Friends.
Sorry, this statement is not correct.
In TCP_STREAM case, if the message size if 16384, then AF_UNIX is much 
better than Friends.
If the message size is smaller, then Friends shows equal performance 
with AF_UNIX.
In TCP_RR,  Friends shows equal performance with AF_UNIX, too.

> While I understand the other portions of your analysis, this one
> mystifies me.
>
> In both cases, the sender has to queue the SKB onto the receiver's
> queue.  And in both cases, the sender takes the lock on that queue.
>
> So the locking contention really ought to be similar if not identical.
>
> The only difference is that AF_UNIX takes the unix_sk()->lock of the
> remote socket around these operations.
>
> If that is enough of a synchronizer to "fix" the contention or reduce
> it, then this would be very easy to test by adding a friend lock to
> tcp_sk().

I make some experiments to reduce the use of lock,
some performance results will be followed up.

thanks
Weiping Pan

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH net-next 4/4 V4] try to fix performance regression
       [not found]       ` <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
@ 2012-12-12 14:29         ` Weiping Pan
  2012-12-12 14:57           ` David Laight
  2012-12-12 16:25           ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-12 14:29 UTC (permalink / raw)
  To: davem; +Cc: brutus, netdev, Weiping Pan

1 do not share tail skb between sender and receiver
2 reduce the use of sock->sk_lock.slock

--------------------------------------------------------------------------
TCP friends performance results start


BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.



      BASE    AF_UNIX    FRIENDS               TCP_STREAM
   7952.97   10864.86   13440.08  168%  123%



      BASE    AF_UNIX    FRIENDS               TCP_MAERTS
   6743.78          -   13809.97  204%    -%



      BASE    AF_UNIX    FRIENDS             TCP_SENDFILE
     11758          -      18483  157%    -%


TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.



        MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
         1      10.70       5.40       4.02   37%   74%
         2      28.01       9.67       7.97   28%   82%
         4      55.53      19.78      16.48   29%   83%
         8     115.40      38.22      33.51   29%   87%
        16     227.31      81.06      67.70   29%   83%
        32     446.20     166.59     129.31   28%   77%
        64     849.04     336.77     259.43   30%   77%
       128    1440.50     661.88     530.43   36%   80%
       256    2404.70    1279.67    1029.15   42%   80%
       512    4331.53    2501.30    1942.21   44%   77%
      1024    6819.78    4622.37    4128.10   60%   89%
      2048   10544.60    6348.81    6349.59   60%  100%
      4096   12830.41    8324.43    7984.43   62%   95%
      8192   13462.65    8355.49   11079.37   82%  132%
     16384    9960.87   10840.13   13037.81  130%  120%
     32768    8749.31   11372.15   15087.08  172%  132%
     65536    7580.27   12150.23   14971.42  197%  123%
    131072    6727.74   11451.34   13604.78  202%  118%
    262144    7673.14   11613.10   11436.97  149%   98%
    524288    7366.17   11675.95   11559.43  156%   99%
   1048576    6608.57   11883.01   10103.20  152%   85%
MS means Message Size in bytes, that is -m -M for netperf



        RR       BASE    AF_UNIX    FRIENDS                TCP_RR_RR
         1   19716.88   34451.39   34574.12  175%  100%
         2   19836.74   34297.00   34671.29  174%  101%
         4   19874.71   34456.48   34552.13  173%  100%
         8   18882.93   34123.00   34661.48  183%  101%
        16   19179.09   34358.47   34599.16  180%  100%
        32   20140.08   34326.35   34616.30  171%  100%
        64   19473.39   34382.05   34583.10  177%  100%
       128   19699.62   34012.03   34566.14  175%  101%
       256   19740.44   34529.71   34624.07  175%  100%
       512   18929.46   33673.06   33932.83  179%  100%
      1024   18738.98   33724.78   33313.44  177%   98%
      2048   17315.61   32982.24   32361.39  186%   98%
      4096   16585.81   31345.85   31073.32  187%   99%
      8192   11933.16   27851.10   27166.94  227%   97%
     16384    9717.19   21746.12   22583.40  232%  103%
     32768    7044.35   12927.23   16253.26  230%  125%
     65536    5038.96    8945.74    7982.61  158%   89%
    131072    2860.64    4981.78    4417.16  154%   88%
    262144    1633.45    2765.27    2739.36  167%   99%
    524288     796.68    1429.79    1445.21  181%  101%
   1048576     379.78        per     730.05  192%     %
RR means Request Response Message Size in bytes, that is -r req,resp for netperf



        RR       BASE    AF_UNIX    FRIENDS               TCP_CRR_RR
         1    5531.49          -    5861.86  105%    -%
         2    5506.13          -    5845.53  106%    -%
         4    5523.27          -    5853.43  105%    -%
         8    5503.73          -    5836.44  106%    -%
        16    5516.23          -    5842.29  105%    -%
        32    5557.37          -    5858.29  105%    -%
        64    5517.51          -    5892.64  106%    -%
       128    5504.18          -    5841.44  106%    -%
       256    5512.82          -    5842.60  105%    -%
       512    5496.36          -    5837.72  106%    -%
      1024    5465.24          -    5827.99  106%    -%
      2048    5550.15          -    5812.88  104%    -%
      4096    5292.75          -    5824.45  110%    -%
      8192    4917.06          -    5705.12  116%    -%
     16384    4278.63          -    5318.39  124%    -%
     32768    3611.86          -    4930.30  136%    -%
     65536      77.35          -    3847.43 4974%    -%
    131072      47.65          -    2811.58 5900%    -%
    262144     805.13          -       4.88    0%    -%
    524288     583.08          -       4.78    0%    -%
   1048576     369.52          -       5.02    1%    -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1



TCP friends performance results end
--------------------------------------------------------------------------


Performance analysis:
1 Friends shows better performance than loopback in TCP_RR, TCP_MAERTS and
TCP_SENDFILE, same in TCP_CRR_RR.

2 In TCP_STREAM, Friends shows much worse perofrmance (30%) than loopback if
the message size if small, and it shows worse performance (80%) than AF_UNIX.

3 Compared with last performance report, Friends shows worse performance in
TCP_RR.

Friends VS AF_UNIX
I think the lock use is much similar this time.
May the locking contention is not the bottle neck ?

Friends VS loopback
I have reduced the locking contention as much as possible,
but it still shows bad performance.
May the locking contention is not the bottle neck ?


Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 include/net/tcp.h |   10 --
 net/ipv4/tcp.c    |  327 ++++++++++++++++++++++-------------------------------
 2 files changed, 136 insertions(+), 201 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5f82770..80a8ec9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -688,15 +688,6 @@ void tcp_send_window_probe(struct sock *sk);
 #define TCPHDR_ECE 0x40
 #define TCPHDR_CWR 0x80
 
-/* If skb_get_friend() != NULL, TCP friends per packet state.
- */
-struct friend_skb_parm {
-	bool	tail_inuse;		/* In use by skb_get_friend() send while */
-					/* on sk_receive_queue for tail put */
-};
-
-#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
-
 /* This is what the send packet queuing engine uses to pass
  * TCP per-packet control information to the transmission code.
  * We also store the host-order sequence numbers in here too.
@@ -709,7 +700,6 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-		struct friend_skb_parm	hf;
 	} header;	/* For incoming frames		*/
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e9d82e0..f008d60 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -336,25 +336,24 @@ static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
 	return 1;
 }
 
-static inline int tcp_friend_send_lock(struct sock *friend)
+static inline int tcp_friend_get_state(struct sock *friend)
 {
 	int err = 0;
 
 	spin_lock_bh(&friend->sk_lock.slock);
-	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
-		spin_unlock_bh(&friend->sk_lock.slock);
+	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN))
 		err = -ECONNRESET;
-	}
+	spin_unlock_bh(&friend->sk_lock.slock);
 
 	return err;
 }
 
-static inline void tcp_friend_recv_lock(struct sock *friend)
+static inline void tcp_friend_state_lock(struct sock *friend)
 {
 	spin_lock_bh(&friend->sk_lock.slock);
 }
 
-static void tcp_friend_unlock(struct sock *friend)
+static inline void tcp_friend_state_unlock(struct sock *friend)
 {
 	spin_unlock_bh(&friend->sk_lock.slock);
 }
@@ -639,71 +638,32 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(tcp_ioctl);
 
-/*
- * Friend receive_queue tail skb space? If true, set tail_inuse.
- * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
- */
-static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
-{
-	struct sk_buff	*skb = NULL;
-	int		sz = 0;
-
-	if (skb_peek_tail(&friend->sk_receive_queue)) {
-		sz = tcp_friend_send_lock(friend);
-		if (!sz) {
-			skb = skb_peek_tail(&friend->sk_receive_queue);
-			if (skb && skb->friend) {
-				if (!*copy)
-					sz = skb_tailroom(skb);
-				else {
-					sz = *copy - skb->len;
-					if (sz < 0)
-						sz = 0;
-				}
-				if (sz > 0)
-					TCP_FRIEND_CB(TCP_SKB_CB(skb))->
-							tail_inuse = true;
-			}
-			tcp_friend_unlock(friend);
-		}
-	}
-
-	*copy = sz;
-	return skb;
-}
-
-static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
-{
-	struct sock	*friend = sk->sk_friend;
-	struct tcp_sock *tp = tcp_sk(friend);
-
-	if (charge) {
-		sk_mem_charge(friend, charge);
-		atomic_add(charge, &friend->sk_rmem_alloc);
-	}
-	tp->rcv_nxt += copy;
-	tp->rcv_wup += copy;
-	tcp_friend_unlock(friend);
-
-	tp = tcp_sk(sk);
-	tp->snd_nxt += copy;
-	tp->pushed_seq += copy;
-	tp->snd_una += copy;
-	tp->snd_up += copy;
-}
-
 static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
 {
-	struct sock	*friend = sk->sk_friend;
-	int		wait = false;
+	struct sock *friend = sk->sk_friend;
+	struct tcp_sock *tp = NULL;
+	int wait = false;
+
+	tcp_friend_state_lock(friend);
 
 	skb_set_owner_r(skb, friend);
-	__skb_queue_tail(&friend->sk_receive_queue, skb);
 	if (!sk_rmem_schedule(friend, skb, skb->truesize))
 		wait = true;
+	__skb_queue_tail(&friend->sk_receive_queue, skb);
+
+	tcp_friend_state_unlock(friend);
 
-	tcp_friend_seq(sk, skb->len, 0);
-	if (skb == skb_peek(&friend->sk_receive_queue))
+	tp = tcp_sk(friend);
+	tp->rcv_nxt += skb->len;
+	tp->rcv_wup += skb->len;
+
+	tp = tcp_sk(sk);
+	tp->snd_nxt += skb->len;
+	tp->pushed_seq += skb->len;
+	tp->snd_una += skb->len;
+	tp->snd_up += skb->len;
+
+	if (skb_queue_len(&friend->sk_receive_queue) == 1)
 		friend->sk_data_ready(friend, 0);
 
 	return wait;
@@ -728,7 +688,6 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	tcb->seq     = tcb->end_seq = tp->write_seq;
 	if (sk->sk_friend) {
 		skb->friend = sk;
-		TCP_FRIEND_CB(tcb)->tail_inuse = false;
 		return;
 	}
 	skb->csum    = 0;
@@ -1048,8 +1007,17 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
+	if (friend) {
+		err = tcp_friend_get_state(friend);
+		if (err) {
+			sk->sk_err = -err;
+			err = -EPIPE;
+			goto out_err;
+		}
+	}
+
 	while (psize > 0) {
-		struct sk_buff *skb;
+		struct sk_buff *skb = NULL;
 		struct tcp_skb_cb *tcb;
 		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
@@ -1059,12 +1027,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 
 		if (friend) {
 			copy = size_goal;
-			skb = tcp_friend_tail(friend, &copy);
-			if (copy < 0) {
-				sk->sk_err = -copy;
-				err = -EPIPE;
-				goto out_err;
-			}
+			if (skb)
+				copy = copy - skb->len;
+			else
+				copy = 0;
 		} else if (!tcp_send_head(sk)) {
 			skb = NULL;
 			copy = 0;
@@ -1078,9 +1044,17 @@ new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
 
-			if (friend)
+			if (friend) {
+				if (skb) {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
+
+				/*
+				 * new skb
+				 */
 				skb = tcp_friend_alloc_skb(sk, 0);
-			else
+			} else
 				skb = sk_stream_alloc_skb(sk, 0,
 							  sk->sk_allocation);
 			if (!skb)
@@ -1097,10 +1071,7 @@ new_segment:
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
-			if (friend) {
-				if (TCP_FRIEND_CB(tcb)->tail_inuse)
-					TCP_FRIEND_CB(tcb)->tail_inuse = false;
-			} else
+			if (!friend)
 				tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
@@ -1124,20 +1095,9 @@ new_segment:
 		psize -= copy;
 
 		if (friend) {
-			err = tcp_friend_send_lock(friend);
-			if (err) {
-				sk->sk_err = -err;
-				err = -EPIPE;
-				goto out_err;
-			}
 			tcb->end_seq += copy;
-			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-				TCP_FRIEND_CB(tcb)->tail_inuse = false;
-				tcp_friend_seq(sk, copy, copy);
-			} else {
-				if (tcp_friend_push(sk, skb))
-					goto wait_for_sndbuf;
-			}
+			if (tcp_friend_push(sk, skb))
+				goto wait_for_sndbuf;
 			if (!psize)
 				goto out;
 			continue;
@@ -1172,6 +1132,18 @@ wait_for_memory:
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
 
+		if (friend) {
+			if (skb) {
+				tcp_friend_state_lock(friend);
+				if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+					tcp_friend_state_unlock(friend);
+					goto wait_for_sndbuf;
+				}
+				tcp_friend_state_unlock(friend);
+				skb = NULL;
+			}
+		}
+
 		if (!friend)
 			mss_now = tcp_send_mss(sk, &size_goal, flags);
 	}
@@ -1266,7 +1238,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	struct iovec *iov;
 	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	struct tcp_skb_cb *tcb;
 	int iovlen, flags, err, copied = 0;
 	int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
@@ -1330,6 +1302,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (friend) {
+		err = tcp_friend_get_state(friend);
+		if (err) {
+			sk->sk_err = -err;
+			err = -EPIPE;
+			goto out_err;
+		}
+	}
+
 	while (--iovlen >= 0) {
 		size_t seglen = iov->iov_len;
 		unsigned char __user *from = iov->iov_base;
@@ -1350,12 +1331,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			int max = size_goal;
 
 			if (friend) {
-				skb = tcp_friend_tail(friend, &copy);
-				if (copy < 0) {
-					sk->sk_err = -copy;
-					err = -EPIPE;
-					goto out_err;
-				}
+				if (skb)
+					copy = skb_availroom(skb);
+				else
+					copy = 0;
 			} else {
 				skb = tcp_write_queue_tail(sk);
 				if (tcp_send_head(sk)) {
@@ -1370,9 +1349,21 @@ new_segment:
 				if (!sk_stream_memory_free(sk))
 					goto wait_for_sndbuf;
 
-				if (friend)
+				if (friend) {
+					if (skb) {
+						/*
+						 * Friend push old skb
+						 */
+
+						if (tcp_friend_push(sk, skb))
+							goto wait_for_sndbuf;
+					}
+
+					/*
+					 * new skb
+					 */
 					skb = tcp_friend_alloc_skb(sk, max);
-				else {
+				} else {
 					/* Allocate new segment. If the
 					 * interface is SG, allocate skb
 					 * fitting to single page.
@@ -1455,32 +1446,23 @@ new_segment:
 			copied += copy;
 			seglen -= copy;
 
-			if (friend) {
-				err = tcp_friend_send_lock(friend);
-				if (err) {
-					sk->sk_err = -err;
-					err = -EPIPE;
-					goto out_err;
-				}
-				tcb->end_seq += copy;
-				if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-					TCP_FRIEND_CB(tcb)->tail_inuse = false;
-					tcp_friend_seq(sk, copy, 0);
-				} else {
-					if (tcp_friend_push(sk, skb))
-						goto wait_for_sndbuf;
-				}
-				continue;
-			}
-
 			tcb->end_seq += copy;
+
 			skb_shinfo(skb)->gso_segs = 0;
 
 			if (copied == copy)
 				tcb->tcp_flags &= ~TCPHDR_PSH;
 
-			if (seglen == 0 && iovlen == 0)
+			if (seglen == 0 && iovlen == 0) {
+				if (friend && skb) {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
 				goto out;
+			}
+
+			if (friend)
+				continue;
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
 				continue;
@@ -1501,6 +1483,17 @@ wait_for_memory:
 			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 				goto do_error;
 
+			if (friend) {
+				if (skb) {
+					tcp_friend_state_lock(friend);
+					if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+						tcp_friend_state_unlock(friend);
+						goto wait_for_sndbuf;
+					}
+					tcp_friend_state_unlock(friend);
+					skb = NULL;
+				}
+			}
 			if (!friend)
 				mss_now = tcp_send_mss(sk, &size_goal, flags);
 		}
@@ -1514,10 +1507,7 @@ out:
 
 do_fault:
 	if (skb->friend) {
-		if (TCP_FRIEND_CB(tcb)->tail_inuse)
-			TCP_FRIEND_CB(tcb)->tail_inuse = false;
-		else
-			__kfree_skb(skb);
+		__kfree_skb(skb);
 	} else if (!skb->len) {
 		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
@@ -1787,8 +1777,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	err = tcp_friend_validate(sk, &friend, &timeo);
 	if (err < 0)
 		return err;
-	if (friend)
-		tcp_friend_recv_lock(sk);
 
 	while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
 		if (len > 0) {
@@ -1803,9 +1791,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 
-			if (friend)
-				tcp_friend_unlock(sk);
-
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
@@ -1817,21 +1802,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				offset += used;
 			}
 
-			if (friend)
-				tcp_friend_recv_lock(sk);
-			if (skb->friend) {
-				len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
-				if (len > 0) {
-					/*
-					 * Friend did an skb_put() while we
-					 * were away so process the same skb.
-					 */
-					if (!desc->count)
-						break;
-					tp->copied_seq = seq;
-					goto again;
-				}
-			} else {
+			if (!skb->friend) {
 				/*
 				 * If recv_actor drops the lock (e.g. TCP
 				 * splice receive) the skb pointer might be
@@ -1844,19 +1815,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 		}
+
 		if (!skb->friend && tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, false);
 			++seq;
 			break;
 		}
 		if (skb->friend) {
-			if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
-				__skb_unlink(skb, &sk->sk_receive_queue);
-				__kfree_skb(skb);
-				tcp_friend_write_space(sk);
+			len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+			if (len > 0) {
+				if (!desc->count)
+					break;
+				tp->copied_seq = seq;
+				goto again;
 			}
-			tcp_friend_unlock(sk);
-			tcp_friend_recv_lock(sk);
+			tcp_friend_state_lock(sk);
+			__skb_unlink(skb, &sk->sk_receive_queue);
+			__kfree_skb(skb);
+			tcp_friend_state_unlock(sk);
+			tcp_friend_write_space(sk);
 		} else
 			sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
@@ -1866,7 +1843,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tp->copied_seq = seq;
 
 	if (friend) {
-		tcp_friend_unlock(sk);
 		tcp_friend_write_space(sk);
 	} else {
 		tcp_rcv_space_adjust(sk);
@@ -1903,7 +1879,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	bool copied_early = false;
 	struct sk_buff *skb;
 	u32 urg_hole = 0;
-	bool locked = false;
 
 	lock_sock(sk);
 
@@ -1991,11 +1966,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		 * slock, end_seq updated, so we can only use the bytes
 		 * from *seq to end_seq!
 		 */
-		if (friend && !locked) {
-			tcp_friend_recv_lock(sk);
-			locked = true;
-		}
-
 		skb_queue_walk(&sk->sk_receive_queue, skb) {
 			tcb = TCP_SKB_CB(skb);
 			offset = *seq - tcb->seq;
@@ -2003,20 +1973,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				if (skb->friend) {
 					used = (u32)(tcb->end_seq - *seq);
 					if (used > 0) {
-						tcp_friend_unlock(sk);
-						locked = false;
 						/* Can use it all */
 						goto found_ok_skb;
 					}
 					/* No data to copyout */
 					if (flags & MSG_PEEK)
 						continue;
-					if (!TCP_FRIEND_CB(tcb)->tail_inuse)
-						goto unlink;
-					break;
+					goto unlink;
 				}
-				tcp_friend_unlock(sk);
-				locked = false;
 			}
 
 			/* Now that we have two receive queues this
@@ -2043,11 +2007,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 		/* Well, if we have backlog, try to process it now yet. */
 
-		if (friend && locked) {
-			tcp_friend_unlock(sk);
-			locked = false;
-		}
-
 		if (copied >= target && !sk->sk_backlog.tail)
 			break;
 
@@ -2262,17 +2221,7 @@ do_prequeue:
 		len -= used;
 		offset += used;
 
-		tcp_rcv_space_adjust(sk);
-
-skip_copy:
-		if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
-			tp->urg_data = 0;
-			tcp_fast_path_check(sk);
-		}
-
 		if (skb->friend) {
-			tcp_friend_recv_lock(sk);
-			locked = true;
 			used = (u32)(tcb->end_seq - *seq);
 			if (used) {
 				/*
@@ -2280,29 +2229,28 @@ skip_copy:
 				 * so if more to do process the same skb.
 				 */
 				if (len > 0) {
-					tcp_friend_unlock(sk);
-					locked = false;
 					goto found_ok_skb;
 				}
 				continue;
 			}
-			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-				/* Give sendmsg a chance */
-				tcp_friend_unlock(sk);
-				locked = false;
-				continue;
-			}
 			if (!(flags & MSG_PEEK)) {
 		unlink:
+				tcp_friend_state_lock(sk);
 				__skb_unlink(skb, &sk->sk_receive_queue);
 				__kfree_skb(skb);
-				tcp_friend_unlock(sk);
-				locked = false;
+				tcp_friend_state_unlock(sk);
 				tcp_friend_write_space(sk);
 			}
 			continue;
 		}
 
+		tcp_rcv_space_adjust(sk);
+skip_copy:
+		if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+			tp->urg_data = 0;
+			tcp_fast_path_check(sk);
+		}
+
 		if (offset < skb->len)
 			continue;
 		else if (tcp_hdr(skb)->fin)
@@ -2323,9 +2271,6 @@ skip_copy:
 		break;
 	} while (len > 0);
 
-	if (friend && locked)
-		tcp_friend_unlock(sk);
-
 	if (user_recv) {
 		if (!skb_queue_empty(&tp->ucopy.prequeue)) {
 			int chunk;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-12 14:29         ` [RFC PATCH net-next 4/4 V4] try to fix performance regression Weiping Pan
@ 2012-12-12 14:57           ` David Laight
  2012-12-13 14:05             ` Weiping Pan
  2012-12-12 16:25           ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: David Laight @ 2012-12-12 14:57 UTC (permalink / raw)
  To: Weiping Pan, davem; +Cc: brutus, netdev

>         MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>          1      10.70       5.40       4.02   37%   74%
>          2      28.01       9.67       7.97   28%   82%
>          4      55.53      19.78      16.48   29%   83%
>          8     115.40      38.22      33.51   29%   87%
>         16     227.31      81.06      67.70   29%   83%
>         32     446.20     166.59     129.31   28%   77%
>         64     849.04     336.77     259.43   30%   77%
>        128    1440.50     661.88     530.43   36%   80%
>        256    2404.70    1279.67    1029.15   42%   80%
>        512    4331.53    2501.30    1942.21   44%   77%
>       1024    6819.78    4622.37    4128.10   60%   89%
>       2048   10544.60    6348.81    6349.59   60%  100%
>       4096   12830.41    8324.43    7984.43   62%   95%
>       8192   13462.65    8355.49   11079.37   82%  132%
>      16384    9960.87   10840.13   13037.81  130%  120%
>      32768    8749.31   11372.15   15087.08  172%  132%
>      65536    7580.27   12150.23   14971.42  197%  123%
>     131072    6727.74   11451.34   13604.78  202%  118%
>     262144    7673.14   11613.10   11436.97  149%   98%
>     524288    7366.17   11675.95   11559.43  156%   99%
>    1048576    6608.57   11883.01   10103.20  152%   85%
> MS means Message Size in bytes, that is -m -M for netperf

If I read that table correctly, it seems to imply that
something goes badly wrong for 'normal' TCP loopback
connections when the read/write size exceeds 8k.
Putting effort into fixing that would appear to be
more worthwhile than the 'friends' code.

	David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-12 14:29         ` [RFC PATCH net-next 4/4 V4] try to fix performance regression Weiping Pan
  2012-12-12 14:57           ` David Laight
@ 2012-12-12 16:25           ` Eric Dumazet
  2012-12-13 14:09             ` Weiping Pan
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2012-12-12 16:25 UTC (permalink / raw)
  To: Weiping Pan; +Cc: davem, brutus, netdev

On Wed, 2012-12-12 at 22:29 +0800, Weiping Pan wrote:

> 
>         MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>          1      10.70       5.40       4.02   37%   74%
>          2      28.01       9.67       7.97   28%   82%
>          4      55.53      19.78      16.48   29%   83%
>          8     115.40      38.22      33.51   29%   87%
>         16     227.31      81.06      67.70   29%   83%
>         32     446.20     166.59     129.31   28%   77%
>         64     849.04     336.77     259.43   30%   77%
>        128    1440.50     661.88     530.43   36%   80%
>        256    2404.70    1279.67    1029.15   42%   80%
>        512    4331.53    2501.30    1942.21   44%   77%
>       1024    6819.78    4622.37    4128.10   60%   89%
>       2048   10544.60    6348.81    6349.59   60%  100%
>       4096   12830.41    8324.43    7984.43   62%   95%
>       8192   13462.65    8355.49   11079.37   82%  132%
>      16384    9960.87   10840.13   13037.81  130%  120%
>      32768    8749.31   11372.15   15087.08  172%  132%
>      65536    7580.27   12150.23   14971.42  197%  123%
>     131072    6727.74   11451.34   13604.78  202%  118%
>     262144    7673.14   11613.10   11436.97  149%   98%
>     524288    7366.17   11675.95   11559.43  156%   99%
>    1048576    6608.57   11883.01   10103.20  152%   85%
> MS means Message Size in bytes, that is -m -M for netperf

I cant reproduce your strange numbers here, they make no sense to me.

for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
65536 131072 262144 524288 1048576
do
 ./netperf -- -m $s -M $s | tail -n1
done

Results :

87380  16384      1    10.00      34.68   
 87380  16384      2    10.00      68.07   
 87380  16384      4    10.00     126.27   
 87380  16384      8    10.00     284.50   
 87380  16384     16    10.00     574.38   
 87380  16384     32    10.00    1091.74   
 87380  16384     64    10.00    2130.23   
 87380  16384    128    10.00    4001.83   
 87380  16384    256    10.00    7666.01   
 87380  16384    512    10.00    13425.81   
 87380  16384   1024    10.00    21146.43   
 87380  16384   2048    10.00    28551.42   
 87380  16384   4096    10.00    37878.95   
 87380  16384   8192    10.00    42507.23   
 87380  16384  16384    10.00    46782.53   
 87380  16384  32768    10.00    42410.97   
 87380  16384  65536    10.00    43053.09   
 87380  16384 131072    10.00    44504.20   
 87380  16384 262144    10.00    50211.74   
 87380  16384 524288    10.00    54004.23   
 87380  16384 1048576    10.00    53852.26   

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-12 14:57           ` David Laight
@ 2012-12-13 14:05             ` Weiping Pan
  2012-12-13 18:25               ` Rick Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Weiping Pan @ 2012-12-13 14:05 UTC (permalink / raw)
  To: David Laight; +Cc: davem, brutus, netdev

On 12/12/2012 10:57 PM, David Laight wrote:
>>          MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>>           1      10.70       5.40       4.02   37%   74%
>>           2      28.01       9.67       7.97   28%   82%
>>           4      55.53      19.78      16.48   29%   83%
>>           8     115.40      38.22      33.51   29%   87%
>>          16     227.31      81.06      67.70   29%   83%
>>          32     446.20     166.59     129.31   28%   77%
>>          64     849.04     336.77     259.43   30%   77%
>>         128    1440.50     661.88     530.43   36%   80%
>>         256    2404.70    1279.67    1029.15   42%   80%
>>         512    4331.53    2501.30    1942.21   44%   77%
>>        1024    6819.78    4622.37    4128.10   60%   89%
>>        2048   10544.60    6348.81    6349.59   60%  100%
>>        4096   12830.41    8324.43    7984.43   62%   95%
>>        8192   13462.65    8355.49   11079.37   82%  132%
>>       16384    9960.87   10840.13   13037.81  130%  120%
>>       32768    8749.31   11372.15   15087.08  172%  132%
>>       65536    7580.27   12150.23   14971.42  197%  123%
>>      131072    6727.74   11451.34   13604.78  202%  118%
>>      262144    7673.14   11613.10   11436.97  149%   98%
>>      524288    7366.17   11675.95   11559.43  156%   99%
>>     1048576    6608.57   11883.01   10103.20  152%   85%
>> MS means Message Size in bytes, that is -m -M for netperf
> If I read that table correctly, it seems to imply that
> something goes badly wrong for 'normal' TCP loopback
> connections when the read/write size exceeds 8k.
> Putting effort into fixing that would appear to be
> more worthwhile than the 'friends' code.
>
> 	David
>
Hi, David,

In my test program, I run normal tcp loopback then friends for each 
message size,
then it generates such strange numbers.

But if I just run normal tcp loopback for each message size, then the 
performance is stable.
[root@intel-s3e3432-01 ~]# cat base.sh
for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 
65536 131072 262144 524288 1048576
do
netperf -i -2,10 -I 95,20 -- -m $s -M $s | tail -n1
done


  87380  16384      1    10.09      15.51
  87380  16384      2    10.01      31.39
  87380  16384      4    10.00      55.78
  87380  16384      8    10.00     115.17
  87380  16384     16    10.00     231.66
  87380  16384     32    10.00     452.42
  87380  16384     64    10.00     859.92
  87380  16384    128    10.00    1464.91
  87380  16384    256    10.00    2613.12
  87380  16384    512    10.00    4338.88
  87380  16384   1024    10.00    7174.22
  87380  16384   2048    10.00    10452.84
  87380  16384   4096    10.00    11932.33
  87380  16384   8192    10.00    13750.49
  87380  16384  16384    10.00    13196.98
  87380  16384  32768    10.00    14881.25
  87380  16384  65536    10.00    13685.36
  87380  16384 131072    10.00    16088.71
  87380  16384 262144    10.00    17193.86
  87380  16384 524288    10.00    16696.07
  87380  16384 1048576    10.00    13638.13

thanks
Weiping Pan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-12 16:25           ` Eric Dumazet
@ 2012-12-13 14:09             ` Weiping Pan
  0 siblings, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-13 14:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, brutus, netdev

On 12/13/2012 12:25 AM, Eric Dumazet wrote:
> On Wed, 2012-12-12 at 22:29 +0800, Weiping Pan wrote:
>
>>          MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>>           1      10.70       5.40       4.02   37%   74%
>>           2      28.01       9.67       7.97   28%   82%
>>           4      55.53      19.78      16.48   29%   83%
>>           8     115.40      38.22      33.51   29%   87%
>>          16     227.31      81.06      67.70   29%   83%
>>          32     446.20     166.59     129.31   28%   77%
>>          64     849.04     336.77     259.43   30%   77%
>>         128    1440.50     661.88     530.43   36%   80%
>>         256    2404.70    1279.67    1029.15   42%   80%
>>         512    4331.53    2501.30    1942.21   44%   77%
>>        1024    6819.78    4622.37    4128.10   60%   89%
>>        2048   10544.60    6348.81    6349.59   60%  100%
>>        4096   12830.41    8324.43    7984.43   62%   95%
>>        8192   13462.65    8355.49   11079.37   82%  132%
>>       16384    9960.87   10840.13   13037.81  130%  120%
>>       32768    8749.31   11372.15   15087.08  172%  132%
>>       65536    7580.27   12150.23   14971.42  197%  123%
>>      131072    6727.74   11451.34   13604.78  202%  118%
>>      262144    7673.14   11613.10   11436.97  149%   98%
>>      524288    7366.17   11675.95   11559.43  156%   99%
>>     1048576    6608.57   11883.01   10103.20  152%   85%
>> MS means Message Size in bytes, that is -m -M for netperf
> I cant reproduce your strange numbers here, they make no sense to me.
>
> for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
> 65536 131072 262144 524288 1048576
> do
>   ./netperf -- -m $s -M $s | tail -n1
> done
>
> Results :
>
> 87380  16384      1    10.00      34.68
>   87380  16384      2    10.00      68.07
>   87380  16384      4    10.00     126.27
>   87380  16384      8    10.00     284.50
>   87380  16384     16    10.00     574.38
>   87380  16384     32    10.00    1091.74
>   87380  16384     64    10.00    2130.23
>   87380  16384    128    10.00    4001.83
>   87380  16384    256    10.00    7666.01
>   87380  16384    512    10.00    13425.81
>   87380  16384   1024    10.00    21146.43
>   87380  16384   2048    10.00    28551.42
>   87380  16384   4096    10.00    37878.95
>   87380  16384   8192    10.00    42507.23
>   87380  16384  16384    10.00    46782.53
>   87380  16384  32768    10.00    42410.97
>   87380  16384  65536    10.00    43053.09
>   87380  16384 131072    10.00    44504.20
>   87380  16384 262144    10.00    50211.74
>   87380  16384 524288    10.00    54004.23
>   87380  16384 1048576    10.00    53852.26
>
>
>
Hi, Eric,

In my test program, I run normal tcp loopback then friends for each 
message size,
then it generates such strange numbers.

But if I just run normal tcp loopback for each message size, then the 
performance is stable.

Maybe I should make the environment clean before each test, like 
dropping cache.

thanks
Weiping Pan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-13 14:05             ` Weiping Pan
@ 2012-12-13 18:25               ` Rick Jones
  2012-12-14  5:53                 ` Weiping Pan
  0 siblings, 1 reply; 15+ messages in thread
From: Rick Jones @ 2012-12-13 18:25 UTC (permalink / raw)
  To: Weiping Pan; +Cc: David Laight, davem, brutus, netdev

On 12/13/2012 06:05 AM, Weiping Pan wrote:
> But if I just run normal tcp loopback for each message size, then the
> performance is stable.
> [root@intel-s3e3432-01 ~]# cat base.sh
> for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
> 65536 131072 262144 524288 1048576
> do
> netperf -i -2,10 -I 95,20 -- -m $s -M $s | tail -n1
> done

The -i option goes max,min iterations:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002di_002c-Global-28

and src/netsh.c will apply some silent clipping to that:

     case 'i':
       /* set the iterations min and max for confidence intervals */
       break_args(optarg,arg1,arg2);
       if (arg1[0]) {
	iteration_max = convert(arg1);
       }
       if (arg2[0] ) {
	iteration_min = convert(arg2);
       }
       /* if the iteration_max is < iteration_min make iteration_max
	 equal iteration_min */
       if (iteration_max < iteration_min) iteration_max = iteration_min;
       /* limit minimum to 3 iterations */
       if (iteration_max < 3) iteration_max = 3;
       if (iteration_min < 3) iteration_min = 3;
       /* limit maximum to 30 iterations */
       if (iteration_max > 30) iteration_max = 30;
       if (iteration_min > 30) iteration_min = 30;
       if (confidence_level == 0) confidence_level = 99;
       if (interval == 0.0) interval = 0.05; /* five percent */
       break;

So, what will happen with your netperf command line above is it will set 
iteration max to 10 iterations and it will always run 10 iterations 
since min will equal max.  If you want it to possibly terminate sooner 
upon hitting the confidence intervals you would want to go with -i 10,3. 
  That will have netperf always run at least three and no more than 10 
iterations.

If I'm not mistaken, the use of the "| tail -n 1" there will cause the 
"classic" confidence intervals not met warning to be tossed (unless I 
suppose it is actually going to stderr?).

If you use the "omni" tests directly rather than via "migration" you 
will no longer get warnings about not hitting the confidence interval, 
but you can have netperf emit the confidence level it actually achieved 
as well as the number of iterations it took to get there.  You would use 
the omni output selection to do that.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection

These may have been mentioned before...

Judging from that command line you have the potential variability of the 
socket buffer auto-tuning.  Does AF_UNIX do the same sort of auto 
tuning?  It may be desirable to add some test-specific -s and -S options 
to have a fixed socket buffer size.

Since the MTU for loopback is ~16K, the send sizes below that will 
probably have differing interactions with the Nagle algorithm. 
Particularly as I suspect the timing will differ between friends and no 
friends.

I would guess the most "consistent" comparison with AF_UNIX would be 
when Nagle is disabled for the TCP_STREAM tests.  That would be a 
test-specific -D option.

Perhaps a more "stable" way to compare friends, no-friends and unix 
would be to use the _RR tests.  That will be a more direct, less-prone 
to other heuristics measure of path-length differences - both in the 
reported transactions per second and in any CPU utilization/service 
demand if you enable that via -c.  I'm not sure it would be necessary to 
take the request/response size out beyond a couple KB.  Take it out to 
the MB level and you will probably return to the question of auto-tuning 
of the socket buffer sizes.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
  2012-12-13 18:25               ` Rick Jones
@ 2012-12-14  5:53                 ` Weiping Pan
  0 siblings, 0 replies; 15+ messages in thread
From: Weiping Pan @ 2012-12-14  5:53 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Laight, davem, brutus, netdev

On 12/14/2012 02:25 AM, Rick Jones wrote:
> On 12/13/2012 06:05 AM, Weiping Pan wrote:
>> But if I just run normal tcp loopback for each message size, then the
>> performance is stable.
>> [root@intel-s3e3432-01 ~]# cat base.sh
>> for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
>> 65536 131072 262144 524288 1048576
>> do
>> netperf -i -2,10 -I 95,20 -- -m $s -M $s | tail -n1
>> done
>
> The -i option goes max,min iterations:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002di_002c-Global-28 
>
>
> and src/netsh.c will apply some silent clipping to that:
>
>
>     case 'i':
>       /* set the iterations min and max for confidence intervals */
>       break_args(optarg,arg1,arg2);
>       if (arg1[0]) {
>     iteration_max = convert(arg1);
>       }
>       if (arg2[0] ) {
>     iteration_min = convert(arg2);
>       }
>       /* if the iteration_max is < iteration_min make iteration_max
>      equal iteration_min */
>       if (iteration_max < iteration_min) iteration_max = iteration_min;
>       /* limit minimum to 3 iterations */
>       if (iteration_max < 3) iteration_max = 3;
>       if (iteration_min < 3) iteration_min = 3;
>       /* limit maximum to 30 iterations */
>       if (iteration_max > 30) iteration_max = 30;
>       if (iteration_min > 30) iteration_min = 30;
>       if (confidence_level == 0) confidence_level = 99;
>       if (interval == 0.0) interval = 0.05; /* five percent */
>       break;
>
> So, what will happen with your netperf command line above is it will 
> set iteration max to 10 iterations and it will always run 10 
> iterations since min will equal max.  If you want it to possibly 
> terminate sooner upon hitting the confidence intervals you would want 
> to go with -i 10,3.  That will have netperf always run at least three 
> and no more than 10 iterations.
Yes, I misread the manual, it should be "-i 10,3".

>
> If I'm not mistaken, the use of the "| tail -n 1" there will cause the 
> "classic" confidence intervals not met warning to be tossed (unless I 
> suppose it is actually going to stderr?).
Yes, I saw that warning.
>
> If you use the "omni" tests directly rather than via "migration" you 
> will no longer get warnings about not hitting the confidence interval, 
> but you can have netperf emit the confidence level it actually 
> achieved as well as the number of iterations it took to get there.  
> You would use the omni output selection to do that.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection 
>
>
>
> These may have been mentioned before...
>
> Judging from that command line you have the potential variability of 
> the socket buffer auto-tuning.  Does AF_UNIX do the same sort of auto 
> tuning?  It may be desirable to add some test-specific -s and -S 
> options to have a fixed socket buffer size.

I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by 
default.
>
> Since the MTU for loopback is ~16K, the send sizes below that will 
> probably have differing interactions with the Nagle algorithm. 
> Particularly as I suspect the timing will differ between friends and 
> no friends.
>
> I would guess the most "consistent" comparison with AF_UNIX would be 
> when Nagle is disabled for the TCP_STREAM tests.  That would be a 
> test-specific -D option.
>
> Perhaps a more "stable" way to compare friends, no-friends and unix 
> would be to use the _RR tests.  That will be a more direct, less-prone 
> to other heuristics measure of path-length differences - both in the 
> reported transactions per second and in any CPU utilization/service 
> demand if you enable that via -c.  I'm not sure it would be necessary 
> to take the request/response size out beyond a couple KB.  Take it out 
> to the MB level and you will probably return to the question of 
> auto-tuning of the socket buffer sizes.
Good suggestion !
>
> happy benchmarking,
>
> rick jones

Rick, thanks !

Weiping Pan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-12-14  5:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-18 10:19 Fwd: Re: [PATCH v3] net-tcp: TCP/IP stack bypass for loopback connections Weiping Pan
2012-10-18 12:23 ` Bruce Curtis
2012-12-05  2:54   ` [RFC PATCH net-next 0/3 V4] " Weiping Pan
2012-12-05  2:54     ` [PATCH 1/3] Bruce's orignal tcp friend V3 Weiping Pan
2012-12-05  2:54     ` [PATCH 2/3] fix panic in tcp_close() Weiping Pan
2012-12-05  2:54     ` [PATCH 3/3] delete request_sock->friend Weiping Pan
2012-12-10 21:02     ` [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections David Miller
2012-12-12 14:13       ` Weiping Pan
     [not found]       ` <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
2012-12-12 14:29         ` [RFC PATCH net-next 4/4 V4] try to fix performance regression Weiping Pan
2012-12-12 14:57           ` David Laight
2012-12-13 14:05             ` Weiping Pan
2012-12-13 18:25               ` Rick Jones
2012-12-14  5:53                 ` Weiping Pan
2012-12-12 16:25           ` Eric Dumazet
2012-12-13 14:09             ` Weiping Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).