Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
@ 2019-04-07 21:28 richard.purdie
  2019-04-24 14:51 ` Bruno Prémont
  0 siblings, 1 reply; 14+ messages in thread
From: richard.purdie @ 2019-04-07 21:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Yuchung Cheng, David S. Miller, netdev,
	Alexander Kanavin, Bruce Ashfield

Hi,

I've been chasing down why a python test from the python3 testsuite
started failing and it seems to point to this kernel change in the
networking stack.

In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:

$ python3 -m test test_httplib -v
== CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
== Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
== cwd: /var/volatile/tmp/test_python_288
== CPU count: 1
== encodings: locale=UTF-8, FS=utf-8
[...]
test_response_fileno (test.test_httplib.BasicTest) ... 

and it hangs in test_response_fileno.

The test in question comes from Lib/test/test_httplib.py in the python
source tree and the code is:

    def test_response_fileno(self):
        # Make sure fd returned by fileno is valid.
        serv = socket.socket(
            socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
        self.addCleanup(serv.close)
        serv.bind((HOST, 0))
        serv.listen()

        result = None
        def run_server():
            [conn, address] = serv.accept()
            with conn, conn.makefile("rb") as reader:
                # Read the request header until a blank line
                while True:
                    line = reader.readline()
                    if not line.rstrip(b"\r\n"):
                        break
                conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
                nonlocal result
                result = reader.read()

        thread = threading.Thread(target=run_server)
        thread.start()
        self.addCleanup(thread.join, float(1))
        conn = client.HTTPConnection(*serv.getsockname())
        conn.request("CONNECT", "dummy:1234")
        response = conn.getresponse()
        try:
            self.assertEqual(response.status, client.OK)
            s = socket.socket(fileno=response.fileno())
            try:
                s.sendall(b"proxied data\n")
            finally:
                s.detach()
        finally:
            response.close()
            conn.close()
        thread.join()
        self.assertEqual(result, b"proxied data\n")

I was hoping someone with more understanding of the networking stack
could look at this and tell whether its a bug in the python test, the
kernel change or otherwise give a pointer to where the problem might
be? I'll freely admit this is not an area I know much about.

Cheers,

Richard


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-07 21:28 [PATCH net-next 2/3] tcp: implement coalescing on backlog queue richard.purdie
@ 2019-04-24 14:51 ` Bruno Prémont
  2019-04-24 15:47   ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Bruno Prémont @ 2019-04-24 14:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

I'm seeing issues with this patch as well, not as regular as for
Richard but still (about up to one in 30-50 TCP sessions).

In my case I have a virtual machine (on VMWare) with this patch where
NGINX as reverse proxy misses part (end) of payload from its upstream
and times out on the upstream connection (while according to tcpdump all
packets including upstream's FIN were sent and the upstream did get
ACKs from the VM).

From when browsers get from NGINX it feels as if at some point reading
from the socket or waiting for data using select() never returned data
that arrived as more than just EOF is missing.

The upstream is a hardware machine in the same subnet.

My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
as network adapter which lists the following features:

rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]


I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.

Cheers,
Bruno

On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> Hi,
>
> I've been chasing down why a python test from the python3 testsuite
> started failing and it seems to point to this kernel change in the
> networking stack.
>
> In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
>
> $ python3 -m test test_httplib -v
> == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> == cwd: /var/volatile/tmp/test_python_288
> == CPU count: 1
> == encodings: locale=UTF-8, FS=utf-8
> [...]
> test_response_fileno (test.test_httplib.BasicTest) ... 
>
> and it hangs in test_response_fileno.
>
> The test in question comes from Lib/test/test_httplib.py in the python
> source tree and the code is:
>
>     def test_response_fileno(self):
>         # Make sure fd returned by fileno is valid.
>         serv = socket.socket(
>             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
>         self.addCleanup(serv.close)
>         serv.bind((HOST, 0))
>         serv.listen()
>
>         result = None
>         def run_server():
>             [conn, address] = serv.accept()
>             with conn, conn.makefile("rb") as reader:
>                 # Read the request header until a blank line
>                 while True:
>                     line = reader.readline()
>                     if not line.rstrip(b"\r\n"):
>                         break
>                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
>                 nonlocal result
>                 result = reader.read()
>
>         thread = threading.Thread(target=run_server)
>         thread.start()
>         self.addCleanup(thread.join, float(1))
>         conn = client.HTTPConnection(*serv.getsockname())
>         conn.request("CONNECT", "dummy:1234")
>         response = conn.getresponse()
>         try:
>             self.assertEqual(response.status, client.OK)
>             s = socket.socket(fileno=response.fileno())
>             try:
>                 s.sendall(b"proxied data\n")
>             finally:
>                 s.detach()
>         finally:
>             response.close()
>             conn.close()
>         thread.join()
>         self.assertEqual(result, b"proxied data\n")
>
> I was hoping someone with more understanding of the networking stack
> could look at this and tell whether its a bug in the python test, the
> kernel change or otherwise give a pointer to where the problem might
> be? I'll freely admit this is not an area I know much about.
>
> Cheers,
>
> Richard
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-24 14:51 ` Bruno Prémont
@ 2019-04-24 15:47   ` Eric Dumazet
  2019-04-25  7:55     ` Bruno Prémont
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2019-04-24 15:47 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont <bonbons@sysophe.eu> wrote:
>
> Hi Eric,
>
> I'm seeing issues with this patch as well, not as regular as for
> Richard but still (about up to one in 30-50 TCP sessions).
>
> In my case I have a virtual machine (on VMWare) with this patch where
> NGINX as reverse proxy misses part (end) of payload from its upstream
> and times out on the upstream connection (while according to tcpdump all
> packets including upstream's FIN were sent and the upstream did get
> ACKs from the VM).
>
> From when browsers get from NGINX it feels as if at some point reading
> from the socket or waiting for data using select() never returned data
> that arrived as more than just EOF is missing.
>
> The upstream is a hardware machine in the same subnet.
>
> My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> as network adapter which lists the following features:
>

Hi Bruno.

I suspect a EPOLLIN notification being lost by the application.

Fact that TCP backlog contains 1 instead of 2+ packets should not
change stack behavior,
this packet should land into socket receive queue eventually.

Are you using epoll() in Edge Trigger mode. You mention select() but
select() is a rather old and inefficient API.

Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?

This migtht give us a clue about packets being dropped, say the the
accumulated packet became too big.



> rx-checksumming: on
> tx-checksumming: on
>         tx-checksum-ipv4: off [fixed]
>         tx-checksum-ip-generic: on
>         tx-checksum-ipv6: off [fixed]
>         tx-checksum-fcoe-crc: off [fixed]
>         tx-checksum-sctp: off [fixed]
> scatter-gather: on
>         tx-scatter-gather: on
>         tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
>         tx-tcp-segmentation: on
>         tx-tcp-ecn-segmentation: off [fixed]
>         tx-tcp-mangleid-segmentation: off
>         tx-tcp6-segmentation: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: on
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off [fixed]
> receive-hashing: off [fixed]
> highdma: on
> rx-vlan-filter: on [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> tx-gre-segmentation: off [fixed]
> tx-gre-csum-segmentation: off [fixed]
> tx-ipxip4-segmentation: off [fixed]
> tx-ipxip6-segmentation: off [fixed]
> tx-udp_tnl-segmentation: off [fixed]
> tx-udp_tnl-csum-segmentation: off [fixed]
> tx-gso-partial: off [fixed]
> tx-sctp-segmentation: off [fixed]
> tx-esp-segmentation: off [fixed]
> tx-udp-segmentation: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: off
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off [fixed]
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off [fixed]
> hw-tc-offload: off [fixed]
> esp-hw-offload: off [fixed]
> esp-tx-csum-hw-offload: off [fixed]
> rx-udp_tunnel-port-offload: off [fixed]
> tls-hw-tx-offload: off [fixed]
> tls-hw-rx-offload: off [fixed]
> rx-gro-hw: off [fixed]
> tls-hw-record: off [fixed]
>
>
> I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
>
> Cheers,
> Bruno
>
> On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> > Hi,
> >
> > I've been chasing down why a python test from the python3 testsuite
> > started failing and it seems to point to this kernel change in the
> > networking stack.
> >
> > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> >
> > $ python3 -m test test_httplib -v
> > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > == cwd: /var/volatile/tmp/test_python_288
> > == CPU count: 1
> > == encodings: locale=UTF-8, FS=utf-8
> > [...]
> > test_response_fileno (test.test_httplib.BasicTest) ...
> >
> > and it hangs in test_response_fileno.
> >
> > The test in question comes from Lib/test/test_httplib.py in the python
> > source tree and the code is:
> >
> >     def test_response_fileno(self):
> >         # Make sure fd returned by fileno is valid.
> >         serv = socket.socket(
> >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> >         self.addCleanup(serv.close)
> >         serv.bind((HOST, 0))
> >         serv.listen()
> >
> >         result = None
> >         def run_server():
> >             [conn, address] = serv.accept()
> >             with conn, conn.makefile("rb") as reader:
> >                 # Read the request header until a blank line
> >                 while True:
> >                     line = reader.readline()
> >                     if not line.rstrip(b"\r\n"):
> >                         break
> >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> >                 nonlocal result
> >                 result = reader.read()
> >
> >         thread = threading.Thread(target=run_server)
> >         thread.start()
> >         self.addCleanup(thread.join, float(1))
> >         conn = client.HTTPConnection(*serv.getsockname())
> >         conn.request("CONNECT", "dummy:1234")
> >         response = conn.getresponse()
> >         try:
> >             self.assertEqual(response.status, client.OK)
> >             s = socket.socket(fileno=response.fileno())
> >             try:
> >                 s.sendall(b"proxied data\n")
> >             finally:
> >                 s.detach()
> >         finally:
> >             response.close()
> >             conn.close()
> >         thread.join()
> >         self.assertEqual(result, b"proxied data\n")
> >
> > I was hoping someone with more understanding of the networking stack
> > could look at this and tell whether its a bug in the python test, the
> > kernel change or otherwise give a pointer to where the problem might
> > be? I'll freely admit this is not an area I know much about.
> >
> > Cheers,
> >
> > Richard
> >
> >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-24 15:47   ` Eric Dumazet
@ 2019-04-25  7:55     ` Bruno Prémont
  2019-04-25 13:13       ` Bruno Prémont
  0 siblings, 1 reply; 14+ messages in thread
From: Bruno Prémont @ 2019-04-25  7:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:
> >
> > Hi Eric,
> >
> > I'm seeing issues with this patch as well, not as regular as for
> > Richard but still (about up to one in 30-50 TCP sessions).
> >
> > In my case I have a virtual machine (on VMWare) with this patch where
> > NGINX as reverse proxy misses part (end) of payload from its upstream
> > and times out on the upstream connection (while according to tcpdump all
> > packets including upstream's FIN were sent and the upstream did get
> > ACKs from the VM).
> >
> > From when browsers get from NGINX it feels as if at some point reading
> > from the socket or waiting for data using select() never returned data
> > that arrived as more than just EOF is missing.
> >
> > The upstream is a hardware machine in the same subnet.
> >
> > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > as network adapter which lists the following features:
> >  
> 
> Hi Bruno.
> 
> I suspect a EPOLLIN notification being lost by the application.
> 
> Fact that TCP backlog contains 1 instead of 2+ packets should not
> change stack behavior,
> this packet should land into socket receive queue eventually.
> 
> Are you using epoll() in Edge Trigger mode. You mention select() but
> select() is a rather old and inefficient API.

nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)

For source, see here
https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70

> Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?

Here it is, three distinct reproducing attempts:
State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156


ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161


ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
         skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187


From nginx debug logging I don't get a real clue though it seems for working connections
the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).

> This migtht give us a clue about packets being dropped, say the the
> accumulated packet became too big.


The following minor patch (might be white-space mangled) does prevent the issue
happening for me:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4904250a9aac..c102cd367c79 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
        if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
            TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
            ((TCP_SKB_CB(tail)->tcp_flags |
-             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
+             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
            ((TCP_SKB_CB(tail)->tcp_flags ^
              TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
 #ifdef CONFIG_TLS_DEVICE

Cheers,
Bruno


> > rx-checksumming: on
> > tx-checksumming: on
> >         tx-checksum-ipv4: off [fixed]
> >         tx-checksum-ip-generic: on
> >         tx-checksum-ipv6: off [fixed]
> >         tx-checksum-fcoe-crc: off [fixed]
> >         tx-checksum-sctp: off [fixed]
> > scatter-gather: on
> >         tx-scatter-gather: on
> >         tx-scatter-gather-fraglist: off [fixed]
> > tcp-segmentation-offload: on
> >         tx-tcp-segmentation: on
> >         tx-tcp-ecn-segmentation: off [fixed]
> >         tx-tcp-mangleid-segmentation: off
> >         tx-tcp6-segmentation: on
> > udp-fragmentation-offload: off
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: on
> > rx-vlan-offload: on
> > tx-vlan-offload: on
> > ntuple-filters: off [fixed]
> > receive-hashing: off [fixed]
> > highdma: on
> > rx-vlan-filter: on [fixed]
> > vlan-challenged: off [fixed]
> > tx-lockless: off [fixed]
> > netns-local: off [fixed]
> > tx-gso-robust: off [fixed]
> > tx-fcoe-segmentation: off [fixed]
> > tx-gre-segmentation: off [fixed]
> > tx-gre-csum-segmentation: off [fixed]
> > tx-ipxip4-segmentation: off [fixed]
> > tx-ipxip6-segmentation: off [fixed]
> > tx-udp_tnl-segmentation: off [fixed]
> > tx-udp_tnl-csum-segmentation: off [fixed]
> > tx-gso-partial: off [fixed]
> > tx-sctp-segmentation: off [fixed]
> > tx-esp-segmentation: off [fixed]
> > tx-udp-segmentation: off [fixed]
> > fcoe-mtu: off [fixed]
> > tx-nocache-copy: off
> > loopback: off [fixed]
> > rx-fcs: off [fixed]
> > rx-all: off [fixed]
> > tx-vlan-stag-hw-insert: off [fixed]
> > rx-vlan-stag-hw-parse: off [fixed]
> > rx-vlan-stag-filter: off [fixed]
> > l2-fwd-offload: off [fixed]
> > hw-tc-offload: off [fixed]
> > esp-hw-offload: off [fixed]
> > esp-tx-csum-hw-offload: off [fixed]
> > rx-udp_tunnel-port-offload: off [fixed]
> > tls-hw-tx-offload: off [fixed]
> > tls-hw-rx-offload: off [fixed]
> > rx-gro-hw: off [fixed]
> > tls-hw-record: off [fixed]
> >
> >
> > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> >
> > Cheers,
> > Bruno
> >
> > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:  
> > > Hi,
> > >
> > > I've been chasing down why a python test from the python3 testsuite
> > > started failing and it seems to point to this kernel change in the
> > > networking stack.
> > >
> > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > >
> > > $ python3 -m test test_httplib -v
> > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > == cwd: /var/volatile/tmp/test_python_288
> > > == CPU count: 1
> > > == encodings: locale=UTF-8, FS=utf-8
> > > [...]
> > > test_response_fileno (test.test_httplib.BasicTest) ...
> > >
> > > and it hangs in test_response_fileno.
> > >
> > > The test in question comes from Lib/test/test_httplib.py in the python
> > > source tree and the code is:
> > >
> > >     def test_response_fileno(self):
> > >         # Make sure fd returned by fileno is valid.
> > >         serv = socket.socket(
> > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > >         self.addCleanup(serv.close)
> > >         serv.bind((HOST, 0))
> > >         serv.listen()
> > >
> > >         result = None
> > >         def run_server():
> > >             [conn, address] = serv.accept()
> > >             with conn, conn.makefile("rb") as reader:
> > >                 # Read the request header until a blank line
> > >                 while True:
> > >                     line = reader.readline()
> > >                     if not line.rstrip(b"\r\n"):
> > >                         break
> > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > >                 nonlocal result
> > >                 result = reader.read()
> > >
> > >         thread = threading.Thread(target=run_server)
> > >         thread.start()
> > >         self.addCleanup(thread.join, float(1))
> > >         conn = client.HTTPConnection(*serv.getsockname())
> > >         conn.request("CONNECT", "dummy:1234")
> > >         response = conn.getresponse()
> > >         try:
> > >             self.assertEqual(response.status, client.OK)
> > >             s = socket.socket(fileno=response.fileno())
> > >             try:
> > >                 s.sendall(b"proxied data\n")
> > >             finally:
> > >                 s.detach()
> > >         finally:
> > >             response.close()
> > >             conn.close()
> > >         thread.join()
> > >         self.assertEqual(result, b"proxied data\n")
> > >
> > > I was hoping someone with more understanding of the networking stack
> > > could look at this and tell whether its a bug in the python test, the
> > > kernel change or otherwise give a pointer to where the problem might
> > > be? I'll freely admit this is not an area I know much about.
> > >
> > > Cheers,
> > >
> > > Richard
> > >
> > >
> > >  

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25  7:55     ` Bruno Prémont
@ 2019-04-25 13:13       ` Bruno Prémont
  2019-04-25 13:30         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Bruno Prémont @ 2019-04-25 13:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: richard.purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

Hi Eric,

On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> Hi Eric,
> 
> On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > >
> > > Hi Eric,
> > >
> > > I'm seeing issues with this patch as well, not as regular as for
> > > Richard but still (about up to one in 30-50 TCP sessions).
> > >
> > > In my case I have a virtual machine (on VMWare) with this patch where
> > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > and times out on the upstream connection (while according to tcpdump all
> > > packets including upstream's FIN were sent and the upstream did get
> > > ACKs from the VM).
> > >
> > > From when browsers get from NGINX it feels as if at some point reading
> > > from the socket or waiting for data using select() never returned data
> > > that arrived as more than just EOF is missing.
> > >
> > > The upstream is a hardware machine in the same subnet.
> > >
> > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > as network adapter which lists the following features:
> > >    
> > 
> > Hi Bruno.
> > 
> > I suspect a EPOLLIN notification being lost by the application.
> > 
> > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > change stack behavior,
> > this packet should land into socket receive queue eventually.
> > 
> > Are you using epoll() in Edge Trigger mode. You mention select() but
> > select() is a rather old and inefficient API.  
> 
> nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> 
> For source, see here
> https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70

I've been trying to reproduce while running nginx under strace (with verbose logging), but
have not been successful. Looks like tracing modifies the timing in a
way to prevent the issue.
Without verbose logging I could hit it (though needed way more perseverence).

What I do have for a working session under strace is:
  socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
  ioctl(12, FIONBIO, [1]) = 0
  epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
  connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
  getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(12, [{""..., 830}, {""..., 115}], 2) = 945
  epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
  recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 3855
  epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
  readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
  close(12)               = 0

Under failing session I got:
  socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
  ioctl(6, FIONBIO, [1])  = 0
  epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
  connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(6, [{""..., 830}, {""..., 120}], 2) = 950
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 2356}], 1) = 2356
  readv(6, [{""..., 4096}], 1) = 3362
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 3619
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  ... some other socket with activity and epoll_wait calls for 50 seconds
  epoll_wait(86, [], 512, 3331) = 0
  close(6)                = 0

 So the EPOLLRDHUP seems to never happen here.


As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
except for the first readv() following revcfrom().


I hope this helps at looking for the cause, as well as that not collapsing FIN packets
also prevents the issue.

Cheers,
Bruno


> > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> 
> Here it is, three distinct reproducing attempts:
> State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> 
> 
> ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> 
> 
> ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> 
> 
> From nginx debug logging I don't get a real clue though it seems for working connections
> the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> 
> > This migtht give us a clue about packets being dropped, say the the
> > accumulated packet became too big.  
> 
> 
> The following minor patch (might be white-space mangled) does prevent the issue
> happening for me:
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 4904250a9aac..c102cd367c79 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
>             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
>             ((TCP_SKB_CB(tail)->tcp_flags |
> -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
>             ((TCP_SKB_CB(tail)->tcp_flags ^
>               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
>  #ifdef CONFIG_TLS_DEVICE
> 
> Cheers,
> Bruno
> 
> 
> > > rx-checksumming: on
> > > tx-checksumming: on
> > >         tx-checksum-ipv4: off [fixed]
> > >         tx-checksum-ip-generic: on
> > >         tx-checksum-ipv6: off [fixed]
> > >         tx-checksum-fcoe-crc: off [fixed]
> > >         tx-checksum-sctp: off [fixed]
> > > scatter-gather: on
> > >         tx-scatter-gather: on
> > >         tx-scatter-gather-fraglist: off [fixed]
> > > tcp-segmentation-offload: on
> > >         tx-tcp-segmentation: on
> > >         tx-tcp-ecn-segmentation: off [fixed]
> > >         tx-tcp-mangleid-segmentation: off
> > >         tx-tcp6-segmentation: on
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: on
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on
> > > ntuple-filters: off [fixed]
> > > receive-hashing: off [fixed]
> > > highdma: on
> > > rx-vlan-filter: on [fixed]
> > > vlan-challenged: off [fixed]
> > > tx-lockless: off [fixed]
> > > netns-local: off [fixed]
> > > tx-gso-robust: off [fixed]
> > > tx-fcoe-segmentation: off [fixed]
> > > tx-gre-segmentation: off [fixed]
> > > tx-gre-csum-segmentation: off [fixed]
> > > tx-ipxip4-segmentation: off [fixed]
> > > tx-ipxip6-segmentation: off [fixed]
> > > tx-udp_tnl-segmentation: off [fixed]
> > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > tx-gso-partial: off [fixed]
> > > tx-sctp-segmentation: off [fixed]
> > > tx-esp-segmentation: off [fixed]
> > > tx-udp-segmentation: off [fixed]
> > > fcoe-mtu: off [fixed]
> > > tx-nocache-copy: off
> > > loopback: off [fixed]
> > > rx-fcs: off [fixed]
> > > rx-all: off [fixed]
> > > tx-vlan-stag-hw-insert: off [fixed]
> > > rx-vlan-stag-hw-parse: off [fixed]
> > > rx-vlan-stag-filter: off [fixed]
> > > l2-fwd-offload: off [fixed]
> > > hw-tc-offload: off [fixed]
> > > esp-hw-offload: off [fixed]
> > > esp-tx-csum-hw-offload: off [fixed]
> > > rx-udp_tunnel-port-offload: off [fixed]
> > > tls-hw-tx-offload: off [fixed]
> > > tls-hw-rx-offload: off [fixed]
> > > rx-gro-hw: off [fixed]
> > > tls-hw-record: off [fixed]
> > >
> > >
> > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > >
> > > Cheers,
> > > Bruno
> > >
> > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:    
> > > > Hi,
> > > >
> > > > I've been chasing down why a python test from the python3 testsuite
> > > > started failing and it seems to point to this kernel change in the
> > > > networking stack.
> > > >
> > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > >
> > > > $ python3 -m test test_httplib -v
> > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > == cwd: /var/volatile/tmp/test_python_288
> > > > == CPU count: 1
> > > > == encodings: locale=UTF-8, FS=utf-8
> > > > [...]
> > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > >
> > > > and it hangs in test_response_fileno.
> > > >
> > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > source tree and the code is:
> > > >
> > > >     def test_response_fileno(self):
> > > >         # Make sure fd returned by fileno is valid.
> > > >         serv = socket.socket(
> > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > >         self.addCleanup(serv.close)
> > > >         serv.bind((HOST, 0))
> > > >         serv.listen()
> > > >
> > > >         result = None
> > > >         def run_server():
> > > >             [conn, address] = serv.accept()
> > > >             with conn, conn.makefile("rb") as reader:
> > > >                 # Read the request header until a blank line
> > > >                 while True:
> > > >                     line = reader.readline()
> > > >                     if not line.rstrip(b"\r\n"):
> > > >                         break
> > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > >                 nonlocal result
> > > >                 result = reader.read()
> > > >
> > > >         thread = threading.Thread(target=run_server)
> > > >         thread.start()
> > > >         self.addCleanup(thread.join, float(1))
> > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > >         conn.request("CONNECT", "dummy:1234")
> > > >         response = conn.getresponse()
> > > >         try:
> > > >             self.assertEqual(response.status, client.OK)
> > > >             s = socket.socket(fileno=response.fileno())
> > > >             try:
> > > >                 s.sendall(b"proxied data\n")
> > > >             finally:
> > > >                 s.detach()
> > > >         finally:
> > > >             response.close()
> > > >             conn.close()
> > > >         thread.join()
> > > >         self.assertEqual(result, b"proxied data\n")
> > > >
> > > > I was hoping someone with more understanding of the networking stack
> > > > could look at this and tell whether its a bug in the python test, the
> > > > kernel change or otherwise give a pointer to where the problem might
> > > > be? I'll freely admit this is not an area I know much about.
> > > >
> > > > Cheers,
> > > >
> > > > Richard
> > > >
> > > >
> > > >    

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25 13:13       ` Bruno Prémont
@ 2019-04-25 13:30         ` Eric Dumazet
  2019-04-25 14:16           ` Bruno Prémont
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2019-04-25 13:30 UTC (permalink / raw)
  To: Bruno Prémont
  Cc: Richard Purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont <bonbons@sysophe.eu> wrote:
>
> Hi Eric,
>
> On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> > Hi Eric,
> >
> > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm seeing issues with this patch as well, not as regular as for
> > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > >
> > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > and times out on the upstream connection (while according to tcpdump all
> > > > packets including upstream's FIN were sent and the upstream did get
> > > > ACKs from the VM).
> > > >
> > > > From when browsers get from NGINX it feels as if at some point reading
> > > > from the socket or waiting for data using select() never returned data
> > > > that arrived as more than just EOF is missing.
> > > >
> > > > The upstream is a hardware machine in the same subnet.
> > > >
> > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > as network adapter which lists the following features:
> > > >
> > >
> > > Hi Bruno.
> > >
> > > I suspect a EPOLLIN notification being lost by the application.
> > >
> > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > change stack behavior,
> > > this packet should land into socket receive queue eventually.
> > >
> > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > select() is a rather old and inefficient API.
> >
> > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> >
> > For source, see here
> > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70
>
> I've been trying to reproduce while running nginx under strace (with verbose logging), but
> have not been successful. Looks like tracing modifies the timing in a
> way to prevent the issue.
> Without verbose logging I could hit it (though needed way more perseverence).
>
> What I do have for a working session under strace is:
>   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
>   ioctl(12, FIONBIO, [1]) = 0
>   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
>   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
>   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
>   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 3855
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
>   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
>   close(12)               = 0
>
> Under failing session I got:
>   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
>   ioctl(6, FIONBIO, [1])  = 0
>   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
>   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 2356}], 1) = 2356
>   readv(6, [{""..., 4096}], 1) = 3362
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 3619
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   ... some other socket with activity and epoll_wait calls for 50 seconds
>   epoll_wait(86, [], 512, 3331) = 0
>   close(6)                = 0
>
>  So the EPOLLRDHUP seems to never happen here.
>
>
> As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> except for the first readv() following revcfrom().
>
>
> I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> also prevents the issue.

Very interesting Bruno, thanks a lot for all these details.

I wonder if you have some existing epoll notification bug, that would
trigger if the last packet received from the peer
contains both data and FIN.

Were the "ss" outputs taken from nginx side ?

>
> Cheers,
> Bruno
>
>
> > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?
> >
> > Here it is, three distinct reproducing attempts:
> > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> >
> >
> > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> >
> >
> > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> >
> >
> > From nginx debug logging I don't get a real clue though it seems for working connections
> > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> >
> > > This migtht give us a clue about packets being dropped, say the the
> > > accumulated packet became too big.
> >
> >
> > The following minor patch (might be white-space mangled) does prevent the issue
> > happening for me:
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 4904250a9aac..c102cd367c79 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> >             ((TCP_SKB_CB(tail)->tcp_flags |
> > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> >             ((TCP_SKB_CB(tail)->tcp_flags ^
> >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> >  #ifdef CONFIG_TLS_DEVICE
> >
> > Cheers,
> > Bruno
> >
> >
> > > > rx-checksumming: on
> > > > tx-checksumming: on
> > > >         tx-checksum-ipv4: off [fixed]
> > > >         tx-checksum-ip-generic: on
> > > >         tx-checksum-ipv6: off [fixed]
> > > >         tx-checksum-fcoe-crc: off [fixed]
> > > >         tx-checksum-sctp: off [fixed]
> > > > scatter-gather: on
> > > >         tx-scatter-gather: on
> > > >         tx-scatter-gather-fraglist: off [fixed]
> > > > tcp-segmentation-offload: on
> > > >         tx-tcp-segmentation: on
> > > >         tx-tcp-ecn-segmentation: off [fixed]
> > > >         tx-tcp-mangleid-segmentation: off
> > > >         tx-tcp6-segmentation: on
> > > > udp-fragmentation-offload: off
> > > > generic-segmentation-offload: on
> > > > generic-receive-offload: on
> > > > large-receive-offload: on
> > > > rx-vlan-offload: on
> > > > tx-vlan-offload: on
> > > > ntuple-filters: off [fixed]
> > > > receive-hashing: off [fixed]
> > > > highdma: on
> > > > rx-vlan-filter: on [fixed]
> > > > vlan-challenged: off [fixed]
> > > > tx-lockless: off [fixed]
> > > > netns-local: off [fixed]
> > > > tx-gso-robust: off [fixed]
> > > > tx-fcoe-segmentation: off [fixed]
> > > > tx-gre-segmentation: off [fixed]
> > > > tx-gre-csum-segmentation: off [fixed]
> > > > tx-ipxip4-segmentation: off [fixed]
> > > > tx-ipxip6-segmentation: off [fixed]
> > > > tx-udp_tnl-segmentation: off [fixed]
> > > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > > tx-gso-partial: off [fixed]
> > > > tx-sctp-segmentation: off [fixed]
> > > > tx-esp-segmentation: off [fixed]
> > > > tx-udp-segmentation: off [fixed]
> > > > fcoe-mtu: off [fixed]
> > > > tx-nocache-copy: off
> > > > loopback: off [fixed]
> > > > rx-fcs: off [fixed]
> > > > rx-all: off [fixed]
> > > > tx-vlan-stag-hw-insert: off [fixed]
> > > > rx-vlan-stag-hw-parse: off [fixed]
> > > > rx-vlan-stag-filter: off [fixed]
> > > > l2-fwd-offload: off [fixed]
> > > > hw-tc-offload: off [fixed]
> > > > esp-hw-offload: off [fixed]
> > > > esp-tx-csum-hw-offload: off [fixed]
> > > > rx-udp_tunnel-port-offload: off [fixed]
> > > > tls-hw-tx-offload: off [fixed]
> > > > tls-hw-rx-offload: off [fixed]
> > > > rx-gro-hw: off [fixed]
> > > > tls-hw-record: off [fixed]
> > > >
> > > >
> > > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > > >
> > > > Cheers,
> > > > Bruno
> > > >
> > > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@linuxfoundation.org wrote:
> > > > > Hi,
> > > > >
> > > > > I've been chasing down why a python test from the python3 testsuite
> > > > > started failing and it seems to point to this kernel change in the
> > > > > networking stack.
> > > > >
> > > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > > >
> > > > > $ python3 -m test test_httplib -v
> > > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > > == cwd: /var/volatile/tmp/test_python_288
> > > > > == CPU count: 1
> > > > > == encodings: locale=UTF-8, FS=utf-8
> > > > > [...]
> > > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > > >
> > > > > and it hangs in test_response_fileno.
> > > > >
> > > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > > source tree and the code is:
> > > > >
> > > > >     def test_response_fileno(self):
> > > > >         # Make sure fd returned by fileno is valid.
> > > > >         serv = socket.socket(
> > > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > > >         self.addCleanup(serv.close)
> > > > >         serv.bind((HOST, 0))
> > > > >         serv.listen()
> > > > >
> > > > >         result = None
> > > > >         def run_server():
> > > > >             [conn, address] = serv.accept()
> > > > >             with conn, conn.makefile("rb") as reader:
> > > > >                 # Read the request header until a blank line
> > > > >                 while True:
> > > > >                     line = reader.readline()
> > > > >                     if not line.rstrip(b"\r\n"):
> > > > >                         break
> > > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > > >                 nonlocal result
> > > > >                 result = reader.read()
> > > > >
> > > > >         thread = threading.Thread(target=run_server)
> > > > >         thread.start()
> > > > >         self.addCleanup(thread.join, float(1))
> > > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > > >         conn.request("CONNECT", "dummy:1234")
> > > > >         response = conn.getresponse()
> > > > >         try:
> > > > >             self.assertEqual(response.status, client.OK)
> > > > >             s = socket.socket(fileno=response.fileno())
> > > > >             try:
> > > > >                 s.sendall(b"proxied data\n")
> > > > >             finally:
> > > > >                 s.detach()
> > > > >         finally:
> > > > >             response.close()
> > > > >             conn.close()
> > > > >         thread.join()
> > > > >         self.assertEqual(result, b"proxied data\n")
> > > > >
> > > > > I was hoping someone with more understanding of the networking stack
> > > > > could look at this and tell whether its a bug in the python test, the
> > > > > kernel change or otherwise give a pointer to where the problem might
> > > > > be? I'll freely admit this is not an area I know much about.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2019-04-25 13:30         ` Eric Dumazet
@ 2019-04-25 14:16           ` Bruno Prémont
  0 siblings, 0 replies; 14+ messages in thread
From: Bruno Prémont @ 2019-04-25 14:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Richard Purdie, Neal Cardwell, Yuchung Cheng, David S. Miller,
	netdev, Alexander Kanavin, Bruce Ashfield

On Thu, 25 Apr 2019 06:30:24 -0700 Eric Dumazet wrote:
> On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont wrote:
> > On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:  
> > > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:  
> > > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > > > > I'm seeing issues with this patch as well, not as regular as for
> > > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > > >
> > > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > > and times out on the upstream connection (while according to tcpdump all
> > > > > packets including upstream's FIN were sent and the upstream did get
> > > > > ACKs from the VM).
> > > > >
> > > > > From when browsers get from NGINX it feels as if at some point reading
> > > > > from the socket or waiting for data using select() never returned data
> > > > > that arrived as more than just EOF is missing.
> > > > >
> > > > > The upstream is a hardware machine in the same subnet.
> > > > >
> > > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > > as network adapter which lists the following features:
> > > > >  
> > > >
> > > > Hi Bruno.
> > > >
> > > > I suspect a EPOLLIN notification being lost by the application.
> > > >
> > > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > > change stack behavior,
> > > > this packet should land into socket receive queue eventually.
> > > >
> > > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > > select() is a rather old and inefficient API.  
> > >
> > > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> > >
> > > For source, see here
> > > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70  
> >
> > I've been trying to reproduce while running nginx under strace (with verbose logging), but
> > have not been successful. Looks like tracing modifies the timing in a
> > way to prevent the issue.
> > Without verbose logging I could hit it (though needed way more perseverence).
> >
> > What I do have for a working session under strace is:
> >   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
> >   ioctl(12, FIONBIO, [1]) = 0
> >   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
> >   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
> >   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
> >   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 3855
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
> >   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
> >   close(12)               = 0
> >
> > Under failing session I got:
> >   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> >   ioctl(6, FIONBIO, [1])  = 0
> >   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
> >   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 2356}], 1) = 2356
> >   readv(6, [{""..., 4096}], 1) = 3362
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 3619
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   ... some other socket with activity and epoll_wait calls for 50 seconds
> >   epoll_wait(86, [], 512, 3331) = 0
> >   close(6)                = 0
> >
> >  So the EPOLLRDHUP seems to never happen here.
> >
> >
> > As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> > except for the first readv() following revcfrom().
> >
> >
> > I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> > also prevents the issue.  
> 
> Very interesting Bruno, thanks a lot for all these details.
> 
> I wonder if you have some existing epoll notification bug, that would
> trigger if the last packet received from the peer
> contains both data and FIN.

I just did take a tcpdump on the nginx host and have a TCP session
for which the issue occurred in the dump, and a few working ones too (I
can't share it in public though unless I could annonymise the payload).
For almost all connections I have in the dump, last packet from
upstream has payload as well as FIN flag, same for the failing one.
So data with payload is not the only trigger (amount of data probably
neither as one with 6 bytes passed, the failing one had 7), if at all.

The dump doesn't reveal anything that would catch my attention though.
At best a difference in timing as to when nginx host sends ACKs.

> Were the "ss" outputs taken from nginx side ?

Yes, they were.

Cheers,
Bruno

> > > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> > >
> > > Here it is, three distinct reproducing attempts:
> > > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> > >
> > >
> > > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> > >
> > >
> > > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> > >
> > >
> > > From nginx debug logging I don't get a real clue though it seems for working connections
> > > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> > >  
> > > > This migtht give us a clue about packets being dropped, say the the
> > > > accumulated packet became too big.  
> > >
> > >
> > > The following minor patch (might be white-space mangled) does prevent the issue
> > > happening for me:
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 4904250a9aac..c102cd367c79 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> > >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> > >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags |
> > > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags ^
> > >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> > >  #ifdef CONFIG_TLS_DEVICE
> > >
> > > Cheers,
> > > Bruno

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:16     ` Eric Dumazet
@ 2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:21 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:16 AM Eric Dumazet <edumazet@google.com> wrote:

> Yes, I was considering properly filtering SACK as a refinement later [1]
> but you raise a valid point for alien stacks that are not yet using SACK :/
>
> [1] This version of the patch will not aggregate sacks since the
> memcmp() on tcp options would fail.
>
> Neal can you double check if cake_ack_filter() does not have the issue
> you just mentioned ?

Note that aggregated pure acks will have a gso_segs set to the number
of aggregated acks,
we might simply use this value later in the stack, instead of forcing
having X pure acks in the backlog
and increase memory needs and cpu costs.

Then I guess I need this fix :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696..736f7f24cdb4fe61769faaa1644c8bff01c746c4
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1669,7 +1669,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
                __skb_pull(skb, hdrlen);
                if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
                        TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
-                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+                       if (after(TCP_SKB_CB(skb)->ack_seq,
TCP_SKB_CB(tail)->ack_seq))
+                               TCP_SKB_CB(tail)->ack_seq =
TCP_SKB_CB(skb)->ack_seq;
                        TCP_SKB_CB(tail)->tcp_flags |=
TCP_SKB_CB(skb)->tcp_flags;

                        if (TCP_SKB_CB(skb)->has_rxtstamp) {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-22 18:01   ` Neal Cardwell
@ 2018-11-22 18:16     ` Eric Dumazet
  2018-11-22 18:21       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2018-11-22 18:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Jean-Louis Dupond, Yuchung Cheng, Eric Dumazet

On Thu, Nov 22, 2018 at 10:01 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > In case GRO is not as efficient as it should be or disabled,
> > we might have a user thread trapped in __release_sock() while
> > softirq handler flood packets up to the point we have to drop.
> >
> > This patch balances work done from user thread and softirq,
> > to give more chances to __release_sock() to complete its work.
> >
> > This also helps if we receive many ACK packets, since GRO
> > does not aggregate them.
>
> Would this coalesce duplicate incoming ACK packets? Is there a risk
> that this would eliminate incoming dupacks needed for fast recovery in
> non-SACK connections? Perhaps pure ACKs should only be coalesced if
> the ACK field is different?

Yes, I was considering properly filtering SACK as a refinement later [1]
but you raise a valid point for alien stacks that are not yet using SACK :/

[1] This version of the patch will not aggregate sacks since the
memcmp() on tcp options would fail.

Neal can you double check if cake_ack_filter() does not have the issue
you just mentioned ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-22 18:01   ` Neal Cardwell
  2018-11-22 18:16     ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Neal Cardwell @ 2018-11-22 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Netdev, jean-louis, Yuchung Cheng, Eric Dumazet

On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:40     ` Eric Dumazet
@ 2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 0 replies; 14+ messages in thread
From: Yuchung Cheng @ 2018-11-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell

On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
>>> +
>> Really nice! would it make sense to re-use (some of) the similar
>> tcp_try_coalesce()?
>>
>
> Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
> are accounted differently (they have skb->destructor set)
>
> Also they had the TCP header pulled already, while the backlog coalescing also has
> to make sure TCP options match.
>
> Not sure if we want to add extra parameters and conditional checks...
Makes sense.

Acked-by: Yuchung Cheng <ycheng@google.com>

>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 22:31   ` Yuchung Cheng
@ 2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 16:34       ` Yuchung Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2018-11-21 22:40 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell



On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:

>> +
> Really nice! would it make sense to re-use (some of) the similar
> tcp_try_coalesce()?
>

Maybe, but it is a bit complex, since skbs in receive queues (regular or out of order)
are accounted differently (they have skb->destructor set)

Also they had the TCP header pulled already, while the backlog coalescing also has
to make sure TCP options match.

Not sure if we want to add extra parameters and conditional checks...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
@ 2018-11-21 22:31   ` Yuchung Cheng
  2018-11-21 22:40     ` Eric Dumazet
  2018-11-22 18:01   ` Neal Cardwell
  1 sibling, 1 reply; 14+ messages in thread
From: Yuchung Cheng @ 2018-11-21 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Jean-Louis Dupond, Neal Cardwell, Eric Dumazet

On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c           |  1 +
>  net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
>  3 files changed, 71 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -243,6 +243,7 @@ enum
>         LINUX_MIB_TCPREQQFULLDROP,              /* TCPReqQFullDrop */
>         LINUX_MIB_TCPRETRANSFAIL,               /* TCPRetransFail */
>         LINUX_MIB_TCPRCVCOALESCE,               /* TCPRcvCoalesce */
> +       LINUX_MIB_TCPBACKLOGCOALESCE,           /* TCPBacklogCoalesce */
>         LINUX_MIB_TCPOFOQUEUE,                  /* TCPOFOQueue */
>         LINUX_MIB_TCPOFODROP,                   /* TCPOFODrop */
>         LINUX_MIB_TCPOFOMERGE,                  /* TCPOFOMerge */
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
> +       SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
>         SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
>         SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
>         SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
>  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
>         u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
> -
> -       /* Only socket owner can try to collapse/prune rx queues
> -        * to reduce memory overhead, so add a little headroom here.
> -        * Few sockets backlog are possibly concurrently non empty.
> -        */
> -       limit += 64*1024;
> +       struct skb_shared_info *shinfo;
> +       const struct tcphdr *th;
> +       struct sk_buff *tail;
> +       unsigned int hdrlen;
>
>         /* In case all data was pulled from skb frags (in __pskb_pull_tail()),
>          * we can fix skb->truesize to its real value to avoid future drops.
> @@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>
>         skb_dst_drop(skb);
>
> +       if (unlikely(tcp_checksum_complete(skb))) {
> +               bh_unlock_sock(sk);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
> +               __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> +               return true;
> +       }
> +
> +       /* Attempt coalescing to last skb in backlog, even if we are
> +        * above the limits.
> +        * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
> +        */
> +       th = (const struct tcphdr *)skb->data;
> +       hdrlen = th->doff * 4;
> +       shinfo = skb_shinfo(skb);
> +
> +       if (!shinfo->gso_size)
> +               shinfo->gso_size = skb->len - hdrlen;
> +
> +       if (!shinfo->gso_segs)
> +               shinfo->gso_segs = 1;
> +
> +       tail = sk->sk_backlog.tail;
> +       if (tail &&
> +           TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
> +#ifdef CONFIG_TLS_DEVICE
> +           tail->decrypted == skb->decrypted &&
> +#endif
> +           !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
> +                   hdrlen - sizeof(*th))) {
> +               bool fragstolen;
> +               int delta;
> +
> +               __skb_pull(skb, hdrlen);
> +               if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
> +                       TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +                       TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +                       TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> +
> +                       if (TCP_SKB_CB(skb)->has_rxtstamp) {
> +                               TCP_SKB_CB(tail)->has_rxtstamp = true;
> +                               tail->tstamp = skb->tstamp;
> +                               skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
> +                       }
> +
Really nice! would it make sense to re-use (some of) the similar
tcp_try_coalesce()?

> +                       /* Not as strict as GRO. We only need to carry mss max value */
> +                       skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
> +                                                        skb_shinfo(tail)->gso_size);
> +
> +                       skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
> +
> +                       sk->sk_backlog.len += delta;
> +                       __NET_INC_STATS(sock_net(sk),
> +                                       LINUX_MIB_TCPBACKLOGCOALESCE);
> +                       kfree_skb_partial(skb, fragstolen);
> +                       return false;
> +               }
> +               __skb_push(skb, hdrlen);
> +       }
> +
> +       /* Only socket owner can try to collapse/prune rx queues
> +        * to reduce memory overhead, so add a little headroom here.
> +        * Few sockets backlog are possibly concurrently non empty.
> +        */
> +       limit += 64*1024;
> +
>         if (unlikely(sk_add_backlog(sk, skb, limit))) {
>                 bh_unlock_sock(sk);
>                 __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> --
> 2.19.1.1215.g8438c0b245-goog
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
  2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
@ 2018-11-21 17:52 ` Eric Dumazet
  2018-11-21 22:31   ` Yuchung Cheng
  2018-11-22 18:01   ` Neal Cardwell
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-21 17:52 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Jean-Louis Dupond, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Eric Dumazet

In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_ipv4.c       | 75 +++++++++++++++++++++++++++++++++++----
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index f80135e5feaa886000009db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,6 +243,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
+	LINUX_MIB_TCPBACKLOGCOALESCE,		/* TCPBacklogCoalesce */
 	LINUX_MIB_TCPOFOQUEUE,			/* TCPOFOQueue */
 	LINUX_MIB_TCPOFODROP,			/* TCPOFODrop */
 	LINUX_MIB_TCPOFOMERGE,			/* TCPOFOMerge */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
 	SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
+	SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
 	SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
 	SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
 	SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 795605a2327504b8a025405826e7e0ca8dc8501d..401e1d1cb904a4c7963d8baa419cfbf178593344 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1619,12 +1619,10 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
 	u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
-
-	/* Only socket owner can try to collapse/prune rx queues
-	 * to reduce memory overhead, so add a little headroom here.
-	 * Few sockets backlog are possibly concurrently non empty.
-	 */
-	limit += 64*1024;
+	struct skb_shared_info *shinfo;
+	const struct tcphdr *th;
+	struct sk_buff *tail;
+	unsigned int hdrlen;
 
 	/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
 	 * we can fix skb->truesize to its real value to avoid future drops.
@@ -1636,6 +1634,71 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 
 	skb_dst_drop(skb);
 
+	if (unlikely(tcp_checksum_complete(skb))) {
+		bh_unlock_sock(sk);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
+		__TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
+		return true;
+	}
+
+	/* Attempt coalescing to last skb in backlog, even if we are
+	 * above the limits.
+	 * This is okay because skb capacity is limited to MAX_SKB_FRAGS.
+	 */
+	th = (const struct tcphdr *)skb->data;
+	hdrlen = th->doff * 4;
+	shinfo = skb_shinfo(skb);
+
+	if (!shinfo->gso_size)
+		shinfo->gso_size = skb->len - hdrlen;
+
+	if (!shinfo->gso_segs)
+		shinfo->gso_segs = 1;
+
+	tail = sk->sk_backlog.tail;
+	if (tail &&
+	    TCP_SKB_CB(tail)->end_seq == TCP_SKB_CB(skb)->seq &&
+#ifdef CONFIG_TLS_DEVICE
+	    tail->decrypted == skb->decrypted &&
+#endif
+	    !memcmp(tail->data + sizeof(*th), skb->data + sizeof(*th),
+		    hdrlen - sizeof(*th))) {
+		bool fragstolen;
+		int delta;
+
+		__skb_pull(skb, hdrlen);
+		if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
+			TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
+			TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+
+			if (TCP_SKB_CB(skb)->has_rxtstamp) {
+				TCP_SKB_CB(tail)->has_rxtstamp = true;
+				tail->tstamp = skb->tstamp;
+				skb_hwtstamps(tail)->hwtstamp = skb_hwtstamps(skb)->hwtstamp;
+			}
+
+			/* Not as strict as GRO. We only need to carry mss max value */
+			skb_shinfo(tail)->gso_size = max(shinfo->gso_size,
+							 skb_shinfo(tail)->gso_size);
+
+			skb_shinfo(tail)->gso_segs += shinfo->gso_segs;
+
+			sk->sk_backlog.len += delta;
+			__NET_INC_STATS(sock_net(sk),
+					LINUX_MIB_TCPBACKLOGCOALESCE);
+			kfree_skb_partial(skb, fragstolen);
+			return false;
+		}
+		__skb_push(skb, hdrlen);
+	}
+
+	/* Only socket owner can try to collapse/prune rx queues
+	 * to reduce memory overhead, so add a little headroom here.
+	 * Few sockets backlog are possibly concurrently non empty.
+	 */
+	limit += 64*1024;
+
 	if (unlikely(sk_add_backlog(sk, skb, limit))) {
 		bh_unlock_sock(sk);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
-- 
2.19.1.1215.g8438c0b245-goog

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-25 14:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-07 21:28 [PATCH net-next 2/3] tcp: implement coalescing on backlog queue richard.purdie
2019-04-24 14:51 ` Bruno Prémont
2019-04-24 15:47   ` Eric Dumazet
2019-04-25  7:55     ` Bruno Prémont
2019-04-25 13:13       ` Bruno Prémont
2019-04-25 13:30         ` Eric Dumazet
2019-04-25 14:16           ` Bruno Prémont
  -- strict thread matches above, loose matches on Subject: below --
2018-11-21 17:52 [PATCH net-next 0/3] tcp: take a bit more care of backlog stress Eric Dumazet
2018-11-21 17:52 ` [PATCH net-next 2/3] tcp: implement coalescing on backlog queue Eric Dumazet
2018-11-21 22:31   ` Yuchung Cheng
2018-11-21 22:40     ` Eric Dumazet
2018-11-22 16:34       ` Yuchung Cheng
2018-11-22 18:01   ` Neal Cardwell
2018-11-22 18:16     ` Eric Dumazet
2018-11-22 18:21       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.