From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=fYmv=S3=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED,
	FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5A64BC43218
	for <netdev@archiver.kernel.org>; Thu, 25 Apr 2019 14:16:50 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 3CE4120717
	for <netdev@archiver.kernel.org>; Thu, 25 Apr 2019 14:16:50 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=sysophe.eu header.i=@sysophe.eu header.b="fXrthAxt"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727959AbfDYOQt (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Thu, 25 Apr 2019 10:16:49 -0400
Received: from hygieia.sysophe.eu ([138.201.91.14]:39459 "EHLO
        hygieia.sysophe.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726290AbfDYOQs (ORCPT
        <rfc822;netdev@vger.kernel.org>); Thu, 25 Apr 2019 10:16:48 -0400
Received: from pluto.restena.lu (pluto.restena.lu [IPv6:2001:a18:1:10::156])
        by smtp.sysophe.eu (Postfix) with ESMTPSA id 7AC21106C2656;
        Thu, 25 Apr 2019 16:14:54 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=sysophe.eu; s=201205;
        t=1556201694; x=1556288094;
        bh=qgAdOxdSSXcsGiLsEa90cHRVnfEVKOyGye7e2Mf7in4=;
        h=Date:From:To:Cc:Subject:In-Reply-To:References;
        b=fXrthAxthiHnlKDIvHP/Cz6r8UTp/ELgsGVPXJ8HflpqOEHgp53mbwe6fdQJ0jgdp
         wiyLJyzL9y+0AdDaAoXoR2JyEYkhmOmJnwBw5/tuNE8Xn2ulamCMXU+q9cYxeUKD25
         8lR4zM6M5q6/h7ADifxah9acoMJK9aSHls0c6eyM=
Date:   Thu, 25 Apr 2019 16:16:43 +0200
From:   Bruno =?UTF-8?B?UHLDqW1vbnQ=?= <bonbons@sysophe.eu>
To:     Eric Dumazet <edumazet@google.com>
Cc:     Richard Purdie <richard.purdie@linuxfoundation.org>,
        Neal Cardwell <ncardwell@google.com>,
        Yuchung Cheng <ycheng@google.com>,
        "David S. Miller" <davem@davemloft.net>,
        netdev <netdev@vger.kernel.org>,
        Alexander Kanavin <alex.kanavin@gmail.com>,
        Bruce Ashfield <bruce.ashfield@gmail.com>
Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
Message-ID: <20190425161643.0eef0009@pluto.restena.lu>
In-Reply-To: <CANn89iL3nwHHXM_VNXZbB_143wW+1nnmMqMLDQTXM=-6vLY3bA@mail.gmail.com>
References: <85aabf9d4f41b6c57629e736993233f80a037e59.camel@linuxfoundation.org>
        <20190424165150.1420b046@pluto.restena.lu>
        <CANn89i+afn_7D+dpk1dd_pjN=OFT=_8xJrr8iybC+oORt2QUoA@mail.gmail.com>
        <20190425095531.4fd8c53f@pluto.restena.lu>
        <20190425151326.305b5c70@pluto.restena.lu>
        <CANn89iL3nwHHXM_VNXZbB_143wW+1nnmMqMLDQTXM=-6vLY3bA@mail.gmail.com>
X-Mailer: Claws Mail 3.17.3 (GTK+ 2.24.32; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On Thu, 25 Apr 2019 06:30:24 -0700 Eric Dumazet wrote:
> On Thu, Apr 25, 2019 at 6:13 AM Bruno Pr=C3=A9mont wrote:
> > On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Pr=C3=A9mont wrote: =20
> > > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote: =20
> > > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Pr=C3=A9mont wrote: =20
> > > > > I'm seeing issues with this patch as well, not as regular as for
> > > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > > >
> > > > > In my case I have a virtual machine (on VMWare) with this patch w=
here
> > > > > NGINX as reverse proxy misses part (end) of payload from its upst=
ream
> > > > > and times out on the upstream connection (while according to tcpd=
ump all
> > > > > packets including upstream's FIN were sent and the upstream did g=
et
> > > > > ACKs from the VM).
> > > > >
> > > > > From when browsers get from NGINX it feels as if at some point re=
ading
> > > > > from the socket or waiting for data using select() never returned=
 data
> > > > > that arrived as more than just EOF is missing.
> > > > >
> > > > > The upstream is a hardware machine in the same subnet.
> > > > >
> > > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (re=
v 01)
> > > > > as network adapter which lists the following features:
> > > > > =20
> > > >
> > > > Hi Bruno.
> > > >
> > > > I suspect a EPOLLIN notification being lost by the application.
> > > >
> > > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > > change stack behavior,
> > > > this packet should land into socket receive queue eventually.
> > > >
> > > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > > select() is a rather old and inefficient API. =20
> > >
> > > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> > >
> > > For source, see here
> > > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epol=
l_module.c?rev=3Debf8c9686b8ce7428f975d8a567935ea3722da70 =20
> >
> > I've been trying to reproduce while running nginx under strace (with ve=
rbose logging), but
> > have not been successful. Looks like tracing modifies the timing in a
> > way to prevent the issue.
> > Without verbose logging I could hit it (though needed way more persever=
ence).
> >
> > What I do have for a working session under strace is:
> >   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) =3D 12
> >   ioctl(12, FIONBIO, [1]) =3D 0
> >   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET,=
 {u32=3D1221158320, u64=3D139682147558832}}) =3D 0
> >   connect(12, {sa_family=3DAF_INET6, sin6_port=3Dhtons(8080), inet_pton=
(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=3D0, sin6_scope_=
id=3D0}, 28) =3D -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(9, [{EPOLLOUT, {u32=3D1221158320, u64=3D139682147558832}}]=
, 512, 57230) =3D 1
> >   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0
> >   writev(12, [{""..., 830}, {""..., 115}], 2) =3D 945
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=3D1221158320, u64=3D1396821475=
58832}}], 512, 57220) =3D 1
> >   recvfrom(12, ""..., 4096, 0, NULL, NULL) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 4096
> >   readv(12, [{""..., 4096}], 1) =3D 3855
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=3D1221158320, u64=
=3D139682147558832}}], 512, 57190) =3D 1
> >   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096},=
 {"", 4096}, {"", 4096}], 7) =3D 0
> >   close(12)               =3D 0
> >
> > Under failing session I got:
> >   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) =3D 6
> >   ioctl(6, FIONBIO, [1])  =3D 0
> >   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET,=
 {u32=3D1221159521, u64=3D139682147560033}}) =3D 0
> >   connect(6, {sa_family=3DAF_INET, sin_port=3Dhtons(8080), sin_addr=3Di=
net_addr("158.64.2.217")}, 16) =3D -1 EINPROGRESS (Operation now in progres=
s)
> >   epoll_wait(86, [{EPOLLOUT, {u32=3D1221159521, u64=3D139682147560033}}=
], 512, 60000) =3D 1
> >   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0
> >   writev(6, [{""..., 830}, {""..., 120}], 2) =3D 950
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D139682147=
560033}}], 512, 60000) =3D 1
> >   recvfrom(6, ""..., 4096, 0, NULL, NULL) =3D 1740
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D139682147=
560033}}], 512, 59980) =3D 1
> >   readv(6, [{""..., 2356}], 1) =3D 2356
> >   readv(6, [{""..., 4096}], 1) =3D 3362
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=3D1221159521, u64=3D139682147=
560033}}], 512, 59980) =3D 1
> >   readv(6, [{""..., 734}, {""..., 4096}], 2) =3D 4830
> >   readv(6, [{""..., 4096}], 1) =3D 4096
> >   readv(6, [{""..., 4096}], 1) =3D 4096
> >   readv(6, [{""..., 4096}], 1) =3D 4096
> >   readv(6, [{""..., 4096}], 1) =3D 3619
> >   epoll_wait(86, [{EPOLLOUT, {u32=3D1221159521, u64=3D139682147560033}}=
], 512, 59980) =3D 1
> >   ... some other socket with activity and epoll_wait calls for 50 secon=
ds
> >   epoll_wait(86, [], 512, 3331) =3D 0
> >   close(6)                =3D 0
> >
> >  So the EPOLLRDHUP seems to never happen here.
> >
> >
> > As a side-note, in strace output (over lots of sessions) I see almost n=
o readv() returning -1 EAGAIN
> > except for the first readv() following revcfrom().
> >
> >
> > I hope this helps at looking for the cause, as well as that not collaps=
ing FIN packets
> > also prevents the issue. =20
>=20
> Very interesting Bruno, thanks a lot for all these details.
>=20
> I wonder if you have some existing epoll notification bug, that would
> trigger if the last packet received from the peer
> contains both data and FIN.

I just did take a tcpdump on the nginx host and have a TCP session
for which the issue occurred in the dump, and a few working ones too (I
can't share it in public though unless I could annonymise the payload).
For almost all connections I have in the dump, last packet from
upstream has payload as well as FIN flag, same for the failing one.
So data with payload is not the only trigger (amount of data probably
neither as one with 6 bytes passed, the failing one had 7), if at all.

The dump doesn't reveal anything that would catch my attention though.
At best a difference in timing as to when nginx host sends ACKs.

> Were the "ss" outputs taken from nginx side ?

Yes, they were.

Cheers,
Bruno

> > > > Could you watch/report the output of " ss -temoi "  for the frozen =
TCP flow ? =20
> > >
> > > Here it is, three distinct reproducing attempts:
> > > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache =
             uid:83 ino:13245 sk:87 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic=
 wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 =
cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_se=
gs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack=
:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 r=
cv_space:14480 minrtt:0.156
> > >
> > >
> > > ESTAB      0      0                                                  =
                 2001:a18:1:6::228:33572                                   =
                                          2001:a18:1:6::217:webcache       =
       uid:83 ino:16699 sk:e1 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic=
 wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 =
cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_se=
gs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8=
360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 =
rcv_space:14280 minrtt:0.161
> > >
> > >
> > > ESTAB      0      0                                                  =
                      158.64.2.228:44578                                   =
                                               158.64.2.217:webcache       =
       uid:83 ino:17628 sk:12c <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic=
 wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448=
 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_s=
egs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastac=
k:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 =
rcv_space:14480 minrtt:0.187
> > >
> > >
> > > From nginx debug logging I don't get a real clue though it seems for =
working connections
> > > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM=
 |
> > > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous=
 ones are 5
> > > while for failing connections it looks like last event seen is 5 (EPO=
LLIN | EPOLLOUT).
> > > =20
> > > > This migtht give us a clue about packets being dropped, say the the
> > > > accumulated packet became too big. =20
> > >
> > >
> > > The following minor patch (might be white-space mangled) does prevent=
 the issue
> > > happening for me:
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 4904250a9aac..c102cd367c79 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk=
_buff *skb)
> > >         if (TCP_SKB_CB(tail)->end_seq !=3D TCP_SKB_CB(skb)->seq ||
> > >             TCP_SKB_CB(tail)->ip_dsfield !=3D TCP_SKB_CB(skb)->ip_dsf=
ield ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags |
> > > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)=
) ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags ^
> > >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)=
) ||
> > >  #ifdef CONFIG_TLS_DEVICE
> > >
> > > Cheers,
> > > Bruno