From mboxrd@z Thu Jan  1 00:00:00 1970
From: Vitaly Davidovich <vitalyd@gmail.com>
Subject: Re: TCP connection closed without FIN or RST
Date: Fri, 3 Nov 2017 11:13:57 -0400
Message-ID: <CAHjP37FZMZJnG2_+YAXtjoakCUMqGMAGZOtBs5PkeBVq0mE+3A@mail.gmail.com>
References: <CAHjP37HOFvQyitEC1s73PHoj120AhE6C6N+FXGUfbd82XO+GQg@mail.gmail.com>
 <1509568471.3828.50.camel@edumazet-glaptop3.roam.corp.google.com>
 <1509569515.3828.53.camel@edumazet-glaptop3.roam.corp.google.com>
 <CAHjP37GkjJyY_6GsVpiZugp+DOKbA8bV2a77iKrAnxrn80Q9Rw@mail.gmail.com>
 <1509573771.3828.58.camel@edumazet-glaptop3.roam.corp.google.com>
 <CAHjP37FyGBmrEi7peAsHBfU=-BzrAafb42RMyjcOoieDsN0vrg@mail.gmail.com>
 <1509577617.3828.62.camel@edumazet-glaptop3.roam.corp.google.com>
 <CAHjP37FeHAps-TYjZk+-yfyDoZJ7yoZsx_yRobZW4+M=eNitNQ@mail.gmail.com>
 <1509714010.2849.41.camel@edumazet-glaptop3.roam.corp.google.com>
 <1509714167.2849.43.camel@edumazet-glaptop3.roam.corp.google.com>
 <CAHjP37Fz3Tu1OQSZhaYs3bZoNpcfJcLGd376azSJSHo-Si-9QA@mail.gmail.com> <CAHjP37GCK+DG=GFPfv-jyKncNV7dr+uXtYBaQcW_oADVDq7vPA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc: netdev <netdev@vger.kernel.org>
To: Eric Dumazet <eric.dumazet@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-lf0-f47.google.com ([209.85.215.47]:43364 "EHLO
        mail-lf0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750971AbdKCPN7 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 3 Nov 2017 11:13:59 -0400
Received: by mail-lf0-f47.google.com with SMTP id a16so3564869lfk.0
        for <netdev@vger.kernel.org>; Fri, 03 Nov 2017 08:13:58 -0700 (PDT)
In-Reply-To: <CAHjP37GCK+DG=GFPfv-jyKncNV7dr+uXtYBaQcW_oADVDq7vPA@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

Ok, an interesting finding.  The client was originally running with
SO_RCVBUF of 75K (apparently someone decided to set that for some
unknown reason).  I tried the test with a 1MB recv buffer and
everything works perfectly! The client responds with 0 window alerts,
the server just hits the persist condition and sends keep-alive
probes; the client continues answering with a 0 window up until it
wakes up and starts processing data in its receive buffer.  At that
point, the window opens up and the server sends more data.  Basically,
things look as one would expect in this situation :).

/proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
conversation flows normally, as described above, when I change the
client's recv buf size to 1048576.  I also tried 131072, but that
doesn't work - same retrans/no ACKs situation.

I think this eliminates (right?) any middleware from the equation.
Instead, perhaps it's some bad interaction between a low recv buf size
and either some other TCP setting or TSO mechanics (LRO specifically).
Still investigating further.

On Fri, Nov 3, 2017 at 10:02 AM, Vitaly Davidovich <vitalyd@gmail.com> wrote:
> On Fri, Nov 3, 2017 at 9:39 AM, Vitaly Davidovich <vitalyd@gmail.com> wrote:
>> On Fri, Nov 3, 2017 at 9:02 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Fri, 2017-11-03 at 06:00 -0700, Eric Dumazet wrote:
>>>> On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
>>>> > Hi Eric,
>>>> >
>>>> > Ran a few more tests yesterday with packet captures, including a
>>>> > capture on the client.  It turns out that the client stops ack'ing
>>>> > entirely at some point in the conversation - the last advertised
>>>> > client window is not even close to zero (it's actually ~348K).  So
>>>> > there's complete radio silence from the client for some reason, even
>>>> > though it does send back ACKs early on in the conversation.  So yes,
>>>> > as far as the server is concerned, the client is completely gone and
>>>> > tcp_retries2 rightfully breaches eventually once the server retrans go
>>>> > unanswered long (and for sufficient times) enough.
>>>> >
>>>> > What's odd though is the packet capture on the client shows the server
>>>> > retrans packets arriving, so it's not like the segments don't reach
>>>> > the client.  I'll keep investigating, but if you (or anyone else
>>>> > reading this) knows of circumstances that might cause this, I'd
>>>> > appreciate any tips on where/what to look at.
>>>>
>>>>
>>>> Might be a middle box issue ?  Like a firewall connection tracking
>>>> having some kind of timeout if nothing is sent on one direction ?
>>>>
>>>> What output do you have from client side with :
>>>>
>>>> ss -temoi dst <server_ip>
>>>
>>> It also could be a wrapping issue on TCP timestamps.
>>>
>>> You could try disabling tcp timestamps, and restart the TCP flow.
>>>
>>> echo 0 >/proc/sys/net/ipv4/tcp_timestamps
>> Ok, I will try to do that.  Thanks for the tip.
> Tried with tcp_timestamps disabled on the client (didn't touch the
> server), but that didn't change the outcome - same issue at the end.
>>>
>>>
>>>
>>>
>>>