From mboxrd@z Thu Jan  1 00:00:00 1970
From: Eric Dumazet <edumazet@google.com>
Subject: Re: Linux ECN Handling
Date: Mon, 20 Nov 2017 07:40:52 -0800
Message-ID: <CANn89iK-hwXjO6y36ORN_JeXSA2aaRsPOc=p3T4qCQREEzg4uw@mail.gmail.com>
References: <CACJspmLFdy9i8K=TkzXHnofyFMNoJf9HkYE7On8uG+PREc2Dqw@mail.gmail.com>
 <20171019124312.GE16796@breakpoint.cc> <CACJspmKjAr+q9cFVssXVxWQMCUWe3TNYO77m0nQwzQK4hTCOzA@mail.gmail.com>
 <CADVnQymmMeo=cemUPQCzHpNvFPKCaqY1+VPchmui+HWuCL7CuA@mail.gmail.com>
 <5A006CF6.1020608@iogearbox.net> <CACJspmL3SKhsAa9_2uEFNGADd0Q3e8S1hrgRAmvXjbnuevasyg@mail.gmail.com>
 <CACJspmLJWE0Ld1ZW_PgKoyeNgbDumL-P3mWAN-8WZRM9fGVvFg@mail.gmail.com> <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc: Steve Ibanez <sibanez@stanford.edu>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Netdev <netdev@vger.kernel.org>, Florian Westphal <fw@strlen.de>,
        Mohammad Alizadeh <alizadeh@csail.mit.edu>,
        Lawrence Brakmo <brakmo@fb.com>,
        Yuchung Cheng <ycheng@google.com>
To: Neal Cardwell <ncardwell@google.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-wm0-f46.google.com ([74.125.82.46]:47068 "EHLO
        mail-wm0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751312AbdKTPkz (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 20 Nov 2017 10:40:55 -0500
Received: by mail-wm0-f46.google.com with SMTP id u83so11240214wmb.5
        for <netdev@vger.kernel.org>; Mon, 20 Nov 2017 07:40:54 -0800 (PST)
In-Reply-To: <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>    goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve

Steve, what HZ value your kernel is compiled with ?