Re: TCP sender stuck in persist despite peer advertising non-zero window

From: Apollon Oikonomopoulos <apoikos@dmesg.gr>
To: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>,
	Netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>
Subject: Re: TCP sender stuck in persist despite peer advertising non-zero window
Date: Thu, 22 Oct 2020 15:47:53 +0300	[thread overview]
Message-ID: <87eelqs9za.fsf@marvin.dmesg.gr> (raw)
In-Reply-To: <878sc63y8j.fsf@marvin.dmesg.gr>

Apollon Oikonomopoulos <apoikos@dmesg.gr> writes:
> We are now running the patched kernel on the machines involved. I want
> to give it some time just to be sure, so I'll get back to you by
> Thursday if everything goes well.

It has been almost a week and we have had zero hangs in 60 rsync runs,
so I guess we can call it fixed. At the same time we didn't notice any
ill side-effects. In the unlikely event it hangs again, I will let you
know.

I spent quite some time pondering this issue and to be honest it
troubles me that it seems to have been there for far too long for nobody
else to have noticed. The only reasonable explanation I can come up with
is that (please comment/correct me if I'm wrong):

 1. It will not be triggered by most L7 protocols. In "synchronous"
    request-response protocols such as HTTP, usually each side will
    consume all available data before sending. In this case, even if
    snd_wl1 wraps around, the bulk receiver is left with a non-zero
    window and is still able to send out data, causing the next
    acknowledgment to update the window and adjust snd_wl1. Also I
    cannot think of any asynchronous protocol apart from rsync where the
    server sends out multi-GB responses without checking for incoming
    data in the process.

 2. Regardless of the application protocol, the receiver must remain
    long enough (for at least 2GB) with a zero send window in the fast
    path to cause a wraparound — but not too long for after(ack_seq,
    snd_wl1) to be true again. In practice this means that header
    prediction should not fail (not even once!) and we should never run
    out of receive space, as these conditions would send us to the slow
    path and call tcp_ack(). I'd argue this is likely to happen only
    with stable, long-running, low- or moderately-paced TCP connections
    in local networks where packet loss is minimal (although most of the
    time things move around as fast as they can in a local network). At
    this point I wonder if the userspace rate-limiting we enabled on
    rsync actually did more harm…

Finally, even if someone hits this, any application caring about network
timeouts will either fail or reconnect, making it appear as a "random
network glitch" and leaving no traces to debug behind. And in the
unlikely event that your application lingers forever in the persist
state, it certainly takes a fair amount of annoyance to sidestep your
ignorance, decide that this might indeed be a kernel bug, and go after
it :)

Thanks again for the fast response!

Best,
Apollon

P.S: I wonder if it would make sense to expose snd_una and snd_wl1
     in struct tcp_info to ease debugging.