TCP stall issue

* TCP stall issue
@ 2021-02-23 10:09 Gil Pedersen
  2021-02-23 15:41 ` Neal Cardwell
  0 siblings, 1 reply; 10+ messages in thread
From: Gil Pedersen @ 2021-02-23 10:09 UTC (permalink / raw)
  To: davem, yoshfuji, dsahern; +Cc: netdev

Hi,

I am investigating a TCP stall that can occur when sending to an Android device (kernel 4.9.148) from an Ubuntu server running kernel 5.11.0.

The issue seems to be that RACK is not applied when a D-SACK (with SACK) is received on the server after an RTO re-transmission (CA_Loss state). Here the re-transmitted segment is considered to be already delivered and loss undo logic is applied. Then nothing is re-transmitted until the next RTO, where the next segment is sent and the same thing happens again. The causes the retransmitted segments to be delivered at a rate of ~1 per second, so a burst loss of eg. 20 segments cause a 20+ second stall. I would expect RACK to kick in long before this happens.

Note the D-SACK should not be considered spurious, as the TSecr value matches the re-transmission TSval.

Also, the Android receiver is definitely sending strange D-SACKs that does not properly advance the ACK number to include received segments. However, I can't control it and need to fix it on the server by quickly re-transmitting the segments. The connection itself is functional. If the client makes a request to the server in this state, it can respond and the client will receive any segments sent in reply.

I can see from counters that TcpExtTCPLossUndo & TcpExtTCPSackFailures are incremented on the server when this happens.
The issue appears both with F-RTO enabled and disabled. Also appears both with BBR and RENO.

Any idea of why this happens, or suggestions on how to debug the issue further?

/Gil

^ permalink raw reply	[flat|nested] 10+ messages in thread