Hi,

I'm trying to debug a (possible) TCP issue we have been encountering
sporadically during the past couple of years. Currently we're running
4.9.144, but we've been observing this since at least 3.16.

Tl;DR: I believe we are seeing a case where snd_wl1 fails to be properly
updated, leading to inability to recover from a TCP persist state and
would appreciate some help debugging this.

The long version:

The issue manifests with the client → server direction of an rsync
pipeline being stuck in TCP persist even though the server actually
advertises a non-zero window. The stall is not readily reproducible,
although it happens quite often (with a ~10% probability I'd say) when a
cluster of 4 machines tries to rsync an 800GB dataset from a single
server at the same time.

For those not familiar with the way rsync works, it essentially creates
a self-throttling, blocking pipeline using both directions of a single
TCP stream to connect the stages:

                 C               S              C
              generator -----> sender -----> receiver
                          A             A'

      [C: Client, S: Server, A & A': TCP stream directions]

Direction A carries file checksum data for the sender to decide what to
send, and A' carries file data for the receiver to write to disk. It's
always A that ends up in persist mode, while A' works normally. When the
zero-window condition hits, eventually the whole transfer stalls because
the generator does not send out metadata and the server has nothing more
to process and send to the receiver. When this happens, the socket on C
looks like this:

 $ ss -mito dst :873

 State      Recv-Q Send-Q                Local Address:Port                                 Peer Address:Port                
 ESTAB      0      392827               2001:db8:2a::3:38022                             2001:db8:2a::18:rsync                 timer:(persist,1min56sec,0)
	 skmem:(r0,rb4194304,t0,tb530944,f3733,w401771,o0,bl0,d757) ts sack cubic wscale:7,7 rto:204 backoff:15 rtt:2.06/0.541 ato:40 mss:1428 cwnd:10 ssthresh:46 bytes_acked:22924107 bytes_received:100439119971 segs_out:7191833 segs_in:70503044 data_segs_out:16161 data_segs_in:70502223 send 55.5Mbps lastsnd:16281856 lastrcv:14261988 lastack:3164 pacing_rate 133.1Mbps retrans:0/11 rcv_rtt:20 rcv_space:2107888 notsent:392827 minrtt:0.189


while the socket on S looks like this:

 $ ss -mito src :873

 State      Recv-Q Send-Q                Local Address:Port                                 Peer Address:Port                
 ESTAB      0      0                   2001:db8:2a::18:rsync                              2001:db8:2a::3:38022                 timer:(keepalive,3min7sec,0)
  	 skmem:(r0,rb3540548,t0,tb4194304,f0,w0,o0,bl0,d292) ts sack cubic wscale:7,7 rto:204 rtt:1.234/1.809 ato:40 mss:1428 cwnd:1453 ssthresh:1431 bytes_acked:100439119971 bytes_received:22924106 segs_out:70503089 segs_in:7191833 data_segs_out:70502269 data_segs_in:16161 send 13451.4Mbps lastsnd:14277708 lastrcv:16297572 lastack:7012576 pacing_rate 16140.1Mbps retrans:0/794 rcv_rtt:7.5 rcv_space:589824 minrtt:0.026

There's a non-empty send queue on C, while S obviously has enough space
to accept new data. Also note the difference between lastsnd and lastrcv
on C. tcpdump reveals the ZWP exchange between C and S:

  […]
  09:34:34.165148 0c:c4:7a:f9:68:e4 > 0c:c4:7a:f9:69:78, ethertype IPv6 (0x86dd), length 86: (flowlabel 0xcbf6f, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::3.38022 > 2001:db8:2a::18.873: Flags [.], cksum 0x711b (incorrect -> 0x4d39), seq 4212361595, ack 1253278418, win 16384, options [nop,nop,TS val 2864739840 ecr 2885730760], length 0
  09:34:34.165354 0c:c4:7a:f9:69:78 > 0c:c4:7a:f9:68:e4, ethertype IPv6 (0x86dd), length 86: (flowlabel 0x25712, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::18.873 > 2001:db8:2a::3.38022: Flags [.], cksum 0x1914 (correct), seq 1253278418, ack 4212361596, win 13831, options [nop,nop,TS val 2885760967 ecr 2863021624], length 0
  [… repeats every 2 mins]

S responds with a non-zero window (13831 << 7), but C seems to never
pick it up. I dumped the internal connection state by hooking at the
bottom of tcp_ack using the attached systemtap script, which reveals the
following:

 ack: 4212361596, ack_seq: 1253278418, prior_snd_una: 4212361596
 sk_send_head seq:4212361596, end_seq: 4212425472
 snd_wnd: 0, tcp_wnd_end: 4212361596, snd_wl1: 1708927328
 flag: 4100, may update window: 0
 rcv_tsval: 2950255047, ts_recent: 2950255047

Everything seems to check out, apart from the (strange ?) fact that
ack_seq < snd_wl1 by some 450MB, which AFAICT leads
tcp_may_update_window() to reject the update:

  static inline bool tcp_may_update_window(const struct tcp_sock *tp,
                                          const u32 ack, const u32 ack_seq,
                                          const u32 nwin)
  {
          return  after(ack, tp->snd_una) ||
                  after(ack_seq, tp->snd_wl1) ||
                  (ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd);
  }

If I understand correctly, the only ways to recover from a zero window
in a case like this would be for ack_seq to equal snd_wl1, or
after(ack_seq, tp->snd_wl1) to be true, none of which holds in our case.

Overall it looks like snd_wl1 stopped advancing at some point and the
peer sequence numbers wrapped in the meantime as traffic in the opposite
direction continued. Every single of the 5 hung connections I've seen in
the past week is in this state, with ack_seq < snd_wl1. The problem is
that - at least to my eyes - there's no way snd_wl1 could *not* advance
when processing a valid ACK, so I'm really stuck here (or I'm completely
misled and talking nonsense :) Any ideas?

Some potentially useful details about the setup and the issue:

 - All machines currently run Debian Stretch with kernel 4.9.144; we
   have been seeing this since at least Linux 3.16.

 - We've witnessed the issue with different hardware (servers &
   NICs). Currently all NICs are igb, but we've had tg3 on both sides at
   some point and still experienced hangs. We tweaked TSO settings in
   the past and it didn't seem to make a difference.

 - It correlates with network congestion. We can never reproduce this
   with a single client, but it happens when all 4 machines try to rsync
   at the same time. Also limiting the bandwidth of the transfers from
   userspace makes the issue less frequent.

Regards,
Apollon

P.S: Please Cc me on any replies as I'm not subscribed to the list.