All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: netdev <netdev@vger.kernel.org>
Subject: Re: Debugging stuck tcp connection across localhost
Date: Thu, 6 Jan 2022 07:39:02 -0800	[thread overview]
Message-ID: <b3e53863-e80e-704f-81a2-905f80f3171d@candelatech.com> (raw)
In-Reply-To: <CADVnQyn97m5ybVZ3FdWAw85gOMLAvPSHiR8_NC_nGFyBdRySqQ@mail.gmail.com>

On 1/6/22 7:20 AM, Neal Cardwell wrote:
> On Thu, Jan 6, 2022 at 10:06 AM Ben Greear <greearb@candelatech.com> wrote:
>>
>> Hello,
>>
>> I'm working on a strange problem, and could use some help if anyone has ideas.
>>
>> On a heavily loaded system (500+ wifi station devices, VRF device per 'real' netdev,
>> traffic generation on the netdevs, etc), I see cases where two processes trying
>> to communicate across localhost with TCP seem to get a stuck network
>> connection:
>>
>> [greearb@bendt7 ben_debug]$ grep 4004 netstat.txt |grep 127.0.0.1
>> tcp        0 7988926 127.0.0.1:4004          127.0.0.1:23184         ESTABLISHED
>> tcp        0  59805 127.0.0.1:23184         127.0.0.1:4004          ESTABLISHED
>>
>> Both processes in question continue to execute, and as far as I can tell, they are properly
>> attempting to read/write the socket, but they are reading/writing 0 bytes (these sockets
>> are non blocking).  If one was stuck not reading, I would expect netstat
>> to show bytes in the rcv buffer, but it is zero as you can see above.
>>
>> Kernel is 5.15.7+ local hacks.  I can only reproduce this in a big messy complicated
>> test case, with my local ath10k-ct and other patches that enable virtual wifi stations,
>> but my code can grab logs at time it sees the problem.  Is there anything
>> more I can do to figure out why the TCP connection appears to be stuck?
> 
> It could be very useful to get more information about the state of all
> the stuck connections (sender and receiver side) with something like:
> 
>    ss -tinmo 'sport = :4004 or sport = :4004'
> 
> I would recommend downloading and building a recent version of the
> 'ss' tool to maximize the information. Here is a recipe for doing
> that:
> 
>   https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections

Thanks for the suggestions!

Here is output from a working system of same OS, the hand-compiled ss seems to give similar output,
do you think it is still worth building ss manually on my system that shows the bugs?

[root@ct523c-3b29 iproute2]# ss -tinmo 'sport = :4004 or sport = :4004'
State             Recv-Q             Send-Q                         Local Address:Port                         Peer Address:Port
ESTAB             0                  0                                  127.0.0.1:4004                            127.0.0.1:40902
	 skmem:(r0,rb87380,t0,tb2626560,f12288,w0,o0,bl0,d0) ts sack reno wscale:4,10 rto:201 rtt:0.009/0.004 ato:40 mss:65483 pmtu:65535 rcvmss:1196 advmss:65483 
cwnd:10 bytes_sent:654589126 bytes_acked:654589126 bytes_received:1687846 segs_out:61416 segs_in:72611 data_segs_out:61406 data_segs_in:11890 send 
582071111111bps lastsnd:163 lastrcv:62910122 lastack:163 pacing_rate 1088548571424bps delivery_rate 261932000000bps delivered:61407 app_limited busy:42494ms 
rcv_rtt:1 rcv_space:43690 rcv_ssthresh:43690 minrtt:0.002
[root@ct523c-3b29 iproute2]# ./misc/ss -tinmo 'sport = :4004 or sport = :4004'
State          Recv-Q          Send-Q                    Local Address:Port                     Peer Address:Port           Process
ESTAB          0               0                             127.0.0.1:4004                        127.0.0.1:40902
	 skmem:(r0,rb87380,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:4,10 rto:201 rtt:0.009/0.003 ato:40 mss:65483 pmtu:65535 rcvmss:1196 advmss:65483 cwnd:10 
bytes_sent:654597556 bytes_acked:654597556 bytes_received:1687846 segs_out:61418 segs_in:72613 data_segs_out:61408 data_segs_in:11890 send 582071111111bps 
lastsnd:219 lastrcv:62916882 lastack:218 pacing_rate 1088548571424bps delivery_rate 261932000000bps delivered:61409 app_limited busy:42495ms rcv_rtt:1 
rcv_space:43690 rcv_ssthresh:43690 minrtt:0.002

> 
> It could also be very useful to collect and share packet traces, as
> long as taking traces does not consume an infeasible amount of space,
> or perturb timing in a way that makes the buggy behavior disappear.
> For example, as root:
> 
>    tcpdump -w /tmp/trace.pcap -s 120 -c 100000000 -i any port 4004 &

I guess this could be  -i lo ?

I sometimes see what is likely a similar problem to an external process, but easiest thing to
reproduce is the localhost stuck connection, and my assumption is that it would be easiest
to debug.

I should have enough space for captures, I'll give that a try.

Thanks,
Ben

> 
> If space is an issue, you might start taking traces once things get
> stuck to see what the retry behavior, if any, looks like.
> 
> thanks,
> neal
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

  reply	other threads:[~2022-01-06 15:46 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-06 14:59 Debugging stuck tcp connection across localhost Ben Greear
2022-01-06 15:20 ` Neal Cardwell
2022-01-06 15:39   ` Ben Greear [this message]
2022-01-06 16:16     ` Neal Cardwell
2022-01-06 19:05       ` Ben Greear
2022-01-06 20:04         ` Neal Cardwell
2022-01-06 20:20           ` Ben Greear
2022-01-06 22:26           ` Ben Greear
2022-01-10 18:10             ` Ben Greear
2022-01-10 22:16               ` David Laight
2022-01-11 10:46               ` Eric Dumazet
2022-01-11 21:35                 ` Ben Greear
2022-01-12  7:41                   ` Eric Dumazet
2022-01-12 14:52                     ` Ben Greear
2022-01-12 17:12                       ` Eric Dumazet
2022-01-12 18:01                         ` Debugging stuck tcp connection across localhost [snip] Ben Greear
2022-01-12 18:44                           ` Ben Greear
2022-01-12 18:47                             ` Eric Dumazet
2022-01-12 18:54                               ` Ben Greear

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3e53863-e80e-704f-81a2-905f80f3171d@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.