All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Linux ECN Handling
       [not found] <CACJspmLFdy9i8K=TkzXHnofyFMNoJf9HkYE7On8uG+PREc2Dqw@mail.gmail.com>
@ 2017-10-19 12:43 ` Florian Westphal
  2017-10-23 22:15   ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Florian Westphal @ 2017-10-19 12:43 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: ncardwell, daniel, Mohammad Alizadeh, Nick McKeown, Lavanya Jose, netdev

[ full-quoting due to Cc fixups, adding netdev ]

Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Florian, Neal, and Daniel,
> 
> I hope this email finds you well. My name is Stephen Ibanez and I'm a PhD
> Student at Stanford currently working on a project with Mohammad Alizadeh,
> Nick McKeown, and Lavanya Jose. We have been doing some experiments using
> the linux DCTCP implementation and are trying to understand some strange
> behavior that we are encountering. I'm contacting you three because I have
> seen your names on some of the source files and recent commits in the linux
> source tree. Hopefully you can help us out or put us in contact with the
> right people?
> 
> Here are some details about our servers:
> 
>    - Distribution: Ubuntu 14.04 LTS
>    - Kernel release: 4.4.0-75-generic

Can you re-test with a more recent kernel such as 4.13.8?

> *The experiment:*
> 
> We use iperf3 to generate two DCTCP flows from different servers to a
> common server, as shown in the diagram below. We measure the sending rate
> of each flow, record the tcp_probe output, as well as run tcpdump on the
> source host interfaces.
> 
> [image: Inline image 6]
> 
> *The problem:*
> 
> Our rate measurements look like the one shown below; the flows often enter
> timeouts. In this case, both flows hit a timeout at t=0.3.
> [image: Inline image 2]
> 
> When looking at the sequence of packets seen at the source host interfaces
> around this timeout event this is what we see:
> 
> *10.0.0.1 timeout event:*
> [image: Inline image 3]
> 
> *10.0.0.3 timeout event:*
> [image: Inline image 4]
> 
> In both cases, the source:
> (1) receives an ACK for byte XYZ with the ECN flag set
> (2) stops sending anything for RTO_min=300ms
> (3) sends a retransmission for byte XYZ
> 
> I have verified that this behavior is consistent across multiple experiment
> runs. Here are the CWND samples for the 10.0.0.1 flow provided by tcp_probe
> at the time of the timeout event:
> 
> [image: Inline image 5]
> 
> From what I can tell, tcp_probe logs a sample whenever a packet is
> received. If this is true, then that means when the source receives the
> final ECN marked ACK just before the timeout the CWND=1 MSS.
> 
> *The conclusion:*
> 
> We believe that there may be an issue with how the linux kernel is handling
> the ECN echoes. For DCTCP, if the CWND is 1 MSS and the end host is still
> receiving ECN marks then the CWND should remain at 1 MSS and should *not*
> enter a timeout. This is because the switch can perform ECN marking very
> aggressively causing the source end host to receive many redundant ECN
> echoes over a short period of time.
> 
> Another potential issue is that from the CWND plot above it looks like the
> end host may be reacting to congestion signals more than once per window,
> which should not happen (section 5 of RF3168
> <https://tools.ietf.org/html/rfc3168>). tcp_probe reports SRTT measurements
> of about 400-500 us and in the plot above the CWND is reduced 6 times
> within this amount of time.
> 
> We have not yet tracked down the code path in the kernel code that is
> causing the behavior described above. Perhaps this is something that you
> can help us with? We would love to hear your thoughts on this matter and
> are happy to try other experiments that you suggest.
> 
> Here is a link
> <https://drive.google.com/file/d/0Bw-GEX7h5ufiYmpCV2VpOGEtQWs/view?usp=sharing>
> to
> download the packet traces if you would like to take a look.
> han-1_host.pcap is the trace from 10.0.0.1 and han-3_host.pcap is the trace
> from 10.0.0.3.
> 
> Looking forward to hearing from you!
> 
> Best,
> -Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-10-19 12:43 ` Linux ECN Handling Florian Westphal
@ 2017-10-23 22:15   ` Steve Ibanez
  2017-10-24  1:11     ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-10-23 22:15 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Mohammad Alizadeh

Hi All,

I upgraded the kernel on all of our machines to Linux
4.13.8-041308-lowlatency. However, I'm still observing the same
behavior where the source enters a timeout when the CWND=1MSS and it
receives ECN marks.

Here are the measured flow rates:
<https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>

Here are snapshots of the packet traces at the sources when they both
enter a timeout at t=1.6sec:

10.0.0.1 timeout event:
<https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>

10.0.0.3 timeout event:
<https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>

Both still essentially follow the same sequence of events that I
mentioned earlier:
(1) receives an ACK for byte XYZ with the ECN flag set
(2) stops sending for RTO_min=300ms
(3) sends a retransmission for byte XYZ

The cwnd samples reported by tcp_probe still indicate that the sources
are reacting to the ECN marks more than once per window. Here are the
cwnd samples at the same timeout event mentioned above:
<https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>

Let me know if there is anything else you think I should try.

Thanks,
-Steve

On Thu, Oct 19, 2017 at 5:43 AM, Florian Westphal <fw@strlen.de> wrote:
>
> [ full-quoting due to Cc fixups, adding netdev ]
>
> Steve Ibanez <sibanez@stanford.edu> wrote:
> > Hi Florian, Neal, and Daniel,
> >
> > I hope this email finds you well. My name is Stephen Ibanez and I'm a PhD
> > Student at Stanford currently working on a project with Mohammad Alizadeh,
> > Nick McKeown, and Lavanya Jose. We have been doing some experiments using
> > the linux DCTCP implementation and are trying to understand some strange
> > behavior that we are encountering. I'm contacting you three because I have
> > seen your names on some of the source files and recent commits in the linux
> > source tree. Hopefully you can help us out or put us in contact with the
> > right people?
> >
> > Here are some details about our servers:
> >
> >    - Distribution: Ubuntu 14.04 LTS
> >    - Kernel release: 4.4.0-75-generic
>
> Can you re-test with a more recent kernel such as 4.13.8?
>
> > *The experiment:*
> >
> > We use iperf3 to generate two DCTCP flows from different servers to a
> > common server, as shown in the diagram below. We measure the sending rate
> > of each flow, record the tcp_probe output, as well as run tcpdump on the
> > source host interfaces.
> >
> > [image: Inline image 6]
> >
> > *The problem:*
> >
> > Our rate measurements look like the one shown below; the flows often enter
> > timeouts. In this case, both flows hit a timeout at t=0.3.
> > [image: Inline image 2]
> >
> > When looking at the sequence of packets seen at the source host interfaces
> > around this timeout event this is what we see:
> >
> > *10.0.0.1 timeout event:*
> > [image: Inline image 3]
> >
> > *10.0.0.3 timeout event:*
> > [image: Inline image 4]
> >
> > In both cases, the source:
> > (1) receives an ACK for byte XYZ with the ECN flag set
> > (2) stops sending anything for RTO_min=300ms
> > (3) sends a retransmission for byte XYZ
> >
> > I have verified that this behavior is consistent across multiple experiment
> > runs. Here are the CWND samples for the 10.0.0.1 flow provided by tcp_probe
> > at the time of the timeout event:
> >
> > [image: Inline image 5]
> >
> > From what I can tell, tcp_probe logs a sample whenever a packet is
> > received. If this is true, then that means when the source receives the
> > final ECN marked ACK just before the timeout the CWND=1 MSS.
> >
> > *The conclusion:*
> >
> > We believe that there may be an issue with how the linux kernel is handling
> > the ECN echoes. For DCTCP, if the CWND is 1 MSS and the end host is still
> > receiving ECN marks then the CWND should remain at 1 MSS and should *not*
> > enter a timeout. This is because the switch can perform ECN marking very
> > aggressively causing the source end host to receive many redundant ECN
> > echoes over a short period of time.
> >
> > Another potential issue is that from the CWND plot above it looks like the
> > end host may be reacting to congestion signals more than once per window,
> > which should not happen (section 5 of RF3168
> > <https://tools.ietf.org/html/rfc3168>). tcp_probe reports SRTT measurements
> > of about 400-500 us and in the plot above the CWND is reduced 6 times
> > within this amount of time.
> >
> > We have not yet tracked down the code path in the kernel code that is
> > causing the behavior described above. Perhaps this is something that you
> > can help us with? We would love to hear your thoughts on this matter and
> > are happy to try other experiments that you suggest.
> >
> > Here is a link
> > <https://drive.google.com/file/d/0Bw-GEX7h5ufiYmpCV2VpOGEtQWs/view?usp=sharing>
> > to
> > download the packet traces if you would like to take a look.
> > han-1_host.pcap is the trace from 10.0.0.1 and han-3_host.pcap is the trace
> > from 10.0.0.3.
> >
> > Looking forward to hearing from you!
> >
> > Best,
> > -Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-10-23 22:15   ` Steve Ibanez
@ 2017-10-24  1:11     ` Neal Cardwell
  2017-11-06 14:08       ` Daniel Borkmann
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-10-24  1:11 UTC (permalink / raw)
  To: Steve Ibanez; +Cc: Netdev, Florian Westphal, Mohammad Alizadeh

On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi All,
>
> I upgraded the kernel on all of our machines to Linux
> 4.13.8-041308-lowlatency. However, I'm still observing the same
> behavior where the source enters a timeout when the CWND=1MSS and it
> receives ECN marks.
>
> Here are the measured flow rates:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>
> Here are snapshots of the packet traces at the sources when they both
> enter a timeout at t=1.6sec:
>
> 10.0.0.1 timeout event:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>
> 10.0.0.3 timeout event:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>
> Both still essentially follow the same sequence of events that I
> mentioned earlier:
> (1) receives an ACK for byte XYZ with the ECN flag set
> (2) stops sending for RTO_min=300ms
> (3) sends a retransmission for byte XYZ
>
> The cwnd samples reported by tcp_probe still indicate that the sources
> are reacting to the ECN marks more than once per window. Here are the
> cwnd samples at the same timeout event mentioned above:
> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>
> Let me know if there is anything else you think I should try.

Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
scenario. Would you be able to add printk statements in
tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
tp->snd_ssthresh?

Based on the output you may be able to figure out where cwnd is being
set to zero. If not, could you please post the printk output and
tcpdump traces (.pcap, headers-only is fine) from your tests?

thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-10-24  1:11     ` Neal Cardwell
@ 2017-11-06 14:08       ` Daniel Borkmann
  2017-11-06 23:31         ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2017-11-06 14:08 UTC (permalink / raw)
  To: Neal Cardwell, Steve Ibanez
  Cc: Netdev, Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On 10/24/2017 03:11 AM, Neal Cardwell wrote:
> On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi All,
>>
>> I upgraded the kernel on all of our machines to Linux
>> 4.13.8-041308-lowlatency. However, I'm still observing the same
>> behavior where the source enters a timeout when the CWND=1MSS and it
>> receives ECN marks.
>>
>> Here are the measured flow rates:
>> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>>
>> Here are snapshots of the packet traces at the sources when they both
>> enter a timeout at t=1.6sec:
>>
>> 10.0.0.1 timeout event:
>> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>>
>> 10.0.0.3 timeout event:
>> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>>
>> Both still essentially follow the same sequence of events that I
>> mentioned earlier:
>> (1) receives an ACK for byte XYZ with the ECN flag set
>> (2) stops sending for RTO_min=300ms
>> (3) sends a retransmission for byte XYZ
>>
>> The cwnd samples reported by tcp_probe still indicate that the sources
>> are reacting to the ECN marks more than once per window. Here are the
>> cwnd samples at the same timeout event mentioned above:
>> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>>
>> Let me know if there is anything else you think I should try.
>
> Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
> scenario. Would you be able to add printk statements in
> tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
> tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
> tp->snd_ssthresh?
>
> Based on the output you may be able to figure out where cwnd is being
> set to zero. If not, could you please post the printk output and
> tcpdump traces (.pcap, headers-only is fine) from your tests?

Hi Steve, do you have any updates on your debugging?

> thanks,
> neal
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-06 14:08       ` Daniel Borkmann
@ 2017-11-06 23:31         ` Steve Ibanez
  2017-11-20  7:31           ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-11-06 23:31 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Neal Cardwell, Netdev, Florian Westphal, Mohammad Alizadeh,
	Lawrence Brakmo

Hi Daniel,

Apologies for the delay. I tried out Neal's suggestion to printk the
cwnd and ss-thresh in the tcp_init_cwnd_reduction(),
tcp_cwnd_reduction(), and tcp_end_cwnd_reduction() functions in
tcp_input.c. From what I can tell, none of these functions are setting
the cwnd to 0.

Here is the kernel log with the cwnd and ss-thresh print statements:
https://drive.google.com/open?id=1LEWIkz64NuZN3yuDpBOAXbUfJfiju55O
And here is the corresponding packet trace at this end host:
https://drive.google.com/open?id=1qf4cSW3wzsiwPngcYpZY-AoBspuqONLH
(The kernel log buffer was not large enough to capture the full
3-second experiment, so there's only about a second of data for the
log buffer and about 3 seconds for the packet trace.)

Here are is a plot of the cwnd and ss-thresh from each of the three functions:
    - tcp_init_cwnd_reduction:
https://drive.google.com/open?id=1KOEXG2ISJQMi9c6KyPOQ6rpVUVsQwtWU
    - tcp_cwnd_reduction:
https://drive.google.com/open?id=1awoPWC3hi4CGZt7HyuI4aAaLG1LPLwJE
    - tcp_end_cwnd_reduction:
https://drive.google.com/open?id=1G7XUSnkX8tP7Z5XdY2O97OWj6jguQHO5

Here is a plot of the measured flow rates:
https://drive.google.com/open?id=1XwmGve10J4qa1nPE3LustK8NbvhZscac

The kernel log and packet trace data was collected on the 10.0.0.3
host. The cwnd and ss-thresh plots are from the final second or so of
the experiment and they show two timeout events. In the first event,
the 10.0.0.1 host times out allowing 10.0.0.3 to increase it's cwnd.
And in the second event, the 10.0.0.3 host times out causing the cwnd
to decrease from ~100 to about ~10. The cwnd samples from tcp_probe (
https://drive.google.com/open?id=1QCuPspLqbGoA68MKTaAh7rx2wCv3Cr_e )
indicate that the cwnd is 1 MSS just before the timeout event, but I
don't see that in the data collected from the tcp_*_cwnd_reduction
functions.

Here is a diff of the changes that I applied to the tcp_input.c file:
https://drive.google.com/open?id=1k5x3AkfTr3tJhohSIcmQp-3g2yTVNMWm

Are there other places in the code that you would suggest I check for
how the cwnd and ss-thresh are changing?

Thanks,
-Steve


On Mon, Nov 6, 2017 at 6:08 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 10/24/2017 03:11 AM, Neal Cardwell wrote:
>>
>> On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@stanford.edu>
>> wrote:
>>>
>>> Hi All,
>>>
>>> I upgraded the kernel on all of our machines to Linux
>>> 4.13.8-041308-lowlatency. However, I'm still observing the same
>>> behavior where the source enters a timeout when the CWND=1MSS and it
>>> receives ECN marks.
>>>
>>> Here are the measured flow rates:
>>>
>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>>>
>>> Here are snapshots of the packet traces at the sources when they both
>>> enter a timeout at t=1.6sec:
>>>
>>> 10.0.0.1 timeout event:
>>>
>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>>>
>>> 10.0.0.3 timeout event:
>>>
>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>>>
>>> Both still essentially follow the same sequence of events that I
>>> mentioned earlier:
>>> (1) receives an ACK for byte XYZ with the ECN flag set
>>> (2) stops sending for RTO_min=300ms
>>> (3) sends a retransmission for byte XYZ
>>>
>>> The cwnd samples reported by tcp_probe still indicate that the sources
>>> are reacting to the ECN marks more than once per window. Here are the
>>> cwnd samples at the same timeout event mentioned above:
>>>
>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>>>
>>> Let me know if there is anything else you think I should try.
>>
>>
>> Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
>> scenario. Would you be able to add printk statements in
>> tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
>> tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
>> tp->snd_ssthresh?
>>
>> Based on the output you may be able to figure out where cwnd is being
>> set to zero. If not, could you please post the printk output and
>> tcpdump traces (.pcap, headers-only is fine) from your tests?
>
>
> Hi Steve, do you have any updates on your debugging?
>
>> thanks,
>> neal
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-06 23:31         ` Steve Ibanez
@ 2017-11-20  7:31           ` Steve Ibanez
  2017-11-20 15:05             ` Neal Cardwell
       [not found]             ` <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Steve Ibanez @ 2017-11-20  7:31 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Neal Cardwell, Netdev, Florian Westphal, Mohammad Alizadeh,
	Lawrence Brakmo

Hi Folks,

I wanted to check back in on this for another update and to solicit
some more suggestions. I did a bit more digging to try an isolate the
problem.

As I explained earlier, the log generated by tcp_probe indicates that
the snd_cwnd is set to 1 just before the end host receives an ECN
marked ACK and unexpectedly enters a timeout (
https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
I was trying to track down where this is happening, but the only place
I could find that might be setting the snd_cwnd to 1 is in the
tcp_enter_loss() function. I inserted a printk() call in this function
to see when it is being invoked and it looks like it is only called by
the tcp_retransmit_timer() function after the timer expires.

I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
inside the tcp_fastretrans_alert() function whenever it processes an
ECN marked ACK (
https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
This plot also shows when the tcp_retransmit_timer() and
tcp_enter_loss() functions are invoked (red and purple dots
respectively). And I see that the ACK state machine is always either
in the TCP_CA_Open or TCP_CA_CWR state whenever the
tcp_fastretrans_alert() function processes ECN marked ACKs (
https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
Neal suggested) just before entering a timeout. Any suggestions here?

In order to do a bit of profiling of the tcp_dctcp code I added
support into tcp_probe for recording the dctcp alpha parameter. I see
that alpha oscillates around about 0.1 when the flow rates have
converged, it goes to zero when the other host enters a timeout, and I
don't see any unexpected behavior just before the timeout (
https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).

So I haven't had much luck yet trying to track down where the problem
is. If you have any suggestions that would help me to focus my search
efforts, I would appreciate the comments.

Thanks!
-Steve


On Mon, Nov 6, 2017 at 3:31 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Daniel,
>
> Apologies for the delay. I tried out Neal's suggestion to printk the
> cwnd and ss-thresh in the tcp_init_cwnd_reduction(),
> tcp_cwnd_reduction(), and tcp_end_cwnd_reduction() functions in
> tcp_input.c. From what I can tell, none of these functions are setting
> the cwnd to 0.
>
> Here is the kernel log with the cwnd and ss-thresh print statements:
> https://drive.google.com/open?id=1LEWIkz64NuZN3yuDpBOAXbUfJfiju55O
> And here is the corresponding packet trace at this end host:
> https://drive.google.com/open?id=1qf4cSW3wzsiwPngcYpZY-AoBspuqONLH
> (The kernel log buffer was not large enough to capture the full
> 3-second experiment, so there's only about a second of data for the
> log buffer and about 3 seconds for the packet trace.)
>
> Here are is a plot of the cwnd and ss-thresh from each of the three functions:
>     - tcp_init_cwnd_reduction:
> https://drive.google.com/open?id=1KOEXG2ISJQMi9c6KyPOQ6rpVUVsQwtWU
>     - tcp_cwnd_reduction:
> https://drive.google.com/open?id=1awoPWC3hi4CGZt7HyuI4aAaLG1LPLwJE
>     - tcp_end_cwnd_reduction:
> https://drive.google.com/open?id=1G7XUSnkX8tP7Z5XdY2O97OWj6jguQHO5
>
> Here is a plot of the measured flow rates:
> https://drive.google.com/open?id=1XwmGve10J4qa1nPE3LustK8NbvhZscac
>
> The kernel log and packet trace data was collected on the 10.0.0.3
> host. The cwnd and ss-thresh plots are from the final second or so of
> the experiment and they show two timeout events. In the first event,
> the 10.0.0.1 host times out allowing 10.0.0.3 to increase it's cwnd.
> And in the second event, the 10.0.0.3 host times out causing the cwnd
> to decrease from ~100 to about ~10. The cwnd samples from tcp_probe (
> https://drive.google.com/open?id=1QCuPspLqbGoA68MKTaAh7rx2wCv3Cr_e )
> indicate that the cwnd is 1 MSS just before the timeout event, but I
> don't see that in the data collected from the tcp_*_cwnd_reduction
> functions.
>
> Here is a diff of the changes that I applied to the tcp_input.c file:
> https://drive.google.com/open?id=1k5x3AkfTr3tJhohSIcmQp-3g2yTVNMWm
>
> Are there other places in the code that you would suggest I check for
> how the cwnd and ss-thresh are changing?
>
> Thanks,
> -Steve
>
>
> On Mon, Nov 6, 2017 at 6:08 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 10/24/2017 03:11 AM, Neal Cardwell wrote:
>>>
>>> On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@stanford.edu>
>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I upgraded the kernel on all of our machines to Linux
>>>> 4.13.8-041308-lowlatency. However, I'm still observing the same
>>>> behavior where the source enters a timeout when the CWND=1MSS and it
>>>> receives ECN marks.
>>>>
>>>> Here are the measured flow rates:
>>>>
>>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>>>>
>>>> Here are snapshots of the packet traces at the sources when they both
>>>> enter a timeout at t=1.6sec:
>>>>
>>>> 10.0.0.1 timeout event:
>>>>
>>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>>>>
>>>> 10.0.0.3 timeout event:
>>>>
>>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>>>>
>>>> Both still essentially follow the same sequence of events that I
>>>> mentioned earlier:
>>>> (1) receives an ACK for byte XYZ with the ECN flag set
>>>> (2) stops sending for RTO_min=300ms
>>>> (3) sends a retransmission for byte XYZ
>>>>
>>>> The cwnd samples reported by tcp_probe still indicate that the sources
>>>> are reacting to the ECN marks more than once per window. Here are the
>>>> cwnd samples at the same timeout event mentioned above:
>>>>
>>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>>>>
>>>> Let me know if there is anything else you think I should try.
>>>
>>>
>>> Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
>>> scenario. Would you be able to add printk statements in
>>> tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
>>> tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
>>> tp->snd_ssthresh?
>>>
>>> Based on the output you may be able to figure out where cwnd is being
>>> set to zero. If not, could you please post the printk output and
>>> tcpdump traces (.pcap, headers-only is fine) from your tests?
>>
>>
>> Hi Steve, do you have any updates on your debugging?
>>
>>> thanks,
>>> neal
>>>
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-20  7:31           ` Steve Ibanez
@ 2017-11-20 15:05             ` Neal Cardwell
       [not found]             ` <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Neal Cardwell @ 2017-11-20 15:05 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Daniel Borkmann, Netdev, Florian Westphal, Mohammad Alizadeh,
	Lawrence Brakmo

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Folks,
>
> I wanted to check back in on this for another update and to solicit
> some more suggestions. I did a bit more digging to try an isolate the
> problem.

Going back to one of your Oct 19 trace snapshots (attached), AFAICT at
the time of the timeout there is actually almost 64KBytes  (352553398
+ 1448 - 352489686 = 65160) of unacknowledged data. So there really
does seem to be a significant chunk of packets that were in-flight
that were then declared lost.

So here is a possibility: perhaps the combination of CWR+PRR plus
tcp_tso_should_defer() means that PRR can make cwnd so gentle that
tcp_tso_should_defer() thinks we should wait for another ACK to send,
and that ACK doesn't come. Breaking it, down, the potential sequence
would be:

(1) tcp_write_xmit() does not send, because the CWR behavior, using
PRR, does not leave enough cwnd for tcp_tso_should_defer() to think we
should send (PRR was originally designed for recovery, which did not
have TSO deferral)

(2) TLP does not fire, because we are in state CWR, not Open

(3) The only remaining option is an RTO, which fires.

In other words, the possibility is that, at the time of the stall, the
cwnd is reasonably high, but tcp_packets_in_flight() is also quite
high, so either there is (a) literally no unused cwnd left (
tcp_packets_in_flight() == cwnd), or (b) some mechanism like
tcp_tso_should_defer() is deciding that there is not enough available
cwnd for it to make sense to chop off a fraction of a TSO skb to send
now.

One way to test that conjecture would be to disable
tcp_tso_should_defer() by adding a:

   goto send_now;

at the top of tcp_tso_should_defer().

If that doesn't prevent the freezes then I would recommend adding
printks or other instrumentation to  tcp_write_xmit() to log:

- time
- ca_state
- cwnd
- ssthresh
- tcp_packets_in_flight()
- the reason for breaking out of the tcp_write_xmit() loop (tso
deferral, no packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)

cheers,
neal

[-- Attachment #2: han-3_timeout-event.png --]
[-- Type: image/png, Size: 91597 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
       [not found]             ` <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
@ 2017-11-20 15:40               ` Eric Dumazet
  2017-11-21  5:58               ` Steve Ibanez
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2017-11-20 15:40 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Steve Ibanez, Daniel Borkmann, Netdev, Florian Westphal,
	Mohammad Alizadeh, Lawrence Brakmo, Yuchung Cheng

On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>    goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve

Steve, what HZ value your kernel is compiled with ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
       [not found]             ` <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
  2017-11-20 15:40               ` Eric Dumazet
@ 2017-11-21  5:58               ` Steve Ibanez
  2017-11-21 15:01                 ` Neal Cardwell
  1 sibling, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-11-21  5:58 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Daniel Borkmann, Netdev, Florian Westphal, Mohammad Alizadeh,
	Lawrence Brakmo, Yuchung Cheng, Eric Dumazet

Hi Neal,

I tried your suggestion to disable tcp_tso_should_defer() and it does
indeed look like it is preventing the host from entering timeouts.
I'll have to do a bit more digging to try and find where the packets
are being dropped. I've verified that the bottleneck link queue is
capacity is at about the configured marking threshold when the timeout
occurs, so the drops may be happening at the NIC interfaces or perhaps
somewhere unexpected in the switch.

I wonder if you can explain why the TLP doesn't fire when in the CWR
state? It seems like that might be worth having for cases like this.

Btw, thank you very much for all the help! It is greatly appreciated :)

Best,
-Steve


On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>    goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve
>>
>>
>> On Mon, Nov 6, 2017 at 3:31 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> > Hi Daniel,
>> >
>> > Apologies for the delay. I tried out Neal's suggestion to printk the
>> > cwnd and ss-thresh in the tcp_init_cwnd_reduction(),
>> > tcp_cwnd_reduction(), and tcp_end_cwnd_reduction() functions in
>> > tcp_input.c. From what I can tell, none of these functions are setting
>> > the cwnd to 0.
>> >
>> > Here is the kernel log with the cwnd and ss-thresh print statements:
>> > https://drive.google.com/open?id=1LEWIkz64NuZN3yuDpBOAXbUfJfiju55O
>> > And here is the corresponding packet trace at this end host:
>> > https://drive.google.com/open?id=1qf4cSW3wzsiwPngcYpZY-AoBspuqONLH
>> > (The kernel log buffer was not large enough to capture the full
>> > 3-second experiment, so there's only about a second of data for the
>> > log buffer and about 3 seconds for the packet trace.)
>> >
>> > Here are is a plot of the cwnd and ss-thresh from each of the three
>> > functions:
>> >     - tcp_init_cwnd_reduction:
>> > https://drive.google.com/open?id=1KOEXG2ISJQMi9c6KyPOQ6rpVUVsQwtWU
>> >     - tcp_cwnd_reduction:
>> > https://drive.google.com/open?id=1awoPWC3hi4CGZt7HyuI4aAaLG1LPLwJE
>> >     - tcp_end_cwnd_reduction:
>> > https://drive.google.com/open?id=1G7XUSnkX8tP7Z5XdY2O97OWj6jguQHO5
>> >
>> > Here is a plot of the measured flow rates:
>> > https://drive.google.com/open?id=1XwmGve10J4qa1nPE3LustK8NbvhZscac
>> >
>> > The kernel log and packet trace data was collected on the 10.0.0.3
>> > host. The cwnd and ss-thresh plots are from the final second or so of
>> > the experiment and they show two timeout events. In the first event,
>> > the 10.0.0.1 host times out allowing 10.0.0.3 to increase it's cwnd.
>> > And in the second event, the 10.0.0.3 host times out causing the cwnd
>> > to decrease from ~100 to about ~10. The cwnd samples from tcp_probe (
>> > https://drive.google.com/open?id=1QCuPspLqbGoA68MKTaAh7rx2wCv3Cr_e )
>> > indicate that the cwnd is 1 MSS just before the timeout event, but I
>> > don't see that in the data collected from the tcp_*_cwnd_reduction
>> > functions.
>> >
>> > Here is a diff of the changes that I applied to the tcp_input.c file:
>> > https://drive.google.com/open?id=1k5x3AkfTr3tJhohSIcmQp-3g2yTVNMWm
>> >
>> > Are there other places in the code that you would suggest I check for
>> > how the cwnd and ss-thresh are changing?
>> >
>> > Thanks,
>> > -Steve
>> >
>> >
>> > On Mon, Nov 6, 2017 at 6:08 AM, Daniel Borkmann <daniel@iogearbox.net>
>> > wrote:
>> >> On 10/24/2017 03:11 AM, Neal Cardwell wrote:
>> >>>
>> >>> On Mon, Oct 23, 2017 at 6:15 PM, Steve Ibanez <sibanez@stanford.edu>
>> >>> wrote:
>> >>>>
>> >>>> Hi All,
>> >>>>
>> >>>> I upgraded the kernel on all of our machines to Linux
>> >>>> 4.13.8-041308-lowlatency. However, I'm still observing the same
>> >>>> behavior where the source enters a timeout when the CWND=1MSS and it
>> >>>> receives ECN marks.
>> >>>>
>> >>>> Here are the measured flow rates:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONT0VXMUt6WHhKREE/view?usp=sharing>
>> >>>>
>> >>>> Here are snapshots of the packet traces at the sources when they both
>> >>>> enter a timeout at t=1.6sec:
>> >>>>
>> >>>> 10.0.0.1 timeout event:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONcl9WRnRPazg2ems/view?usp=sharing>
>> >>>>
>> >>>> 10.0.0.3 timeout event:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONeDlxRjNXa0VzWm8/view?usp=sharing>
>> >>>>
>> >>>> Both still essentially follow the same sequence of events that I
>> >>>> mentioned earlier:
>> >>>> (1) receives an ACK for byte XYZ with the ECN flag set
>> >>>> (2) stops sending for RTO_min=300ms
>> >>>> (3) sends a retransmission for byte XYZ
>> >>>>
>> >>>> The cwnd samples reported by tcp_probe still indicate that the
>> >>>> sources
>> >>>> are reacting to the ECN marks more than once per window. Here are the
>> >>>> cwnd samples at the same timeout event mentioned above:
>> >>>>
>> >>>>
>> >>>> <https://drive.google.com/file/d/0B-bt9QS-C3ONdEZQdktpaW5JUm8/view?usp=sharing>
>> >>>>
>> >>>> Let me know if there is anything else you think I should try.
>> >>>
>> >>>
>> >>> Sounds like perhaps cwnd is being set to 0 somewhere in this DCTCP
>> >>> scenario. Would you be able to add printk statements in
>> >>> tcp_init_cwnd_reduction(), tcp_cwnd_reduction(), and
>> >>> tcp_end_cwnd_reduction(), printing the IP:port, tp->snd_cwnd, and
>> >>> tp->snd_ssthresh?
>> >>>
>> >>> Based on the output you may be able to figure out where cwnd is being
>> >>> set to zero. If not, could you please post the printk output and
>> >>> tcpdump traces (.pcap, headers-only is fine) from your tests?
>> >>
>> >>
>> >> Hi Steve, do you have any updates on your debugging?
>> >>
>> >>> thanks,
>> >>> neal
>> >>>
>> >>
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-21  5:58               ` Steve Ibanez
@ 2017-11-21 15:01                 ` Neal Cardwell
  2017-11-21 15:51                   ` Yuchung Cheng
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-11-21 15:01 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Daniel Borkmann, Netdev, Florian Westphal, Mohammad Alizadeh,
	Lawrence Brakmo, Yuchung Cheng, Eric Dumazet

On Tue, Nov 21, 2017 at 12:58 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> I tried your suggestion to disable tcp_tso_should_defer() and it does
> indeed look like it is preventing the host from entering timeouts.
> I'll have to do a bit more digging to try and find where the packets
> are being dropped. I've verified that the bottleneck link queue is
> capacity is at about the configured marking threshold when the timeout
> occurs, so the drops may be happening at the NIC interfaces or perhaps
> somewhere unexpected in the switch.

Great! Thanks for running that test.

> I wonder if you can explain why the TLP doesn't fire when in the CWR
> state? It seems like that might be worth having for cases like this.

The original motivation for only allowing TLP in the CA_Open state was
to be conservative and avoid having the TLP impose extra load on the
bottleneck when it may be congested. Plus if there are any SACKed
packets in the SACK scoreboard then there are other existing
mechanisms to do speedy loss recovery.

But at various times we have talked about expanding the set of
scenarios where TLP is used. And I think this example demonstrates
that there is a class of real-world cases where it probably makes
sense to allow TLP in the CWR state.

If you have time, would you be able to check if leaving
tcp_tso_should_defer () as-is but enabling TLP probes in CWR state
also fixes your performance issue? Perhaps something like
(uncompiled/untested):

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ea79b2ad82e..deccf8070f84 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)

        early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
        /* Schedule a loss probe in 2*RTT for SACK capable connections
-        * in Open state, that are either limited by cwnd or application.
+        * not in loss recovery, that are either limited by cwnd or application.
         */
        if ((early_retrans != 3 && early_retrans != 4) ||
            !tp->packets_out || !tcp_is_sack(tp) ||
-           icsk->icsk_ca_state != TCP_CA_Open)
+           icsk->icsk_ca_state >= TCP_CA_Recovery)
                return false;

        if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&

> Btw, thank you very much for all the help! It is greatly appreciated :)

You are very welcome! :-)

cheers,
neal

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-21 15:01                 ` Neal Cardwell
@ 2017-11-21 15:51                   ` Yuchung Cheng
  2017-11-21 16:20                     ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Yuchung Cheng @ 2017-11-21 15:51 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Steve Ibanez, Daniel Borkmann, Netdev, Florian Westphal,
	Mohammad Alizadeh, Lawrence Brakmo, Eric Dumazet

On Tue, Nov 21, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
>
> On Tue, Nov 21, 2017 at 12:58 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
> > Hi Neal,
> >
> > I tried your suggestion to disable tcp_tso_should_defer() and it does
> > indeed look like it is preventing the host from entering timeouts.
> > I'll have to do a bit more digging to try and find where the packets
> > are being dropped. I've verified that the bottleneck link queue is
> > capacity is at about the configured marking threshold when the timeout
> > occurs, so the drops may be happening at the NIC interfaces or perhaps
> > somewhere unexpected in the switch.
>
> Great! Thanks for running that test.
>
> > I wonder if you can explain why the TLP doesn't fire when in the CWR
> > state? It seems like that might be worth having for cases like this.
>
> The original motivation for only allowing TLP in the CA_Open state was
> to be conservative and avoid having the TLP impose extra load on the
> bottleneck when it may be congested. Plus if there are any SACKed
> packets in the SACK scoreboard then there are other existing
> mechanisms to do speedy loss recovery.
Neal I like your idea of covering more states in TLP. but shouldn't we
also fix the tso_deferral_logic to work better w/ PRR in CWR state, b/c
it's a general transmission issue.


>
> But at various times we have talked about expanding the set of
> scenarios where TLP is used. And I think this example demonstrates
> that there is a class of real-world cases where it probably makes
> sense to allow TLP in the CWR state.
>
> If you have time, would you be able to check if leaving
> tcp_tso_should_defer () as-is but enabling TLP probes in CWR state
> also fixes your performance issue? Perhaps something like
> (uncompiled/untested):
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4ea79b2ad82e..deccf8070f84 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
> bool advancing_rto)
>
>         early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
>         /* Schedule a loss probe in 2*RTT for SACK capable connections
> -        * in Open state, that are either limited by cwnd or application.
> +        * not in loss recovery, that are either limited by cwnd or application.
>          */
>         if ((early_retrans != 3 && early_retrans != 4) ||
>             !tp->packets_out || !tcp_is_sack(tp) ||
> -           icsk->icsk_ca_state != TCP_CA_Open)
> +           icsk->icsk_ca_state >= TCP_CA_Recovery)
>                 return false;
>
>         if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
>
> > Btw, thank you very much for all the help! It is greatly appreciated :)
>
> You are very welcome! :-)
>
> cheers,
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-21 15:51                   ` Yuchung Cheng
@ 2017-11-21 16:20                     ` Neal Cardwell
  2017-11-21 16:52                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-11-21 16:20 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Steve Ibanez, Daniel Borkmann, Netdev, Florian Westphal,
	Mohammad Alizadeh, Lawrence Brakmo, Eric Dumazet

On Tue, Nov 21, 2017 at 10:51 AM, Yuchung Cheng <ycheng@google.com> wrote:
> On Tue, Nov 21, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
>>
>> The original motivation for only allowing TLP in the CA_Open state was
>> to be conservative and avoid having the TLP impose extra load on the
>> bottleneck when it may be congested. Plus if there are any SACKed
>> packets in the SACK scoreboard then there are other existing
>> mechanisms to do speedy loss recovery.
> Neal I like your idea of covering more states in TLP. but shouldn't we
> also fix the tso_deferral_logic to work better w/ PRR in CWR state, b/c
> it's a general transmission issue.

Yes, I agree it's also worthwhile to see if we can make PRR and TSO
deferral play well together. Sorry, I should have been more clear
about that.

neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-21 16:20                     ` Neal Cardwell
@ 2017-11-21 16:52                       ` Eric Dumazet
  2017-11-22  3:02                         ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2017-11-21 16:52 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Yuchung Cheng, Steve Ibanez, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Nov 21, 2017 at 8:20 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Nov 21, 2017 at 10:51 AM, Yuchung Cheng <ycheng@google.com> wrote:
>> On Tue, Nov 21, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
>>>
>>> The original motivation for only allowing TLP in the CA_Open state was
>>> to be conservative and avoid having the TLP impose extra load on the
>>> bottleneck when it may be congested. Plus if there are any SACKed
>>> packets in the SACK scoreboard then there are other existing
>>> mechanisms to do speedy loss recovery.
>> Neal I like your idea of covering more states in TLP. but shouldn't we
>> also fix the tso_deferral_logic to work better w/ PRR in CWR state, b/c
>> it's a general transmission issue.
>
> Yes, I agree it's also worthwhile to see if we can make PRR and TSO
> deferral play well together. Sorry, I should have been more clear
> about that.

Yes, but tso auto defer is an heuristic, and since we do not have a
timer to 'send the partial packet'
after we understand the ACK that we were waiting for does not arrive in time,
we know that the heuristic is not perfect.

Adding a timer (and its overhead) for maybe a fraction of cases might
be overkill.

'Fixing' TSO autodefer has been on our plates for ever, we played some
games that proved to be too expensive.

Although I have not played re-using the new hr timer we added for TCP pacing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-21 16:52                       ` Eric Dumazet
@ 2017-11-22  3:02                         ` Steve Ibanez
  2017-11-22  3:46                           ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-11-22  3:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I just tried out your fix for enabling TLPs in the CWR state (while
leaving tcp_tso_should_defer() unchanged), but I'm still seeing the
host enter long timeouts. Feel free to let me know if there is
something else you'd like me to try.

Best,
-Steve

On Tue, Nov 21, 2017 at 8:52 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Tue, Nov 21, 2017 at 8:20 AM, Neal Cardwell <ncardwell@google.com> wrote:
>> On Tue, Nov 21, 2017 at 10:51 AM, Yuchung Cheng <ycheng@google.com> wrote:
>>> On Tue, Nov 21, 2017 at 7:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
>>>>
>>>> The original motivation for only allowing TLP in the CA_Open state was
>>>> to be conservative and avoid having the TLP impose extra load on the
>>>> bottleneck when it may be congested. Plus if there are any SACKed
>>>> packets in the SACK scoreboard then there are other existing
>>>> mechanisms to do speedy loss recovery.
>>> Neal I like your idea of covering more states in TLP. but shouldn't we
>>> also fix the tso_deferral_logic to work better w/ PRR in CWR state, b/c
>>> it's a general transmission issue.
>>
>> Yes, I agree it's also worthwhile to see if we can make PRR and TSO
>> deferral play well together. Sorry, I should have been more clear
>> about that.
>
> Yes, but tso auto defer is an heuristic, and since we do not have a
> timer to 'send the partial packet'
> after we understand the ACK that we were waiting for does not arrive in time,
> we know that the heuristic is not perfect.
>
> Adding a timer (and its overhead) for maybe a fraction of cases might
> be overkill.
>
> 'Fixing' TSO autodefer has been on our plates for ever, we played some
> games that proved to be too expensive.
>
> Although I have not played re-using the new hr timer we added for TCP pacing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-22  3:02                         ` Steve Ibanez
@ 2017-11-22  3:46                           ` Neal Cardwell
  2017-11-27 18:49                             ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-11-22  3:46 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Nov 21, 2017 at 10:02 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> I just tried out your fix for enabling TLPs in the CWR state (while
> leaving tcp_tso_should_defer() unchanged), but I'm still seeing the
> host enter long timeouts. Feel free to let me know if there is
> something else you'd like me to try.

Oh, interesting. That was surprising to me, until I re-read the TLP
code. I think the TLP code is accidentally preventing the TLP timer
from being set in cases where TSO deferral is using cwnd unused,
because of this part of the logic:

  if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
      !tcp_write_queue_empty(sk))
        return false;

AFAICT it would be great for the TLP timer to be set if TSO deferral
decides not to send. That way the TLP timer firing can unwedge the
connection (in only a few milliseconds in LAN cases) if TSO deferral
decides to defer sending and ACK stop arriving. Removing those 3 lines
might allow TLP to give us much of the benefit of having a timer to
unwedge things after TSO deferral, without adding any new timers or
code.

If you have time, would you be able to try the following two patches
together in your test set-up?

(1)
commit 1ade85cd788cfed0433a83da03e299f396769c73
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Nov 21 22:33:30 2017 -0500

    tcp: allow TLP in CWR

    (Also allows TLP in disorder, though this is somewhat academic, since
    in disorder RACK will almost always override the TLP timer with a
    reorder timeout.)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4ea79b2ad82e..deccf8070f84 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)

        early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
        /* Schedule a loss probe in 2*RTT for SACK capable connections
-        * in Open state, that are either limited by cwnd or application.
+        * not in loss recovery, that are either limited by cwnd or application.
         */
        if ((early_retrans != 3 && early_retrans != 4) ||
            !tp->packets_out || !tcp_is_sack(tp) ||
-           icsk->icsk_ca_state != TCP_CA_Open)
+           icsk->icsk_ca_state >= TCP_CA_Recovery)
                return false;

        if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&

(2)
commit ccd377a601c14dc82826720d93afb573a388022e (HEAD ->
gnetnext8xx-tlp-recalc-rto-on-ack-fix-v4)
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Nov 21 22:34:42 2017 -0500

    tcp: allow scheduling TLP timer if TSO deferral leaves some cwnd unused

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index deccf8070f84..1724cc2bbf1a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2543,10 +2543,6 @@ bool tcp_schedule_loss_probe(struct sock *sk,
bool advancing_rto)
            icsk->icsk_ca_state >= TCP_CA_Recovery)
                return false;

-       if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
-            !tcp_write_queue_empty(sk))
-               return false;
-
        /* Probe timeout is 2*rtt. Add minimum RTO to account
         * for delayed ack when there's one outstanding packet. If no RTT
         * sample is available then probe after TCP_TIMEOUT_INIT.

Thanks!

neal

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-22  3:46                           ` Neal Cardwell
@ 2017-11-27 18:49                             ` Steve Ibanez
  2017-12-01 16:35                               ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-11-27 18:49 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I tried out your new suggested patches and indeed it looks like it is
working. The duration of the freezes looks like it has reduced from an
RTO to 10ms (tcp probe reports SRTT measurements of about 200us just
before the freeze). So the PTO value seems to be correctly set to
max(2*SRTT, 10ms).

Best,
-Steve

On Tue, Nov 21, 2017 at 7:46 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Nov 21, 2017 at 10:02 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> I just tried out your fix for enabling TLPs in the CWR state (while
>> leaving tcp_tso_should_defer() unchanged), but I'm still seeing the
>> host enter long timeouts. Feel free to let me know if there is
>> something else you'd like me to try.
>
> Oh, interesting. That was surprising to me, until I re-read the TLP
> code. I think the TLP code is accidentally preventing the TLP timer
> from being set in cases where TSO deferral is using cwnd unused,
> because of this part of the logic:
>
>   if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
>       !tcp_write_queue_empty(sk))
>         return false;
>
> AFAICT it would be great for the TLP timer to be set if TSO deferral
> decides not to send. That way the TLP timer firing can unwedge the
> connection (in only a few milliseconds in LAN cases) if TSO deferral
> decides to defer sending and ACK stop arriving. Removing those 3 lines
> might allow TLP to give us much of the benefit of having a timer to
> unwedge things after TSO deferral, without adding any new timers or
> code.
>
> If you have time, would you be able to try the following two patches
> together in your test set-up?
>
> (1)
> commit 1ade85cd788cfed0433a83da03e299f396769c73
> Author: Neal Cardwell <ncardwell@google.com>
> Date:   Tue Nov 21 22:33:30 2017 -0500
>
>     tcp: allow TLP in CWR
>
>     (Also allows TLP in disorder, though this is somewhat academic, since
>     in disorder RACK will almost always override the TLP timer with a
>     reorder timeout.)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4ea79b2ad82e..deccf8070f84 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2536,11 +2536,11 @@ bool tcp_schedule_loss_probe(struct sock *sk,
> bool advancing_rto)
>
>         early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
>         /* Schedule a loss probe in 2*RTT for SACK capable connections
> -        * in Open state, that are either limited by cwnd or application.
> +        * not in loss recovery, that are either limited by cwnd or application.
>          */
>         if ((early_retrans != 3 && early_retrans != 4) ||
>             !tp->packets_out || !tcp_is_sack(tp) ||
> -           icsk->icsk_ca_state != TCP_CA_Open)
> +           icsk->icsk_ca_state >= TCP_CA_Recovery)
>                 return false;
>
>         if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
>
> (2)
> commit ccd377a601c14dc82826720d93afb573a388022e (HEAD ->
> gnetnext8xx-tlp-recalc-rto-on-ack-fix-v4)
> Author: Neal Cardwell <ncardwell@google.com>
> Date:   Tue Nov 21 22:34:42 2017 -0500
>
>     tcp: allow scheduling TLP timer if TSO deferral leaves some cwnd unused
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index deccf8070f84..1724cc2bbf1a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2543,10 +2543,6 @@ bool tcp_schedule_loss_probe(struct sock *sk,
> bool advancing_rto)
>             icsk->icsk_ca_state >= TCP_CA_Recovery)
>                 return false;
>
> -       if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
> -            !tcp_write_queue_empty(sk))
> -               return false;
> -
>         /* Probe timeout is 2*rtt. Add minimum RTO to account
>          * for delayed ack when there's one outstanding packet. If no RTT
>          * sample is available then probe after TCP_TIMEOUT_INIT.
>
> Thanks!
>
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-11-27 18:49                             ` Steve Ibanez
@ 2017-12-01 16:35                               ` Neal Cardwell
  2017-12-05  5:22                                 ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-01 16:35 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Mon, Nov 27, 2017 at 1:49 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>
> Hi Neal,
>
> I tried out your new suggested patches and indeed it looks like it is
> working. The duration of the freezes looks like it has reduced from an
> RTO to 10ms (tcp probe reports SRTT measurements of about 200us just
> before the freeze). So the PTO value seems to be correctly set to
> max(2*SRTT, 10ms).

Great. Thank you for testing this!

Our team will look into testing these patches more and sending some
version of them upstream if things look good.

Also, BTW, in newer kernels, with bb4d991a28cc ("tcp: adjust tail loss
probe timeout") from July, the TLP timeout should be closer to 2*SRTT
+ 2 jiffies, so if your kernel has 1ms jiffies this should further
improve things.

Thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-01 16:35                               ` Neal Cardwell
@ 2017-12-05  5:22                                 ` Steve Ibanez
  2017-12-05 15:23                                   ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-12-05  5:22 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]

Hi Neal,

Happy to help out :) And thanks for the tip!

I was able to track down where the missing bytes that you pointed out
are being lost. It turns out the destination host seems to be
misbehaving. I performed a packet capture at the destination host
interface (a snapshot of the trace is attached). I see the following
sequence of events when a timeout occurs (note that I have NIC
offloading enabled so wireshark captures packets larger than the MTU):

1. The destination receives a data packet of length X with seqNo = Y
from the src with the CWR bit set and does not send back a
corresponding ACK.
2. The source times out and sends a retransmission packet of length Z
(where Z < X) with seqNo = Y
3. The destination sends back an ACK with AckNo = Y + X

So in other words, the packet which the destination host does not
initially ACK (causing the timeout) does not actually get lost because
after receiving the retransmission the AckNo moves forward all the way
past the bytes in the initial unACKed CWR packet. In the attached
screenshot, I've marked the unACKed CWR packet with a red box.

Have you seen this behavior before? And do you know what might be
causing the destination host not to ACK the CWR packet? In most cases
the CWR marked packets are ACKed properly, it's just occasionally they
are not.

Thanks!
-Steve



On Fri, Dec 1, 2017 at 8:35 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Nov 27, 2017 at 1:49 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Neal,
>>
>> I tried out your new suggested patches and indeed it looks like it is
>> working. The duration of the freezes looks like it has reduced from an
>> RTO to 10ms (tcp probe reports SRTT measurements of about 200us just
>> before the freeze). So the PTO value seems to be correctly set to
>> max(2*SRTT, 10ms).
>
> Great. Thank you for testing this!
>
> Our team will look into testing these patches more and sending some
> version of them upstream if things look good.
>
> Also, BTW, in newer kernels, with bb4d991a28cc ("tcp: adjust tail loss
> probe timeout") from July, the TLP timeout should be closer to 2*SRTT
> + 2 jiffies, so if your kernel has 1ms jiffies this should further
> improve things.
>
> Thanks,
> neal

[-- Attachment #2: han-5_timeout.png --]
[-- Type: image/png, Size: 151773 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-05  5:22                                 ` Steve Ibanez
@ 2017-12-05 15:23                                   ` Neal Cardwell
  2017-12-05 19:36                                     ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-05 15:23 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Dec 5, 2017 at 12:22 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> Happy to help out :) And thanks for the tip!
>
> I was able to track down where the missing bytes that you pointed out
> are being lost. It turns out the destination host seems to be
> misbehaving. I performed a packet capture at the destination host
> interface (a snapshot of the trace is attached). I see the following
> sequence of events when a timeout occurs (note that I have NIC
> offloading enabled so wireshark captures packets larger than the MTU):
>
> 1. The destination receives a data packet of length X with seqNo = Y
> from the src with the CWR bit set and does not send back a
> corresponding ACK.
> 2. The source times out and sends a retransmission packet of length Z
> (where Z < X) with seqNo = Y
> 3. The destination sends back an ACK with AckNo = Y + X
>
> So in other words, the packet which the destination host does not
> initially ACK (causing the timeout) does not actually get lost because
> after receiving the retransmission the AckNo moves forward all the way
> past the bytes in the initial unACKed CWR packet. In the attached
> screenshot, I've marked the unACKed CWR packet with a red box.
>
> Have you seen this behavior before? And do you know what might be
> causing the destination host not to ACK the CWR packet? In most cases
> the CWR marked packets are ACKed properly, it's just occasionally they
> are not.

Thanks for the detailed report!

I have not heard of an incoming CWR causing the receiver to fail to
ACK. And in re-reading the code, I don't see an obvious way in which a
CWR bit should cause the receiver to fail to ACK.

That screen shot is a bit hard to parse. Would you be able to post a
tcpdump .pcap of that particular section, or post a screen shot of a
time-sequence plot of that section?

To extract that segment and take screen shot, you could use something like:

  editcap -A "2017-12-04 11:22:27"  -B "2017-12-04 11:22:30"  all.pcap
slice.pcap
  tcptrace -S -xy -zy slice.pcap
  xplot.org a2b_tsg.xpl &
  # take screenshot

Or, alternatively, would you be able to post the slice.pcap on a web
server or public drive?

thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-05 15:23                                   ` Neal Cardwell
@ 2017-12-05 19:36                                     ` Steve Ibanez
  2017-12-05 20:04                                       ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-12-05 19:36 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I've included a link to small trace of 13 packets which is different
from the screenshot I attached in my last email, but shows the same
sequence of events. It's a bit hard to read the tcptrace due to the
300ms timeout, so I figured this was the best approach.

slice.pcap: https://drive.google.com/open?id=1hYXbUClHGbQv1hWG1HZWDO2WYf30N6G8

Thanks for the help!
-Steve

On Tue, Dec 5, 2017 at 7:23 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Dec 5, 2017 at 12:22 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> Happy to help out :) And thanks for the tip!
>>
>> I was able to track down where the missing bytes that you pointed out
>> are being lost. It turns out the destination host seems to be
>> misbehaving. I performed a packet capture at the destination host
>> interface (a snapshot of the trace is attached). I see the following
>> sequence of events when a timeout occurs (note that I have NIC
>> offloading enabled so wireshark captures packets larger than the MTU):
>>
>> 1. The destination receives a data packet of length X with seqNo = Y
>> from the src with the CWR bit set and does not send back a
>> corresponding ACK.
>> 2. The source times out and sends a retransmission packet of length Z
>> (where Z < X) with seqNo = Y
>> 3. The destination sends back an ACK with AckNo = Y + X
>>
>> So in other words, the packet which the destination host does not
>> initially ACK (causing the timeout) does not actually get lost because
>> after receiving the retransmission the AckNo moves forward all the way
>> past the bytes in the initial unACKed CWR packet. In the attached
>> screenshot, I've marked the unACKed CWR packet with a red box.
>>
>> Have you seen this behavior before? And do you know what might be
>> causing the destination host not to ACK the CWR packet? In most cases
>> the CWR marked packets are ACKed properly, it's just occasionally they
>> are not.
>
> Thanks for the detailed report!
>
> I have not heard of an incoming CWR causing the receiver to fail to
> ACK. And in re-reading the code, I don't see an obvious way in which a
> CWR bit should cause the receiver to fail to ACK.
>
> That screen shot is a bit hard to parse. Would you be able to post a
> tcpdump .pcap of that particular section, or post a screen shot of a
> time-sequence plot of that section?
>
> To extract that segment and take screen shot, you could use something like:
>
>   editcap -A "2017-12-04 11:22:27"  -B "2017-12-04 11:22:30"  all.pcap
> slice.pcap
>   tcptrace -S -xy -zy slice.pcap
>   xplot.org a2b_tsg.xpl &
>   # take screenshot
>
> Or, alternatively, would you be able to post the slice.pcap on a web
> server or public drive?
>
> thanks,
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-05 19:36                                     ` Steve Ibanez
@ 2017-12-05 20:04                                       ` Neal Cardwell
  2017-12-19  5:16                                         ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-05 20:04 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

On Tue, Dec 5, 2017 at 2:36 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> I've included a link to small trace of 13 packets which is different
> from the screenshot I attached in my last email, but shows the same
> sequence of events. It's a bit hard to read the tcptrace due to the
> 300ms timeout, so I figured this was the best approach.
>
> slice.pcap: https://drive.google.com/open?id=1hYXbUClHGbQv1hWG1HZWDO2WYf30N6G8

Thanks for the trace! Attached is a screen shot (first screen shot is
for the arriving packets with CWR; second is after the RTO). The
sender behavior looks reasonable. I don't see why the receiver is not
ACKing. As you say, it does look like a receiver bug. You could try
adding instrumentation to try to isolate why the receiver is not
sending an ACK immediately. You might instrument __tcp_ack_snd_check()
and tcp_send_delayed_ack() so that when the most recent incoming
packet had cwr set they printk to log what they are deciding in this
case. Perhaps the tcp_send_delayed_ack()  code is hitting the max_ato
= HZ / 2 code path?

neal

[-- Attachment #2: cwr-no-ack-1.png --]
[-- Type: image/png, Size: 9725 bytes --]

[-- Attachment #3: cwr-no-ack-2.png --]
[-- Type: image/png, Size: 11703 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-05 20:04                                       ` Neal Cardwell
@ 2017-12-19  5:16                                         ` Steve Ibanez
  2017-12-19 15:28                                           ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-12-19  5:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I started looking into this receiver ACKing issue today. Strangely,
when I tried adding printk statements at the top of the
tcp_v4_do_rcv(), tcp_rcv_established(), __tcp_ack_snd_check() and
tcp_send_delayed_ack() functions they were never executed on the
machine running the iperf3 server (i.e. the destination of the flows).
Maybe the iperf3 server is using its own TCP stack?

In any case, the ACKing problem is reproducible using just normal
iperf for which I do see my printk statements being executed. I can
now confirm that when the CWR marked packet (for which no ACK is sent)
arrives at the receiver, the __tcp_ack_snd_check() function is never
invoked; and hence neither is the tcp_send_delayed_ack() function.
Hopefully this helps narrow down where the issue might be? I started
adding some printk statements into the tcp_rcv_established() function,
but I'm not sure where the best places to look would be so I wanted to
ask your advice on this.

In case you're interested, I instrumented the __tcp_ack_snd_check()
function with the following printk statements:

@@ -5057,9 +5117,15 @@ static inline void tcp_data_snd_check(struct sock *sk)
 /*
  * Check if sending an ack is needed.
  */
-static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible,
const struct tcphdr *th)
 {
        struct tcp_sock *tp = tcp_sk(sk);
+        struct inet_sock *inet = inet_sk(sk);

            /* More than one full frame received... */
        if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
@@ -5071,21 +5137,31 @@ static void __tcp_ack_snd_check(struct sock
*sk, int ofo_possible)
            tcp_in_quickack_mode(sk) ||
            /* We have out of order data. */
            (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
+                // SI: Debugging TCP ECN handeling
+                if (sk->sk_family == AF_INET && th->cwr) {
+                        printk("tcp_debug: __tcp_ack_snd_check:
%pI4/%u CWR set and sending ACK now - rcv_nxt=%u\n",
+                                 &inet->inet_daddr,
ntohs(inet->inet_sport), tp->rcv_nxt);
+                 }
                /* Then ack it now */
                tcp_send_ack(sk);
        } else {
+                // SI: Debugging TCP ECN handeling
+                if (sk->sk_family == AF_INET && th->cwr) {
+                        printk("tcp_debug: __tcp_ack_snd_check:
%pI4/%u CWR set and sending delayed ACK - rcv_nxt=%u\n",
+                                 &inet->inet_daddr,
ntohs(inet->inet_sport), tp->rcv_nxt);
+                 }
                /* Else, send delayed ack. */
-               tcp_send_delayed_ack(sk);
+               tcp_send_delayed_ack(sk, th);
        }
 }

In the kernel log on the receiver, I see the following sequence of
events at a timeout:

[ 2730.145023] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2317949387
[ 2730.145543] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318243331 <-- last log statement before
timeout
[ 2730.452540] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318593747
[ 2730.453137] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318813843

>From the tcpdump trace at the receiver's interface I see that the last
log statement before the timeout corresponds exactly to one CWR packet
before the unACKed CWR packet. For example, in this case, the CWR
packet to which the indicated log statement corresponds has seqNo =
2318196995 and length 46336 ==> 2318196995 + 46336 = 2318243331, which
is exactly the rcv_nxt value of the indicated log statement. And then
unACKed CWR packet arrives and it is completely missing from the log
file, there is no indication of sending a delayed ACK either. Hence my
conclusion that the __tcp_ack_snd_check() function is never invoked by
the receiver upon receiving the unACKed CWR packet.

Sorry if that was long and verbose, I just wanted to be clear on what
I had done. Please do let me know if you have any questions though.

Thanks,
-Steve


On Tue, Dec 5, 2017 at 12:04 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Dec 5, 2017 at 2:36 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> I've included a link to small trace of 13 packets which is different
>> from the screenshot I attached in my last email, but shows the same
>> sequence of events. It's a bit hard to read the tcptrace due to the
>> 300ms timeout, so I figured this was the best approach.
>>
>> slice.pcap: https://drive.google.com/open?id=1hYXbUClHGbQv1hWG1HZWDO2WYf30N6G8
>
> Thanks for the trace! Attached is a screen shot (first screen shot is
> for the arriving packets with CWR; second is after the RTO). The
> sender behavior looks reasonable. I don't see why the receiver is not
> ACKing. As you say, it does look like a receiver bug. You could try
> adding instrumentation to try to isolate why the receiver is not
> sending an ACK immediately. You might instrument __tcp_ack_snd_check()
> and tcp_send_delayed_ack() so that when the most recent incoming
> packet had cwr set they printk to log what they are deciding in this
> case. Perhaps the tcp_send_delayed_ack()  code is hitting the max_ato
> = HZ / 2 code path?
>
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-19  5:16                                         ` Steve Ibanez
@ 2017-12-19 15:28                                           ` Neal Cardwell
  2017-12-19 22:00                                             ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-19 15:28 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Dec 19, 2017 at 12:16 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>
> Hi Neal,
>
> I started looking into this receiver ACKing issue today. Strangely,
> when I tried adding printk statements at the top of the
> tcp_v4_do_rcv(), tcp_rcv_established(), __tcp_ack_snd_check() and
> tcp_send_delayed_ack() functions they were never executed on the
> machine running the iperf3 server (i.e. the destination of the flows).
> Maybe the iperf3 server is using its own TCP stack?
>
> In any case, the ACKing problem is reproducible using just normal
> iperf for which I do see my printk statements being executed. I can
> now confirm that when the CWR marked packet (for which no ACK is sent)
> arrives at the receiver, the __tcp_ack_snd_check() function is never
> invoked; and hence neither is the tcp_send_delayed_ack() function.
> Hopefully this helps narrow down where the issue might be? I started
> adding some printk statements into the tcp_rcv_established() function,
> but I'm not sure where the best places to look would be so I wanted to
> ask your advice on this.

Thanks for the detailed report!

As a next step to narrow down why the CWR-marked packet is not acked,
I would suggest adding printk statements at the bottom of
tcp_rcv_established() in all the spots where we have a goto or return
that would cause us to short-circuit and not reach the
tcp_ack_snd_check() at the bottom of the function. This could be much
like your existing nice debugging printks, that log any time we get to
that spot and if (sk->sk_family == AF_INET && th->cwr). And these
could be in the following spots (marked "here"):

slow_path:
        if (len < (th->doff << 2) || tcp_checksum_complete(skb))
                goto csum_error;                /* <=== here */

        if (!th->ack && !th->rst && !th->syn)
                goto discard;                /* <=== here */

        /*
         *      Standard slow path.
         */

        if (!tcp_validate_incoming(sk, skb, th, 1))
                return;                /* <=== here */

step5:
        if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0)
                goto discard;                /* <=== here */


thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-19 15:28                                           ` Neal Cardwell
@ 2017-12-19 22:00                                             ` Steve Ibanez
  2017-12-20  0:08                                               ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-12-19 22:00 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I managed to track down the code path that the unACKed CWR packet is
taking. The tcp_rcv_established() function calls tcp_ack_snd_check()
at the end of step5 and then the return statement indicated below is
invoked, which prevents the __tcp_ack_snd_check() function from
running.

static inline void tcp_ack_snd_check(struct sock *sk)
{
        if (!inet_csk_ack_scheduled(sk)) {
                /* We sent a data segment already. */
                return;   /* <=== here */
        }
        __tcp_ack_snd_check(sk, 1);
}

So somehow tcp_ack_snd_check() thinks that a data segment was already
sent when in fact it wasn't. Do you see a way around this issue?

Thanks,
-Steve

On Tue, Dec 19, 2017 at 7:28 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Dec 19, 2017 at 12:16 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Neal,
>>
>> I started looking into this receiver ACKing issue today. Strangely,
>> when I tried adding printk statements at the top of the
>> tcp_v4_do_rcv(), tcp_rcv_established(), __tcp_ack_snd_check() and
>> tcp_send_delayed_ack() functions they were never executed on the
>> machine running the iperf3 server (i.e. the destination of the flows).
>> Maybe the iperf3 server is using its own TCP stack?
>>
>> In any case, the ACKing problem is reproducible using just normal
>> iperf for which I do see my printk statements being executed. I can
>> now confirm that when the CWR marked packet (for which no ACK is sent)
>> arrives at the receiver, the __tcp_ack_snd_check() function is never
>> invoked; and hence neither is the tcp_send_delayed_ack() function.
>> Hopefully this helps narrow down where the issue might be? I started
>> adding some printk statements into the tcp_rcv_established() function,
>> but I'm not sure where the best places to look would be so I wanted to
>> ask your advice on this.
>
> Thanks for the detailed report!
>
> As a next step to narrow down why the CWR-marked packet is not acked,
> I would suggest adding printk statements at the bottom of
> tcp_rcv_established() in all the spots where we have a goto or return
> that would cause us to short-circuit and not reach the
> tcp_ack_snd_check() at the bottom of the function. This could be much
> like your existing nice debugging printks, that log any time we get to
> that spot and if (sk->sk_family == AF_INET && th->cwr). And these
> could be in the following spots (marked "here"):
>
> slow_path:
>         if (len < (th->doff << 2) || tcp_checksum_complete(skb))
>                 goto csum_error;                /* <=== here */
>
>         if (!th->ack && !th->rst && !th->syn)
>                 goto discard;                /* <=== here */
>
>         /*
>          *      Standard slow path.
>          */
>
>         if (!tcp_validate_incoming(sk, skb, th, 1))
>                 return;                /* <=== here */
>
> step5:
>         if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0)
>                 goto discard;                /* <=== here */
>
>
> thanks,
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-19 22:00                                             ` Steve Ibanez
@ 2017-12-20  0:08                                               ` Neal Cardwell
  2017-12-20 19:20                                                 ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-20  0:08 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Dec 19, 2017 at 5:00 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> I managed to track down the code path that the unACKed CWR packet is
> taking. The tcp_rcv_established() function calls tcp_ack_snd_check()
> at the end of step5 and then the return statement indicated below is
> invoked, which prevents the __tcp_ack_snd_check() function from
> running.
>
> static inline void tcp_ack_snd_check(struct sock *sk)
> {
>         if (!inet_csk_ack_scheduled(sk)) {
>                 /* We sent a data segment already. */
>                 return;   /* <=== here */
>         }
>         __tcp_ack_snd_check(sk, 1);
> }
>
> So somehow tcp_ack_snd_check() thinks that a data segment was already
> sent when in fact it wasn't. Do you see a way around this issue?

Thanks for tracking that down! AFAICT in this case the call chain we
are trying to achieve is as follows:

tcp_rcv_established()
 -> tcp_data_queue()
 -> tcp_event_data_recv()
 -> inet_csk_schedule_ack()

The only think I can think of would be to add printks that fire for
CWR packets, to isolate why the code bails out before it reaches those
calls...

thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-20  0:08                                               ` Neal Cardwell
@ 2017-12-20 19:20                                                 ` Steve Ibanez
  2017-12-20 20:17                                                   ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2017-12-20 19:20 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

I added in some more printk statements and it does indeed look like
all of these calls you listed are being invoked successfully. I guess
this isn't too surprising given what the inet_csk_schedule_ack() and
inet_csk_ack_scheduled() functions are doing:

static inline void inet_csk_schedule_ack(struct sock *sk)
{
        inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_SCHED;
}

static inline int inet_csk_ack_scheduled(const struct sock *sk)
{
        return inet_csk(sk)->icsk_ack.pending & ICSK_ACK_SCHED;
}

So through the code path that you listed, the inet_csk_schedule_ack()
function sets the ICSK_ACK_SCHED bit and then the tcp_ack_snd_check()
function just checks that the ICSK_ACK_SCHED bit is indeed set.

Do you know how I can verify that setting the ICSK_ACK_SCHED bit
actually results in an ACK being sent?

Thanks,
-Steve

On Tue, Dec 19, 2017 at 4:08 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Dec 19, 2017 at 5:00 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> I managed to track down the code path that the unACKed CWR packet is
>> taking. The tcp_rcv_established() function calls tcp_ack_snd_check()
>> at the end of step5 and then the return statement indicated below is
>> invoked, which prevents the __tcp_ack_snd_check() function from
>> running.
>>
>> static inline void tcp_ack_snd_check(struct sock *sk)
>> {
>>         if (!inet_csk_ack_scheduled(sk)) {
>>                 /* We sent a data segment already. */
>>                 return;   /* <=== here */
>>         }
>>         __tcp_ack_snd_check(sk, 1);
>> }
>>
>> So somehow tcp_ack_snd_check() thinks that a data segment was already
>> sent when in fact it wasn't. Do you see a way around this issue?
>
> Thanks for tracking that down! AFAICT in this case the call chain we
> are trying to achieve is as follows:
>
> tcp_rcv_established()
>  -> tcp_data_queue()
>  -> tcp_event_data_recv()
>  -> inet_csk_schedule_ack()
>
> The only think I can think of would be to add printks that fire for
> CWR packets, to isolate why the code bails out before it reaches those
> calls...
>
> thanks,
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-20 19:20                                                 ` Steve Ibanez
@ 2017-12-20 20:17                                                   ` Neal Cardwell
  2018-01-02  7:43                                                     ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2017-12-20 20:17 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Wed, Dec 20, 2017 at 2:20 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>
> Hi Neal,
>
> I added in some more printk statements and it does indeed look like
> all of these calls you listed are being invoked successfully. I guess
> this isn't too surprising given what the inet_csk_schedule_ack() and
> inet_csk_ack_scheduled() functions are doing:
>
> static inline void inet_csk_schedule_ack(struct sock *sk)
> {
>         inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_SCHED;
> }
>
> static inline int inet_csk_ack_scheduled(const struct sock *sk)
> {
>         return inet_csk(sk)->icsk_ack.pending & ICSK_ACK_SCHED;
> }
>
> So through the code path that you listed, the inet_csk_schedule_ack()
> function sets the ICSK_ACK_SCHED bit and then the tcp_ack_snd_check()
> function just checks that the ICSK_ACK_SCHED bit is indeed set.
>inet_csk_schedule_ack
> Do you know how I can verify that setting the ICSK_ACK_SCHED bit
> actually results in an ACK being sent?

Hmm. I don't think in this case we can verify that setting the
ICSK_ACK_SCHED bit actually results in an ACK being sent. Because
AFAICT in this case it seems like an ACK is not sent. :-) This is
based on both the tcpdumps on Dec 5 and your detective work yesterday
("The tcp_rcv_established() function calls tcp_ack_snd_check() at the
end of step5 and then the return statement indicated below is invoked,
which prevents the __tcp_ack_snd_check() function from running.")

So AFAICT the puzzle is: how is the icsk_ack.pending  ICSK_ACK_SCHED
bit being cleared between the inet_csk_schedule_ack() call and the
tcp_ack_snd_check() call, without (apparently) an actual ACK being
sent on the wire?

AFAICT the ICSK_ACK_SCHED bit is not supposed to be cleared unless we
get to this sequence:

tcp_transmit_skb()
  if (likely(tcb->tcp_flags & TCPHDR_ACK))
    tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
     -> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
            icsk->icsk_ack.blocked = icsk->icsk_ack.pending = 0;

I don't have a theory that fits all of those data points, unless this
is a bi-directional transfer (is it?) and between the
inet_csk_schedule_ack() call and the tcp_ack_snd_check() call the TCP
connection sends a data packet (in tcp_data_snd_check()) and then it
is dropped for some reason before the packet make it to the tcpdump
sniffing point. Perhaps because of a qdisc limit or something?

I guess a possible next step would be, while processing an incoming
skb with the cwr bit set, the code could set a new debugging field in
the tcp_sock (tp->processing_cwr), and then you could check this field
in tcp_transmit_skb() and printk if (1) there is an attempted
queue_xmit() cal and (2) if the queue_xmit() fails (the err > 0 case).

That's a long shot, but the only idea I have at this point.

thanks,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2017-12-20 20:17                                                   ` Neal Cardwell
@ 2018-01-02  7:43                                                     ` Steve Ibanez
  2018-01-02 16:27                                                       ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2018-01-02  7:43 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

Apologies for the delay, and happy new year!

To answer your question, data is only transferred in one direction
(from the client to the server). The SeqNo in the pkts from the server
to the client is not incremented. So I don't think that a data pkt is
attempted to be sent in the tcp_data_snd_check() call clearing the
ICSK_ACK_SCHED bit. Although I think it would be helpful to include
your new debugging field in the tcp_sock (tp->processing_cwr) so that
I can check this field in the tcp_transmit_skb() and tcp_send_ack()
functions. I added the new field and tried to set it at the top of the
tcp_rcv_established(), but then when I try to check the field in the
tcp_send_ack() function it never appears to be set. Below I'm showing
how I set the tp->processing_cwr field in the tcp_rcv_established
function and how I check it in the tcp_send_ack function. Is this how
you were imagining the processing_cwr field to be used?

void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
                         const struct tcphdr *th, unsigned int len)
 {
        struct tcp_sock *tp = tcp_sk(sk);
+        struct inet_sock *inet = inet_sk(sk);
+
+        // SI: Debugging TCP ECN handeling
+        if (tcp_hdr(skb)->cwr) {
+               tp->processing_cwr = 1;
+        } else {
+               tp->processing_cwr = 0;
+       }

        tcp_mstamp_refresh(tp);
        if (unlikely(!sk->sk_rx_dst))


 void tcp_send_ack(struct sock *sk)
 {
        struct sk_buff *buff;
+        // SI: Debugging TCP ECN haneling
+        const struct tcp_sock *tp = tcp_sk(sk);
+        struct inet_sock *inet = inet_sk(sk);
+
+        // SI: Debugging TCP ECN handeling
+        if ((sk->sk_family == AF_INET) && tp->processing_cwr) {
+                printk("tcp_debug: tcp_send_ack: %pI4/%u  - CWR set
and rcv_nxt=%u, snd_una=%u\n",
+                         &inet->inet_daddr, ntohs(inet->inet_sport),
tp->rcv_nxt, tp->snd_una);
+        }

        /* If we have been reset, we may not send again. */
        if (sk->sk_state == TCP_CLOSE)

Thanks!
-Steve


On Wed, Dec 20, 2017 at 12:17 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, Dec 20, 2017 at 2:20 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>>
>> Hi Neal,
>>
>> I added in some more printk statements and it does indeed look like
>> all of these calls you listed are being invoked successfully. I guess
>> this isn't too surprising given what the inet_csk_schedule_ack() and
>> inet_csk_ack_scheduled() functions are doing:
>>
>> static inline void inet_csk_schedule_ack(struct sock *sk)
>> {
>>         inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_SCHED;
>> }
>>
>> static inline int inet_csk_ack_scheduled(const struct sock *sk)
>> {
>>         return inet_csk(sk)->icsk_ack.pending & ICSK_ACK_SCHED;
>> }
>>
>> So through the code path that you listed, the inet_csk_schedule_ack()
>> function sets the ICSK_ACK_SCHED bit and then the tcp_ack_snd_check()
>> function just checks that the ICSK_ACK_SCHED bit is indeed set.
>>inet_csk_schedule_ack
>> Do you know how I can verify that setting the ICSK_ACK_SCHED bit
>> actually results in an ACK being sent?
>
> Hmm. I don't think in this case we can verify that setting the
> ICSK_ACK_SCHED bit actually results in an ACK being sent. Because
> AFAICT in this case it seems like an ACK is not sent. :-) This is
> based on both the tcpdumps on Dec 5 and your detective work yesterday
> ("The tcp_rcv_established() function calls tcp_ack_snd_check() at the
> end of step5 and then the return statement indicated below is invoked,
> which prevents the __tcp_ack_snd_check() function from running.")
>
> So AFAICT the puzzle is: how is the icsk_ack.pending  ICSK_ACK_SCHED
> bit being cleared between the inet_csk_schedule_ack() call and the
> tcp_ack_snd_check() call, without (apparently) an actual ACK being
> sent on the wire?
>
> AFAICT the ICSK_ACK_SCHED bit is not supposed to be cleared unless we
> get to this sequence:
>
> tcp_transmit_skb()
>   if (likely(tcb->tcp_flags & TCPHDR_ACK))
>     tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
>      -> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>             icsk->icsk_ack.blocked = icsk->icsk_ack.pending = 0;
>
> I don't have a theory that fits all of those data points, unless this
> is a bi-directional transfer (is it?) and between the
> inet_csk_schedule_ack() call and the tcp_ack_snd_check() call the TCP
> connection sends a data packet (in tcp_data_snd_check()) and then it
> is dropped for some reason before the packet make it to the tcpdump
> sniffing point. Perhaps because of a qdisc limit or something?
>
> I guess a possible next step would be, while processing an incoming
> skb with the cwr bit set, the code could set a new debugging field in
> the tcp_sock (tp->processing_cwr), and then you could check this field
> in tcp_transmit_skb() and printk if (1) there is an attempted
> queue_xmit() cal and (2) if the queue_xmit() fails (the err > 0 case).
>
> That's a long shot, but the only idea I have at this point.
>
> thanks,
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2018-01-02  7:43                                                     ` Steve Ibanez
@ 2018-01-02 16:27                                                       ` Neal Cardwell
  2018-01-02 23:57                                                         ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2018-01-02 16:27 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Jan 2, 2018 at 2:43 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> Apologies for the delay, and happy new year!
>
> To answer your question, data is only transferred in one direction
> (from the client to the server). The SeqNo in the pkts from the server
> to the client is not incremented. So I don't think that a data pkt is
> attempted to be sent in the tcp_data_snd_check() call clearing the
> ICSK_ACK_SCHED bit. Although I think it would be helpful to include
> your new debugging field in the tcp_sock (tp->processing_cwr) so that
> I can check this field in the tcp_transmit_skb() and tcp_send_ack()
> functions. I added the new field and tried to set it at the top of the
> tcp_rcv_established(), but then when I try to check the field in the
> tcp_send_ack() function it never appears to be set. Below I'm showing
> how I set the tp->processing_cwr field in the tcp_rcv_established
> function and how I check it in the tcp_send_ack function. Is this how
> you were imagining the processing_cwr field to be used?

Happy new year to you as well, and thank you, Steve, for running this
experiment! Yes, this is basically the kind of thing I had in mind.

The connection will run the "fast path" tcp_rcv_established() code if
the connection is in the ESTABLISHED state.  From the symptoms it
sounds like what's happening is that in this test the connection is
not in the ESTABLISHED state when the CWR arrives, so it's probably
running the more general tcp_rcv_state_process() function. I would
suggest adding your tp->processing_cwr instrumentation at the top of
tcp_rcv_state_process() as well, and then re-running the test. (In
tcp_v4_do_rcv() and tcp_v6_do_rcv(), for each incoming skb one of
those two functions is called).

It is interesting that the connection does not seem to be in the
ESTABLISHED state. Maybe that is an ingredient of the unexpected
behavior in this case...

Thanks!
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2018-01-02 16:27                                                       ` Neal Cardwell
@ 2018-01-02 23:57                                                         ` Steve Ibanez
  2018-01-03 19:39                                                           ` Neal Cardwell
  0 siblings, 1 reply; 32+ messages in thread
From: Steve Ibanez @ 2018-01-02 23:57 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

Hi Neal,

Sorry, my last email was incorrect. It turns out the default tcp
congestion control alg that was being used on my client machines was
cubic instead of dctcp. That is why tp->processing_cwr field was never
set in the tcp_rcv_established function. I've changed the default back
to dctcp on all of my machines.

I am now logging the value of tp->rcv_nxt at the top of the
tcp_transmit_skb() function for all CWR segments. I see that during
normal operation, the value of tp->rcv_nxt is equal to the SeqNo in
the CWR segment  + length of the CWR segment. However, for the unACKed
CWR segment, the value of tp->rcv_nxt is just equal to the SeqNo in
the CWR segment (i.e. not incremented by the length). And I see that
by the time the tcp_ack_snd_check() function is executed, tp->rcv_nxt
has been incremented by the length of the unACKed CWR segment.

The tcp_transmit_skb() function sets the outgoing segment's ack_seq to
be tp->rcv_next:

th->ack_seq             = htonl(tp->rcv_nxt);

So I think the rcv_nxt field is supposed to be incremented before
reaching tcp_transmit_skb(). Can you see any reason as to why this
field would not be incremented for CWR segments sometimes?

Thanks,
-Steve


On Tue, Jan 2, 2018 at 8:27 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Jan 2, 2018 at 2:43 AM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> Apologies for the delay, and happy new year!
>>
>> To answer your question, data is only transferred in one direction
>> (from the client to the server). The SeqNo in the pkts from the server
>> to the client is not incremented. So I don't think that a data pkt is
>> attempted to be sent in the tcp_data_snd_check() call clearing the
>> ICSK_ACK_SCHED bit. Although I think it would be helpful to include
>> your new debugging field in the tcp_sock (tp->processing_cwr) so that
>> I can check this field in the tcp_transmit_skb() and tcp_send_ack()
>> functions. I added the new field and tried to set it at the top of the
>> tcp_rcv_established(), but then when I try to check the field in the
>> tcp_send_ack() function it never appears to be set. Below I'm showing
>> how I set the tp->processing_cwr field in the tcp_rcv_established
>> function and how I check it in the tcp_send_ack function. Is this how
>> you were imagining the processing_cwr field to be used?
>
> Happy new year to you as well, and thank you, Steve, for running this
> experiment! Yes, this is basically the kind of thing I had in mind.
>
> The connection will run the "fast path" tcp_rcv_established() code if
> the connection is in the ESTABLISHED state.  From the symptoms it
> sounds like what's happening is that in this test the connection is
> not in the ESTABLISHED state when the CWR arrives, so it's probably
> running the more general tcp_rcv_state_process() function. I would
> suggest adding your tp->processing_cwr instrumentation at the top of
> tcp_rcv_state_process() as well, and then re-running the test. (In
> tcp_v4_do_rcv() and tcp_v6_do_rcv(), for each incoming skb one of
> those two functions is called).
>
> It is interesting that the connection does not seem to be in the
> ESTABLISHED state. Maybe that is an ingredient of the unexpected
> behavior in this case...
>
> Thanks!
> neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2018-01-02 23:57                                                         ` Steve Ibanez
@ 2018-01-03 19:39                                                           ` Neal Cardwell
  2018-01-03 22:21                                                             ` Steve Ibanez
  0 siblings, 1 reply; 32+ messages in thread
From: Neal Cardwell @ 2018-01-03 19:39 UTC (permalink / raw)
  To: Steve Ibanez
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

On Tue, Jan 2, 2018 at 6:57 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
> Hi Neal,
>
> Sorry, my last email was incorrect. It turns out the default tcp
> congestion control alg that was being used on my client machines was
> cubic instead of dctcp. That is why tp->processing_cwr field was never
> set in the tcp_rcv_established function. I've changed the default back
> to dctcp on all of my machines.
>
> I am now logging the value of tp->rcv_nxt at the top of the
> tcp_transmit_skb() function for all CWR segments. I see that during
> normal operation, the value of tp->rcv_nxt is equal to the SeqNo in
> the CWR segment  + length of the CWR segment.

OK, thanks. That makes sense.

This part I didn't understand:

> However, for the unACKed
> CWR segment, the value of tp->rcv_nxt is just equal to the SeqNo in
> the CWR segment (i.e. not incremented by the length). And I see that
> by the time the tcp_ack_snd_check() function is executed, tp->rcv_nxt
> has been incremented by the length of the unACKed CWR segment.

I would have thought that for the processing of the skb that has the
CWR, the sequence would be:

(1)  "...the tcp_ack_snd_check() function is executed, tp->rcv_nxt has
been incremented by the length of the unACKed CWR segment"

(2) then we send the ACK, and the instrumentation at the top of the
tcp_transmit_skb() function logs that rcv_nxt value (which "has been
incremented by the length of the unACKed CWR segment").

But you are saying "for the unACKed CWR segment, the value of
tp->rcv_nxt is just equal to the SeqNo in the CWR segment (i.e. not
incremented by the length)", which does not seem to match my
prediction in (2). Apparently I am mis-understanding the sequence.
Perhaps you can help clear it up for me? :-)

Is it possible that the case where you see "tp->rcv_nxt is just equal
to the SeqNo in the CWR segment" is a log line that was logged while
processing the skb that precedes the skb with the CWR?

> The tcp_transmit_skb() function sets the outgoing segment's ack_seq to
> be tp->rcv_next:
>
> th->ack_seq             = htonl(tp->rcv_nxt);
>
> So I think the rcv_nxt field is supposed to be incremented before
> reaching tcp_transmit_skb(). Can you see any reason as to why this
> field would not be incremented for CWR segments sometimes?

No, so far I haven't been able to think of a reason why rcv_nxt would
not be incremented for in-order CWR-marked segments...

cheers,
neal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux ECN Handling
  2018-01-03 19:39                                                           ` Neal Cardwell
@ 2018-01-03 22:21                                                             ` Steve Ibanez
  0 siblings, 0 replies; 32+ messages in thread
From: Steve Ibanez @ 2018-01-03 22:21 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Yuchung Cheng, Daniel Borkmann, Netdev,
	Florian Westphal, Mohammad Alizadeh, Lawrence Brakmo

[-- Attachment #1: Type: text/plain, Size: 3169 bytes --]

Hi Neal,

I've attached a pdf of a slide that I made which shows some data from
the kernel log at the receiver as well as the timestamp, SeqNo, and
length of the corresponding segments from the tcpdump trace at the
receiver interface. Hopefully this helps clarify why I think
tcp_transmit_skb() is called at some point before tcp_ack_snd_check()
while processing the CWR segment for which no ACK is sent. Please let
me know if anything is unclear or if you know where that initial call
to tcp_transmit_skb() might be coming from and why it's happening
prematurely.

Best,
-Steve

On Wed, Jan 3, 2018 at 11:39 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Jan 2, 2018 at 6:57 PM, Steve Ibanez <sibanez@stanford.edu> wrote:
>> Hi Neal,
>>
>> Sorry, my last email was incorrect. It turns out the default tcp
>> congestion control alg that was being used on my client machines was
>> cubic instead of dctcp. That is why tp->processing_cwr field was never
>> set in the tcp_rcv_established function. I've changed the default back
>> to dctcp on all of my machines.
>>
>> I am now logging the value of tp->rcv_nxt at the top of the
>> tcp_transmit_skb() function for all CWR segments. I see that during
>> normal operation, the value of tp->rcv_nxt is equal to the SeqNo in
>> the CWR segment  + length of the CWR segment.
>
> OK, thanks. That makes sense.
>
> This part I didn't understand:
>
>> However, for the unACKed
>> CWR segment, the value of tp->rcv_nxt is just equal to the SeqNo in
>> the CWR segment (i.e. not incremented by the length). And I see that
>> by the time the tcp_ack_snd_check() function is executed, tp->rcv_nxt
>> has been incremented by the length of the unACKed CWR segment.
>
> I would have thought that for the processing of the skb that has the
> CWR, the sequence would be:
>
> (1)  "...the tcp_ack_snd_check() function is executed, tp->rcv_nxt has
> been incremented by the length of the unACKed CWR segment"
>
> (2) then we send the ACK, and the instrumentation at the top of the
> tcp_transmit_skb() function logs that rcv_nxt value (which "has been
> incremented by the length of the unACKed CWR segment").
>
> But you are saying "for the unACKed CWR segment, the value of
> tp->rcv_nxt is just equal to the SeqNo in the CWR segment (i.e. not
> incremented by the length)", which does not seem to match my
> prediction in (2). Apparently I am mis-understanding the sequence.
> Perhaps you can help clear it up for me? :-)
>
> Is it possible that the case where you see "tp->rcv_nxt is just equal
> to the SeqNo in the CWR segment" is a log line that was logged while
> processing the skb that precedes the skb with the CWR?
>
>> The tcp_transmit_skb() function sets the outgoing segment's ack_seq to
>> be tp->rcv_next:
>>
>> th->ack_seq             = htonl(tp->rcv_nxt);
>>
>> So I think the rcv_nxt field is supposed to be incremented before
>> reaching tcp_transmit_skb(). Can you see any reason as to why this
>> field would not be incremented for CWR segments sometimes?
>
> No, so far I haven't been able to think of a reason why rcv_nxt would
> not be incremented for in-order CWR-marked segments...
>
> cheers,
> neal

[-- Attachment #2: CWR_log_and_trace.pdf --]
[-- Type: application/pdf, Size: 34236 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-01-03 22:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CACJspmLFdy9i8K=TkzXHnofyFMNoJf9HkYE7On8uG+PREc2Dqw@mail.gmail.com>
2017-10-19 12:43 ` Linux ECN Handling Florian Westphal
2017-10-23 22:15   ` Steve Ibanez
2017-10-24  1:11     ` Neal Cardwell
2017-11-06 14:08       ` Daniel Borkmann
2017-11-06 23:31         ` Steve Ibanez
2017-11-20  7:31           ` Steve Ibanez
2017-11-20 15:05             ` Neal Cardwell
     [not found]             ` <CADVnQy=q4qBpe0Ymo8dtFTYU_0x0q_XKE+ZvazLQE-ULwu7pQA@mail.gmail.com>
2017-11-20 15:40               ` Eric Dumazet
2017-11-21  5:58               ` Steve Ibanez
2017-11-21 15:01                 ` Neal Cardwell
2017-11-21 15:51                   ` Yuchung Cheng
2017-11-21 16:20                     ` Neal Cardwell
2017-11-21 16:52                       ` Eric Dumazet
2017-11-22  3:02                         ` Steve Ibanez
2017-11-22  3:46                           ` Neal Cardwell
2017-11-27 18:49                             ` Steve Ibanez
2017-12-01 16:35                               ` Neal Cardwell
2017-12-05  5:22                                 ` Steve Ibanez
2017-12-05 15:23                                   ` Neal Cardwell
2017-12-05 19:36                                     ` Steve Ibanez
2017-12-05 20:04                                       ` Neal Cardwell
2017-12-19  5:16                                         ` Steve Ibanez
2017-12-19 15:28                                           ` Neal Cardwell
2017-12-19 22:00                                             ` Steve Ibanez
2017-12-20  0:08                                               ` Neal Cardwell
2017-12-20 19:20                                                 ` Steve Ibanez
2017-12-20 20:17                                                   ` Neal Cardwell
2018-01-02  7:43                                                     ` Steve Ibanez
2018-01-02 16:27                                                       ` Neal Cardwell
2018-01-02 23:57                                                         ` Steve Ibanez
2018-01-03 19:39                                                           ` Neal Cardwell
2018-01-03 22:21                                                             ` Steve Ibanez

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.