Re: [RFC] tcp: Consider mtu probing for tcp_xmit_size_goal

From: Leonard Crestez <lcrestez@drivenets.com>
To: Neal Cardwell <ncardwell@google.com>,
	Matt Mathis <mattmathis@google.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	David Ahern <dsahern@kernel.org>,
	John Heffner <johnwheffner@gmail.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	Leonard Crestez <cdleonard@gmail.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC] tcp: Consider mtu probing for tcp_xmit_size_goal
Date: Mon, 26 Apr 2021 19:58:50 +0300	[thread overview]
Message-ID: <ec77b414-32f5-b512-7f1a-fb939430cea6@drivenets.com> (raw)
In-Reply-To: <CADVnQykCbF9jcFS6xE9fmaVnimSvY+0yOgGf=THA0-tnnyYPDQ@mail.gmail.com>

On 26.04.2021 18:40, Neal Cardwell wrote:
> Thanks for the patch. Some thoughts, in-line below:
> 
> On Sun, Apr 25, 2021 at 10:22 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux checks for probe_size + (1 + retries) * mss_cache to be available
> 
> I think this means to say "reordering" rather than "retries"?
> 
>> in the send buffer and if that condition is not met it will send anyway
>> using the current MSS. The feature can be made to work by sending very
>> large chunks of data from userspace (for example 128k) but for small
>> writes on fast links tcp mtu probes almost never happen.
>>
>> This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
>> function which otherwise attempts to accumulate a packet suitable for
>> TSO. No delays are introduced beyond existing autocork heuristics.
> 
> I would suggest phrasing this paragraph as something like:
> 
>    This patch generalizes tcp_xmit_size_goal(), a
>    function which is used in autocorking in order to attempt to
>    accumulate suitable transmit sizes, so that it takes into account
>    not only TSO and receive window, but now also transmit
>    sizes needed for efficient PMTU discovery (when a flow is
>    eligible for PMTU discovery).
> 
>>
>> Suggested-by: Matt Mathis <mattmathis@google.com>
>> Signed-off-by: Leonard Crestez <lcrestez@drivenets.com>
>> ---
> ...
>> +
>> +               /* Timer for wait_data */
>> +               struct    timer_list    wait_data_timer;
> 
> This timer_list seems left over from a previous patch version, and no
> longer used?

Yes, I'll most a cleaned-up version.

>> @@ -1375,10 +1376,12 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
>>          s32 delta;
>>
>>          if (!sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle || tp->packets_out ||
>>              ca_ops->cong_control)
>>                  return;
>> +       if (inet_csk(sk)->icsk_mtup.wait_data)
>> +               return;
>>          delta = tcp_jiffies32 - tp->lsndtime;
>>          if (delta > inet_csk(sk)->icsk_rto)
>>                  tcp_cwnd_restart(sk, delta);
>>   }
> 
> I would argue that the MTU probing heuristics should not override
> congestion control. If congestion control thinks it's proper to reset
> the cwnd to a lower value due to the connection being idle, then this
> should happen irrespective of MTU probing activity.

This is a hack from the previous version, removed.

>>   int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>>   {
>>          int mss_now;
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index bde781f46b41..c15ed548a48a 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2311,10 +2311,23 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
>>          }
>>
>>          return true;
>>   }
>>
>> +int tcp_mtu_probe_size_needed(struct sock *sk)
>> +{
>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>> +       struct tcp_sock *tp = tcp_sk(sk);
>> +       int probe_size;
>> +       int size_needed;
>> +
>> +       probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1);
>> +       size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
> 
> Adding this helper without refactoring tcp_mtu_probe() to use this
> helper would leave some significant unnecessary code duplication. To
> avoid duplication, AFAICT your tcp_mtu_probe() should call your new
> helper.

Yes. This function should also check if RACK is enabled and try to send 
smaller probes in that case.

> You may also want to make this little helper static inline, since
> AFAICT it is called in some hot paths.

It's only called when tp->icsk_mtup.wait_data is marked true so only for 
a brief while every ten minutes or so. It is possible that this "wait" 
stretches for a long time though.

>>   static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>>                             int push_one, gfp_t gfp)
>>   {
>> +       struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct tcp_sock *tp = tcp_sk(sk);
>> +       struct net *net = sock_net(sk);
>>          struct sk_buff *skb;
>>          unsigned int tso_segs, sent_pkts;
>>          int cwnd_quota;
>>          int result;
>>          bool is_cwnd_limited = false, is_rwnd_limited = false;
>>          u32 max_segs;
>>
>>          sent_pkts = 0;
>>
>>          tcp_mstamp_refresh(tp);
>> +       /*
>> +        * Waiting for tcp probe data also applies when push_one=1
>> +        * If user does many small writes we hold them until we have have enough
>> +        * for a probe.
>> +        */
> 
> This comment seems incomplete; with this patch AFAICT small writes are
> not simply always held unconditionally until there is enough data for
> a full-sized MTU probe. Rather the autocorking mechanism is used, so
> that if the flow if a flow is eligible for MTU probes, autocorking
> attempts to wait for a full amount of data with which to probe (and as
> is the case with autocorking, if this does not become available then
> when the local send queues are empty then the flow sends out whatever
> it has).

Comment is from a previous version, should be deleted.

--
Regards,
Leonard