From: Leonard Crestez <lcrestez@drivenets.com>
To: Neal Cardwell <ncardwell@google.com>,
Matt Mathis <mattmathis@google.com>
Cc: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Willem de Bruijn <willemb@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
David Ahern <dsahern@kernel.org>,
John Heffner <johnwheffner@gmail.com>,
Soheil Hassas Yeganeh <soheil@google.com>,
Roopa Prabhu <roopa@cumulusnetworks.com>,
Leonard Crestez <cdleonard@gmail.com>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC] tcp: Consider mtu probing for tcp_xmit_size_goal
Date: Mon, 26 Apr 2021 19:58:50 +0300 [thread overview]
Message-ID: <ec77b414-32f5-b512-7f1a-fb939430cea6@drivenets.com> (raw)
In-Reply-To: <CADVnQykCbF9jcFS6xE9fmaVnimSvY+0yOgGf=THA0-tnnyYPDQ@mail.gmail.com>
On 26.04.2021 18:40, Neal Cardwell wrote:
> Thanks for the patch. Some thoughts, in-line below:
>
> On Sun, Apr 25, 2021 at 10:22 PM Leonard Crestez <lcrestez@drivenets.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux checks for probe_size + (1 + retries) * mss_cache to be available
>
> I think this means to say "reordering" rather than "retries"?
>
>> in the send buffer and if that condition is not met it will send anyway
>> using the current MSS. The feature can be made to work by sending very
>> large chunks of data from userspace (for example 128k) but for small
>> writes on fast links tcp mtu probes almost never happen.
>>
>> This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
>> function which otherwise attempts to accumulate a packet suitable for
>> TSO. No delays are introduced beyond existing autocork heuristics.
>
> I would suggest phrasing this paragraph as something like:
>
> This patch generalizes tcp_xmit_size_goal(), a
> function which is used in autocorking in order to attempt to
> accumulate suitable transmit sizes, so that it takes into account
> not only TSO and receive window, but now also transmit
> sizes needed for efficient PMTU discovery (when a flow is
> eligible for PMTU discovery).
>
>>
>> Suggested-by: Matt Mathis <mattmathis@google.com>
>> Signed-off-by: Leonard Crestez <lcrestez@drivenets.com>
>> ---
> ...
>> +
>> + /* Timer for wait_data */
>> + struct timer_list wait_data_timer;
>
> This timer_list seems left over from a previous patch version, and no
> longer used?
Yes, I'll most a cleaned-up version.
>> @@ -1375,10 +1376,12 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
>> s32 delta;
>>
>> if (!sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle || tp->packets_out ||
>> ca_ops->cong_control)
>> return;
>> + if (inet_csk(sk)->icsk_mtup.wait_data)
>> + return;
>> delta = tcp_jiffies32 - tp->lsndtime;
>> if (delta > inet_csk(sk)->icsk_rto)
>> tcp_cwnd_restart(sk, delta);
>> }
>
> I would argue that the MTU probing heuristics should not override
> congestion control. If congestion control thinks it's proper to reset
> the cwnd to a lower value due to the connection being idle, then this
> should happen irrespective of MTU probing activity.
This is a hack from the previous version, removed.
>> int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>> {
>> int mss_now;
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index bde781f46b41..c15ed548a48a 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2311,10 +2311,23 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
>> }
>>
>> return true;
>> }
>>
>> +int tcp_mtu_probe_size_needed(struct sock *sk)
>> +{
>> + struct inet_connection_sock *icsk = inet_csk(sk);
>> + struct tcp_sock *tp = tcp_sk(sk);
>> + int probe_size;
>> + int size_needed;
>> +
>> + probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1);
>> + size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> Adding this helper without refactoring tcp_mtu_probe() to use this
> helper would leave some significant unnecessary code duplication. To
> avoid duplication, AFAICT your tcp_mtu_probe() should call your new
> helper.
Yes. This function should also check if RACK is enabled and try to send
smaller probes in that case.
> You may also want to make this little helper static inline, since
> AFAICT it is called in some hot paths.
It's only called when tp->icsk_mtup.wait_data is marked true so only for
a brief while every ten minutes or so. It is possible that this "wait"
stretches for a long time though.
>> static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>> int push_one, gfp_t gfp)
>> {
>> + struct inet_connection_sock *icsk = inet_csk(sk);
>> struct tcp_sock *tp = tcp_sk(sk);
>> + struct net *net = sock_net(sk);
>> struct sk_buff *skb;
>> unsigned int tso_segs, sent_pkts;
>> int cwnd_quota;
>> int result;
>> bool is_cwnd_limited = false, is_rwnd_limited = false;
>> u32 max_segs;
>>
>> sent_pkts = 0;
>>
>> tcp_mstamp_refresh(tp);
>> + /*
>> + * Waiting for tcp probe data also applies when push_one=1
>> + * If user does many small writes we hold them until we have have enough
>> + * for a probe.
>> + */
>
> This comment seems incomplete; with this patch AFAICT small writes are
> not simply always held unconditionally until there is enough data for
> a full-sized MTU probe. Rather the autocorking mechanism is used, so
> that if the flow if a flow is eligible for MTU probes, autocorking
> attempts to wait for a full amount of data with which to probe (and as
> is the case with autocorking, if this does not become available then
> when the local send queues are empty then the flow sends out whatever
> it has).
Comment is from a previous version, should be deleted.
--
Regards,
Leonard
prev parent reply other threads:[~2021-04-26 17:06 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-26 2:22 [RFC] tcp: Consider mtu probing for tcp_xmit_size_goal Leonard Crestez
2021-04-26 15:40 ` Neal Cardwell
2021-04-26 16:58 ` Leonard Crestez [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ec77b414-32f5-b512-7f1a-fb939430cea6@drivenets.com \
--to=lcrestez@drivenets.com \
--cc=cdleonard@gmail.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=johnwheffner@gmail.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mattmathis@google.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=roopa@cumulusnetworks.com \
--cc=soheil@google.com \
--cc=willemb@google.com \
--cc=yoshfuji@linux-ipv6.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).