From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-28.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 669ACC433ED for ; Wed, 21 Apr 2021 16:46:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1699F6144E for ; Wed, 21 Apr 2021 16:46:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244393AbhDUQqj (ORCPT ); Wed, 21 Apr 2021 12:46:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59984 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238199AbhDUQqa (ORCPT ); Wed, 21 Apr 2021 12:46:30 -0400 Received: from mail-wr1-x432.google.com (mail-wr1-x432.google.com [IPv6:2a00:1450:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 377F9C06174A for ; Wed, 21 Apr 2021 09:45:56 -0700 (PDT) Received: by mail-wr1-x432.google.com with SMTP id h4so33006549wrt.12 for ; Wed, 21 Apr 2021 09:45:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BLwYcqJ6UcNz3X6OS9ZPBF9VijP9qaPjSIQPm+2ZeF4=; b=YZjDJg6WZvk0vHuhjp7OGbOTJI0KQH7YRmkk9+rQZOT0S+yR+cJm65Zawbb2hUUaes YGjvEufFoS9NByC2oydxaGQ5F6lBuqvYPMdeYxr2VNOLZ8+Hg/G/dErNOZwdSj2yKNnc XlGhgOX4MYupGtUDEWzcWW+pZudojlWlyuw8TR7xwbZJnqQ9QjInoI1kIpxs5FH1ndsj JiBNzGvDT38dXh04BhIQr6ClbuLnPDOgwmmcR5B3K61NpQYlh3Puk0h+YAoBPZxxvQ7y pa0cb0bVHvaYayLKaj1O2cVAWteJyFPcBaRocIoZhHqEpsN+3ZlDYD8xjTjWXT2iniKR 9d7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BLwYcqJ6UcNz3X6OS9ZPBF9VijP9qaPjSIQPm+2ZeF4=; b=C3pIbORaKyEpbjdiNmxvbfpD+VowYh48WUjc0bUDHBaZ7LDpdX77JMuopWWfaMVC/v 4+OinHdggxujhd/OhtTbfzr1gf4qtVDo+HgBdbYf5rMEzFMuSTCqOVUfZxHFqqFXLYz6 UR2JUrbRNXtw99H4dbQL6AIoumQ02NUfZfrVbfwEsNHk/MfYmFnjuy9lisJtxifhgEEP kk+S+auysSgrv/6sb0duz7sZgFnUhv2HDm7mPfT5e+O6apZJuZKq9h/AYf7GcJRrVKlC sUhd88GnoG5uukO4tPG8KNQ3VIhePYE+QvXW+AJPX8wMXkFG5ChCZZF2LE4gQ90KH7RE pvOw== X-Gm-Message-State: AOAM530VfdrGTZggnNFhsjpDWBQzlMGkqOdeqTfvL/tcDU2q6oXpJZ5u chvgNcNdpSC/RpkaHIT8o9er6vmCB+TLi65L3XH44Q== X-Google-Smtp-Source: ABdhPJwkfmMvaQ+1LA+nhCeYuUt8FBtc52TexF/XJKjA5LC9WerIyIPZgMqnNiYQG3EULO148fjeP4yO5v2Hd+3HdZE= X-Received: by 2002:a05:6000:1249:: with SMTP id j9mr27716991wrx.416.1619023554782; Wed, 21 Apr 2021 09:45:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Matt Mathis Date: Wed, 21 Apr 2021 09:45:42 -0700 Message-ID: Subject: Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing To: Leonard Crestez Cc: "Cc: Willem de Bruijn" , Neal Cardwell , Ilya Lesokhin , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Hideaki YOSHIFUJI , David Ahern , Wei Wang , Soheil Hassas Yeganeh , Roopa Prabhu , netdev , linux-kernel@vger.kernel.org, Yuchung Cheng Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org (Resending in plain text mode) Surely there is a way to adapt tcp_tso_should_defer(), it is trying to solve a similar problem. If I were to implement PLPMTUD today, I would more deeply entwine it into TCP's support for TSO. e.g. successful deferring segments sometimes enables TSO and sometimes enables PLPMTUD. But there is a deeper question: John Heffner and I invested a huge amount of energy in trying to make PLPMTUD work for opportunistic Jumbo discovery, only to discover that we had moved the problem down to the device driver/nic, were it isn't so readily solvable. The driver needs to carve nic buffer memory before it can communicate with a switch (to either ask or measure the MTU), and once it has done that it needs to either re-carve the memory or run with suboptimal carving. Both of these are problematic. There is also a problem that many link technologies will non-deterministically deliver jumbo frames at greatly increased error rates. This issue requires a long conversation on it's own. Thanks, --MM-- The best way to predict the future is to create it. - Alan Kay We must not tolerate intolerance; however our response must be carefully measured: too strong would be hypocritical and risks spiraling out of control; too weak risks being mistaken for tacit approval. On Wed, Apr 21, 2021 at 5:48 AM Neal Cardwell wrote: > > On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez wrote: > > > > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes > > in order to accumulate enough data" but linux almost never does that. > > > > Linux waits for probe_size + (1 + retries) * mss_cache to be available > > in the send buffer and if that condition is not met it will send anyway > > using the current MSS. The feature can be made to work by sending very > > large chunks of data from userspace (for example 128k) but for small writes > > on fast links probes almost never happen. > > > > This patch tries to implement the "MAY" by adding an extra flag > > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but > > insufficient data is available. Then data is held back in > > tcp_write_xmit until a probe is sent, probing conditions are no longer > > met, or 500ms pass. > > > > Signed-off-by: Leonard Crestez > > > > --- > > Documentation/networking/ip-sysctl.rst | 4 ++ > > include/net/inet_connection_sock.h | 7 +++- > > include/net/netns/ipv4.h | 1 + > > include/net/tcp.h | 2 + > > net/ipv4/sysctl_net_ipv4.c | 7 ++++ > > net/ipv4/tcp_ipv4.c | 1 + > > net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++-- > > 7 files changed, 71 insertions(+), 5 deletions(-) > > > > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing > > > > This patch makes the test pass quite reliably with > > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while > > before it only worked with much higher IPERF_LEN=256k > > > > In my loopback tests I also observed another issue when tcp_retries > > increases because of SACKReorder. This makes the original problem worse > > (since the retries amount factors in buffer requirement) and seems to be > > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender > > sack logic is confused somehow? > > > > I know it's towards the end of the cycle but this is mostly just intended for > > discussion. > > Thanks for raising the question of how to trigger PMTU probes more often! > > AFAICT this approach would cause unacceptable performance impacts by > often injecting unnecessary 500ms delays when there is no need to do > so. > > If the goal is to increase the frequency of PMTU probes, which seems > like a valid goal, I would suggest that we rethink the Linux heuristic > for triggering PMTU probes in the light of the fact that the loss > detection mechanism is now RACK-TLP, which provides quick recovery in > a much wider variety of scenarios. > > After all, https://tools.ietf.org/html/rfc4821#section-7.4 says: > > In addition, the timely loss detection algorithms in most protocols > have pre-conditions that SHOULD be satisfied before sending a probe. > > And we know that the "timely loss detection algorithms" have advanced > since this RFC was written in 2007. > > You mention: > > Linux waits for probe_size + (1 + retries) * mss_cache to be available > > The code in question seems to be: > > size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache; > > How about just changing this to: > > size_needed = probe_size + tp->mss_cache; > > The rationale would be that if that amount of data is available, then > the sender can send one probe and one following current-mss-size > packet. If the path MTU has not increased to allow the probe of size > probe_size to pass through the network, then the following > current-mss-size packet will likely pass through the network, generate > a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the > RACK reorder timer fires. > > A secondary rationale for this heuristic would be: if the flow never > accumulates roughly two packets worth of data, then does the flow > really need a bigger packet size? > > IMHO, just reducing the size_needed seems far preferable to needlessly > injecting 500ms delays. > > best, > neal