From: Eric Dumazet <eric.dumazet@gmail.com>
To: Christoph Paasch <christoph.paasch@gmail.com>,
Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Prout, Andrew - LLSC - MITLL" <aprout@ll.mit.edu>,
David Miller <davem@davemloft.net>,
netdev <netdev@vger.kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Jonathan Looney <jtl@netflix.com>,
Neal Cardwell <ncardwell@google.com>,
Tyler Hicks <tyhicks@canonical.com>,
Yuchung Cheng <ycheng@google.com>,
Bruce Curtis <brucec@netflix.com>,
Jonathan Lemon <jonathan.lemon@gmail.com>,
Dustin Marquess <dmarquess@apple.com>
Subject: Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
Date: Thu, 11 Jul 2019 11:19:45 +0200 [thread overview]
Message-ID: <eb6121ea-b02d-672e-25c9-2ad054d49fc7@gmail.com> (raw)
In-Reply-To: <B600B3AB-559E-44C1-869C-7309DB28850E@gmail.com>
On 7/11/19 9:28 AM, Christoph Paasch wrote:
>
>
>> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>>>
>>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>>>
>>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>>> -The send socket was non-blocking
>>> -SO_SNDBUF set to 128KiB
>>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>>> -Load was on both systems: a while(1) program spinning on each CPU core
>>> -The receiver was on an older unaffected kernel
>>>
>>
>> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
>> since skbs needs to be allocated for retransmits.
>
> Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?
4.15+ kernels have :
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {
Meaning that things like TLP will succeed.
Anything we add in TCP stack to overcome the SO_SNDBUF by twice the limit _will_ be exploited at scale.
I am not sure we want to continue to support small SO_SNDBUF values, as this makes no sense today.
We use 64 MB auto tuning limit, and /proc/sys/net/ipv4/tcp_notsent_lowat to 1 MB.
I would rather work (when net-next reopens) on better collapsing at rtx to allow reduction of the overhead.
Something like :
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f6a9c95a48edb234e4d4e21bf585744fbaf9a0a7..d5c85986209cd162cf39edb787b1385cb2c8b630 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2860,7 +2860,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_early_retrans = 3;
net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior. */
- net->ipv4.sysctl_tcp_retrans_collapse = 1;
+ net->ipv4.sysctl_tcp_retrans_collapse = 3;
net->ipv4.sysctl_tcp_max_reordering = 300;
net->ipv4.sysctl_tcp_dsack = 1;
net->ipv4.sysctl_tcp_app_win = 31;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d61264cf89ef66b229ecf797c1abfb7fcdab009f..05cd264f98b084f62eaf2ef9d6e14a392670d02c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3015,8 +3015,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
next_skb_size = next_skb->len;
- BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
-
if (next_skb_size) {
if (next_skb_size <= skb_availroom(skb))
skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
@@ -3054,8 +3052,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
/* Check if coalescing SKBs is legal. */
static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
{
- if (tcp_skb_pcount(skb) > 1)
- return false;
if (skb_cloned(skb))
return false;
/* Some heuristics for collapsing over SACK'd could be invented */
@@ -3114,7 +3110,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
unsigned int cur_mss;
- int diff, len, err;
+ int diff, len, maxlen, err;
/* Inconclusive MTU probe */
@@ -3165,12 +3161,13 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
return -ENOMEM;
diff = tcp_skb_pcount(skb);
+ maxlen = (sock_net(sk)->ipv4.sysctl_tcp_retrans_collapse & 2) ? len : cur_mss;
+ if (skb->len < maxlen)
+ tcp_retrans_try_collapse(sk, skb, maxlen);
tcp_set_skb_tso_segs(skb, cur_mss);
diff -= tcp_skb_pcount(skb);
if (diff)
tcp_adjust_pcount(sk, skb, diff);
- if (skb->len < cur_mss)
- tcp_retrans_try_collapse(sk, skb, cur_mss);
}
/* RFC3168, section 6.1.1.1. ECN fallback */
next prev parent reply other threads:[~2019-07-11 9:19 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
2019-06-17 17:14 ` Jonathan Lemon
2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
2019-06-17 17:14 ` Jonathan Lemon
2019-06-18 0:18 ` Christoph Paasch
2019-06-18 2:28 ` Eric Dumazet
2019-06-18 3:19 ` Christoph Paasch
2019-06-18 3:44 ` Eric Dumazet
2019-06-18 3:53 ` Christoph Paasch
2019-06-18 4:08 ` Eric Dumazet
2019-07-10 18:23 ` Prout, Andrew - LLSC - MITLL
2019-07-10 18:28 ` Eric Dumazet
2019-07-10 18:53 ` Prout, Andrew - LLSC - MITLL
2019-07-10 19:26 ` Eric Dumazet
2019-07-11 7:28 ` Christoph Paasch
2019-07-11 9:19 ` Eric Dumazet [this message]
2019-07-11 18:26 ` Michal Kubecek
2019-07-11 18:50 ` Eric Dumazet
2019-07-11 10:18 ` Eric Dumazet
2019-07-11 17:14 ` Prout, Andrew - LLSC - MITLL
2019-07-11 18:28 ` Eric Dumazet
2019-07-11 19:04 ` Jonathan Lemon
2019-07-12 7:05 ` Eric Dumazet
2019-07-16 15:13 ` Prout, Andrew - LLSC - MITLL
2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
2019-06-17 17:15 ` Jonathan Lemon
2019-06-17 17:18 ` Tyler Hicks
2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
2019-06-17 17:16 ` Jonathan Lemon
2019-06-17 17:18 ` Tyler Hicks
2019-06-17 17:41 ` [PATCH net 0/4] tcp: make sack processing more robust David Miller
2019-08-02 19:02 [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Bernd
2019-08-02 19:14 ` Neal Cardwell
2019-08-02 19:58 ` Bernd
2019-08-14 14:41 ` Marcelo Ricardo Leitner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=eb6121ea-b02d-672e-25c9-2ad054d49fc7@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=aprout@ll.mit.edu \
--cc=brucec@netflix.com \
--cc=christoph.paasch@gmail.com \
--cc=davem@davemloft.net \
--cc=dmarquess@apple.com \
--cc=gregkh@linuxfoundation.org \
--cc=jonathan.lemon@gmail.com \
--cc=jtl@netflix.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=tyhicks@canonical.com \
--cc=ycheng@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).