netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/4] tcp: make sack processing more robust
@ 2019-06-17 17:03 Eric Dumazet
  2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-06-17 17:03 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon

Jonathan Looney brought to our attention multiple problems
in TCP stack at the sender side.

SACK processing can be abused by malicious peers to either
cause overflows, or increase of memory usage.

First two patches fix the immediate problems.

Since the malicious peers abuse senders by advertizing a very
small MSS in their SYN or SYNACK packet, the last two
patches add a new sysctl so that admins can chose a higher
limit for MSS clamping.

Eric Dumazet (4):
  tcp: limit payload size of sacked skbs
  tcp: tcp_fragment() should apply sane memory limits
  tcp: add tcp_min_snd_mss sysctl
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()

 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/linux/tcp.h                    |  4 ++++
 include/net/netns/ipv4.h               |  1 +
 include/net/tcp.h                      |  2 ++
 include/uapi/linux/snmp.h              |  1 +
 net/ipv4/proc.c                        |  1 +
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 net/ipv4/tcp.c                         |  1 +
 net/ipv4/tcp_input.c                   | 26 ++++++++++++++++++++------
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  | 10 +++++++---
 net/ipv4/tcp_timer.c                   |  1 +
 12 files changed, 58 insertions(+), 9 deletions(-)

-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net 1/4] tcp: limit payload size of sacked skbs
  2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
@ 2019-06-17 17:03 ` Eric Dumazet
  2019-06-17 17:14   ` Jonathan Lemon
  2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-06-17 17:03 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon

Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/tcp.h   |  4 ++++
 include/net/tcp.h     |  2 ++
 net/ipv4/tcp.c        |  1 +
 net/ipv4/tcp_input.c  | 26 ++++++++++++++++++++------
 net/ipv4/tcp_output.c |  6 +++---
 5 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 711361af9ce019f08c8b6accc33220b673b34d56..9a478a0cd3a20b40ed344f178e35228a0b8ee203 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -484,4 +484,8 @@ static inline u16 tcp_mss_clamp(const struct tcp_sock *tp, u16 mss)
 
 	return (user_mss && user_mss < mss) ? user_mss : mss;
 }
+
+int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, int pcount,
+		  int shiftlen);
+
 #endif	/* _LINUX_TCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ac2f53fbfa6b4cbf1fc615c952a5e1cac1124300..582c0caa98116740b5bde8c5dbb5d94fc69d1caa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -51,6 +51,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 #define MAX_TCP_HEADER	(128 + MAX_HEADER)
 #define MAX_TCP_OPTION_SPACE 40
+#define TCP_MIN_SND_MSS		48
+#define TCP_MIN_GSO_SIZE	(TCP_MIN_SND_MSS - MAX_TCP_OPTION_SPACE)
 
 /*
  * Never offer a window over 32767 without using window scaling. Some
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f448a288d158c4baa6d8d5ed82c4b129404233a1..7dc9ab84bb69aa90953e98f9763287fcee3a1659 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3873,6 +3873,7 @@ void __init tcp_init(void)
 	unsigned long limit;
 	unsigned int i;
 
+	BUILD_BUG_ON(TCP_MIN_SND_MSS <= MAX_TCP_OPTION_SPACE);
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) >
 		     FIELD_SIZEOF(struct sk_buff, cb));
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 38dfc308c0fb9832facadb0aeec8f3e4931901f4..d95ee40df6c2b020d590018bc41833b8a6aefa4a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1302,7 +1302,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	TCP_SKB_CB(skb)->seq += shifted;
 
 	tcp_skb_pcount_add(prev, pcount);
-	BUG_ON(tcp_skb_pcount(skb) < pcount);
+	WARN_ON_ONCE(tcp_skb_pcount(skb) < pcount);
 	tcp_skb_pcount_add(skb, -pcount);
 
 	/* When we're adding to gso_segs == 1, gso_size will be zero,
@@ -1368,6 +1368,21 @@ static int skb_can_shift(const struct sk_buff *skb)
 	return !skb_headlen(skb) && skb_is_nonlinear(skb);
 }
 
+int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from,
+		  int pcount, int shiftlen)
+{
+	/* TCP min gso_size is 8 bytes (TCP_MIN_GSO_SIZE)
+	 * Since TCP_SKB_CB(skb)->tcp_gso_segs is 16 bits, we need
+	 * to make sure not storing more than 65535 * 8 bytes per skb,
+	 * even if current MSS is bigger.
+	 */
+	if (unlikely(to->len + shiftlen >= 65535 * TCP_MIN_GSO_SIZE))
+		return 0;
+	if (unlikely(tcp_skb_pcount(to) + pcount > 65535))
+		return 0;
+	return skb_shift(to, from, shiftlen);
+}
+
 /* Try collapsing SACK blocks spanning across multiple skbs to a single
  * skb.
  */
@@ -1473,7 +1488,7 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if (!after(TCP_SKB_CB(skb)->seq + len, tp->snd_una))
 		goto fallback;
 
-	if (!skb_shift(prev, skb, len))
+	if (!tcp_skb_shift(prev, skb, pcount, len))
 		goto fallback;
 	if (!tcp_shifted_skb(sk, prev, skb, state, pcount, len, mss, dup_sack))
 		goto out;
@@ -1491,11 +1506,10 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 		goto out;
 
 	len = skb->len;
-	if (skb_shift(prev, skb, len)) {
-		pcount += tcp_skb_pcount(skb);
-		tcp_shifted_skb(sk, prev, skb, state, tcp_skb_pcount(skb),
+	pcount = tcp_skb_pcount(skb);
+	if (tcp_skb_shift(prev, skb, pcount, len))
+		tcp_shifted_skb(sk, prev, skb, state, pcount,
 				len, mss, 0);
-	}
 
 out:
 	return prev;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f429e856e2631a9e6de1d2e060406742f97e538e..b8e3bbb852117459d131fbb41d69ae63bd251a3e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1454,8 +1454,8 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
 	mss_now -= icsk->icsk_ext_hdr_len;
 
 	/* Then reserve room for full set of TCP options and 8 bytes of data */
-	if (mss_now < 48)
-		mss_now = 48;
+	if (mss_now < TCP_MIN_SND_MSS)
+		mss_now = TCP_MIN_SND_MSS;
 	return mss_now;
 }
 
@@ -2747,7 +2747,7 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 		if (next_skb_size <= skb_availroom(skb))
 			skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
 				      next_skb_size);
-		else if (!skb_shift(skb, next_skb, next_skb_size))
+		else if (!tcp_skb_shift(skb, next_skb, 1, next_skb_size))
 			return false;
 	}
 	tcp_highest_sack_replace(sk, next_skb, skb);
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
  2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
@ 2019-06-17 17:03 ` Eric Dumazet
  2019-06-17 17:14   ` Jonathan Lemon
  2019-06-18  0:18   ` Christoph Paasch
  2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-06-17 17:03 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon

Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.

TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.

A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.

Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.

CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
	socket is already using more than half the allowed space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/uapi/linux/snmp.h | 1 +
 net/ipv4/proc.c           | 1 +
 net/ipv4/tcp_output.c     | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 86dc24a96c90ab047d5173d625450facd6c6dd79..fd42c1316d3d112ecd8a00d2b499d6f6901c5e81 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -283,6 +283,7 @@ enum
 	LINUX_MIB_TCPACKCOMPRESSED,		/* TCPAckCompressed */
 	LINUX_MIB_TCPZEROWINDOWDROP,		/* TCPZeroWindowDrop */
 	LINUX_MIB_TCPRCVQDROP,			/* TCPRcvQDrop */
+	LINUX_MIB_TCPWQUEUETOOBIG,		/* TCPWqueueTooBig */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 4370f4246e86dfe06a9e07cace848baeaf6cc4da..073273b751f8fcda1c9c79cd1ab566f2939b2517 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -287,6 +287,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
 	SNMP_MIB_ITEM("TCPZeroWindowDrop", LINUX_MIB_TCPZEROWINDOWDROP),
 	SNMP_MIB_ITEM("TCPRcvQDrop", LINUX_MIB_TCPRCVQDROP),
+	SNMP_MIB_ITEM("TCPWqueueTooBig", LINUX_MIB_TCPWQUEUETOOBIG),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b8e3bbb852117459d131fbb41d69ae63bd251a3e..1bb1c46b4abad100622d3f101a0a3ca0a6c8e881 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1296,6 +1296,11 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	if (nsize < 0)
 		nsize = 0;
 
+	if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
+		return -ENOMEM;
+	}
+
 	if (skb_unclone(skb, gfp))
 		return -ENOMEM;
 
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl
  2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
  2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
  2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
@ 2019-06-17 17:03 ` Eric Dumazet
  2019-06-17 17:15   ` Jonathan Lemon
  2019-06-17 17:18   ` Tyler Hicks
  2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
  2019-06-17 17:41 ` [PATCH net 0/4] tcp: make sack processing more robust David Miller
  4 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-06-17 17:03 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon

Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |  8 ++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  |  3 +--
 5 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 288aa264ac26d98637a5bb1babc334bfc699bef1..22f6b8b1110ad20c36e7ceea6d67fd2cc938eb7b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -255,6 +255,14 @@ tcp_base_mss - INTEGER
 	Path MTU discovery (MTU probing).  If MTU probing is enabled,
 	this is the initial MSS used by the connection.
 
+tcp_min_snd_mss - INTEGER
+	TCP SYN and SYNACK messages usually advertise an ADVMSS option,
+	as described in RFC 1122 and RFC 6691.
+	If this ADVMSS option is smaller than tcp_min_snd_mss,
+	it is silently capped to tcp_min_snd_mss.
+
+	Default : 48 (at least 8 bytes of payload per segment)
+
 tcp_congestion_control - STRING
 	Set the congestion control algorithm to be used for new
 	connections. The algorithm "reno" is always available, but
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7698460a3dd1e5070e12d406b3ee58834688cdc9..623cfbb7b8dcbb2a6d8325ec010aff78bbdf8839 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -117,6 +117,7 @@ struct netns_ipv4 {
 #endif
 	int sysctl_tcp_mtu_probing;
 	int sysctl_tcp_base_mss;
+	int sysctl_tcp_min_snd_mss;
 	int sysctl_tcp_probe_threshold;
 	u32 sysctl_tcp_probe_interval;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fa213bd8e233b577114815ca2227f08264e7df06..b6f14af926faf80f1686549bee7154c584dc63e6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -39,6 +39,8 @@ static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
 static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
+static int tcp_min_snd_mss_min = TCP_MIN_SND_MSS;
+static int tcp_min_snd_mss_max = 65535;
 static int ip_privileged_port_min;
 static int ip_privileged_port_max = 65535;
 static int ip_ttl_min = 1;
@@ -769,6 +771,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "tcp_min_snd_mss",
+		.data		= &init_net.ipv4.sysctl_tcp_min_snd_mss,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_min_snd_mss_min,
+		.extra2		= &tcp_min_snd_mss_max,
+	},
 	{
 		.procname	= "tcp_probe_threshold",
 		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index bc86f9735f4577d50d94f42b10edb6ba95bb7a05..cfa81190a1b1af30d05f4f6cd84c05b025a6afeb 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2628,6 +2628,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
+	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1bb1c46b4abad100622d3f101a0a3ca0a6c8e881..00c01a01b547ec67c971dc25a74c9258563cf871 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1459,8 +1459,7 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
 	mss_now -= icsk->icsk_ext_hdr_len;
 
 	/* Then reserve room for full set of TCP options and 8 bytes of data */
-	if (mss_now < TCP_MIN_SND_MSS)
-		mss_now = TCP_MIN_SND_MSS;
+	mss_now = max(mss_now, sock_net(sk)->ipv4.sysctl_tcp_min_snd_mss);
 	return mss_now;
 }
 
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
                   ` (2 preceding siblings ...)
  2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
@ 2019-06-17 17:03 ` Eric Dumazet
  2019-06-17 17:16   ` Jonathan Lemon
  2019-06-17 17:18   ` Tyler Hicks
  2019-06-17 17:41 ` [PATCH net 0/4] tcp: make sack processing more robust David Miller
  4 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-06-17 17:03 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon

If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
---
 net/ipv4/tcp_timer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 5bad937ce779ef8dca42a26dcbb5f1d60a571c73..c801cd37cc2a9c11f2dd4b9681137755e501a538 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -155,6 +155,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 		mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
 		mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
 		mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+		mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
 		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
 	}
 	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net 1/4] tcp: limit payload size of sacked skbs
  2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
@ 2019-06-17 17:14   ` Jonathan Lemon
  0 siblings, 0 replies; 36+ messages in thread
From: Jonathan Lemon @ 2019-06-17 17:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis



On 17 Jun 2019, at 10:03, Eric Dumazet wrote:

> Jonathan Looney reported that TCP can trigger the following crash
> in tcp_shifted_skb() :
>
> 	BUG_ON(tcp_skb_pcount(skb) < pcount);
>
> This can happen if the remote peer has advertized the smallest
> MSS that linux TCP accepts : 48
>
> An skb can hold 17 fragments, and each fragment can hold 32KB
> on x86, or 64KB on PowerPC.
>
> This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
> can overflow.
>
> Note that tcp_sendmsg() builds skbs with less than 64KB
> of payload, so this problem needs SACK to be enabled.
> SACK blocks allow TCP to coalesce multiple skbs in the retransmit
> queue, thus filling the 17 fragments to maximal capacity.
>
> CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
>
> Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK 
> processing")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
@ 2019-06-17 17:14   ` Jonathan Lemon
  2019-06-18  0:18   ` Christoph Paasch
  1 sibling, 0 replies; 36+ messages in thread
From: Jonathan Lemon @ 2019-06-17 17:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis



On 17 Jun 2019, at 10:03, Eric Dumazet wrote:

> Jonathan Looney reported that a malicious peer can force a sender
> to fragment its retransmit queue into tiny skbs, inflating memory
> usage and/or overflow 32bit counters.
>
> TCP allows an application to queue up to sk_sndbuf bytes,
> so we need to give some allowance for non malicious splitting
> of retransmit queue.
>
> A new SNMP counter is added to monitor how many times TCP
> did not allow to split an skb if the allowance was exceeded.
>
> Note that this counter might increase in the case applications
> use SO_SNDBUF socket option to lower sk_sndbuf.
>
> CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
> 	socket is already using more than half the allowed space
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
> Cc: Bruce Curtis <brucec@netflix.com>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl
  2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
@ 2019-06-17 17:15   ` Jonathan Lemon
  2019-06-17 17:18   ` Tyler Hicks
  1 sibling, 0 replies; 36+ messages in thread
From: Jonathan Lemon @ 2019-06-17 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis



On 17 Jun 2019, at 10:03, Eric Dumazet wrote:

> Some TCP peers announce a very small MSS option in their SYN and/or
> SYN/ACK messages.
>
> This forces the stack to send packets with a very high network/cpu
> overhead.
>
> Linux has enforced a minimal value of 48. Since this value includes
> the size of TCP options, and that the options can consume up to 40
> bytes, this means that each segment can include only 8 bytes of payload.
>
> In some cases, it can be useful to increase the minimal value
> to a saner value.
>
> We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
> reasons.
>
> Note that TCP_MAXSEG socket option enforces a minimal value
> of (TCP_MIN_MSS). David Miller increased this minimal value
> in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
> from 64 to 88.
>
> We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
>
> CVE-2019-11479 -- tcp mss hardcoded to 48
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Suggested-by: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Tyler Hicks <tyhicks@canonical.com>
> Cc: Bruce Curtis <brucec@netflix.com>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
@ 2019-06-17 17:16   ` Jonathan Lemon
  2019-06-17 17:18   ` Tyler Hicks
  1 sibling, 0 replies; 36+ messages in thread
From: Jonathan Lemon @ 2019-06-17 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis



On 17 Jun 2019, at 10:03, Eric Dumazet wrote:

> If mtu probing is enabled tcp_mtu_probing() could very well end up
> with a too small MSS.
>
> Use the new sysctl tcp_min_snd_mss to make sure MSS search
> is performed in an acceptable range.
>
> CVE-2019-11479 -- tcp mss hardcoded to 48
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> Cc: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Tyler Hicks <tyhicks@canonical.com>
> Cc: Bruce Curtis <brucec@netflix.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl
  2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
  2019-06-17 17:15   ` Jonathan Lemon
@ 2019-06-17 17:18   ` Tyler Hicks
  1 sibling, 0 replies; 36+ messages in thread
From: Tyler Hicks @ 2019-06-17 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon

On 2019-06-17 10:03:53, Eric Dumazet wrote:
> Some TCP peers announce a very small MSS option in their SYN and/or
> SYN/ACK messages.
> 
> This forces the stack to send packets with a very high network/cpu
> overhead.
> 
> Linux has enforced a minimal value of 48. Since this value includes
> the size of TCP options, and that the options can consume up to 40
> bytes, this means that each segment can include only 8 bytes of payload.
> 
> In some cases, it can be useful to increase the minimal value
> to a saner value.
> 
> We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
> reasons.
> 
> Note that TCP_MAXSEG socket option enforces a minimal value
> of (TCP_MIN_MSS). David Miller increased this minimal value
> in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
> from 64 to 88.
> 
> We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
> 
> CVE-2019-11479 -- tcp mss hardcoded to 48
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Suggested-by: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Tyler Hicks <tyhicks@canonical.com>

I've given the two sysctl patches a close review and some testing.

Acked-by: Tyler Hicks <tyhicks@canonical.com>

Tyler

> Cc: Bruce Curtis <brucec@netflix.com>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  Documentation/networking/ip-sysctl.txt |  8 ++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  |  3 +--
>  5 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 288aa264ac26d98637a5bb1babc334bfc699bef1..22f6b8b1110ad20c36e7ceea6d67fd2cc938eb7b 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -255,6 +255,14 @@ tcp_base_mss - INTEGER
>  	Path MTU discovery (MTU probing).  If MTU probing is enabled,
>  	this is the initial MSS used by the connection.
>  
> +tcp_min_snd_mss - INTEGER
> +	TCP SYN and SYNACK messages usually advertise an ADVMSS option,
> +	as described in RFC 1122 and RFC 6691.
> +	If this ADVMSS option is smaller than tcp_min_snd_mss,
> +	it is silently capped to tcp_min_snd_mss.
> +
> +	Default : 48 (at least 8 bytes of payload per segment)
> +
>  tcp_congestion_control - STRING
>  	Set the congestion control algorithm to be used for new
>  	connections. The algorithm "reno" is always available, but
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 7698460a3dd1e5070e12d406b3ee58834688cdc9..623cfbb7b8dcbb2a6d8325ec010aff78bbdf8839 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -117,6 +117,7 @@ struct netns_ipv4 {
>  #endif
>  	int sysctl_tcp_mtu_probing;
>  	int sysctl_tcp_base_mss;
> +	int sysctl_tcp_min_snd_mss;
>  	int sysctl_tcp_probe_threshold;
>  	u32 sysctl_tcp_probe_interval;
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index fa213bd8e233b577114815ca2227f08264e7df06..b6f14af926faf80f1686549bee7154c584dc63e6 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -39,6 +39,8 @@ static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
>  static int tcp_adv_win_scale_min = -31;
>  static int tcp_adv_win_scale_max = 31;
> +static int tcp_min_snd_mss_min = TCP_MIN_SND_MSS;
> +static int tcp_min_snd_mss_max = 65535;
>  static int ip_privileged_port_min;
>  static int ip_privileged_port_max = 65535;
>  static int ip_ttl_min = 1;
> @@ -769,6 +771,15 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "tcp_min_snd_mss",
> +		.data		= &init_net.ipv4.sysctl_tcp_min_snd_mss,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &tcp_min_snd_mss_min,
> +		.extra2		= &tcp_min_snd_mss_max,
> +	},
>  	{
>  		.procname	= "tcp_probe_threshold",
>  		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index bc86f9735f4577d50d94f42b10edb6ba95bb7a05..cfa81190a1b1af30d05f4f6cd84c05b025a6afeb 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2628,6 +2628,7 @@ static int __net_init tcp_sk_init(struct net *net)
>  	net->ipv4.sysctl_tcp_ecn_fallback = 1;
>  
>  	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
> +	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>  	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>  	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>  
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1bb1c46b4abad100622d3f101a0a3ca0a6c8e881..00c01a01b547ec67c971dc25a74c9258563cf871 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1459,8 +1459,7 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
>  	mss_now -= icsk->icsk_ext_hdr_len;
>  
>  	/* Then reserve room for full set of TCP options and 8 bytes of data */
> -	if (mss_now < TCP_MIN_SND_MSS)
> -		mss_now = TCP_MIN_SND_MSS;
> +	mss_now = max(mss_now, sock_net(sk)->ipv4.sysctl_tcp_min_snd_mss);
>  	return mss_now;
>  }
>  
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
  2019-06-17 17:16   ` Jonathan Lemon
@ 2019-06-17 17:18   ` Tyler Hicks
  1 sibling, 0 replies; 36+ messages in thread
From: Tyler Hicks @ 2019-06-17 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon

On 2019-06-17 10:03:54, Eric Dumazet wrote:
> If mtu probing is enabled tcp_mtu_probing() could very well end up
> with a too small MSS.
> 
> Use the new sysctl tcp_min_snd_mss to make sure MSS search
> is performed in an acceptable range.
> 
> CVE-2019-11479 -- tcp mss hardcoded to 48
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> Cc: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Tyler Hicks <tyhicks@canonical.com>

As mentioned for the other sysctl patch, I've given the two sysctl
patches a close review and some testing.

Acked-by: Tyler Hicks <tyhicks@canonical.com>

Tyler

> Cc: Bruce Curtis <brucec@netflix.com>
> ---
>  net/ipv4/tcp_timer.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 5bad937ce779ef8dca42a26dcbb5f1d60a571c73..c801cd37cc2a9c11f2dd4b9681137755e501a538 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -155,6 +155,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>  		mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>  		mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>  		mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
> +		mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>  		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>  	}
>  	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 0/4] tcp: make sack processing more robust
  2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
                   ` (3 preceding siblings ...)
  2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
@ 2019-06-17 17:41 ` David Miller
  4 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2019-06-17 17:41 UTC (permalink / raw)
  To: edumazet
  Cc: netdev, eric.dumazet, gregkh, jtl, ncardwell, tyhicks, ycheng,
	brucec, jonathan.lemon

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 17 Jun 2019 10:03:50 -0700

> Jonathan Looney brought to our attention multiple problems
> in TCP stack at the sender side.
> 
> SACK processing can be abused by malicious peers to either
> cause overflows, or increase of memory usage.
> 
> First two patches fix the immediate problems.
> 
> Since the malicious peers abuse senders by advertizing a very
> small MSS in their SYN or SYNACK packet, the last two
> patches add a new sysctl so that admins can chose a higher
> limit for MSS clamping.

Series applied, thanks Eric.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
  2019-06-17 17:14   ` Jonathan Lemon
@ 2019-06-18  0:18   ` Christoph Paasch
  2019-06-18  2:28     ` Eric Dumazet
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Paasch @ 2019-06-18  0:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon, Dustin Marquess

On Mon, Jun 17, 2019 at 10:05 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Jonathan Looney reported that a malicious peer can force a sender
> to fragment its retransmit queue into tiny skbs, inflating memory
> usage and/or overflow 32bit counters.
>
> TCP allows an application to queue up to sk_sndbuf bytes,
> so we need to give some allowance for non malicious splitting
> of retransmit queue.
>
> A new SNMP counter is added to monitor how many times TCP
> did not allow to split an skb if the allowance was exceeded.
>
> Note that this counter might increase in the case applications
> use SO_SNDBUF socket option to lower sk_sndbuf.
>
> CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
>         socket is already using more than half the allowed space
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jonathan Looney <jtl@netflix.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
> Cc: Bruce Curtis <brucec@netflix.com>
> Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  include/uapi/linux/snmp.h | 1 +
>  net/ipv4/proc.c           | 1 +
>  net/ipv4/tcp_output.c     | 5 +++++
>  3 files changed, 7 insertions(+)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index 86dc24a96c90ab047d5173d625450facd6c6dd79..fd42c1316d3d112ecd8a00d2b499d6f6901c5e81 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -283,6 +283,7 @@ enum
>         LINUX_MIB_TCPACKCOMPRESSED,             /* TCPAckCompressed */
>         LINUX_MIB_TCPZEROWINDOWDROP,            /* TCPZeroWindowDrop */
>         LINUX_MIB_TCPRCVQDROP,                  /* TCPRcvQDrop */
> +       LINUX_MIB_TCPWQUEUETOOBIG,              /* TCPWqueueTooBig */
>         __LINUX_MIB_MAX
>  };
>
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 4370f4246e86dfe06a9e07cace848baeaf6cc4da..073273b751f8fcda1c9c79cd1ab566f2939b2517 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -287,6 +287,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
>         SNMP_MIB_ITEM("TCPZeroWindowDrop", LINUX_MIB_TCPZEROWINDOWDROP),
>         SNMP_MIB_ITEM("TCPRcvQDrop", LINUX_MIB_TCPRCVQDROP),
> +       SNMP_MIB_ITEM("TCPWqueueTooBig", LINUX_MIB_TCPWQUEUETOOBIG),
>         SNMP_MIB_SENTINEL
>  };
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b8e3bbb852117459d131fbb41d69ae63bd251a3e..1bb1c46b4abad100622d3f101a0a3ca0a6c8e881 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1296,6 +1296,11 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
>         if (nsize < 0)
>                 nsize = 0;
>
> +       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> +               NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> +               return -ENOMEM;
> +       }
> +

Hi Eric, I now have a packetdrill test that started failing (see
below). Admittedly, a bit weird test with the SO_SNDBUF forced so low.

Nevertheless, previously this test would pass, now it stalls after the
write() because tcp_fragment() returns -ENOMEM. Your commit-message
mentions that this could trigger when one sets SO_SNDBUF low. But,
here we have a complete stall of the connection and we never recover.

I don't know if we care about this, but there it is :-)


Christoph

----
--tolerance_usecs=10000

+0.0 `ifconfig tun0 mtu 5060`
+0.0 `ethtool -K tun0 tso off` // don't send big segments > MSS

+0.015 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.0 bind(3, ..., ...) = 0

+0.0 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [2000], 4) = 0

+0.0 listen(3, 1) = 0

+0 < S 0:0(0) win 24900 <mss 5020,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 5020,sackOK,nop,nop,nop,wscale 7>
+0 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4

+0 setsockopt(4, SOL_TCP, TCP_CORK, [1], 4) = 0

+0 write(4, ..., 9999) = 9999
+0.0 > . 1:5021(5020) ack 1
+0.01 <  . 1:1(0) ack 5021 win 257

+0.206 > P. 5021:10000(4979) ack 1
+0.01 <  . 1:1(0) ack 10000 win 257

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  0:18   ` Christoph Paasch
@ 2019-06-18  2:28     ` Eric Dumazet
  2019-06-18  3:19       ` Christoph Paasch
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-06-18  2:28 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess



On 6/17/19 5:18 PM, Christoph Paasch wrote:
> 
> Hi Eric, I now have a packetdrill test that started failing (see
> below). Admittedly, a bit weird test with the SO_SNDBUF forced so low.
> 
> Nevertheless, previously this test would pass, now it stalls after the
> write() because tcp_fragment() returns -ENOMEM. Your commit-message
> mentions that this could trigger when one sets SO_SNDBUF low. But,
> here we have a complete stall of the connection and we never recover.
> 
> I don't know if we care about this, but there it is :-)

I guess it is WAI :)

Honestly I am not sure we want to add code just to allow these degenerated use cases.

Upstream kernels could check if rtx queue is empty or not, but this check will be not trivial to backport


Can you test :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 00c01a01b547ec67c971dc25a74c9258563cf871..06576540133806222f43d4a9532c5a929a2965b0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1296,7 +1296,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
        if (nsize < 0)
                nsize = 0;
 
-       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
+       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
+                    !tcp_rtx_queue_empty(sk))) {
                NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
                return -ENOMEM;
        }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  2:28     ` Eric Dumazet
@ 2019-06-18  3:19       ` Christoph Paasch
  2019-06-18  3:44         ` Eric Dumazet
  2019-07-10 18:23         ` Prout, Andrew - LLSC - MITLL
  0 siblings, 2 replies; 36+ messages in thread
From: Christoph Paasch @ 2019-06-18  3:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon, Dustin Marquess

On Mon, Jun 17, 2019 at 7:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/17/19 5:18 PM, Christoph Paasch wrote:
> >
> > Hi Eric, I now have a packetdrill test that started failing (see
> > below). Admittedly, a bit weird test with the SO_SNDBUF forced so low.
> >
> > Nevertheless, previously this test would pass, now it stalls after the
> > write() because tcp_fragment() returns -ENOMEM. Your commit-message
> > mentions that this could trigger when one sets SO_SNDBUF low. But,
> > here we have a complete stall of the connection and we never recover.
> >
> > I don't know if we care about this, but there it is :-)
>
> I guess it is WAI :)
>
> Honestly I am not sure we want to add code just to allow these degenerated use cases.
>
> Upstream kernels could check if rtx queue is empty or not, but this check will be not trivial to backport
>
>
> Can you test :

Yes, this does the trick for my packetdrill-test.

I wonder, is there a way we could end up in a situation where we can't
retransmit anymore?
For example, sk_wmem_queued has grown so much that the new test fails.
Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
won't be able to do so. So we will never retransmit. And if no ACK
comes back in to make some room we are stuck, no?


Christoph



>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 00c01a01b547ec67c971dc25a74c9258563cf871..06576540133806222f43d4a9532c5a929a2965b0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1296,7 +1296,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
>         if (nsize < 0)
>                 nsize = 0;
>
> -       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> +       if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
> +                    !tcp_rtx_queue_empty(sk))) {
>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>                 return -ENOMEM;
>         }

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  3:19       ` Christoph Paasch
@ 2019-06-18  3:44         ` Eric Dumazet
  2019-06-18  3:53           ` Christoph Paasch
  2019-07-10 18:23         ` Prout, Andrew - LLSC - MITLL
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-06-18  3:44 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon, Dustin Marquess



On 6/17/19 8:19 PM, Christoph Paasch wrote:
> 
> Yes, this does the trick for my packetdrill-test.
> 
> I wonder, is there a way we could end up in a situation where we can't
> retransmit anymore?
> For example, sk_wmem_queued has grown so much that the new test fails.
> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
> won't be able to do so. So we will never retransmit. And if no ACK
> comes back in to make some room we are stuck, no?

Well, RTO will eventually fire.

Really TCP can not work well with tiny sndbuf limits.

There is really no point trying to be nice.

There is precedent in TCP stack where we always allow one packet in RX or TX queue
even with tiny rcv/sndbuf limits (or global memory pressure)

We only need to make sure to allow having at least one packet in rtx queue as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  3:44         ` Eric Dumazet
@ 2019-06-18  3:53           ` Christoph Paasch
  2019-06-18  4:08             ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Paasch @ 2019-06-18  3:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon, Dustin Marquess

On Mon, Jun 17, 2019 at 8:44 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
> >
> > Yes, this does the trick for my packetdrill-test.
> >
> > I wonder, is there a way we could end up in a situation where we can't
> > retransmit anymore?
> > For example, sk_wmem_queued has grown so much that the new test fails.
> > Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
> > won't be able to do so. So we will never retransmit. And if no ACK
> > comes back in to make some room we are stuck, no?
>
> Well, RTO will eventually fire.

But even the RTO would have to go through __tcp_retransmit_skb(), and
let's say the MTU of the interface changed and thus we need to
fragment. tcp_fragment() would keep on failing then, no? Sure,
eventually we will ETIMEOUT but that's a long way to go.

> Really TCP can not work well with tiny sndbuf limits.
>
> There is really no point trying to be nice.

Sure, fair enough :-)


Christoph

>
> There is precedent in TCP stack where we always allow one packet in RX or TX queue
> even with tiny rcv/sndbuf limits (or global memory pressure)
>
> We only need to make sure to allow having at least one packet in rtx queue as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  3:53           ` Christoph Paasch
@ 2019-06-18  4:08             ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-06-18  4:08 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Greg Kroah-Hartman,
	Jonathan Looney, Neal Cardwell, Tyler Hicks, Yuchung Cheng,
	Bruce Curtis, Jonathan Lemon, Dustin Marquess



On 6/17/19 8:53 PM, Christoph Paasch wrote:
> On Mon, Jun 17, 2019 at 8:44 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>>
>>> Yes, this does the trick for my packetdrill-test.
>>>
>>> I wonder, is there a way we could end up in a situation where we can't
>>> retransmit anymore?
>>> For example, sk_wmem_queued has grown so much that the new test fails.
>>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>>> won't be able to do so. So we will never retransmit. And if no ACK
>>> comes back in to make some room we are stuck, no?
>>
>> Well, RTO will eventually fire.
> 
> But even the RTO would have to go through __tcp_retransmit_skb(), and
> let's say the MTU of the interface changed and thus we need to
> fragment. tcp_fragment() would keep on failing then, no? Sure,
> eventually we will ETIMEOUT but that's a long way to go.

Also I want to point that normal skb split for not-yet transmitted skbs
does not use tcp_fragment(), with one exception (the one you hit)

Only the first skb in write queue can possibly have payload in skb->head
and might go through tcp_fragment()

Other splits will use tso_fragment() which does not enforce sk_wmem_queued limits (yet)

So things like TLP should work.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-06-18  3:19       ` Christoph Paasch
  2019-06-18  3:44         ` Eric Dumazet
@ 2019-07-10 18:23         ` Prout, Andrew - LLSC - MITLL
  2019-07-10 18:28           ` Eric Dumazet
  1 sibling, 1 reply; 36+ messages in thread
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:23 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess

On 6/17/19 8:19 PM, Christoph Paasch wrote:
> 
> Yes, this does the trick for my packetdrill-test.
> 
> I wonder, is there a way we could end up in a situation where we can't
> retransmit anymore?
> For example, sk_wmem_queued has grown so much that the new test fails.
> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
> won't be able to do so. So we will never retransmit. And if no ACK
> comes back in to make some room we are stuck, no?

We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.

I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.

Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.

The TCPWqueueTooBig count is incrementing when the problem occurs.

Using ftrace/trace-cmd on an affected process, it appears the call stack is:
run_timer_softirq
expire_timers
call_timer_fn
tcp_write_timer
tcp_write_timer_handler
tcp_retransmit_timer
tcp_retransmit_skb
__tcp_retransmit_skb
tcp_fragment

Andrew Prout
MIT Lincoln Laboratory Supercomputing Center


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-10 18:23         ` Prout, Andrew - LLSC - MITLL
@ 2019-07-10 18:28           ` Eric Dumazet
  2019-07-10 18:53             ` Prout, Andrew - LLSC - MITLL
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-07-10 18:28 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess



On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>
>> Yes, this does the trick for my packetdrill-test.
>>
>> I wonder, is there a way we could end up in a situation where we can't
>> retransmit anymore?
>> For example, sk_wmem_queued has grown so much that the new test fails.
>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>> won't be able to do so. So we will never retransmit. And if no ACK
>> comes back in to make some room we are stuck, no?
> 
> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
> 
> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
> 
> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
> 
> The TCPWqueueTooBig count is incrementing when the problem occurs.
> 
> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
> run_timer_softirq
> expire_timers
> call_timer_fn
> tcp_write_timer
> tcp_write_timer_handler
> tcp_retransmit_timer
> tcp_retransmit_skb
> __tcp_retransmit_skb
> tcp_fragment
> 
> Andrew Prout
> MIT Lincoln Laboratory Supercomputing Center
> 

What was the kernel version you used exactly ?

This problem is supposed to be fixed in v4.14.131


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-10 18:28           ` Eric Dumazet
@ 2019-07-10 18:53             ` Prout, Andrew - LLSC - MITLL
  2019-07-10 19:26               ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:53 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess

On 7/10/19 2:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
>> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>>
>>> Yes, this does the trick for my packetdrill-test.
>>>
>>> I wonder, is there a way we could end up in a situation where we can't
>>> retransmit anymore?
>>> For example, sk_wmem_queued has grown so much that the new test fails.
>>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>>> won't be able to do so. So we will never retransmit. And if no ACK
>>> comes back in to make some room we are stuck, no?
>> 
>> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
>> 
>> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
>> 
>> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
>> 
>> The TCPWqueueTooBig count is incrementing when the problem occurs.
>> 
>> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
>> run_timer_softirq
>> expire_timers
>> call_timer_fn
>> tcp_write_timer
>> tcp_write_timer_handler
>> tcp_retransmit_timer
>> tcp_retransmit_skb
>> __tcp_retransmit_skb
>> tcp_fragment
>> 
>> Andrew Prout
>> MIT Lincoln Laboratory Supercomputing Center
>> 
>
> What was the kernel version you used exactly ?
>
> This problem is supposed to be fixed in v4.14.131

Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.

The synthetic test was a pair of simple send/recv test programs under the following conditions:
-The send socket was non-blocking
-SO_SNDBUF set to 128KiB
-The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
-Load was on both systems: a while(1) program spinning on each CPU core
-The receiver was on an older unaffected kernel


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-10 18:53             ` Prout, Andrew - LLSC - MITLL
@ 2019-07-10 19:26               ` Eric Dumazet
  2019-07-11  7:28                 ` Christoph Paasch
  2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
  0 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-07-10 19:26 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess



On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
> 
> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
> 
> The synthetic test was a pair of simple send/recv test programs under the following conditions:
> -The send socket was non-blocking
> -SO_SNDBUF set to 128KiB
> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
> -Load was on both systems: a while(1) program spinning on each CPU core
> -The receiver was on an older unaffected kernel
> 

SO_SNDBUF to 128KB does not permit to recover from heavy losses,
since skbs needs to be allocated for retransmits.

The bug we fixed allowed remote attackers to crash all linux hosts,

I am afraid we have to enforce the real SO_SNDBUF limit, finally.

Even a cushion of 128KB per socket is dangerous, for servers with millions of TCP sockets.

You will either have to set SO_SNDBUF to higher values, or let autotuning in place.
Or revert the patches and allow attackers hit you badly.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-10 19:26               ` Eric Dumazet
@ 2019-07-11  7:28                 ` Christoph Paasch
  2019-07-11  9:19                   ` Eric Dumazet
  2019-07-11 10:18                   ` Eric Dumazet
  2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
  1 sibling, 2 replies; 36+ messages in thread
From: Christoph Paasch @ 2019-07-11  7:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, David Miller, netdev,
	Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell, Tyler Hicks,
	Yuchung Cheng, Bruce Curtis, Jonathan Lemon, Dustin Marquess



> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>> 
>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>> 
>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>> -The send socket was non-blocking
>> -SO_SNDBUF set to 128KiB
>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>> -Load was on both systems: a while(1) program spinning on each CPU core
>> -The receiver was on an older unaffected kernel
>> 
> 
> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
> since skbs needs to be allocated for retransmits.

Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?

AFAICS, the crasher was when an attacker sends "fake" SACK-blocks. Thus, we would still be protected from too much fragmentation, but at least would always allow the retransmission to go out.


Christoph

> 
> The bug we fixed allowed remote attackers to crash all linux hosts,
> 
> I am afraid we have to enforce the real SO_SNDBUF limit, finally.
> 
> Even a cushion of 128KB per socket is dangerous, for servers with millions of TCP sockets.
> 
> You will either have to set SO_SNDBUF to higher values, or let autotuning in place.
> Or revert the patches and allow attackers hit you badly.
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11  7:28                 ` Christoph Paasch
@ 2019-07-11  9:19                   ` Eric Dumazet
  2019-07-11 18:26                     ` Michal Kubecek
  2019-07-11 10:18                   ` Eric Dumazet
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-07-11  9:19 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, David Miller, netdev,
	Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell, Tyler Hicks,
	Yuchung Cheng, Bruce Curtis, Jonathan Lemon, Dustin Marquess



On 7/11/19 9:28 AM, Christoph Paasch wrote:
> 
> 
>> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>>>
>>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>>>
>>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>>> -The send socket was non-blocking
>>> -SO_SNDBUF set to 128KiB
>>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>>> -Load was on both systems: a while(1) program spinning on each CPU core
>>> -The receiver was on an older unaffected kernel
>>>
>>
>> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
>> since skbs needs to be allocated for retransmits.
> 
> Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?

4.15+ kernels have :

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
    tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {


Meaning that things like TLP will succeed.

Anything we add in TCP stack to overcome the SO_SNDBUF by twice the limit _will_ be exploited at scale.

I am not sure we want to continue to support small SO_SNDBUF values, as this makes no sense today.

We use 64 MB auto tuning limit, and /proc/sys/net/ipv4/tcp_notsent_lowat to 1 MB.

I would rather work (when net-next reopens) on better collapsing at rtx to allow reduction of the overhead.


Something like :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f6a9c95a48edb234e4d4e21bf585744fbaf9a0a7..d5c85986209cd162cf39edb787b1385cb2c8b630 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2860,7 +2860,7 @@ static int __net_init tcp_sk_init(struct net *net)
        net->ipv4.sysctl_tcp_early_retrans = 3;
        net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
        net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior.  */
-       net->ipv4.sysctl_tcp_retrans_collapse = 1;
+       net->ipv4.sysctl_tcp_retrans_collapse = 3;
        net->ipv4.sysctl_tcp_max_reordering = 300;
        net->ipv4.sysctl_tcp_dsack = 1;
        net->ipv4.sysctl_tcp_app_win = 31;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d61264cf89ef66b229ecf797c1abfb7fcdab009f..05cd264f98b084f62eaf2ef9d6e14a392670d02c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3015,8 +3015,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
        next_skb_size = next_skb->len;
 
-       BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
-
        if (next_skb_size) {
                if (next_skb_size <= skb_availroom(skb))
                        skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
@@ -3054,8 +3052,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 /* Check if coalescing SKBs is legal. */
 static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 {
-       if (tcp_skb_pcount(skb) > 1)
-               return false;
        if (skb_cloned(skb))
                return false;
        /* Some heuristics for collapsing over SACK'd could be invented */
@@ -3114,7 +3110,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        unsigned int cur_mss;
-       int diff, len, err;
+       int diff, len, maxlen, err;
 
 
        /* Inconclusive MTU probe */
@@ -3165,12 +3161,13 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
                        return -ENOMEM;
 
                diff = tcp_skb_pcount(skb);
+               maxlen = (sock_net(sk)->ipv4.sysctl_tcp_retrans_collapse & 2) ? len : cur_mss;
+               if (skb->len < maxlen)
+                       tcp_retrans_try_collapse(sk, skb, maxlen);
                tcp_set_skb_tso_segs(skb, cur_mss);
                diff -= tcp_skb_pcount(skb);
                if (diff)
                        tcp_adjust_pcount(sk, skb, diff);
-               if (skb->len < cur_mss)
-                       tcp_retrans_try_collapse(sk, skb, cur_mss);
        }
 
        /* RFC3168, section 6.1.1.1. ECN fallback */

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11  7:28                 ` Christoph Paasch
  2019-07-11  9:19                   ` Eric Dumazet
@ 2019-07-11 10:18                   ` Eric Dumazet
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-07-11 10:18 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, David Miller, netdev,
	Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell, Tyler Hicks,
	Yuchung Cheng, Bruce Curtis, Jonathan Lemon, Dustin Marquess



On 7/11/19 9:28 AM, Christoph Paasch wrote:
> 

> Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?
> 
> AFAICS, the crasher was when an attacker sends "fake" SACK-blocks. Thus, we would still be protected from too much fragmentation, but at least would always allow the retransmission to go out.
> 
> 
I guess this could be done and hopefully something that can be backported without too much pain,
but I am sure some people will complain because reverting to timeouts might still
show regressions :/


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-10 19:26               ` Eric Dumazet
  2019-07-11  7:28                 ` Christoph Paasch
@ 2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
  2019-07-11 18:28                   ` Eric Dumazet
  2019-07-16 15:13                   ` Prout, Andrew - LLSC - MITLL
  1 sibling, 2 replies; 36+ messages in thread
From: Prout, Andrew - LLSC - MITLL @ 2019-07-11 17:14 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess

On 7/10/19 3:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>> 
>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>> 
>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>> -The send socket was non-blocking
>> -SO_SNDBUF set to 128KiB
>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>> -Load was on both systems: a while(1) program spinning on each CPU core
>> -The receiver was on an older unaffected kernel
>> 
>
> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
> since skbs needs to be allocated for retransmits.
>
> The bug we fixed allowed remote attackers to crash all linux hosts,
>
> I am afraid we have to enforce the real SO_SNDBUF limit, finally.
>
> Even a cushion of 128KB per socket is dangerous, for servers with millions of TCP sockets.
>
> You will either have to set SO_SNDBUF to higher values, or let autotuning in place.
> Or revert the patches and allow attackers hit you badly.

I in no way intended to imply that I had confirmed the small SO_SNDBUF was a prerequisite to our problem. With my synthetic test, I was looking for a concise reproducer and trying things to stress the system.

Unfortunately we're often stuck being forced to support very old code, right alongside the latest and greatest. We still run a lot of FORTRAN. Telling users en-mass to search and revise their code is not an option for us.

In my opinion, if a small SO_SNDBUF below a certain value is no longer supported, then SOCK_MIN_SNDBUF should be adjusted to reflect this. The RCVBUF/SNDBUF sizes are supposed to be hints, no error is returned if they are not honored. The kernel should continue to function regardless of what userspace requests for their values.

Alternatively, a config option could be added. I am not concerned about DoS attacks, our system is not connected to the internet, and we shouldn't have to maintain an out-of-tree patch for basic functionality.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11  9:19                   ` Eric Dumazet
@ 2019-07-11 18:26                     ` Michal Kubecek
  2019-07-11 18:50                       ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Kubecek @ 2019-07-11 18:26 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Christoph Paasch, Prout, Andrew - LLSC - MITLL,
	David Miller, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
	Tyler Hicks, Yuchung Cheng, Bruce Curtis, Jonathan Lemon,
	Dustin Marquess

On Thu, Jul 11, 2019 at 11:19:45AM +0200, Eric Dumazet wrote:
> 
> 
> On 7/11/19 9:28 AM, Christoph Paasch wrote:
> > 
> > 
> >> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >>
> >>
> >> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
> >>>
> >>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
> >>>
> >>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
> >>> -The send socket was non-blocking
> >>> -SO_SNDBUF set to 128KiB
> >>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
> >>> -Load was on both systems: a while(1) program spinning on each CPU core
> >>> -The receiver was on an older unaffected kernel
> >>>
> >>
> >> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
> >> since skbs needs to be allocated for retransmits.
> > 
> > Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?
> 
> 4.15+ kernels have :
> 
> if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
>     tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {
> 
> 
> Meaning that things like TLP will succeed.

I get

          <idle>-0     [010] ..s. 301696.143296: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301696.143301: r_tcp_fragment_0: (tcp_send_loss_probe+0x13d/0x1f0 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301696.267644: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301696.267650: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301696.875289: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301696.875293: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301698.059267: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301698.059271: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301700.427225: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301700.427230: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301705.291144: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301705.291151: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
          <idle>-0     [010] ..s. 301714.762961: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
          <idle>-0     [010] d.s. 301714.762966: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12

on 5.2 kernel with this packetdrill script:

------------------------------------------------------------------------
--tolerance_usecs=10000

// flush cached TCP metrics
0.000  `ip tcp_metrics flush all`

// establish a connection
+0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [15000], 4) = 0
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0

+0.100 < S 0:0(0) win 60000 <mss 1000,nop,nop,sackOK,nop,wscale 7>
+0.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 2000
+0.000 accept(3, ..., ...) = 4
+0.100 write(4, ..., 30000) = 30000

+0.000 > . 1:2001(2000) ack 1
+0.000 > . 2001:4001(2000) ack 1
+0.000 > . 4001:6001(2000) ack 1
+0.000 > . 6001:8001(2000) ack 1
+0.000 > . 8001:10001(2000) ack 1
+0.010 < . 1:1(0) ack 10001 win 2000
+0.000 > . 10001:12001(2000) ack 1
+0.000 > . 12001:14001(2000) ack 1
+0.000 > . 14001:16001(2000) ack 1
+0.000 > . 16001:18001(2000) ack 1
+0.000 > . 18001:20001(2000) ack 1
+0.000 > . 20001:22001(2000) ack 1
+0.000 > . 22001:24001(2000) ack 1
+0.000 > . 24001:26001(2000) ack 1
+0.000 > . 26001:28001(2000) ack 1
+0.000 > P. 28001:30001(2000) ack 1
+0.010 < . 1:1(0) ack 30001 win 2000
+0.000 write(4, ..., 40000) = 40000
+0.000 > . 30001:32001(2000) ack 1
+0.000 > . 32001:34001(2000) ack 1
+0.000 > . 34001:36001(2000) ack 1
+0.000 > . 36001:38001(2000) ack 1
+0.000 > . 38001:40001(2000) ack 1
+0.000 > . 40001:42001(2000) ack 1
+0.000 > . 42001:44001(2000) ack 1
+0.000 > . 44001:46001(2000) ack 1
+0.000 > . 46001:48001(2000) ack 1
+0.000 > . 48001:50001(2000) ack 1
+0.000 > . 50001:52001(2000) ack 1
+0.000 > . 52001:54001(2000) ack 1
+0.000 > . 54001:56001(2000) ack 1
+0.000 > . 56001:58001(2000) ack 1
+0.000 > . 58001:60001(2000) ack 1
+0.000 > . 60001:62001(2000) ack 1
+0.000 > . 62001:64001(2000) ack 1
+0.000 > . 64001:66001(2000) ack 1
+0.000 > . 66001:68001(2000) ack 1
+0.000 > P. 68001:70001(2000) ack 1

+0.000 `ss -nteim state established sport == :8080`

+0.120~+0.200 > P. 69001:70001(1000) ack 1
------------------------------------------------------------------------

I'm aware it's not a realistic test. It was written as quick and simple
check of the pre-4.19 patch, but it shows that even TLP may not get
through.

Michal

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
@ 2019-07-11 18:28                   ` Eric Dumazet
  2019-07-11 19:04                     ` Jonathan Lemon
  2019-07-16 15:13                   ` Prout, Andrew - LLSC - MITLL
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2019-07-11 18:28 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess



On 7/11/19 7:14 PM, Prout, Andrew - LLSC - MITLL wrote:
> 
> In my opinion, if a small SO_SNDBUF below a certain value is no longer supported, then SOCK_MIN_SNDBUF should be adjusted to reflect this. The RCVBUF/SNDBUF sizes are supposed to be hints, no error is returned if they are not honored. The kernel should continue to function regardless of what userspace requests for their values.
> 

It is supported to set whatever SO_SNDBUF value and get terrible performance.

It always has been.

The only difference is that we no longer allow an attacker to fool TCP stack
and consume up to 2 GB per socket while SO_SNDBUF was set to 128 KB.

The side effect is that in some cases, the workload can appear to have the signature of the attack.

The solution is to increase your SO_SNDBUF, or even better let TCP stack autotune it.
nobody forced you to set very small values for it.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11 18:26                     ` Michal Kubecek
@ 2019-07-11 18:50                       ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-07-11 18:50 UTC (permalink / raw)
  To: Michal Kubecek, netdev
  Cc: Eric Dumazet, Christoph Paasch, Prout, Andrew - LLSC - MITLL,
	David Miller, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
	Tyler Hicks, Yuchung Cheng, Bruce Curtis, Jonathan Lemon,
	Dustin Marquess



On 7/11/19 8:26 PM, Michal Kubecek wrote:

> 
> I'm aware it's not a realistic test. It was written as quick and simple
> check of the pre-4.19 patch, but it shows that even TLP may not get
> through.


Most of TLP probes send new data, not rtx.

But yes, I get your point.

SO_SNDBUF=15000 in your case is seriously wrong.

Lets code a safety feature over SO_SNDBUF to not allow pathological small values,
because I do not want to support a constrained TCP stack in 2019.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11 18:28                   ` Eric Dumazet
@ 2019-07-11 19:04                     ` Jonathan Lemon
  2019-07-12  7:05                       ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Jonathan Lemon @ 2019-07-11 19:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, Christoph Paasch, David S . Miller,
	netdev, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
	Tyler Hicks, Yuchung Cheng, Bruce Curtis, Dustin Marquess



On 11 Jul 2019, at 11:28, Eric Dumazet wrote:

> On 7/11/19 7:14 PM, Prout, Andrew - LLSC - MITLL wrote:
>>
>> In my opinion, if a small SO_SNDBUF below a certain value is no 
>> longer supported, then SOCK_MIN_SNDBUF should be adjusted to reflect 
>> this. The RCVBUF/SNDBUF sizes are supposed to be hints, no error is 
>> returned if they are not honored. The kernel should continue to 
>> function regardless of what userspace requests for their values.
>>
>
> It is supported to set whatever SO_SNDBUF value and get terrible 
> performance.
>
> It always has been.
>
> The only difference is that we no longer allow an attacker to fool TCP 
> stack
> and consume up to 2 GB per socket while SO_SNDBUF was set to 128 KB.
>
> The side effect is that in some cases, the workload can appear to have 
> the signature of the attack.
>
> The solution is to increase your SO_SNDBUF, or even better let TCP 
> stack autotune it.
> nobody forced you to set very small values for it.

I discovered we have some production services that set SO_SNDBUF to
very small values (~4k), as they are essentially doing interactive
communications, not bulk transfers.  But there's a difference between
"terrible performance" and "TCP stops working".
-- 
Jonathan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11 19:04                     ` Jonathan Lemon
@ 2019-07-12  7:05                       ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2019-07-12  7:05 UTC (permalink / raw)
  To: Jonathan Lemon, Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, Christoph Paasch, David S . Miller,
	netdev, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
	Tyler Hicks, Yuchung Cheng, Bruce Curtis, Dustin Marquess



On 7/11/19 9:04 PM, Jonathan Lemon wrote:
 
> I discovered we have some production services that set SO_SNDBUF to
> very small values (~4k), as they are essentially doing interactive
> communications, not bulk transfers.  But there's a difference between
> "terrible performance" and "TCP stops working".

You had a copy of these patches month ago, yet you discovered this issue today ?

I already said I was going to work on the issue,
no need to add pressure on me, I had enough of it already.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
  2019-07-11 18:28                   ` Eric Dumazet
@ 2019-07-16 15:13                   ` Prout, Andrew - LLSC - MITLL
  1 sibling, 0 replies; 36+ messages in thread
From: Prout, Andrew - LLSC - MITLL @ 2019-07-16 15:13 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess

On 7/11/2019 11:15 PM, Prout, Andrew - LLSC - MITLL wrote: 
> I in no way intended to imply that I had confirmed the small SO_SNDBUF was a prerequisite to our problem. With my synthetic test, I was looking for a concise reproducer and trying things to > stress the system.

I've looked into this some more: I am now able to reproduce it consistently (on 4.14.132) with my synthetic test seemingly regardless of the SO_SNDBUF size (tested 128KiB, 512KiB, 1MiB and 10MiB). The key change to make it reproducible was to use sendfile() instead of send() - this is what samba was doing in our real failure case. Looking at the code, it appears that sendfile() calls into the zerocopy functions which only check SNDBUF for memory they allocate (the skb structure) - they'll happily "overfill" the send buffer with zerocopy'd pages.

In my tests I can see the send buffer used reported by netstat getting to much larger values than the SO_SNDBUF value I set (with a 10MiB SO_SNDBUF, it got up to ~52MiB on a stalled connection). This of course easily causes the CVE-2019-11478 fix to false alarm and the TCP connection to stall.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-08-02 19:58   ` Bernd
@ 2019-08-14 14:41     ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 36+ messages in thread
From: Marcelo Ricardo Leitner @ 2019-08-14 14:41 UTC (permalink / raw)
  To: Bernd; +Cc: Neal Cardwell, netdev, edumazet

On Fri, Aug 02, 2019 at 09:58:12PM +0200, Bernd wrote:
> 2019-08-02 21:14 GMT+02:00, Neal Cardwell <ncardwell@google.com>:
> > What's the exact kernel version you are using?
> 
> It is the RHEL errata kernel 3.10.0-957.21.3.el7 (rhsa-2019:1481), i
> need to check if there is a newer one.

FWIW, this one doesn't have the patch below.

  Marcelo

> 
> > Eric submitted a patch recently that may address your issue:
> >    tcp: be more careful in tcp_fragment()
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab
> >
> > Would you be able to test your workload with that commit
> > cherry-picked, and see if the issue still occurs?
> 
> It only happens on a customer system in production up to now, so most
> likely not.
> 
> > That commit was targeted to many stable releases, so you may be able
> > to pick up that fix from a stable branch.
> 
> The only thing which is a bit strange, this is a Java client, and I am
> pretty sure we don’t set a small SO_SNDBUF, if anything it is
> increased (I need to verify that).
> 
> Not to worry, I guess I can now with your helpful pointer sort that
> out with Redhat.
> 
> gruss
> Bernd
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-08-02 19:14 ` Neal Cardwell
@ 2019-08-02 19:58   ` Bernd
  2019-08-14 14:41     ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 36+ messages in thread
From: Bernd @ 2019-08-02 19:58 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: netdev, edumazet

2019-08-02 21:14 GMT+02:00, Neal Cardwell <ncardwell@google.com>:
> What's the exact kernel version you are using?

It is the RHEL errata kernel 3.10.0-957.21.3.el7 (rhsa-2019:1481), i
need to check if there is a newer one.

> Eric submitted a patch recently that may address your issue:
>    tcp: be more careful in tcp_fragment()
>
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab
>
> Would you be able to test your workload with that commit
> cherry-picked, and see if the issue still occurs?

It only happens on a customer system in production up to now, so most
likely not.

> That commit was targeted to many stable releases, so you may be able
> to pick up that fix from a stable branch.

The only thing which is a bit strange, this is a Java client, and I am
pretty sure we don’t set a small SO_SNDBUF, if anything it is
increased (I need to verify that).

Not to worry, I guess I can now with your helpful pointer sort that
out with Redhat.

gruss
Bernd

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
  2019-08-02 19:02 [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Bernd
@ 2019-08-02 19:14 ` Neal Cardwell
  2019-08-02 19:58   ` Bernd
  0 siblings, 1 reply; 36+ messages in thread
From: Neal Cardwell @ 2019-08-02 19:14 UTC (permalink / raw)
  To: Bernd; +Cc: netdev, Eric Dumazet

On Fri, Aug 2, 2019 at 3:03 PM Bernd <ecki@zusammenkunft.net> wrote:
>
> Hello,
>
> While analyzing a aborted upload packet capture I came across a odd
> trace where a sender was not responding to a duplicate SACK but
> sending further segments until it stalled.
>
> Took me some time until I remembered this fix, and actually the
> problems started since the security fix was applied.
>
> I see a high counter for TCPWqueueTooBig - and I don’t think that’s an
> actual attack.
>
> Is there a probability for triggering the limit with connections with
> big windows and large send buffers and dropped segments? If so what
> would be the plan? It does not look like it is configurable. The trace
> seem to have 100 (filled) inflight segments.
>
> Gruss
> Bernd
> --
> http://bernd.eckenfels.net

What's the exact kernel version you are using?

Eric submitted a patch recently that may address your issue:
   tcp: be more careful in tcp_fragment()
  https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab

Would you be able to test your workload with that commit
cherry-picked, and see if the issue still occurs?

That commit was targeted to many stable releases, so you may be able
to pick up that fix from a stable branch.

cheers,
neal

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
@ 2019-08-02 19:02 Bernd
  2019-08-02 19:14 ` Neal Cardwell
  0 siblings, 1 reply; 36+ messages in thread
From: Bernd @ 2019-08-02 19:02 UTC (permalink / raw)
  To: netdev

Hello,

While analyzing a aborted upload packet capture I came across a odd
trace where a sender was not responding to a duplicate SACK but
sending further segments until it stalled.

Took me some time until I remembered this fix, and actually the
problems started since the security fix was applied.

I see a high counter for TCPWqueueTooBig - and I don’t think that’s an
actual attack.

Is there a probability for triggering the limit with connections with
big windows and large send buffers and dropped segments? If so what
would be the plan? It does not look like it is configurable. The trace
seem to have 100 (filled) inflight segments.

Gruss
Bernd
-- 
http://bernd.eckenfels.net

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-08-14 14:41 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17 17:03 [PATCH net 0/4] tcp: make sack processing more robust Eric Dumazet
2019-06-17 17:03 ` [PATCH net 1/4] tcp: limit payload size of sacked skbs Eric Dumazet
2019-06-17 17:14   ` Jonathan Lemon
2019-06-17 17:03 ` [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Eric Dumazet
2019-06-17 17:14   ` Jonathan Lemon
2019-06-18  0:18   ` Christoph Paasch
2019-06-18  2:28     ` Eric Dumazet
2019-06-18  3:19       ` Christoph Paasch
2019-06-18  3:44         ` Eric Dumazet
2019-06-18  3:53           ` Christoph Paasch
2019-06-18  4:08             ` Eric Dumazet
2019-07-10 18:23         ` Prout, Andrew - LLSC - MITLL
2019-07-10 18:28           ` Eric Dumazet
2019-07-10 18:53             ` Prout, Andrew - LLSC - MITLL
2019-07-10 19:26               ` Eric Dumazet
2019-07-11  7:28                 ` Christoph Paasch
2019-07-11  9:19                   ` Eric Dumazet
2019-07-11 18:26                     ` Michal Kubecek
2019-07-11 18:50                       ` Eric Dumazet
2019-07-11 10:18                   ` Eric Dumazet
2019-07-11 17:14                 ` Prout, Andrew - LLSC - MITLL
2019-07-11 18:28                   ` Eric Dumazet
2019-07-11 19:04                     ` Jonathan Lemon
2019-07-12  7:05                       ` Eric Dumazet
2019-07-16 15:13                   ` Prout, Andrew - LLSC - MITLL
2019-06-17 17:03 ` [PATCH net 3/4] tcp: add tcp_min_snd_mss sysctl Eric Dumazet
2019-06-17 17:15   ` Jonathan Lemon
2019-06-17 17:18   ` Tyler Hicks
2019-06-17 17:03 ` [PATCH net 4/4] tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() Eric Dumazet
2019-06-17 17:16   ` Jonathan Lemon
2019-06-17 17:18   ` Tyler Hicks
2019-06-17 17:41 ` [PATCH net 0/4] tcp: make sack processing more robust David Miller
2019-08-02 19:02 [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits Bernd
2019-08-02 19:14 ` Neal Cardwell
2019-08-02 19:58   ` Bernd
2019-08-14 14:41     ` Marcelo Ricardo Leitner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).