All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
@ 2018-08-16  2:50 Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Mao Wenan
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

There are five patches to fix CVE-2018-5390 in latest mainline 
branch, but only two patches exist in stable 4.4 and 3.18: 
dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
I have tested with stable 4.4 kernel, and found the cpu usage was very high.
So I think only two patches can't fix the CVE-2018-5390.
test results:
with fix patch:     78.2%   ksoftirqd
withoutfix patch:   90%     ksoftirqd

Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
instead of RB tree. The result is not very well.
 
After analysing the codes of stable 4.4, and debuging the 
system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.

So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.

Stable 4.4 have already back port two patches, 
f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
If we want to change simple queue to RB tree to finally resolve, we should apply previous 
patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
then apply 9f5afeae, and reapply five patches from Eric.

Eric Dumazet (6):
  tcp: increment sk_drops for dropped rx packets
  tcp: free batches of packets in tcp_prune_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: add tcp_ooo_try_coalesce() helper

Mao Wenan (2):
  Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

Yaogong Wang (1):
  tcp: use an RB tree for ooo receive queue

 include/linux/skbuff.h   |   8 +
 include/linux/tcp.h      |   7 +-
 include/net/sock.h       |   7 +
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 412 +++++++++++++++++++++++++++++------------------
 net/ipv4/tcp_ipv4.c      |   3 +-
 net/ipv4/tcp_minisocks.c |   1 -
 net/ipv6/tcp_ipv6.c      |   1 +
 10 files changed, 294 insertions(+), 170 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible" Mao Wenan
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e0..995b2bc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4791,7 +4791,6 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
 	struct sk_buff *head;
 	u32 start, end;
@@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	range_truesize = skb->truesize;
 	head = skb;
 
 	for (;;) {
@@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			/* Do not attempt collapsing tiny skbs */
-			if (range_truesize != head->truesize ||
-			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
-				tcp_collapse(sk, &tp->out_of_order_queue,
-					     head, skb, start, end);
-			} else {
-				sum_tiny += range_truesize;
-				if (sum_tiny > sk->sk_rcvbuf >> 3)
-					return;
-			}
-
+			tcp_collapse(sk, &tp->out_of_order_queue,
+				     head, skb, start, end);
 			head = skb;
 			if (!skb)
 				break;
 			/* Start new segment */
 			start = TCP_SKB_CB(skb)->seq;
 			end = TCP_SKB_CB(skb)->end_seq;
-			range_truesize = skb->truesize;
 		} else {
 			if (before(TCP_SKB_CB(skb)->seq, start))
 				start = TCP_SKB_CB(skb)->seq;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets Mao Wenan
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply 
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after 
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 995b2bc..df2f342 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
-		return 0;
-
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible" Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue Mao Wenan
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 532182cd610782db8c18230c2747626562032205 ]

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/net/sock.h   |  7 +++++++
 net/ipv4/tcp_input.c | 33 ++++++++++++++++++++-------------
 net/ipv4/tcp_ipv4.c  |  1 +
 net/ipv6/tcp_ipv6.c  |  1 +
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3d5ff74..5770757 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
 	SOCK_SKB_CB(skb)->dropcount = atomic_read(&sk->sk_drops);
 }
 
+static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
+{
+	int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+	atomic_add(segs, &sk->sk_drops);
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 			   struct sk_buff *skb);
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2f342..5fb4e80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static void tcp_drop(struct sock *sk, struct sk_buff *skb)
+{
+	sk_drops_add(sk, skb);
+	__kfree_skb(skb);
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk)
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			continue;
 		}
 		SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP);
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
 			/* All the bits are present. Drop. */
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			skb = NULL;
 			tcp_dsack_set(sk, seq, end_seq);
 			goto add_sack;
@@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-		__kfree_skb(skb1);
+		tcp_drop(sk, skb1);
 	}
 
 add_sack:
@@ -4558,12 +4564,13 @@ err:
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int eaten = -1;
 	bool fragstolen = false;
+	int eaten = -1;
 
-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-		goto drop;
-
+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+		__kfree_skb(skb);
+		return;
+	}
 	skb_dst_drop(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
@@ -4645,7 +4652,7 @@ out_of_window:
 		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 		inet_csk_schedule_ack(sk);
 drop:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -5220,7 +5227,7 @@ syn_challenge:
 	return true;
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 	return false;
 }
 
@@ -5438,7 +5445,7 @@ csum_error:
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
@@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
@@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 
 	if (!queued) {
 discard:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 	}
 	return 0;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index eeda67c..01715fc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1716,6 +1716,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 90abe88..d6c1911 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1505,6 +1505,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (2 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue() Mao Wenan
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Yaogong Wang <wygivan@google.com>

[ Upstream commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ]

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/linux/skbuff.h   |   8 ++
 include/linux/tcp.h      |   7 +-
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 354 +++++++++++++++++++++++++++--------------------
 net/ipv4/tcp_ipv4.c      |   2 +-
 net/ipv4/tcp_minisocks.c |   1 -
 8 files changed, 241 insertions(+), 156 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c28bd8b..a490dd7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 	return __pskb_trim(skb, len);
 }
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
 		for (skb = (queue)->next;					\
 		     skb != (struct sk_buff *)(queue);				\
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5b6df1a..747404d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -279,10 +279,9 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
-	/* OOO segments go in this list. Note that socket lock must be held,
-	 * as we do not use sk_buff_head lock.
-	 */
-	struct sk_buff_head	out_of_order_queue;
+	/* OOO segments go in this rbtree. Socket lock must be held. */
+	struct rb_root	out_of_order_queue;
+	struct sk_buff	*ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
 	/* SACKs data, these 2 need to be together (see tcp_options_write) */
 	struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cac4a6ad..8bc259d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (skb_queue_empty(&tp->out_of_order_queue) &&
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
 	    tp->rcv_wnd &&
 	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
 	    !tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 55be076..9703924 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ *	skb_rbtree_purge - empty a skb rbtree
+ *	@root: root of the rbtree to empty
+ *
+ *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
+ *	the list and one reference dropped. This function does not take
+ *	any lock. Synchronization should be handled by the caller (e.g., TCP
+ *	out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+	struct sk_buff *skb, *next;
+
+	rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+		kfree_skb(skb);
+
+	*root = RB_ROOT;
+}
+
+/**
  *	skb_queue_head - queue a buffer at the list head
  *	@list: list to use
  *	@newsk: buffer to queue
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a0f0a7d..8bd2874 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -382,7 +382,7 @@ void tcp_init_sock(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	__skb_queue_head_init(&tp->out_of_order_queue);
+	tp->out_of_order_queue = RB_ROOT;
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 	INIT_LIST_HEAD(&tp->tsq_node);
@@ -2240,7 +2240,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
 	tcp_write_queue_purge(sk);
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 	inet->inet_dport = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5fb4e80..12edc4f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4073,7 +4073,7 @@ static void tcp_fin(struct sock *sk)
 	/* It _is_ possible, that we have something out-of-order _after_ FIN.
 	 * Probably, we should reset in this case. For now drop them.
 	 */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 	if (tcp_is_sack(tp))
 		tcp_sack_reset(&tp->rx_opt);
 	sk_mem_reclaim(sk);
@@ -4233,7 +4233,7 @@ static void tcp_sack_remove(struct tcp_sock *tp)
 	int this_sack;
 
 	/* Empty ofo queue, hence, all the SACKs are eaten. Clear. */
-	if (skb_queue_empty(&tp->out_of_order_queue)) {
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		tp->rx_opt.num_sacks = 0;
 		return;
 	}
@@ -4309,10 +4309,13 @@ static void tcp_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 dsack_high = tp->rcv_nxt;
+	bool fin, fragstolen, eaten;
 	struct sk_buff *skb, *tail;
-	bool fragstolen, eaten;
+	struct rb_node *p;
 
-	while ((skb = skb_peek(&tp->out_of_order_queue)) != NULL) {
+	p = rb_first(&tp->out_of_order_queue);
+	while (p) {
+		skb = rb_entry(p, struct sk_buff, rbnode);
 		if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
 			break;
 
@@ -4322,9 +4325,10 @@ static void tcp_ofo_queue(struct sock *sk)
 				dsack_high = TCP_SKB_CB(skb)->end_seq;
 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
 		}
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, &tp->out_of_order_queue);
 
-		__skb_unlink(skb, &tp->out_of_order_queue);
-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+		if (unlikely(!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
 			tcp_drop(sk, skb);
 			continue;
@@ -4336,12 +4340,19 @@ static void tcp_ofo_queue(struct sock *sk)
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
-		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-			tcp_fin(sk);
-		if (eaten)
+		else
 			kfree_skb_partial(skb, fragstolen);
+
+		if (unlikely(fin)) {
+			tcp_fin(sk);
+			/* tcp_fin() purges tp->out_of_order_queue,
+			 * so we must end this loop right now.
+			 */
+			break;
+		}
 	}
 }
 
@@ -4371,8 +4382,10 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct rb_node **p, *q, *parent;
 	struct sk_buff *skb1;
 	u32 seq, end_seq;
+	bool fragstolen;
 
 	tcp_ecn_check_ce(sk, skb);
 
@@ -4387,88 +4400,86 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	inet_csk_schedule_ack(sk);
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
-		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+		   tp->rcv_nxt, seq, end_seq);
 
-	skb1 = skb_peek_tail(&tp->out_of_order_queue);
-	if (!skb1) {
+	p = &tp->out_of_order_queue.rb_node;
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		/* Initial out of order segment, build 1 SACK. */
 		if (tcp_is_sack(tp)) {
 			tp->rx_opt.num_sacks = 1;
-			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
-			tp->selective_acks[0].end_seq =
-						TCP_SKB_CB(skb)->end_seq;
+			tp->selective_acks[0].start_seq = seq;
+			tp->selective_acks[0].end_seq = end_seq;
 		}
-		__skb_queue_head(&tp->out_of_order_queue, skb);
+		rb_link_node(&skb->rbnode, NULL, p);
+		rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
+		tp->ooo_last_skb = skb;
 		goto end;
 	}
 
-	seq = TCP_SKB_CB(skb)->seq;
-	end_seq = TCP_SKB_CB(skb)->end_seq;
-
-	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		bool fragstolen;
-
-		if (!tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-		} else {
-			tcp_grow_window(sk, skb);
-			kfree_skb_partial(skb, fragstolen);
-			skb = NULL;
+	/* In the typical case, we are adding an skb to the end of the list.
+	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
+	 */
+	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+coalesce_done:
+		tcp_grow_window(sk, skb);
+		kfree_skb_partial(skb, fragstolen);
+		skb = NULL;
+		goto add_sack;
+	}
+
+	/* Find place to insert this segment. Handle overlaps on the way. */
+	parent = NULL;
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(seq, TCP_SKB_CB(skb1)->seq)) {
+			p = &parent->rb_left;
+			continue;
 		}
 
-		if (!tp->rx_opt.num_sacks ||
-		    tp->selective_acks[0].end_seq != seq)
-			goto add_sack;
-
-		/* Common case: data arrive in order after hole. */
-		tp->selective_acks[0].end_seq = end_seq;
-		goto end;
-	}
-
-	/* Find place to insert this segment. */
-	while (1) {
-		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-			break;
-		if (skb_queue_is_first(&tp->out_of_order_queue, skb1)) {
-			skb1 = NULL;
-			break;
+		if (before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+			if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
+				/* All the bits are present. Drop. */
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb);
+				skb = NULL;
+				tcp_dsack_set(sk, seq, end_seq);
+				goto add_sack;
+			}
+			if (after(seq, TCP_SKB_CB(skb1)->seq)) {
+				/* Partial overlap. */
+				tcp_dsack_set(sk, seq, TCP_SKB_CB(skb1)->end_seq);
+			} else {
+				/* skb's seq == skb1's seq and skb covers skb1.
+				 * Replace skb1 with skb.
+				 */
+				rb_replace_node(&skb1->rbnode, &skb->rbnode,
+						&tp->out_of_order_queue);
+				tcp_dsack_extend(sk,
+						 TCP_SKB_CB(skb1)->seq,
+						 TCP_SKB_CB(skb1)->end_seq);
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb1);
+				goto add_sack;
+			}
+		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+			goto coalesce_done;
 		}
-		skb1 = skb_queue_prev(&tp->out_of_order_queue, skb1);
+		p = &parent->rb_right;
 	}
 
-	/* Do skb overlap to previous one? */
-	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-			/* All the bits are present. Drop. */
-			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			tcp_drop(sk, skb);
-			skb = NULL;
-			tcp_dsack_set(sk, seq, end_seq);
-			goto add_sack;
-		}
-		if (after(seq, TCP_SKB_CB(skb1)->seq)) {
-			/* Partial overlap. */
-			tcp_dsack_set(sk, seq,
-				      TCP_SKB_CB(skb1)->end_seq);
-		} else {
-			if (skb_queue_is_first(&tp->out_of_order_queue,
-					       skb1))
-				skb1 = NULL;
-			else
-				skb1 = skb_queue_prev(
-					&tp->out_of_order_queue,
-					skb1);
-		}
-	}
-	if (!skb1)
-		__skb_queue_head(&tp->out_of_order_queue, skb);
-	else
-		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+	/* Insert segment into RB tree. */
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
 
-	/* And clean segments covered by new one as whole. */
-	while (!skb_queue_is_last(&tp->out_of_order_queue, skb)) {
-		skb1 = skb_queue_next(&tp->out_of_order_queue, skb);
+	/* Remove other segments covered by skb. */
+	while ((q = rb_next(&skb->rbnode)) != NULL) {
+		skb1 = rb_entry(q, struct sk_buff, rbnode);
 
 		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
 			break;
@@ -4477,12 +4488,15 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 					 end_seq);
 			break;
 		}
-		__skb_unlink(skb1, &tp->out_of_order_queue);
+		rb_erase(&skb1->rbnode, &tp->out_of_order_queue);
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
 		tcp_drop(sk, skb1);
 	}
+	/* If there is no skb after us, we are the last_skb ! */
+	if (!q)
+		tp->ooo_last_skb = skb;
 
 add_sack:
 	if (tcp_is_sack(tp))
@@ -4621,13 +4635,13 @@ queue_and_out:
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
 			tcp_fin(sk);
 
-		if (!skb_queue_empty(&tp->out_of_order_queue)) {
+		if (!RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 			tcp_ofo_queue(sk);
 
 			/* RFC2581. 4.2. SHOULD send immediate ACK, when
 			 * gap in queue is filled.
 			 */
-			if (skb_queue_empty(&tp->out_of_order_queue))
+			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 				inet_csk(sk)->icsk_ack.pingpong = 0;
 		}
 
@@ -4679,48 +4693,76 @@ drop:
 	tcp_data_queue_ofo(sk, skb);
 }
 
+static struct sk_buff *tcp_skb_next(struct sk_buff *skb, struct sk_buff_head *list)
+{
+	if (list)
+		return !skb_queue_is_last(list, skb) ? skb->next : NULL;
+
+	return rb_entry_safe(rb_next(&skb->rbnode), struct sk_buff, rbnode);
+}
+
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
-					struct sk_buff_head *list)
+					struct sk_buff_head *list,
+					struct rb_root *root)
 {
-	struct sk_buff *next = NULL;
+	struct sk_buff *next = tcp_skb_next(skb, list);
 
-	if (!skb_queue_is_last(list, skb))
-		next = skb_queue_next(list, skb);
+	if (list)
+		__skb_unlink(skb, list);
+	else
+		rb_erase(&skb->rbnode, root);
 
-	__skb_unlink(skb, list);
 	__kfree_skb(skb);
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
 
 	return next;
 }
 
+/* Insert skb into rb tree, ordered by TCP_SKB_CB(skb)->seq */
+static void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb1)->seq))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, root);
+}
+
 /* Collapse contiguous sequence of skbs head..tail with
  * sequence numbers start..end.
  *
- * If tail is NULL, this means until the end of the list.
+ * If tail is NULL, this means until the end of the queue.
  *
  * Segments with FIN/SYN are not collapsed (only because this
  * simplifies code)
  */
 static void
-tcp_collapse(struct sock *sk, struct sk_buff_head *list,
-	     struct sk_buff *head, struct sk_buff *tail,
-	     u32 start, u32 end)
+tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
+	     struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
 {
-	struct sk_buff *skb, *n;
+	struct sk_buff *skb = head, *n;
+	struct sk_buff_head tmp;
 	bool end_of_skbs;
 
 	/* First, check that queue is collapsible and find
-	 * the point where collapsing can be useful. */
-	skb = head;
+	 * the point where collapsing can be useful.
+	 */
 restart:
-	end_of_skbs = true;
-	skb_queue_walk_from_safe(list, skb, n) {
-		if (skb == tail)
-			break;
+	for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
+		n = tcp_skb_next(skb, list);
+
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-			skb = tcp_collapse_one(sk, skb, list);
+			skb = tcp_collapse_one(sk, skb, list, root);
 			if (!skb)
 				break;
 			goto restart;
@@ -4738,13 +4780,10 @@ restart:
 			break;
 		}
 
-		if (!skb_queue_is_last(list, skb)) {
-			struct sk_buff *next = skb_queue_next(list, skb);
-			if (next != tail &&
-			    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(next)->seq) {
-				end_of_skbs = false;
-				break;
-			}
+		if (n && n != tail &&
+		    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
+			end_of_skbs = false;
+			break;
 		}
 
 		/* Decided to skip this, advance start seq. */
@@ -4754,17 +4793,22 @@ restart:
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
+	__skb_queue_head_init(&tmp);
+
 	while (before(start, end)) {
 		int copy = min_t(int, SKB_MAX_ORDER(0, 0), end - start);
 		struct sk_buff *nskb;
 
 		nskb = alloc_skb(copy, GFP_ATOMIC);
 		if (!nskb)
-			return;
+			break;
 
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
-		__skb_queue_before(list, skb, nskb);
+		if (list)
+			__skb_queue_before(list, skb, nskb);
+		else
+			__skb_queue_tail(&tmp, nskb); /* defer rbtree insertion */
 		skb_set_owner_r(nskb, sk);
 
 		/* Copy data, releasing collapsed skbs. */
@@ -4782,14 +4826,17 @@ restart:
 				start += size;
 			}
 			if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-				skb = tcp_collapse_one(sk, skb, list);
+				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
-					return;
+					goto end;
 			}
 		}
 	}
+end:
+	skb_queue_walk_safe(&tmp, skb, n)
+		tcp_rbtree_insert(root, skb);
 }
 
 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs
@@ -4798,43 +4845,43 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
-	struct sk_buff *head;
+	struct sk_buff *skb, *head;
+	struct rb_node *p;
 	u32 start, end;
 
-	if (!skb)
+	p = rb_first(&tp->out_of_order_queue);
+	skb = rb_entry_safe(p, struct sk_buff, rbnode);
+new_range:
+	if (!skb) {
+		p = rb_last(&tp->out_of_order_queue);
+		/* Note: This is possible p is NULL here. We do not
+		 * use rb_entry_safe(), as ooo_last_skb is valid only
+		 * if rbtree is not empty.
+		 */
+		tp->ooo_last_skb = rb_entry(p, struct sk_buff, rbnode);
 		return;
-
+	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	head = skb;
-
-	for (;;) {
-		struct sk_buff *next = NULL;
 
-		if (!skb_queue_is_last(&tp->out_of_order_queue, skb))
-			next = skb_queue_next(&tp->out_of_order_queue, skb);
-		skb = next;
+	for (head = skb;;) {
+		skb = tcp_skb_next(skb, NULL);
 
-		/* Segment is terminated when we see gap or when
-		 * we are at the end of all the queue. */
+		/* Range is terminated when we see a gap or when
+		 * we are at the queue end.
+		 */
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, &tp->out_of_order_queue,
+			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
 				     head, skb, start, end);
-			head = skb;
-			if (!skb)
-				break;
-			/* Start new segment */
+			goto new_range;
+		}
+
+		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
+		if (after(TCP_SKB_CB(skb)->end_seq, end))
 			end = TCP_SKB_CB(skb)->end_seq;
-		} else {
-			if (before(TCP_SKB_CB(skb)->seq, start))
-				start = TCP_SKB_CB(skb)->seq;
-			if (after(TCP_SKB_CB(skb)->end_seq, end))
-				end = TCP_SKB_CB(skb)->end_seq;
-		}
 	}
 }
 
@@ -4845,23 +4892,36 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	bool res = false;
+	struct rb_node *node, *prev;
 
-	if (!skb_queue_empty(&tp->out_of_order_queue)) {
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-		__skb_queue_purge(&tp->out_of_order_queue);
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
+		return false;
 
-		/* Reset SACK state.  A conforming SACK implementation will
-		 * do the same at a timeout based retransmit.  When a connection
-		 * is in a sad state like this, we care only about integrity
-		 * of the connection not performance.
-		 */
-		if (tp->rx_opt.sack_ok)
-			tcp_sack_reset(&tp->rx_opt);
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
+
+	node = &tp->ooo_last_skb->rbnode;
+	do {
+		prev = rb_prev(node);
+		rb_erase(node, &tp->out_of_order_queue);
+		__kfree_skb(rb_to_skb(node));
 		sk_mem_reclaim(sk);
-		res = true;
-	}
-	return res;
+                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+                    !tcp_under_memory_pressure(sk))
+                        break;
+
+		node = prev;
+	} while (node);
+	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
+
+	/* Reset SACK state.  A conforming SACK implementation will
+	 * do the same at a timeout based retransmit.  When a connection
+	 * is in a sad state like this, we care only about integrity
+	 * of the connection not performance.
+	 */
+	if (tp->rx_opt.sack_ok)
+		tcp_sack_reset(&tp->rx_opt);
+
+	return true;
 }
 
 /* Reduce allocated memory if we can, trying to get
@@ -4886,7 +4946,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
-		tcp_collapse(sk, &sk->sk_receive_queue,
+		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
 			     skb_peek(&sk->sk_receive_queue),
 			     NULL,
 			     tp->copied_seq, tp->rcv_nxt);
@@ -4991,7 +5051,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(sk) ||
 	    /* We have out of order data. */
-	    (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
+	    (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
 		/* Then ack it now */
 		tcp_send_ack(sk);
 	} else {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 01715fc..ee8399f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1830,7 +1830,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_write_queue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Clean up the MD5 key list, if any */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4c1c94f..81c633d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,7 +495,6 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		newtp->snd_cwnd_cnt = 0;
 
 		tcp_init_xmit_timers(newsk);
-		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
 
 		newtp->rx_opt.saw_tstamp = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (3 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible Mao Wenan
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12edc4f..32225dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct rb_node *node, *prev;
+	int goal;
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 		return false;
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-
+	goal = sk->sk_rcvbuf >> 3;
 	node = &tp->ooo_last_skb->rbnode;
 	do {
 		prev = rb_prev(node);
 		rb_erase(node, &tp->out_of_order_queue);
+		goal -= rb_to_skb(node)->truesize;
 		__kfree_skb(rb_to_skb(node));
-		sk_mem_reclaim(sk);
-                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-                    !tcp_under_memory_pressure(sk))
-                        break;
-
+		if (!prev || goal <= 0) {
+			sk_mem_reclaim(sk);
+			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+			    !tcp_under_memory_pressure(sk))
+				break;
+			goal = sk->sk_rcvbuf >> 3;
+		}
 		node = prev;
 	} while (node);
 	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (4 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue() Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue() Mao Wenan
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
	tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32225dc..77130ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
+	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+		return 0;
+
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (5 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo() Mao Wenan
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77130ae..c48924f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4845,6 +4845,7 @@ end:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb, *head;
 	struct rb_node *p;
 	u32 start, end;
@@ -4863,6 +4864,7 @@ new_range:
 	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
+	range_truesize = skb->truesize;
 
 	for (head = skb;;) {
 		skb = tcp_skb_next(skb, NULL);
@@ -4873,11 +4875,21 @@ new_range:
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
-				     head, skb, start, end);
+			/* Do not attempt collapsing tiny skbs */
+			if (range_truesize != head->truesize ||
+			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
+				tcp_collapse(sk, NULL, &tp->out_of_order_queue,
+					     head, skb, start, end);
+			} else {
+				sum_tiny += range_truesize;
+				if (sum_tiny > sk->sk_rcvbuf >> 3)
+					return;
+			}
+
 			goto new_range;
 		}
 
+		range_truesize += skb->truesize;
 		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
 		if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (6 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue() Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  2:50 ` [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper Mao Wenan
  2018-08-16  6:16 ` [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Michal Kubecek
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ] 

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c48924f..96a1e0d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4445,7 +4445,7 @@ coalesce_done:
 				/* All the bits are present. Drop. */
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb);
+				tcp_drop(sk, skb);
 				skb = NULL;
 				tcp_dsack_set(sk, seq, end_seq);
 				goto add_sack;
@@ -4464,7 +4464,7 @@ coalesce_done:
 						 TCP_SKB_CB(skb1)->end_seq);
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb1);
+				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
 		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (7 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo() Mao Wenan
@ 2018-08-16  2:50 ` Mao Wenan
  2018-08-16  6:16 ` [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Michal Kubecek
  9 siblings, 0 replies; 28+ messages in thread
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c ]

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a1e0d..fdb5509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static bool tcp_ooo_try_coalesce(struct sock *sk,
+			     struct sk_buff *to,
+			     struct sk_buff *from,
+			     bool *fragstolen)
+{
+	bool res = tcp_try_coalesce(sk, to, from, fragstolen);
+
+	/* In case tcp_drop() is called later, update to->gso_segs */
+	if (res) {
+		u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) +
+			       max_t(u16, 1, skb_shinfo(from)->gso_segs);
+
+		skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0xFFFF);
+	}
+	return res;
+}
+
 static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 {
 	sk_drops_add(sk, skb);
@@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	/* In the typical case, we are adding an skb to the end of the list.
 	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 	 */
-	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+	if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb,
+				 skb, &fragstolen)) {
 coalesce_done:
 		tcp_grow_window(sk, skb);
 		kfree_skb_partial(skb, fragstolen);
@@ -4467,7 +4485,8 @@ coalesce_done:
 				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
-		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+		} else if (tcp_ooo_try_coalesce(sk, skb1,
+						skb, &fragstolen)) {
 			goto coalesce_done;
 		}
 		p = &parent->rb_right;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
                   ` (8 preceding siblings ...)
  2018-08-16  2:50 ` [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper Mao Wenan
@ 2018-08-16  6:16 ` Michal Kubecek
  2018-08-16  6:42   ` maowenan
  9 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16  6:16 UTC (permalink / raw)
  To: Mao Wenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
> There are five patches to fix CVE-2018-5390 in latest mainline 
> branch, but only two patches exist in stable 4.4 and 3.18: 
> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
> So I think only two patches can't fix the CVE-2018-5390.
> test results:
> with fix patch:     78.2%   ksoftirqd
> withoutfix patch:   90%     ksoftirqd
> 
> Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
> instead of RB tree. The result is not very well.
>  
> After analysing the codes of stable 4.4, and debuging the 
> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
> 
> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
> 
> Stable 4.4 have already back port two patches, 
> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
> If we want to change simple queue to RB tree to finally resolve, we should apply previous 
> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
> then apply 9f5afeae, and reapply five patches from Eric.

There seems to be an obvious mistake in one of the backports. Could you
check the results with Takashi's follow-up fix submitted at

  http://lkml.kernel.org/r/20180815095846.7734-1-tiwai@suse.de

(I would try myself but you don't mention what test you ran.)

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  6:16 ` [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Michal Kubecek
@ 2018-08-16  6:42   ` maowenan
  2018-08-16  6:52     ` Michal Kubecek
  0 siblings, 1 reply; 28+ messages in thread
From: maowenan @ 2018-08-16  6:42 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai



On 2018/8/16 14:16, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
>> There are five patches to fix CVE-2018-5390 in latest mainline 
>> branch, but only two patches exist in stable 4.4 and 3.18: 
>> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
>> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
>> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
>> So I think only two patches can't fix the CVE-2018-5390.
>> test results:
>> with fix patch:     78.2%   ksoftirqd
>> withoutfix patch:   90%     ksoftirqd
>>
>> Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
>> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
>> instead of RB tree. The result is not very well.
>>  
>> After analysing the codes of stable 4.4, and debuging the 
>> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
>>
>> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
>> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
>> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
>>
>> Stable 4.4 have already back port two patches, 
>> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
>> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
>> If we want to change simple queue to RB tree to finally resolve, we should apply previous 
>> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
>> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
>> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
>> then apply 9f5afeae, and reapply five patches from Eric.
> 
> There seems to be an obvious mistake in one of the backports. Could you
> check the results with Takashi's follow-up fix submitted at
> 
>   http://lkml.kernel.org/r/20180815095846.7734-1-tiwai@suse.de
> 
> (I would try myself but you don't mention what test you ran.)

I have backport RB tree in stable 4.4, function tcp_collapse_ofo_queue() has been refined, which keep the
same with mainline, so it seems no problem when apply Eric's patch 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue()).

I also noticed that range_truesize != head->truesize will be always false which mentioned in your URL, but this only based on stable 4.4's codes,
If I applied RB tree's patch 9f5afeae(tcp: use an RB tree for ooo receive queue), and after apply 3d4bf93a,the codes should be
range_truesize += skb->truesize, and range_truesize != head->truesize can be true.

One POC programm(named smack-for-ficora) is to send large out of order packets to existed tcp link, and check the cpu usage of the system with top command.

> 
> Michal Kubecek
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  6:42   ` maowenan
@ 2018-08-16  6:52     ` Michal Kubecek
  2018-08-16  7:19       ` maowenan
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16  6:52 UTC (permalink / raw)
  To: maowenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 02:42:32PM +0800, maowenan wrote:
> On 2018/8/16 14:16, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
> >> There are five patches to fix CVE-2018-5390 in latest mainline 
> >> branch, but only two patches exist in stable 4.4 and 3.18: 
> >> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> >> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> >> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
> >> So I think only two patches can't fix the CVE-2018-5390.
> >> test results:
> >> with fix patch:     78.2%   ksoftirqd
> >> withoutfix patch:   90%     ksoftirqd
> >>
> >> Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
> >> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
> >> instead of RB tree. The result is not very well.
> >>  
> >> After analysing the codes of stable 4.4, and debuging the 
> >> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
> >>
> >> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
> >> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
> >> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
> >>
> >> Stable 4.4 have already back port two patches, 
> >> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
> >> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
> >> If we want to change simple queue to RB tree to finally resolve, we should apply previous 
> >> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
> >> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
> >> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
> >> then apply 9f5afeae, and reapply five patches from Eric.
> > 
> > There seems to be an obvious mistake in one of the backports. Could you
> > check the results with Takashi's follow-up fix submitted at
> > 
> >   http://lkml.kernel.org/r/20180815095846.7734-1-tiwai@suse.de
> > 
> > (I would try myself but you don't mention what test you ran.)
> 
> I have backport RB tree in stable 4.4, function
> tcp_collapse_ofo_queue() has been refined, which keep the same with
> mainline, so it seems no problem when apply Eric's patch 3d4bf93a(tcp:
> detect malicious patterns in tcp_collapse_ofo_queue()).
> 
> I also noticed that range_truesize != head->truesize will be always
> false which mentioned in your URL, but this only based on stable 4.4's
> codes, If I applied RB tree's patch 9f5afeae(tcp: use an RB tree for
> ooo receive queue), and after apply 3d4bf93a,the codes should be
> range_truesize += skb->truesize, and range_truesize != head->truesize
> can be true.

My point is that backporting all this into stable 4.4 is quite intrusive
so that if we can achieve similar results with a simple fix of an
obvious omission, it would be preferrable.

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  6:52     ` Michal Kubecek
@ 2018-08-16  7:19       ` maowenan
  2018-08-16  7:23         ` Michal Kubecek
  0 siblings, 1 reply; 28+ messages in thread
From: maowenan @ 2018-08-16  7:19 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai



On 2018/8/16 14:52, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 02:42:32PM +0800, maowenan wrote:
>> On 2018/8/16 14:16, Michal Kubecek wrote:
>>> On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
>>>> There are five patches to fix CVE-2018-5390 in latest mainline 
>>>> branch, but only two patches exist in stable 4.4 and 3.18: 
>>>> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
>>>> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
>>>> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
>>>> So I think only two patches can't fix the CVE-2018-5390.
>>>> test results:
>>>> with fix patch:     78.2%   ksoftirqd
>>>> withoutfix patch:   90%     ksoftirqd
>>>>
>>>> Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
>>>> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
>>>> instead of RB tree. The result is not very well.
>>>>  
>>>> After analysing the codes of stable 4.4, and debuging the 
>>>> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
>>>>
>>>> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
>>>> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
>>>> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
>>>>
>>>> Stable 4.4 have already back port two patches, 
>>>> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
>>>> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
>>>> If we want to change simple queue to RB tree to finally resolve, we should apply previous 
>>>> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
>>>> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
>>>> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
>>>> then apply 9f5afeae, and reapply five patches from Eric.
>>>
>>> There seems to be an obvious mistake in one of the backports. Could you
>>> check the results with Takashi's follow-up fix submitted at
>>>
>>>   http://lkml.kernel.org/r/20180815095846.7734-1-tiwai@suse.de
>>>
>>> (I would try myself but you don't mention what test you ran.)
>>
>> I have backport RB tree in stable 4.4, function
>> tcp_collapse_ofo_queue() has been refined, which keep the same with
>> mainline, so it seems no problem when apply Eric's patch 3d4bf93a(tcp:
>> detect malicious patterns in tcp_collapse_ofo_queue()).
>>
>> I also noticed that range_truesize != head->truesize will be always
>> false which mentioned in your URL, but this only based on stable 4.4's
>> codes, If I applied RB tree's patch 9f5afeae(tcp: use an RB tree for
>> ooo receive queue), and after apply 3d4bf93a,the codes should be
>> range_truesize += skb->truesize, and range_truesize != head->truesize
>> can be true.
> 
> My point is that backporting all this into stable 4.4 is quite intrusive
> so that if we can achieve similar results with a simple fix of an
> obvious omission, it would be preferrable.

There are five patches in mainline to fix this CVE, only two patches have no effect on stable 4.4,
the important reason is 4.4 use simple queue but mainline use RB tree.

I have tried my best to use easy way to fix this with dropping packets 12.5%(or other value) based on simple queue,
but the result is not very well, so the RB tree is needed and tested result is my desire.

If we only back port two patches but they don't fix the issue, I think they don't make any sense.

> 
> Michal Kubecek
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  7:19       ` maowenan
@ 2018-08-16  7:23         ` Michal Kubecek
  2018-08-16  7:39           ` maowenan
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16  7:23 UTC (permalink / raw)
  To: maowenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> On 2018/8/16 14:52, Michal Kubecek wrote:
> > 
> > My point is that backporting all this into stable 4.4 is quite intrusive
> > so that if we can achieve similar results with a simple fix of an
> > obvious omission, it would be preferrable.
> 
> There are five patches in mainline to fix this CVE, only two patches
> have no effect on stable 4.4, the important reason is 4.4 use simple
> queue but mainline use RB tree.
> 
> I have tried my best to use easy way to fix this with dropping packets
> 12.5%(or other value) based on simple queue, but the result is not
> very well, so the RB tree is needed and tested result is my desire.
> 
> If we only back port two patches but they don't fix the issue, I think
> they don't make any sense.

There is an obvious omission in one of the two patches and Takashi's
patch fixes it. If his follow-up fix (applied on top of what is in
stable 4.4 now) addresses the problem, I would certainly prefer using it
over backporting the whole series.

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  7:23         ` Michal Kubecek
@ 2018-08-16  7:39           ` maowenan
  2018-08-16  7:44             ` Michal Kubecek
  0 siblings, 1 reply; 28+ messages in thread
From: maowenan @ 2018-08-16  7:39 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai



On 2018/8/16 15:23, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
>> On 2018/8/16 14:52, Michal Kubecek wrote:
>>>
>>> My point is that backporting all this into stable 4.4 is quite intrusive
>>> so that if we can achieve similar results with a simple fix of an
>>> obvious omission, it would be preferrable.
>>
>> There are five patches in mainline to fix this CVE, only two patches
>> have no effect on stable 4.4, the important reason is 4.4 use simple
>> queue but mainline use RB tree.
>>
>> I have tried my best to use easy way to fix this with dropping packets
>> 12.5%(or other value) based on simple queue, but the result is not
>> very well, so the RB tree is needed and tested result is my desire.
>>
>> If we only back port two patches but they don't fix the issue, I think
>> they don't make any sense.
> 
> There is an obvious omission in one of the two patches and Takashi's
> patch fixes it. If his follow-up fix (applied on top of what is in
> stable 4.4 now) addresses the problem, I would certainly prefer using it
> over backporting the whole series.

Do you mean below codes from Takashi can fix this CVE?
But I have already tested like this two days ago, it is not good effect.

Could you try to test with POC programme mentioned previous mail in case I made mistake?

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e078082..9c4c6cd0316e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4835,6 +4835,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 			end = TCP_SKB_CB(skb)->end_seq;
 			range_truesize = skb->truesize;
 		} else {
+			range_truesize += skb->truesize;
 			if (before(TCP_SKB_CB(skb)->seq, start))
 				start = TCP_SKB_CB(skb)->seq;
 			if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 


> 
> Michal Kubecek
> 
> 
> .
> 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  7:39           ` maowenan
@ 2018-08-16  7:44             ` Michal Kubecek
  2018-08-16  7:55               ` maowenan
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16  7:44 UTC (permalink / raw)
  To: maowenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
> 
> 
> On 2018/8/16 15:23, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> >> On 2018/8/16 14:52, Michal Kubecek wrote:
> >>>
> >>> My point is that backporting all this into stable 4.4 is quite intrusive
> >>> so that if we can achieve similar results with a simple fix of an
> >>> obvious omission, it would be preferrable.
> >>
> >> There are five patches in mainline to fix this CVE, only two patches
> >> have no effect on stable 4.4, the important reason is 4.4 use simple
> >> queue but mainline use RB tree.
> >>
> >> I have tried my best to use easy way to fix this with dropping packets
> >> 12.5%(or other value) based on simple queue, but the result is not
> >> very well, so the RB tree is needed and tested result is my desire.
> >>
> >> If we only back port two patches but they don't fix the issue, I think
> >> they don't make any sense.
> > 
> > There is an obvious omission in one of the two patches and Takashi's
> > patch fixes it. If his follow-up fix (applied on top of what is in
> > stable 4.4 now) addresses the problem, I would certainly prefer using it
> > over backporting the whole series.
> 
> Do you mean below codes from Takashi can fix this CVE?
> But I have already tested like this two days ago, it is not good effect.

IIRC what you proposed was different, you proposed to replace the "=" in
the other branch by "+=".

Michal Kubecek


> 
> Could you try to test with POC programme mentioned previous mail in case I made mistake?
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4a261e078082..9c4c6cd0316e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4835,6 +4835,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>  			end = TCP_SKB_CB(skb)->end_seq;
>  			range_truesize = skb->truesize;
>  		} else {
> +			range_truesize += skb->truesize;
>  			if (before(TCP_SKB_CB(skb)->seq, start))
>  				start = TCP_SKB_CB(skb)->seq;
>  			if (after(TCP_SKB_CB(skb)->end_seq, end))
> -- 
> 
> 
> > 
> > Michal Kubecek
> > 
> > 
> > .
> > 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  7:44             ` Michal Kubecek
@ 2018-08-16  7:55               ` maowenan
  2018-08-16 11:39                 ` Michal Kubecek
  0 siblings, 1 reply; 28+ messages in thread
From: maowenan @ 2018-08-16  7:55 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai



On 2018/8/16 15:44, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
>>
>>
>> On 2018/8/16 15:23, Michal Kubecek wrote:
>>> On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
>>>> On 2018/8/16 14:52, Michal Kubecek wrote:
>>>>>
>>>>> My point is that backporting all this into stable 4.4 is quite intrusive
>>>>> so that if we can achieve similar results with a simple fix of an
>>>>> obvious omission, it would be preferrable.
>>>>
>>>> There are five patches in mainline to fix this CVE, only two patches
>>>> have no effect on stable 4.4, the important reason is 4.4 use simple
>>>> queue but mainline use RB tree.
>>>>
>>>> I have tried my best to use easy way to fix this with dropping packets
>>>> 12.5%(or other value) based on simple queue, but the result is not
>>>> very well, so the RB tree is needed and tested result is my desire.
>>>>
>>>> If we only back port two patches but they don't fix the issue, I think
>>>> they don't make any sense.
>>>
>>> There is an obvious omission in one of the two patches and Takashi's
>>> patch fixes it. If his follow-up fix (applied on top of what is in
>>> stable 4.4 now) addresses the problem, I would certainly prefer using it
>>> over backporting the whole series.
>>
>> Do you mean below codes from Takashi can fix this CVE?
>> But I have already tested like this two days ago, it is not good effect.
> 
> IIRC what you proposed was different, you proposed to replace the "=" in
> the other branch by "+=".

No, I think you don't get what I mean, I have already tested stable 4.4,
based on commit dc6ae4d, and change the codes like Takashi, which didn't
contain any codes I have sent in this patch series.

I suggest someone to test again based on Takashi.

dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
255924e tcp: do not delay ACK in DCTCP upon CE status change
0b1d40e tcp: do not cancel delay-AcK on DCTCP special ACK
17fea38e7 tcp: helpers to send special DCTCP ack
500e03f tcp: fix dctcp delayed ACK schedule
b04c9a0 rtnetlink: add rtnl_link_state check in rtnl_configure_link
73dad08 net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
48f41c0 ip: hash fragments consistently
54a634c MIPS: ath79: fix register address in ath79_ddr_wb_flush()
762b585 Linux 4.4.144


> 
> Michal Kubecek
> 
> 
>>
>> Could you try to test with POC programme mentioned previous mail in case I made mistake?
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 4a261e078082..9c4c6cd0316e 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -4835,6 +4835,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>>  			end = TCP_SKB_CB(skb)->end_seq;
>>  			range_truesize = skb->truesize;
>>  		} else {
>> +			range_truesize += skb->truesize;
>>  			if (before(TCP_SKB_CB(skb)->seq, start))
>>  				start = TCP_SKB_CB(skb)->seq;
>>  			if (after(TCP_SKB_CB(skb)->end_seq, end))
>> -- 
>>
>>
>>>
>>> Michal Kubecek
>>>
>>>
>>> .
>>>
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16  7:55               ` maowenan
@ 2018-08-16 11:39                 ` Michal Kubecek
  2018-08-16 12:05                   ` maowenan
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16 11:39 UTC (permalink / raw)
  To: maowenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 03:55:16PM +0800, maowenan wrote:
> On 2018/8/16 15:44, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
> >>
> >>
> >> On 2018/8/16 15:23, Michal Kubecek wrote:
> >>> On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> >>>> On 2018/8/16 14:52, Michal Kubecek wrote:
> >>>>>
> >>>>> My point is that backporting all this into stable 4.4 is quite intrusive
> >>>>> so that if we can achieve similar results with a simple fix of an
> >>>>> obvious omission, it would be preferrable.
> >>>>
> >>>> There are five patches in mainline to fix this CVE, only two patches
> >>>> have no effect on stable 4.4, the important reason is 4.4 use simple
> >>>> queue but mainline use RB tree.
> >>>>
> >>>> I have tried my best to use easy way to fix this with dropping packets
> >>>> 12.5%(or other value) based on simple queue, but the result is not
> >>>> very well, so the RB tree is needed and tested result is my desire.
> >>>>
> >>>> If we only back port two patches but they don't fix the issue, I think
> >>>> they don't make any sense.
> >>>
> >>> There is an obvious omission in one of the two patches and Takashi's
> >>> patch fixes it. If his follow-up fix (applied on top of what is in
> >>> stable 4.4 now) addresses the problem, I would certainly prefer using it
> >>> over backporting the whole series.
> >>
> >> Do you mean below codes from Takashi can fix this CVE?
> >> But I have already tested like this two days ago, it is not good effect.
> > 
> > IIRC what you proposed was different, you proposed to replace the "=" in
> > the other branch by "+=".
> 
> No, I think you don't get what I mean, I have already tested stable 4.4,
> based on commit dc6ae4d, and change the codes like Takashi, which didn't
> contain any codes I have sent in this patch series.

I suspect you may be doing something wrong with your tests. I checked
the segmentsmack testcase and the CPU utilization on receiving side
(with sending 10 times as many packets as default) went down from ~100%
to ~3% even when comparing what is in stable 4.4 now against older 4.4
kernel.

This is actually not surprising. the testcase only sends 1-byte segments
starting at even offsets so that receiver never gets two adjacent
segments and the "range_truesize != head->truesize" condition would
never be satisfied, whether you fix the backport or not.

Where the missing "range_truesize += skb->truesize" makes a difference
is in the case when there are some adjacent out of order segments, e.g.
if you omitted 1:10 and then sent 10:11, 11:12, 12:13, 13:14, ...
Then the missing update of range_truesize would prevent collapsing the
queue until the total length of the range would exceed the value of
SKB_WITH_OVERHEAD(SK_MEM_QUANTUM) (i.e. a bit less than 4 KB).

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 11:39                 ` Michal Kubecek
@ 2018-08-16 12:05                   ` maowenan
  2018-08-16 12:33                     ` Michal Kubecek
  0 siblings, 1 reply; 28+ messages in thread
From: maowenan @ 2018-08-16 12:05 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai



On 2018/8/16 19:39, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 03:55:16PM +0800, maowenan wrote:
>> On 2018/8/16 15:44, Michal Kubecek wrote:
>>> On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
>>>>
>>>>
>>>> On 2018/8/16 15:23, Michal Kubecek wrote:
>>>>> On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
>>>>>> On 2018/8/16 14:52, Michal Kubecek wrote:
>>>>>>>
>>>>>>> My point is that backporting all this into stable 4.4 is quite intrusive
>>>>>>> so that if we can achieve similar results with a simple fix of an
>>>>>>> obvious omission, it would be preferrable.
>>>>>>
>>>>>> There are five patches in mainline to fix this CVE, only two patches
>>>>>> have no effect on stable 4.4, the important reason is 4.4 use simple
>>>>>> queue but mainline use RB tree.
>>>>>>
>>>>>> I have tried my best to use easy way to fix this with dropping packets
>>>>>> 12.5%(or other value) based on simple queue, but the result is not
>>>>>> very well, so the RB tree is needed and tested result is my desire.
>>>>>>
>>>>>> If we only back port two patches but they don't fix the issue, I think
>>>>>> they don't make any sense.
>>>>>
>>>>> There is an obvious omission in one of the two patches and Takashi's
>>>>> patch fixes it. If his follow-up fix (applied on top of what is in
>>>>> stable 4.4 now) addresses the problem, I would certainly prefer using it
>>>>> over backporting the whole series.
>>>>
>>>> Do you mean below codes from Takashi can fix this CVE?
>>>> But I have already tested like this two days ago, it is not good effect.
>>>
>>> IIRC what you proposed was different, you proposed to replace the "=" in
>>> the other branch by "+=".
>>
>> No, I think you don't get what I mean, I have already tested stable 4.4,
>> based on commit dc6ae4d, and change the codes like Takashi, which didn't
>> contain any codes I have sent in this patch series.
> 
> I suspect you may be doing something wrong with your tests. I checked
> the segmentsmack testcase and the CPU utilization on receiving side
> (with sending 10 times as many packets as default) went down from ~100%
> to ~3% even when comparing what is in stable 4.4 now against older 4.4
> kernel.

There seems no obvious problem when you send packets with default parameter in Segmentsmack POC,
Which is also very related with your server's hardware configuration. Please try with below parameter
to form OFO packets then you will see cpu usage is high, and perf top shows that tcp_data_queue costs
cpu about 55.6%.
If dst port is 22, then you will see sshd about 95%.

int main(int argc, char **argv)
{
  // Adjust dst_mac, src_mac and dst_ip to match source and target!
  // Adjust dst_port to match the target, needs to be an open port!
  char dst_mac[6] = {0xb8,0x27,0xeb,0x54,0x23,0x4a};
  char src_mac[6] = {0x08,0x00,0x27,0xbc,0x91,0x93};
  uint32_t dst_ip = (192<<24)|(168<<16)|(1<<8)|225;
  uint32_t src_ip = 0;
  uint16_t dst_port = 22;   //attack existed ssh link
  uint16_t src_port = 0;

  ......

    for (j = 0; j < 102400*10; j++)    //10240->102400
    {
      for (i = 0; i < 1024; i++)      // 128->1024
      {
        tcp_set_ack_on(only_tcp[i]);
        tcp_set_src_port(only_tcp[i], src_port);
        tcp_set_dst_port(only_tcp[i], dst_port);
        tcp_set_seq_number(only_tcp[i], isn+2+2*(rand()%16384));
        //tcp_set_seq_number(only_tcp[i], isn+2);
        tcp_set_ack_number(only_tcp[i], other_isn+1);
        tcp_set_data_offset(only_tcp[i], 20);
        tcp_set_window(only_tcp[i], 65535);
        tcp_set_cksum_calc(ip, 20, only_tcp[i], sizeof(only_tcp[i]));
      }
      ret = ldp_out_inject_chunk(outq, pkt_tbl_chunk, 1024);  //128->1024
      printf("sent %d packets\n", ret);
      ldp_out_txsync(outq);
      usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
    }
  ......

> 
> This is actually not surprising. the testcase only sends 1-byte segments
> starting at even offsets so that receiver never gets two adjacent
> segments and the "range_truesize != head->truesize" condition would
> never be satisfied, whether you fix the backport or not.
> 
> Where the missing "range_truesize += skb->truesize" makes a difference
> is in the case when there are some adjacent out of order segments, e.g.
> if you omitted 1:10 and then sent 10:11, 11:12, 12:13, 13:14, ...
> Then the missing update of range_truesize would prevent collapsing the
> queue until the total length of the range would exceed the value of
> SKB_WITH_OVERHEAD(SK_MEM_QUANTUM) (i.e. a bit less than 4 KB).
> 
> Michal Kubecek
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 12:05                   ` maowenan
@ 2018-08-16 12:33                     ` Michal Kubecek
  2018-08-16 15:24                       ` Greg KH
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16 12:33 UTC (permalink / raw)
  To: maowenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> On 2018/8/16 19:39, Michal Kubecek wrote:
> > 
> > I suspect you may be doing something wrong with your tests. I checked
> > the segmentsmack testcase and the CPU utilization on receiving side
> > (with sending 10 times as many packets as default) went down from ~100%
> > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > kernel.
> 
> There seems no obvious problem when you send packets with default
> parameter in Segmentsmack POC, Which is also very related with your
> server's hardware configuration. Please try with below parameter to
> form OFO packets

I did and even with these (questionable, see below) changes, I did not
get more than 10% (of one core) by receiving ksoftirqd.

>       for (i = 0; i < 1024; i++)      // 128->1024
...
>       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms

The comment in the testcase source suggests to do _one_ of these two
changes so that you generate 10 times as many packets as the original
testcase. You did both so that you end up sending 102400 packets per
second. With 55 byte long packets, this kind of attack requires at least
5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
DoS", I'm afraid.

Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).

What I can see, though, is that with current stable 4.4 code, modified
testcase which sends something like

  2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...

I quickly eat 6 MB of memory for receive queue of one socket while
earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
Takashi's follow-up yet but I'm pretty sure it will help while
preserving nice performance when using the original segmentsmack
testcase (with increased packet ratio).

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 12:33                     ` Michal Kubecek
@ 2018-08-16 15:24                       ` Greg KH
  2018-08-16 16:06                         ` Michal Kubecek
  2018-09-13 12:32                         ` Greg KH
  0 siblings, 2 replies; 28+ messages in thread
From: Greg KH @ 2018-08-16 15:24 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: maowenan, dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> > On 2018/8/16 19:39, Michal Kubecek wrote:
> > > 
> > > I suspect you may be doing something wrong with your tests. I checked
> > > the segmentsmack testcase and the CPU utilization on receiving side
> > > (with sending 10 times as many packets as default) went down from ~100%
> > > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > > kernel.
> > 
> > There seems no obvious problem when you send packets with default
> > parameter in Segmentsmack POC, Which is also very related with your
> > server's hardware configuration. Please try with below parameter to
> > form OFO packets
> 
> I did and even with these (questionable, see below) changes, I did not
> get more than 10% (of one core) by receiving ksoftirqd.
> 
> >       for (i = 0; i < 1024; i++)      // 128->1024
> ...
> >       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
> 
> The comment in the testcase source suggests to do _one_ of these two
> changes so that you generate 10 times as many packets as the original
> testcase. You did both so that you end up sending 102400 packets per
> second. With 55 byte long packets, this kind of attack requires at least
> 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
> DoS", I'm afraid.
> 
> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> 
> What I can see, though, is that with current stable 4.4 code, modified
> testcase which sends something like
> 
>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> 
> I quickly eat 6 MB of memory for receive queue of one socket while
> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> Takashi's follow-up yet but I'm pretty sure it will help while
> preserving nice performance when using the original segmentsmack
> testcase (with increased packet ratio).

Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
push out a new 4.4-rc later tonight.  Can everyone standardize on that
and test and let me know if it does, or does not, fix the reported
issues?

If not, we can go from there and evaluate this much larger patch series.
But let's try the simple thing first.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 15:24                       ` Greg KH
@ 2018-08-16 16:06                         ` Michal Kubecek
  2018-08-16 16:20                           ` Greg KH
  2018-08-17  2:48                           ` maowenan
  2018-09-13 12:32                         ` Greg KH
  1 sibling, 2 replies; 28+ messages in thread
From: Michal Kubecek @ 2018-08-16 16:06 UTC (permalink / raw)
  To: Greg KH
  Cc: maowenan, dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > 
> > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > 
> > What I can see, though, is that with current stable 4.4 code, modified
> > testcase which sends something like
> > 
> >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > 
> > I quickly eat 6 MB of memory for receive queue of one socket while
> > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > Takashi's follow-up yet but I'm pretty sure it will help while
> > preserving nice performance when using the original segmentsmack
> > testcase (with increased packet ratio).
> 
> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> push out a new 4.4-rc later tonight.  Can everyone standardize on that
> and test and let me know if it does, or does not, fix the reported
> issues?

I did repeat the tests with Takashi's fix and the CPU utilization is
similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still
saturate one CPU somewhere around 50K pkt/s but that already requires
2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's
changes in fact used lower speeds as the change from 128 to 1024 would
need to be done in two places.)

Where Takashi's patch does help is that it does not prevent collapsing
of ranges of adjacent segments with total length shorter than ~4KB. It
took more time to verify: it cannot be checked by watching the socket
memory consumption with ss as tcp_collapse_ofo_queue isn't called until
we reach the limits. So I needed to trace when and how tcp_collpse() is
called with both current stable 4.4 code and one with Takashi's fix.
  
> If not, we can go from there and evaluate this much larger patch
> series.  But let's try the simple thing first.

At high packet rates (say 30K pkt/s and more), we can still saturate the
CPU. This is also mentioned in the announcement with claim that switch
to rbtree based queue would be necessary to fully address that. My tests
seem to confirm that but I'm still not sure it is worth backporting
something as intrusive into stable 4.4.

Michal Kubecek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 16:06                         ` Michal Kubecek
@ 2018-08-16 16:20                           ` Greg KH
  2018-08-17  2:48                           ` maowenan
  1 sibling, 0 replies; 28+ messages in thread
From: Greg KH @ 2018-08-16 16:20 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: maowenan, dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 06:06:16PM +0200, Michal Kubecek wrote:
> > If not, we can go from there and evaluate this much larger patch
> > series.  But let's try the simple thing first.
> 
> At high packet rates (say 30K pkt/s and more), we can still saturate the
> CPU. This is also mentioned in the announcement with claim that switch
> to rbtree based queue would be necessary to fully address that. My tests
> seem to confirm that but I'm still not sure it is worth backporting
> something as intrusive into stable 4.4.

No, it is not.  If you worry about those things, you should not be
running a 4.4 kernel, use 4.14 or newer please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 16:06                         ` Michal Kubecek
  2018-08-16 16:20                           ` Greg KH
@ 2018-08-17  2:48                           ` maowenan
  1 sibling, 0 replies; 28+ messages in thread
From: maowenan @ 2018-08-17  2:48 UTC (permalink / raw)
  To: Michal Kubecek, Greg KH
  Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw,
	stable, Takashi Iwai



On 2018/8/17 0:06, Michal Kubecek wrote:
> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>
>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>
>>> What I can see, though, is that with current stable 4.4 code, modified
>>> testcase which sends something like
>>>
>>>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>
>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>> preserving nice performance when using the original segmentsmack
>>> testcase (with increased packet ratio).
>>
>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>> and test and let me know if it does, or does not, fix the reported
>> issues?
> 
> I did repeat the tests with Takashi's fix and the CPU utilization is
> similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still
> saturate one CPU somewhere around 50K pkt/s but that already requires
> 2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's
> changes in fact used lower speeds as the change from 128 to 1024 would
> need to be done in two places.)
> 
> Where Takashi's patch does help is that it does not prevent collapsing
> of ranges of adjacent segments with total length shorter than ~4KB. It
> took more time to verify: it cannot be checked by watching the socket
> memory consumption with ss as tcp_collapse_ofo_queue isn't called until
> we reach the limits. So I needed to trace when and how tcp_collpse() is
> called with both current stable 4.4 code and one with Takashi's fix.

The POC is default to attack Raspberry Pi system, whose cpu performance is lower,
so the default parameter is not aggressive, we would enlarge parameter to test
in our intel skylake system(with high performance), if don't do this, cpu usage isn't
obvious different with fixed patch and without fixed patch, you can't distinguish
whether the patch can really fix it or not.

I have made series testing here, including low rate attacking(128B,100ms interval)
and high rate attacking(1024B,10ms interval), with original 4.4 kernel, only Takashi's patch,
and only Mao Wenan's patches. I will check the cpu usage of ksoftirq.

	      original	Takashi	Mao Wenan
low rate 	3%	2%	2%
high rate 	50%	49%	~10%

so, I can't identify whether Takashi's patch can really fix radical issue, which I think
the root reason exist in simple queue, and Eric's patch
72cd43ba tcp: free batches of packets in tcp_prune_ofo_queue() can completely fix this,
which have already involved in my patch series. This patch need change simple queue to
RB tree, and it is high efficiency searching and dropping packets, and avoid large tcp retransmitting.
so cpu usage will be fall down.

>   
>> If not, we can go from there and evaluate this much larger patch
>> series.  But let's try the simple thing first.
> 
> At high packet rates (say 30K pkt/s and more), we can still saturate the
> CPU. This is also mentioned in the announcement with claim that switch
> to rbtree based queue would be necessary to fully address that. My tests
> seem to confirm that but I'm still not sure it is worth backporting
> something as intrusive into stable 4.4.
> 
> Michal Kubecek
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-08-16 15:24                       ` Greg KH
  2018-08-16 16:06                         ` Michal Kubecek
@ 2018-09-13 12:32                         ` Greg KH
  2018-09-13 12:44                           ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Greg KH @ 2018-09-13 12:32 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: maowenan, dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng,
	jdw, stable, Takashi Iwai

On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> > > On 2018/8/16 19:39, Michal Kubecek wrote:
> > > > 
> > > > I suspect you may be doing something wrong with your tests. I checked
> > > > the segmentsmack testcase and the CPU utilization on receiving side
> > > > (with sending 10 times as many packets as default) went down from ~100%
> > > > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > > > kernel.
> > > 
> > > There seems no obvious problem when you send packets with default
> > > parameter in Segmentsmack POC, Which is also very related with your
> > > server's hardware configuration. Please try with below parameter to
> > > form OFO packets
> > 
> > I did and even with these (questionable, see below) changes, I did not
> > get more than 10% (of one core) by receiving ksoftirqd.
> > 
> > >       for (i = 0; i < 1024; i++)      // 128->1024
> > ...
> > >       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
> > 
> > The comment in the testcase source suggests to do _one_ of these two
> > changes so that you generate 10 times as many packets as the original
> > testcase. You did both so that you end up sending 102400 packets per
> > second. With 55 byte long packets, this kind of attack requires at least
> > 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
> > DoS", I'm afraid.
> > 
> > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > 
> > What I can see, though, is that with current stable 4.4 code, modified
> > testcase which sends something like
> > 
> >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > 
> > I quickly eat 6 MB of memory for receive queue of one socket while
> > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > Takashi's follow-up yet but I'm pretty sure it will help while
> > preserving nice performance when using the original segmentsmack
> > testcase (with increased packet ratio).
> 
> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> push out a new 4.4-rc later tonight.  Can everyone standardize on that
> and test and let me know if it does, or does not, fix the reported
> issues?
> 
> If not, we can go from there and evaluate this much larger patch series.
> But let's try the simple thing first.

So, is the issue still present on the latest 4.4 release?  Has anyone
tested it?  If not, I'm more than willing to look at backported patches,
but I want to ensure that they really are needed here.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-09-13 12:32                         ` Greg KH
@ 2018-09-13 12:44                           ` Eric Dumazet
  2018-09-14  2:24                             ` maowenan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2018-09-13 12:44 UTC (permalink / raw)
  To: gregkh
  Cc: mkubecek, maowenan, David Woodhouse, netdev, Eric Dumazet,
	David Miller, Yuchung Cheng, jdw, stable, tiwai

On Thu, Sep 13, 2018 at 5:32 AM Greg KH <gregkh@linux-foundation.org> wrote:
>
> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> > On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > > On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> > > > On 2018/8/16 19:39, Michal Kubecek wrote:
> > > > >
> > > > > I suspect you may be doing something wrong with your tests. I checked
> > > > > the segmentsmack testcase and the CPU utilization on receiving side
> > > > > (with sending 10 times as many packets as default) went down from ~100%
> > > > > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > > > > kernel.
> > > >
> > > > There seems no obvious problem when you send packets with default
> > > > parameter in Segmentsmack POC, Which is also very related with your
> > > > server's hardware configuration. Please try with below parameter to
> > > > form OFO packets
> > >
> > > I did and even with these (questionable, see below) changes, I did not
> > > get more than 10% (of one core) by receiving ksoftirqd.
> > >
> > > >       for (i = 0; i < 1024; i++)      // 128->1024
> > > ...
> > > >       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
> > >
> > > The comment in the testcase source suggests to do _one_ of these two
> > > changes so that you generate 10 times as many packets as the original
> > > testcase. You did both so that you end up sending 102400 packets per
> > > second. With 55 byte long packets, this kind of attack requires at least
> > > 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
> > > DoS", I'm afraid.
> > >
> > > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > >
> > > What I can see, though, is that with current stable 4.4 code, modified
> > > testcase which sends something like
> > >
> > >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > >
> > > I quickly eat 6 MB of memory for receive queue of one socket while
> > > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > > Takashi's follow-up yet but I'm pretty sure it will help while
> > > preserving nice performance when using the original segmentsmack
> > > testcase (with increased packet ratio).
> >
> > Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> > push out a new 4.4-rc later tonight.  Can everyone standardize on that
> > and test and let me know if it does, or does not, fix the reported
> > issues?
> >
> > If not, we can go from there and evaluate this much larger patch series.
> > But let's try the simple thing first.
>
> So, is the issue still present on the latest 4.4 release?  Has anyone
> tested it?  If not, I'm more than willing to look at backported patches,
> but I want to ensure that they really are needed here.
>
> thanks,

Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
even with non malicious sender,
but with big enough TCP receive window and a not favorable network.

So a malicious peer can definitely send packets needed to make TCP
stack behave in O(N), which is pretty bad if N is big...

9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
receive queue")
was proven to be almost bug free [1], and should be backported if possible.

[1] bug fixed :
76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
after a replace

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
  2018-09-13 12:44                           ` Eric Dumazet
@ 2018-09-14  2:24                             ` maowenan
  0 siblings, 0 replies; 28+ messages in thread
From: maowenan @ 2018-09-14  2:24 UTC (permalink / raw)
  To: Eric Dumazet, gregkh
  Cc: mkubecek, David Woodhouse, netdev, Eric Dumazet, David Miller,
	Yuchung Cheng, jdw, stable, tiwai



On 2018/9/13 20:44, Eric Dumazet wrote:
> On Thu, Sep 13, 2018 at 5:32 AM Greg KH <gregkh@linux-foundation.org> wrote:
>>
>> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>> On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
>>>>> On 2018/8/16 19:39, Michal Kubecek wrote:
>>>>>>
>>>>>> I suspect you may be doing something wrong with your tests. I checked
>>>>>> the segmentsmack testcase and the CPU utilization on receiving side
>>>>>> (with sending 10 times as many packets as default) went down from ~100%
>>>>>> to ~3% even when comparing what is in stable 4.4 now against older 4.4
>>>>>> kernel.
>>>>>
>>>>> There seems no obvious problem when you send packets with default
>>>>> parameter in Segmentsmack POC, Which is also very related with your
>>>>> server's hardware configuration. Please try with below parameter to
>>>>> form OFO packets
>>>>
>>>> I did and even with these (questionable, see below) changes, I did not
>>>> get more than 10% (of one core) by receiving ksoftirqd.
>>>>
>>>>>       for (i = 0; i < 1024; i++)      // 128->1024
>>>> ...
>>>>>       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
>>>>
>>>> The comment in the testcase source suggests to do _one_ of these two
>>>> changes so that you generate 10 times as many packets as the original
>>>> testcase. You did both so that you end up sending 102400 packets per
>>>> second. With 55 byte long packets, this kind of attack requires at least
>>>> 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
>>>> DoS", I'm afraid.
>>>>
>>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>>
>>>> What I can see, though, is that with current stable 4.4 code, modified
>>>> testcase which sends something like
>>>>
>>>>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>>
>>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>>> preserving nice performance when using the original segmentsmack
>>>> testcase (with increased packet ratio).
>>>
>>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>>> and test and let me know if it does, or does not, fix the reported
>>> issues?
>>>
>>> If not, we can go from there and evaluate this much larger patch series.
>>> But let's try the simple thing first.
>>
>> So, is the issue still present on the latest 4.4 release?  Has anyone
>> tested it?  If not, I'm more than willing to look at backported patches,
>> but I want to ensure that they really are needed here.
>>
>> thanks,
> 
> Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
> even with non malicious sender,
> but with big enough TCP receive window and a not favorable network.
> 
> So a malicious peer can definitely send packets needed to make TCP
> stack behave in O(N), which is pretty bad if N is big...
> 
> 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
> receive queue")
> was proven to be almost bug free [1], and should be backported if possible.
> 
> [1] bug fixed :
> 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
> after a replace

Thank you for Eric's suggestion, I will do some work to backport them.
> 
> .
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-09-14  2:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-16  2:50 [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible" Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue() Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue() Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo() Mao Wenan
2018-08-16  2:50 ` [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper Mao Wenan
2018-08-16  6:16 ` [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390) Michal Kubecek
2018-08-16  6:42   ` maowenan
2018-08-16  6:52     ` Michal Kubecek
2018-08-16  7:19       ` maowenan
2018-08-16  7:23         ` Michal Kubecek
2018-08-16  7:39           ` maowenan
2018-08-16  7:44             ` Michal Kubecek
2018-08-16  7:55               ` maowenan
2018-08-16 11:39                 ` Michal Kubecek
2018-08-16 12:05                   ` maowenan
2018-08-16 12:33                     ` Michal Kubecek
2018-08-16 15:24                       ` Greg KH
2018-08-16 16:06                         ` Michal Kubecek
2018-08-16 16:20                           ` Greg KH
2018-08-17  2:48                           ` maowenan
2018-09-13 12:32                         ` Greg KH
2018-09-13 12:44                           ` Eric Dumazet
2018-09-14  2:24                             ` maowenan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.