[PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391)

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391)
@ 2019-01-23  2:19 Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 01/11] net: speed up skb_rbtree_purge() Mao Wenan
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

There is one CVE: CVE-2018-5391 kernel: IP fragments with random offsets allow a 
remote denial of service (FragmentSmack), 
A fix is a merge commit in the Linux kernel tree:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c30f1fc041b74ecdb072dd44f858750414b8b19f

consisting of the following commits:
7969e5c40dfd04799d4341f1b7cd266b6e47f227 ip: discard IPv4 datagrams with overlapping segments.
385114dec8a49b5e5945e77ba7de6356106713f4 net: modify skb_rbtree_purge to return the truesize of all purged skbs.
fa0f527358bd900ef92f925878ed6bfbd51305cc ip: use rb trees for IP frag queue.

All above patches are with rb tree to fix this CVE, which is very similar the CVE-2018-5390, that I have backport
to stable 4.4 branch in last year.

In these patchset, I will backport some patches to fix CVE-2018-5391 with rb tree.  

Dan Carpenter (1):
  ipv4: frags: precedence bug in ip_expire()

Eric Dumazet (2):
  net: speed up skb_rbtree_purge()
  inet: frags: get rif of inet_frag_evicting()

Florian Westphal (1):
  ipv6: defrag: drop non-last frags smaller than min mtu

Michal Kubecek (1):
  net: ipv4: do not handle duplicate fragments as overlapping

Peter Oskolkov (5):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
    skbs.
  ip: use rb trees for IP frag queue.
  ip: add helpers to process in-order fragments faster.
  ip: process in-order fragments efficiently

Taehee Yoo (1):
  ip: frags: fix crash in ip_do_fragment()

 include/linux/skbuff.h                  |   4 +-
 include/net/inet_frag.h                 |  12 +-
 include/uapi/linux/snmp.h               |   1 +
 net/core/skbuff.c                       |  17 +-
 net/ipv4/inet_fragment.c                |  16 +-
 net/ipv4/ip_fragment.c                  | 410 +++++++++++++++++++-------------
 net/ipv4/proc.c                         |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   6 +
 net/ipv6/reassembly.c                   |   9 +-
 9 files changed, 292 insertions(+), 184 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 01/11] net: speed up skb_rbtree_purge()
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 02/11] ip: discard IPv4 datagrams with overlapping segments Mao Wenan
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 7c90584c66cc4b033a3b684b0e0950f79e7b7166 ]

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

Also note that there is not even an increase of text size :
$ size net/core/skbuff.o.before net/core/skbuff.o
   text	   data	    bss	    dec	    hex	filename
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o.before
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o

From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/core/skbuff.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9703924..8a57bba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2388,12 +2388,15 @@ EXPORT_SYMBOL(skb_queue_purge);
  */
 void skb_rbtree_purge(struct rb_root *root)
 {
-	struct sk_buff *skb, *next;
+	struct rb_node *p = rb_first(root);
 
-	rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
-		kfree_skb(skb);
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
-	*root = RB_ROOT;
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		kfree_skb(skb);
+	}
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 02/11] ip: discard IPv4 datagrams with overlapping segments.
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 01/11] net: speed up skb_rbtree_purge() Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 03/11] net: modify skb_rbtree_purge to return the truesize of all purged skbs Mao Wenan
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Peter Oskolkov <posk@google.com>

[ Upstream commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227 ]

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c    | 71 +++++++++++++----------------------------------
 net/ipv4/proc.c           |  1 +
 3 files changed, 21 insertions(+), 52 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 25a9ad8..9de808e 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -55,6 +55,7 @@ enum
 	IPSTATS_MIB_ECT1PKTS,			/* InECT1Pkts */
 	IPSTATS_MIB_ECT0PKTS,			/* InECT0Pkts */
 	IPSTATS_MIB_CEPKTS,			/* InCEPkts */
+	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
 	__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7291565..4e64879 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -342,6 +342,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
 	unsigned int fragsize;
@@ -422,60 +423,22 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 found:
-	/* We found where to put this one.  Check for overlap with
-	 * preceding fragment, and, if needed, align things so that
-	 * any overlaps are eliminated.
+	/* RFC5722, Section 4, amended by Errata ID : 3089
+	 *                          When reassembling an IPv6 datagram, if
+	 *   one or more its constituent fragments is determined to be an
+	 *   overlapping fragment, the entire datagram (and any constituent
+	 *   fragments) MUST be silently discarded.
+	 *
+	 * We do the same here for IPv4.
 	 */
-	if (prev) {
-		int i = (FRAG_CB(prev)->offset + prev->len) - offset;
-
-		if (i > 0) {
-			offset += i;
-			err = -EINVAL;
-			if (end <= offset)
-				goto err;
-			err = -ENOMEM;
-			if (!pskb_pull(skb, i))
-				goto err;
-			if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-				skb->ip_summed = CHECKSUM_NONE;
-		}
-	}
-
-	err = -ENOMEM;
+	/* Is there an overlap with the previous fragment? */
+	if (prev &&
+	    (FRAG_CB(prev)->offset + prev->len) > offset)
+		goto discard_qp;
 
-	while (next && FRAG_CB(next)->offset < end) {
-		int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */
-
-		if (i < next->len) {
-			/* Eat head of the next overlapped fragment
-			 * and leave the loop. The next ones cannot overlap.
-			 */
-			if (!pskb_pull(next, i))
-				goto err;
-			FRAG_CB(next)->offset += i;
-			qp->q.meat -= i;
-			if (next->ip_summed != CHECKSUM_UNNECESSARY)
-				next->ip_summed = CHECKSUM_NONE;
-			break;
-		} else {
-			struct sk_buff *free_it = next;
-
-			/* Old fragment is completely overridden with
-			 * new one drop it.
-			 */
-			next = next->next;
-
-			if (prev)
-				prev->next = next;
-			else
-				qp->q.fragments = next;
-
-			qp->q.meat -= free_it->len;
-			sub_frag_mem_limit(qp->q.net, free_it->truesize);
-			kfree_skb(free_it);
-		}
-	}
+	/* Is there an overlap with the next fragment? */
+	if (next && FRAG_CB(next)->offset < end)
+		goto discard_qp;
 
 	FRAG_CB(skb)->offset = offset;
 
@@ -522,6 +485,10 @@ found:
 	skb_dst_drop(skb);
 	return -EINPROGRESS;
 
+discard_qp:
+	ipq_kill(qp);
+	err = -EINVAL;
+	IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
 	kfree_skb(skb);
 	return err;
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 3abd9d7..55545d0 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -132,6 +132,7 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_ITEM("InECT1Pkts", IPSTATS_MIB_ECT1PKTS),
 	SNMP_MIB_ITEM("InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("InCEPkts", IPSTATS_MIB_CEPKTS),
+	SNMP_MIB_ITEM("ReasmOverlaps", IPSTATS_MIB_REASM_OVERLAPS),
 	SNMP_MIB_SENTINEL
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 03/11] net: modify skb_rbtree_purge to return the truesize of all purged skbs.
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 01/11] net: speed up skb_rbtree_purge() Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 02/11] ip: discard IPv4 datagrams with overlapping segments Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 04/11] inet: frags: get rif of inet_frag_evicting() Mao Wenan
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Peter Oskolkov <posk@google.com>

[ Upstream commit 385114dec8a49b5e5945e77ba7de6356106713f4 ]

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c      | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a490dd7..8bfefdd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,7 +2273,7 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8a57bba..49f73fb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2380,23 +2380,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  *	skb_rbtree_purge - empty a skb rbtree
  *	@root: root of the rbtree to empty
+ *	Return value: the sum of truesizes of all purged skbs.
  *
  *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  *	the list and one reference dropped. This function does not take
  *	any lock. Synchronization should be handled by the caller (e.g., TCP
  *	out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
 	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
 
 	while (p) {
 		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
 		p = rb_next(p);
 		rb_erase(&skb->rbnode, root);
+		sum += skb->truesize;
 		kfree_skb(skb);
 	}
+	return sum;
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 04/11] inet: frags: get rif of inet_frag_evicting()
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (2 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 03/11] net: modify skb_rbtree_purge to return the truesize of all purged skbs Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue Mao Wenan
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 399d1404be660d355192ff4df5ccc3f4159ec1e4 ]

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/net/inet_frag.h |  5 ----
 net/ipv4/ip_fragment.c  | 66 ++++++++++++++++++++++++-------------------------
 net/ipv6/reassembly.c   |  4 ---
 3 files changed, 32 insertions(+), 43 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index c26a6e4..09472b8 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -123,11 +123,6 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
 		inet_frag_destroy(q, f);
 }
 
-static inline bool inet_frag_evicting(struct inet_frag_queue *q)
-{
-	return !hlist_unhashed(&q->list_evictor);
-}
-
 /* Memory Tracking Functions. */
 
 static inline int frag_mem_limit(struct netns_frags *nf)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 4e64879..264f382 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -194,8 +194,11 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-	struct ipq *qp;
+	struct sk_buff *clone, *head;
+	const struct iphdr *iph;
 	struct net *net;
+	struct ipq *qp;
+	int err;
 
 	qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
 	net = container_of(qp->q.net, struct net, ipv4.frags);
@@ -209,45 +212,40 @@ static void ip_expire(unsigned long arg)
 	ipq_kill(qp);
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
 
-	if (!inet_frag_evicting(&qp->q)) {
-		struct sk_buff *clone, *head = qp->q.fragments;
-		const struct iphdr *iph;
-		int err;
-
-		IP_INC_STATS_BH(net, IPSTATS_MIB_REASMTIMEOUT);
+	head = qp->q.fragments;
 
-		if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !qp->q.fragments)
-			goto out;
+	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMTIMEOUT);
 
-		head->dev = dev_get_by_index_rcu(net, qp->iif);
-		if (!head->dev)
-			goto out;
+	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+		goto out;
 
+	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	if (!head->dev)
+		goto out;
 
-		/* skb has no dst, perform route lookup again */
-		iph = ip_hdr(head);
-		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+	/* skb has no dst, perform route lookup again */
+	iph = ip_hdr(head);
+	err = ip_route_input_noref(head, iph->daddr, iph->saddr,
 					   iph->tos, head->dev);
-		if (err)
-			goto out;
+	if (err)
+		goto out;
 
-		/* Only an end host needs to send an ICMP
-		 * "Fragment Reassembly Timeout" message, per RFC792.
-		 */
-		if (frag_expire_skip_icmp(qp->user) &&
-		    (skb_rtable(head)->rt_type != RTN_LOCAL))
-			goto out;
-
-		clone = skb_clone(head, GFP_ATOMIC);
-
-		/* Send an ICMP "Fragment Reassembly Timeout" message. */
-		if (clone) {
-			spin_unlock(&qp->q.lock);
-			icmp_send(clone, ICMP_TIME_EXCEEDED,
-				  ICMP_EXC_FRAGTIME, 0);
-			consume_skb(clone);
-			goto out_rcu_unlock;
-		}
+	/* Only an end host needs to send an ICMP
+	 * "Fragment Reassembly Timeout" message, per RFC792.
+	 */
+	if (frag_expire_skip_icmp(qp->user) &&
+	    (skb_rtable(head)->rt_type != RTN_LOCAL))
+		goto out;
+
+	clone = skb_clone(head, GFP_ATOMIC);
+
+	/* Send an ICMP "Fragment Reassembly Timeout" message. */
+	if (clone) {
+		spin_unlock(&qp->q.lock);
+		icmp_send(clone, ICMP_TIME_EXCEEDED,
+			  ICMP_EXC_FRAGTIME, 0);
+		consume_skb(clone);
+		goto out_rcu_unlock;
 	}
 out:
 	spin_unlock(&qp->q.lock);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 58f2139..ee4789b 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -146,10 +146,6 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
 		goto out_rcu_unlock;
 
 	IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
-
-	if (inet_frag_evicting(&fq->q))
-		goto out_rcu_unlock;
-
 	IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
 	/* Don't send error if the first segment did not arrive. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue.
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (3 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 04/11] inet: frags: get rif of inet_frag_evicting() Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-24 17:58   ` Greg KH
  2019-01-23  2:19 ` [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu Mao Wenan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Peter Oskolkov <posk@google.com>

[ Upstream commit fa0f527358bd900ef92f925878ed6bfbd51305cc ]

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/linux/skbuff.h                  |   2 +-
 include/net/inet_frag.h                 |   3 +-
 net/ipv4/inet_fragment.c                |  16 ++-
 net/ipv4/ip_fragment.c                  | 190 ++++++++++++++++++--------------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 6 files changed, 121 insertions(+), 92 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bfefdd..5c73c79 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -556,7 +556,7 @@ struct sk_buff {
 				struct skb_mstamp skb_mstamp;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node	rbnode; /* used in netem, ip4 defrag, and tcp stack */
 	};
 	struct sock		*sk;
 	struct net_device	*dev;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 09472b8..861d24c 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -45,7 +45,8 @@ struct inet_frag_queue {
 	struct timer_list	timer;
 	struct hlist_node	list;
 	atomic_t		refcnt;
-	struct sk_buff		*fragments;
+	struct sk_buff		*fragments;  /* Used in IPv6. */
+	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
 	int			len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index b2001b2..2b3a926 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -306,12 +306,16 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f)
 	/* Release all fragment data. */
 	fp = q->fragments;
 	nf = q->net;
-	while (fp) {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		frag_kfree_skb(nf, f, fp);
-		fp = xp;
+	if (fp) {
+		do {
+			struct sk_buff *xp = fp->next;
+
+			sum_truesize += fp->truesize;
+			kfree_skb(fp);
+			fp = xp;
+		} while (fp);
+	} else {
+		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 264f382..e820eb9 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -194,7 +194,7 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-	struct sk_buff *clone, *head;
+	struct sk_buff *head = NULL;
 	const struct iphdr *iph;
 	struct net *net;
 	struct ipq *qp;
@@ -211,14 +211,31 @@ static void ip_expire(unsigned long arg)
 
 	ipq_kill(qp);
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
-
-	head = qp->q.fragments;
-
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+	if (!qp->q.flags & INET_FRAG_FIRST_IN)
 		goto out;
 
+	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
+	 * pull the head out of the tree in order to be able to
+	 * deal with head->dev.
+	 */
+	if (qp->q.fragments) {
+		head = qp->q.fragments;
+		qp->q.fragments = head->next;
+	} else {
+		head = skb_rb_first(&qp->q.rb_fragments);
+		if (!head)
+			goto out;
+		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		memset(&head->rbnode, 0, sizeof(head->rbnode));
+		barrier();
+	}
+	if (head == qp->q.fragments_tail)
+		qp->q.fragments_tail = NULL;
+
+	sub_frag_mem_limit(qp->q.net, head->truesize);
+
 	head->dev = dev_get_by_index_rcu(net, qp->iif);
 	if (!head->dev)
 		goto out;
@@ -237,20 +254,17 @@ static void ip_expire(unsigned long arg)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	clone = skb_clone(head, GFP_ATOMIC);
-
 	/* Send an ICMP "Fragment Reassembly Timeout" message. */
-	if (clone) {
-		spin_unlock(&qp->q.lock);
-		icmp_send(clone, ICMP_TIME_EXCEEDED,
-			  ICMP_EXC_FRAGTIME, 0);
-		consume_skb(clone);
-		goto out_rcu_unlock;
-	}
+	spin_unlock(&qp->q.lock);
+	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+	goto out_rcu_unlock;
+
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
 	rcu_read_unlock();
+	if (head)
+		kfree_skb(head);
 	ipq_put(qp);
 }
 
@@ -294,7 +308,7 @@ static int ip_frag_too_far(struct ipq *qp)
 	end = atomic_inc_return(&peer->rid);
 	qp->rid = end;
 
-	rc = qp->q.fragments && (end - start) > max;
+	rc = qp->q.fragments_tail && (end - start) > max;
 
 	if (rc) {
 		struct net *net;
@@ -308,7 +322,6 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	struct sk_buff *fp;
 	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
@@ -316,20 +329,14 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	fp = qp->q.fragments;
-	do {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
-	} while (fp);
+	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
 	qp->q.meat = 0;
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
@@ -341,7 +348,8 @@ static int ip_frag_reinit(struct ipq *qp)
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
-	struct sk_buff *prev, *next;
+	struct rb_node **rbn, *parent;
+	struct sk_buff *skb1;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -404,56 +412,60 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	if (err)
 		goto err;
 
-	/* Find out which fragments are in front and at the back of us
-	 * in the chain of fragments so far.  We must know where to put
-	 * this fragment, right?
-	 */
-	prev = qp->q.fragments_tail;
-	if (!prev || FRAG_CB(prev)->offset < offset) {
-		next = NULL;
-		goto found;
-	}
-	prev = NULL;
-	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (FRAG_CB(next)->offset >= offset)
-			break;	/* bingo! */
-		prev = next;
-	}
+	/* Note : skb->rbnode and skb->dev share the same location. */
+	dev = skb->dev;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
 
-found:
 	/* RFC5722, Section 4, amended by Errata ID : 3089
 	 *                          When reassembling an IPv6 datagram, if
 	 *   one or more its constituent fragments is determined to be an
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4.
+	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
-	/* Is there an overlap with the previous fragment? */
-	if (prev &&
-	    (FRAG_CB(prev)->offset + prev->len) > offset)
-		goto discard_qp;
-
-	/* Is there an overlap with the next fragment? */
-	if (next && FRAG_CB(next)->offset < end)
-		goto discard_qp;
-
-	FRAG_CB(skb)->offset = offset;
 
-	/* Insert this fragment in the chain of fragments. */
-	skb->next = next;
-	if (!next)
+	/* Find out where to put this fragment.  */
+	skb1 = qp->q.fragments_tail;
+	if (!skb1) {
+		/* This is the first fragment we've received. */
+		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
 		qp->q.fragments_tail = skb;
-	if (prev)
-		prev->next = skb;
-	else
-		qp->q.fragments = skb;
+	} else if ((FRAG_CB(skb1)->offset + skb1->len) < end) {
+		/* This is the common/special case: skb goes to the end. */
+		/* Detect and discard overlaps. */
+		if (offset < (FRAG_CB(skb1)->offset + skb1->len))
+			goto discard_qp;
+		/* Insert after skb1. */
+		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
+		qp->q.fragments_tail = skb;
+	} else {
+		/* Binary search. Note that skb can become the first fragment, but
+		 * not the last (covered above). */
+		rbn = &qp->q.rb_fragments.rb_node;
+		do {
+			parent = *rbn;
+			skb1 = rb_to_skb(parent);
+			if (end <= FRAG_CB(skb1)->offset)
+				rbn = &parent->rb_left;
+			else if (offset >= FRAG_CB(skb1)->offset + skb1->len)
+				rbn = &parent->rb_right;
+			else /* Found an overlap with skb1. */
+				goto discard_qp;
+		} while (*rbn);
+		/* Here we have parent properly set, and rbn pointing to
+		 * one of its NULL left/right children. Insert skb. */
+		rb_link_node(&skb->rbnode, parent, rbn);
+	}
+	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
-	dev = skb->dev;
 	if (dev) {
 		qp->iif = dev->ifindex;
 		skb->dev = NULL;
 	}
+	FRAG_CB(skb)->offset = offset;
+
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -475,7 +487,7 @@ found:
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, skb, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -492,15 +504,15 @@ err:
 	return err;
 }
 
-
 /* Build a new IP datagram from all its fragments. */
-
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
-	struct sk_buff *fp, *head = qp->q.fragments;
+	struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
+	struct sk_buff **nextp; /* To build frag_list. */
+	struct rb_node *rbn;
 	int len;
 	int ihlen;
 	int err;
@@ -514,25 +526,21 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		goto out_fail;
 	}
 	/* Make the one we just received the head. */
-	if (prev) {
-		head = prev->next;
-		fp = skb_clone(head, GFP_ATOMIC);
+	if (head != skb) {
+		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
 
-		fp->next = head->next;
-		if (!fp->next)
+		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
-		prev->next = fp;
-
-		skb_morph(head, qp->q.fragments);
-		head->next = qp->q.fragments->next;
-
-		consume_skb(qp->q.fragments);
-		qp->q.fragments = head;
+		skb_morph(skb, head);
+		rb_replace_node(&head->rbnode, &skb->rbnode,
+				&qp->q.rb_fragments);
+		consume_skb(head);
+		head = skb;
 	}
 
-	WARN_ON(!head);
 	WARN_ON(FRAG_CB(head)->offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
@@ -557,24 +565,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
 			goto out_nomem;
-		clone->next = head->next;
-		head->next = clone;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		head->data_len -= clone->len;
-		head->len -= clone->len;
+		skb->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
+		skb_shinfo(head)->frag_list = clone;
+		nextp = &clone->next;
+	} else {
+		nextp = &skb_shinfo(head)->frag_list;
 	}
 
-	skb_shinfo(head)->frag_list = head->next;
 	skb_push(head, head->data - skb_network_header(head));
 
-	for (fp=head->next; fp; fp = fp->next) {
+	/* Traverse the tree in order, to build frag_list. */
+	rbn = rb_next(&head->rbnode);
+	rb_erase(&head->rbnode, &qp->q.rb_fragments);
+	while (rbn) {
+		struct rb_node *rbnext = rb_next(rbn);
+		fp = rb_to_skb(rbn);
+		rb_erase(rbn, &qp->q.rb_fragments);
+		rbn = rbnext;
+		*nextp = fp;
+		nextp = &fp->next;
+		fp->prev = NULL;
+		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
 		head->data_len += fp->len;
 		head->len += fp->len;
 		if (head->ip_summed != fp->ip_summed)
@@ -585,7 +604,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
+	*nextp = NULL;
 	head->next = NULL;
+	head->prev = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
 	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
@@ -613,6 +634,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	return 0;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 5a9ae56..9cd8863 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -472,6 +472,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 
 	/* all original skbs are linked into the NFCT_FRAG6_CB(head).orig */
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index ee4789b..adc7512 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -499,6 +499,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
 	rcu_read_unlock();
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 	return 1;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (4 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-24 18:31   ` Greg KH
  2019-01-23  2:19 ` [PATCH stable 4.4 07/11] ip: add helpers to process in-order fragments faster Mao Wenan
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Florian Westphal <fw@strlen.de>

[ Upstream commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ]

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
    There were concerns that there could be even smaller frags
    generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++++
 net/ipv6/reassembly.c                   | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 9cd8863..c5033a2 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -602,6 +602,10 @@ struct sk_buff *nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 use
 	hdr = ipv6_hdr(clone);
 	fhdr = (struct frag_hdr *)skb_transport_header(clone);
 
+	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+	    fhdr->frag_off & htons(IP6_MF))
+		return -EINVAL;
+
 	skb_orphan(skb);
 	fq = fq_find(net, fhdr->identification, user, &hdr->saddr, &hdr->daddr,
 		     skb->dev ? skb->dev->ifindex : 0, ip6_frag_ecn(hdr));
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index adc7512..44c7f4c 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -549,6 +549,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
+	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+	    fhdr->frag_off & htons(IP6_MF))
+		goto fail_hdr;
+
 	fq = fq_find(net, fhdr->identification, &hdr->saddr, &hdr->daddr,
 		     skb->dev ? skb->dev->ifindex : 0, ip6_frag_ecn(hdr));
 	if (fq) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 07/11] ip: add helpers to process in-order fragments faster.
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (5 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 08/11] ip: process in-order fragments efficiently Mao Wenan
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Peter Oskolkov <posk@google.com>

[ Upstream commit 353c9cb360874e737fb000545f783df756c06f9a ]

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/net/inet_frag.h |  4 +++
 net/ipv4/ip_fragment.c  | 74 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 861d24c..e7d9577 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -48,6 +48,7 @@ struct inet_frag_queue {
 	struct sk_buff		*fragments;  /* Used in IPv6. */
 	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
+	struct sk_buff		*last_run_head;
 	ktime_t			stamp;
 	int			len;
 	int			meat;
@@ -118,6 +119,9 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
 				   const char *prefix);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f)
 {
 	if (atomic_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e820eb9..73ec3a9 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -58,13 +58,57 @@
 static int sysctl_ipfrag_max_dist __read_mostly = 64;
 static const char ip_frag_cache_name[] = "ip4-frags";
 
-struct ipfrag_skb_cb
-{
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
 	struct inet_skb_parm	h;
-	int			offset;
+	int                     offset;
+	struct sk_buff		*next_frag;
+	int			frag_run_len;
 };
 
-#define FRAG_CB(skb)	((struct ipfrag_skb_cb *)((skb)->cb))
+#define FRAG_CB(skb)		((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+	FRAG_CB(skb)->next_frag = NULL;
+	FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+					struct sk_buff *skb)
+{
+	RB_CLEAR_NODE(&skb->rbnode);
+	FRAG_CB(skb)->next_frag = NULL;
+
+	FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+	FRAG_CB(q->fragments_tail)->next_frag = skb;
+	q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+	if (q->last_run_head)
+		rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+			     &q->last_run_head->rbnode.rb_right);
+	else
+		rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+	rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+	ip4_frag_init_run(skb);
+	q->fragments_tail = skb;
+	q->last_run_head = skb;
+}
 
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
@@ -721,6 +765,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
+
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		while (skb) {
+			struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+			sum += skb->truesize;
+			kfree_skb(skb);
+			skb = next;
+		}
+	}
+	return sum;
+}
+EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
 #ifdef CONFIG_SYSCTL
 static int zero;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 08/11] ip: process in-order fragments efficiently
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (6 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 07/11] ip: add helpers to process in-order fragments faster Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 09/11] net: ipv4: do not handle duplicate fragments as overlapping Mao Wenan
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Peter Oskolkov <posk@google.com>

[ Upstream commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c ]

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 +++++++++++++++++++++++++++++------------------
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 2b3a926..046c6c3 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -315,7 +315,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f)
 			fp = xp;
 		} while (fp);
 	} else {
-		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+		sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 73ec3a9..61f4216 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -139,8 +139,8 @@ int ip_frag_mem(struct net *net)
 	return sum_frag_mem_limit(&net->ipv4.frags);
 }
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+			struct sk_buff *prev_tail, struct net_device *dev);
 
 struct ip4_create_arg {
 	struct iphdr *iph;
@@ -271,7 +271,12 @@ static void ip_expire(unsigned long arg)
 		head = skb_rb_first(&qp->q.rb_fragments);
 		if (!head)
 			goto out;
-		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		if (FRAG_CB(head)->next_frag)
+			rb_replace_node(&head->rbnode,
+					&FRAG_CB(head)->next_frag->rbnode,
+					&qp->q.rb_fragments);
+		else
+			rb_erase(&head->rbnode, &qp->q.rb_fragments);
 		memset(&head->rbnode, 0, sizeof(head->rbnode));
 		barrier();
 	}
@@ -373,7 +378,7 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
@@ -382,6 +387,7 @@ static int ip_frag_reinit(struct ipq *qp)
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
 
@@ -393,7 +399,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct rb_node **rbn, *parent;
-	struct sk_buff *skb1;
+	struct sk_buff *skb1, *prev_tail;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -471,38 +477,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 */
 
 	/* Find out where to put this fragment.  */
-	skb1 = qp->q.fragments_tail;
-	if (!skb1) {
-		/* This is the first fragment we've received. */
-		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-		qp->q.fragments_tail = skb;
-	} else if ((FRAG_CB(skb1)->offset + skb1->len) < end) {
-		/* This is the common/special case: skb goes to the end. */
+	prev_tail = qp->q.fragments_tail;
+	if (!prev_tail)
+		ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+	else if (FRAG_CB(prev_tail)->offset + prev_tail->len < end) {
+		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
-		if (offset < (FRAG_CB(skb1)->offset + skb1->len))
+		if (offset < FRAG_CB(prev_tail)->offset + prev_tail->len)
 			goto discard_qp;
-		/* Insert after skb1. */
-		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
-		qp->q.fragments_tail = skb;
+		if (offset == FRAG_CB(prev_tail)->offset + prev_tail->len)
+			ip4_frag_append_to_last_run(&qp->q, skb);
+		else
+			ip4_frag_create_run(&qp->q, skb);
 	} else {
-		/* Binary search. Note that skb can become the first fragment, but
-		 * not the last (covered above). */
+		/* Binary search. Note that skb can become the first fragment,
+		 * but not the last (covered above).
+		 */
 		rbn = &qp->q.rb_fragments.rb_node;
 		do {
 			parent = *rbn;
 			skb1 = rb_to_skb(parent);
 			if (end <= FRAG_CB(skb1)->offset)
 				rbn = &parent->rb_left;
-			else if (offset >= FRAG_CB(skb1)->offset + skb1->len)
+			else if (offset >= FRAG_CB(skb1)->offset +
+						FRAG_CB(skb1)->frag_run_len)
 				rbn = &parent->rb_right;
 			else /* Found an overlap with skb1. */
 				goto discard_qp;
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
-		 * one of its NULL left/right children. Insert skb. */
+		 * one of its NULL left/right children. Insert skb.
+		 */
+		ip4_frag_init_run(skb);
 		rb_link_node(&skb->rbnode, parent, rbn);
+		rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 	}
-	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
 	if (dev) {
 		qp->iif = dev->ifindex;
@@ -531,7 +540,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, skb, dev);
+		err = ip_frag_reasm(qp, skb, prev_tail, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -550,7 +559,7 @@ err:
 
 /* Build a new IP datagram from all its fragments. */
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
-			 struct net_device *dev)
+			 struct sk_buff *prev_tail, struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -575,10 +584,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		if (!fp)
 			goto out_nomem;
 
-		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
+		if (RB_EMPTY_NODE(&skb->rbnode))
+			FRAG_CB(prev_tail)->next_frag = fp;
+		else
+			rb_replace_node(&skb->rbnode, &fp->rbnode,
+					&qp->q.rb_fragments);
 		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
 		skb_morph(skb, head);
+		FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
 		rb_replace_node(&head->rbnode, &skb->rbnode,
 				&qp->q.rb_fragments);
 		consume_skb(head);
@@ -614,7 +629,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		skb->truesize += clone->truesize;
+		head->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
@@ -627,24 +642,36 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	skb_push(head, head->data - skb_network_header(head));
 
 	/* Traverse the tree in order, to build frag_list. */
+	fp = FRAG_CB(head)->next_frag;
 	rbn = rb_next(&head->rbnode);
 	rb_erase(&head->rbnode, &qp->q.rb_fragments);
-	while (rbn) {
-		struct rb_node *rbnext = rb_next(rbn);
-		fp = rb_to_skb(rbn);
-		rb_erase(rbn, &qp->q.rb_fragments);
-		rbn = rbnext;
-		*nextp = fp;
-		nextp = &fp->next;
-		fp->prev = NULL;
-		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
-		head->data_len += fp->len;
-		head->len += fp->len;
-		if (head->ip_summed != fp->ip_summed)
-			head->ip_summed = CHECKSUM_NONE;
-		else if (head->ip_summed == CHECKSUM_COMPLETE)
-			head->csum = csum_add(head->csum, fp->csum);
-		head->truesize += fp->truesize;
+	while (rbn || fp) {
+		/* fp points to the next sk_buff in the current run;
+		 * rbn points to the next run.
+		 */
+		/* Go through the current run. */
+		while (fp) {
+			*nextp = fp;
+			nextp = &fp->next;
+			fp->prev = NULL;
+			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			head->data_len += fp->len;
+			head->len += fp->len;
+			if (head->ip_summed != fp->ip_summed)
+				head->ip_summed = CHECKSUM_NONE;
+			else if (head->ip_summed == CHECKSUM_COMPLETE)
+				head->csum = csum_add(head->csum, fp->csum);
+			head->truesize += fp->truesize;
+			fp = FRAG_CB(fp)->next_frag;
+		}
+		/* Move to the next run. */
+		if (rbn) {
+			struct rb_node *rbnext = rb_next(rbn);
+
+			fp = rb_to_skb(rbn);
+			rb_erase(rbn, &qp->q.rb_fragments);
+			rbn = rbnext;
+		}
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
@@ -680,6 +707,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	return 0;
 
 out_nomem:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 09/11] net: ipv4: do not handle duplicate fragments as overlapping
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (7 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 08/11] ip: process in-order fragments efficiently Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 10/11] ip: frags: fix crash in ip_do_fragment() Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 11/11] ipv4: frags: precedence bug in ip_expire() Mao Wenan
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Michal Kubecek <mkubecek@suse.cz>

[ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

Since commit 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping
segments.") IPv4 reassembly code drops the whole queue whenever an
overlapping fragment is received. However, the test is written in a way
which detects duplicate fragments as overlapping so that in environments
with many duplicate packets, fragmented packets may be undeliverable.

Add an extra test and for (potentially) duplicate fragment, only drop the
new fragment rather than the whole queue. Only starting offset and length
are checked, not the contents of the fragments as that would be too
expensive. For similar reason, linear list ("run") of a rbtree node is not
iterated, we only check if the new fragment is a subset of the interval
covered by existing consecutive fragments.

v2: instead of an exact check iterating through linear list of an rbtree
node, only check if the new fragment is subset of the "run" (suggested
by Eric Dumazet)

Fixes: 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping segments.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/ip_fragment.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 61f4216..500dfdd 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -400,10 +400,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct rb_node **rbn, *parent;
 	struct sk_buff *skb1, *prev_tail;
+	int ihl, end, skb1_run_end;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
-	int ihl, end;
 	int err = -ENOENT;
 	u8 ecn;
 
@@ -473,7 +473,9 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4 (and increment an snmp counter).
+	 * We do the same here for IPv4 (and increment an snmp counter) but
+	 * we do not want to drop the whole queue in response to a duplicate
+	 * fragment.
 	 */
 
 	/* Find out where to put this fragment.  */
@@ -497,13 +499,17 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		do {
 			parent = *rbn;
 			skb1 = rb_to_skb(parent);
+			skb1_run_end = FRAG_CB(skb1)->offset +
+				       FRAG_CB(skb1)->frag_run_len;
 			if (end <= FRAG_CB(skb1)->offset)
 				rbn = &parent->rb_left;
-			else if (offset >= FRAG_CB(skb1)->offset +
-						FRAG_CB(skb1)->frag_run_len)
+			else if (offset >= skb1_run_end)
 				rbn = &parent->rb_right;
-			else /* Found an overlap with skb1. */
-				goto discard_qp;
+			else if (offset >= FRAG_CB(skb1)->offset &&
+				 end <= skb1_run_end)
+				goto err; /* No new data, potential duplicate */
+			else
+				goto discard_qp; /* Found an overlap */
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
 		 * one of its NULL left/right children. Insert skb.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 10/11] ip: frags: fix crash in ip_do_fragment()
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (8 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 09/11] net: ipv4: do not handle duplicate fragments as overlapping Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  2019-01-23  2:19 ` [PATCH stable 4.4 11/11] ipv4: frags: precedence bug in ip_expire() Mao Wenan
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Taehee Yoo <ap420073@gmail.com>

[ Upstream commit 5d407b071dc369c26a38398326ee2be53651cfe4 ]

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/ip_fragment.c                  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 500dfdd..0e75881 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -661,6 +661,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			nextp = &fp->next;
 			fp->prev = NULL;
 			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			fp->sk = NULL;
 			head->data_len += fp->len;
 			head->len += fp->len;
 			if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index c5033a2..17f80dd 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -454,6 +454,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
+		fp->sk = NULL;
 	}
 	sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH stable 4.4 11/11] ipv4: frags: precedence bug in ip_expire()
  2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
                   ` (9 preceding siblings ...)
  2019-01-23  2:19 ` [PATCH stable 4.4 10/11] ip: frags: fix crash in ip_do_fragment() Mao Wenan
@ 2019-01-23  2:19 ` Mao Wenan
  10 siblings, 0 replies; 18+ messages in thread
From: Mao Wenan @ 2019-01-23  2:19 UTC (permalink / raw)
  To: netdev, gregkh, eric.dumazet, davem, stable, edumazet

From: Dan Carpenter <dan.carpenter@oracle.com>

[ Upstream commit 70837ffe3085c9a91488b52ca13ac84424da1042 ]

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/ip_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0e75881..eb8955c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -257,7 +257,7 @@ static void ip_expire(unsigned long arg)
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!qp->q.flags & INET_FRAG_FIRST_IN)
+	if (!(qp->q.flags & INET_FRAG_FIRST_IN))
 		goto out;
 
 	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue.
  2019-01-23  2:19 ` [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue Mao Wenan
@ 2019-01-24 17:58   ` Greg KH
  2019-01-25  1:50     ` maowenan
  0 siblings, 1 reply; 18+ messages in thread
From: Greg KH @ 2019-01-24 17:58 UTC (permalink / raw)
  To: Mao Wenan; +Cc: netdev, eric.dumazet, davem, stable, edumazet

On Wed, Jan 23, 2019 at 10:19:40AM +0800, Mao Wenan wrote:
> From: Peter Oskolkov <posk@google.com>
> 
> [ Upstream commit fa0f527358bd900ef92f925878ed6bfbd51305cc ]

This commit is not in the 4.14.y tree, any specific reason why not?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-23  2:19 ` [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu Mao Wenan
@ 2019-01-24 18:31   ` Greg KH
  2019-01-25  2:24     ` maowenan
  0 siblings, 1 reply; 18+ messages in thread
From: Greg KH @ 2019-01-24 18:31 UTC (permalink / raw)
  To: Mao Wenan; +Cc: netdev, eric.dumazet, davem, stable, edumazet

On Wed, Jan 23, 2019 at 10:19:41AM +0800, Mao Wenan wrote:
> From: Florian Westphal <fw@strlen.de>
> 
> [ Upstream commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ]
> 
> don't bother with pathological cases, they only waste cycles.
> IPv6 requires a minimum MTU of 1280 so we should never see fragments
> smaller than this (except last frag).
> 
> v3: don't use awkward "-offset + len"
> v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
>     There were concerns that there could be even smaller frags
>     generated by intermediate nodes, e.g. on radio networks.
> 
> Cc: Peter Oskolkov <posk@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> ---
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++++
>  net/ipv6/reassembly.c                   | 4 ++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 9cd8863..c5033a2 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -602,6 +602,10 @@ struct sk_buff *nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 use
>  	hdr = ipv6_hdr(clone);
>  	fhdr = (struct frag_hdr *)skb_transport_header(clone);
>  
> +	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
> +	    fhdr->frag_off & htons(IP6_MF))
> +		return -EINVAL;

This backport is incorrect, you should be returning a pointer, right?

How did you test this?  This should have blown up under test :(

I'm going to drop this whole series.  Please fix it up and test it
properly and then resend.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue.
  2019-01-24 17:58   ` Greg KH
@ 2019-01-25  1:50     ` maowenan
  2019-01-25  7:07       ` Greg KH
  0 siblings, 1 reply; 18+ messages in thread
From: maowenan @ 2019-01-25  1:50 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, eric.dumazet, davem, stable, edumazet



On 2019/1/25 1:58, Greg KH wrote:
> On Wed, Jan 23, 2019 at 10:19:40AM +0800, Mao Wenan wrote:
>> From: Peter Oskolkov <posk@google.com>
>>
>> [ Upstream commit fa0f527358bd900ef92f925878ed6bfbd51305cc ]
> 
> This commit is not in the 4.14.y tree, any specific reason why not?

I found the commit 6b921536f1707a240e6f53843f1f26231016fda5 net: sk_buff rbnode reorg  in v4.14.y
including the fixes.

> 
> thanks,
> 
> greg k-h
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-24 18:31   ` Greg KH
@ 2019-01-25  2:24     ` maowenan
  0 siblings, 0 replies; 18+ messages in thread
From: maowenan @ 2019-01-25  2:24 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, eric.dumazet, davem, stable, edumazet



On 2019/1/25 2:31, Greg KH wrote:
> On Wed, Jan 23, 2019 at 10:19:41AM +0800, Mao Wenan wrote:
>> From: Florian Westphal <fw@strlen.de>
>>
>> [ Upstream commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 ]
>>
>> don't bother with pathological cases, they only waste cycles.
>> IPv6 requires a minimum MTU of 1280 so we should never see fragments
>> smaller than this (except last frag).
>>
>> v3: don't use awkward "-offset + len"
>> v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
>>     There were concerns that there could be even smaller frags
>>     generated by intermediate nodes, e.g. on radio networks.
>>
>> Cc: Peter Oskolkov <posk@google.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Florian Westphal <fw@strlen.de>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>> ---
>>  net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++++
>>  net/ipv6/reassembly.c                   | 4 ++++
>>  2 files changed, 8 insertions(+)
>>
>> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
>> index 9cd8863..c5033a2 100644
>> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
>> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
>> @@ -602,6 +602,10 @@ struct sk_buff *nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 use
>>  	hdr = ipv6_hdr(clone);
>>  	fhdr = (struct frag_hdr *)skb_transport_header(clone);
>>  
>> +	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
>> +	    fhdr->frag_off & htons(IP6_MF))
>> +		return -EINVAL;
> 
> This backport is incorrect, you should be returning a pointer, right?

Thank you for correcting me, the return value should be a pointer,
I will fix it and test all of patches again, then resend v2.
sorry for my mistake.
> 
> How did you test this?  This should have blown up under test :(
> 
> I'm going to drop this whole series.  Please fix it up and test it
> properly and then resend.
> 
> thanks,
> 
> greg k-h
> 
> .
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue.
  2019-01-25  1:50     ` maowenan
@ 2019-01-25  7:07       ` Greg KH
  2019-01-25  8:12         ` maowenan
  0 siblings, 1 reply; 18+ messages in thread
From: Greg KH @ 2019-01-25  7:07 UTC (permalink / raw)
  To: maowenan; +Cc: netdev, eric.dumazet, davem, stable, edumazet

On Fri, Jan 25, 2019 at 09:50:35AM +0800, maowenan wrote:
> 
> 
> On 2019/1/25 1:58, Greg KH wrote:
> > On Wed, Jan 23, 2019 at 10:19:40AM +0800, Mao Wenan wrote:
> >> From: Peter Oskolkov <posk@google.com>
> >>
> >> [ Upstream commit fa0f527358bd900ef92f925878ed6bfbd51305cc ]
> > 
> > This commit is not in the 4.14.y tree, any specific reason why not?
> 
> I found the commit 6b921536f1707a240e6f53843f1f26231016fda5 net: sk_buff rbnode reorg  in v4.14.y
> including the fixes.

Yes, that commit is really bffa72cf7f9d ("net: sk_buff rbnode reorg"),
which is upstream in 4.14.15 and 4.15.  But fa0f527358bd ("ip: use rb
trees for IP frag queue.") is not in 4.14 at all, it showed up in
4.9.134 and 4.19.  Why did the 4.14 tree not need it and 4.9 and 4.4
does?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue.
  2019-01-25  7:07       ` Greg KH
@ 2019-01-25  8:12         ` maowenan
  0 siblings, 0 replies; 18+ messages in thread
From: maowenan @ 2019-01-25  8:12 UTC (permalink / raw)
  To: Greg KH; +Cc: netdev, eric.dumazet, davem, stable, edumazet



On 2019/1/25 15:07, Greg KH wrote:
> On Fri, Jan 25, 2019 at 09:50:35AM +0800, maowenan wrote:
>>
>>
>> On 2019/1/25 1:58, Greg KH wrote:
>>> On Wed, Jan 23, 2019 at 10:19:40AM +0800, Mao Wenan wrote:
>>>> From: Peter Oskolkov <posk@google.com>
>>>>
>>>> [ Upstream commit fa0f527358bd900ef92f925878ed6bfbd51305cc ]
>>>
>>> This commit is not in the 4.14.y tree, any specific reason why not?
>>
>> I found the commit 6b921536f1707a240e6f53843f1f26231016fda5 net: sk_buff rbnode reorg  in v4.14.y
>> including the fixes.
> 
> Yes, that commit is really bffa72cf7f9d ("net: sk_buff rbnode reorg"),
> which is upstream in 4.14.15 and 4.15.  But fa0f527358bd ("ip: use rb
> trees for IP frag queue.") is not in 4.14 at all, it showed up in
> 4.9.134 and 4.19.  Why did the 4.14 tree not need it and 4.9 and 4.4
> does?

The commit 6b921536f170(net: sk_buff rbnode reorg) in 4.14 combined two commits in
mainline(bffa72cf7f9d net: sk_buff rbnode reorg. and fa0f527358bd ip: use rb trees for IP frag queue.).
The main fix patch for CVE-2018-5391 is fa0f527358bd(ip: use rb trees for IP frag queue), I don't think
it is necessary to backport bffa72cf7f9d to 4.14, but the fa0f527358bd is really needed.


mainline patches
commit bffa72cf7f9df842f0016ba03586039296b4caaf
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Sep 19 05:14:24 2017 -0700

    net: sk_buff rbnode reorg

commit fa0f527358bd900ef92f925878ed6bfbd51305cc
Author: Peter Oskolkov <posk@google.com>
Date:   Thu Aug 2 23:34:39 2018 +0000

    ip: use rb trees for IP frag queue.

linux-4.14.y
commit 6b921536f1707a240e6f53843f1f26231016fda5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Sep 13 07:58:58 2018 -0700

    net: sk_buff rbnode reorg


> 
> thanks,
> 
> greg k-h
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-01-25  8:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23  2:19 [PATCH stable 4.4 00/11] fix FragmentSmack in stable branch (CVE-2018-5391) Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 01/11] net: speed up skb_rbtree_purge() Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 02/11] ip: discard IPv4 datagrams with overlapping segments Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 03/11] net: modify skb_rbtree_purge to return the truesize of all purged skbs Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 04/11] inet: frags: get rif of inet_frag_evicting() Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 05/11] ip: use rb trees for IP frag queue Mao Wenan
2019-01-24 17:58   ` Greg KH
2019-01-25  1:50     ` maowenan
2019-01-25  7:07       ` Greg KH
2019-01-25  8:12         ` maowenan
2019-01-23  2:19 ` [PATCH stable 4.4 06/11] ipv6: defrag: drop non-last frags smaller than min mtu Mao Wenan
2019-01-24 18:31   ` Greg KH
2019-01-25  2:24     ` maowenan
2019-01-23  2:19 ` [PATCH stable 4.4 07/11] ip: add helpers to process in-order fragments faster Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 08/11] ip: process in-order fragments efficiently Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 09/11] net: ipv4: do not handle duplicate fragments as overlapping Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 10/11] ip: frags: fix crash in ip_do_fragment() Mao Wenan
2019-01-23  2:19 ` [PATCH stable 4.4 11/11] ipv4: frags: precedence bug in ip_expire() Mao Wenan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).