All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/30] backport of IP fragmentation fixes
@ 2018-09-13 14:58 Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 01/30] inet: frags: change inet_frags_init_net() return value Stephen Hemminger
                   ` (30 more replies)
  0 siblings, 31 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Stephen Hemminger

Took the set of patches from 4.19 to handle IP fragmentation DoS
and applied them against 4.14.69.  Most of these are from Eric.
In a couple case, it required some manual merge conflict resolution.

Tested normal IP fragmentation with iperf3 and malicious IP fragments
with fragmentsmack. Under fragmentation attack (700Kpps) the original
4.14.69 consumes 97% CPU; with this patch it drops to 5%.

v3 - send to wider audience
v2 - added patch from 4.19 linux-next to fix ip fragmentation crash

Dan Carpenter (1):
  ipv4: frags: precedence bug in ip_expire()

Eric Dumazet (22):
  inet: frags: change inet_frags_init_net() return value
  inet: frags: add a pointer to struct netns_frags
  inet: frags: refactor ipfrag_init()
  inet: frags: refactor ipv6_frag_init()
  inet: frags: refactor lowpan_net_frag_init()
  ipv6: export ip6 fragments sysctl to unprivileged users
  rhashtable: add schedule points
  inet: frags: use rhashtables for reassembly units
  inet: frags: remove some helpers
  inet: frags: get rif of inet_frag_evicting()
  inet: frags: remove inet_frag_maybe_warn_overflow()
  inet: frags: break the 2GB limit for frags storage
  inet: frags: do not clone skb in ip_expire()
  ipv6: frags: rewrite ip6_expire_frag_queue()
  rhashtable: reorganize struct rhashtable layout
  inet: frags: reorganize struct netns_frags
  inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
  inet: frags: fix ip6frag_low_thresh boundary
  net: speed up skb_rbtree_purge()
  net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
  net: add rb_to_skb() and other rb tree helpers
  net: sk_buff rbnode reorg

Florian Westphal (1):
  ipv6: defrag: drop non-last frags smaller than min mtu

Kees Cook (1):
  inet: frags: Convert timers to use timer_setup()

Peter Oskolkov (4):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
    skbs.
  ip: add helpers to process in-order fragments faster.
  ip: process in-order fragments efficiently

Taehee Yoo (1):
  ip: frags: fix crash in ip_do_fragment()

 Documentation/networking/ip-sysctl.txt  |  13 +-
 include/linux/rhashtable.h              |   8 +-
 include/linux/skbuff.h                  |  50 +-
 include/net/inet_frag.h                 | 135 +++---
 include/net/ip.h                        |   1 -
 include/net/ipv6.h                      |  26 +-
 include/uapi/linux/snmp.h               |   1 +
 lib/rhashtable.c                        |   2 +
 net/core/skbuff.c                       |  31 +-
 net/ieee802154/6lowpan/6lowpan_i.h      |  26 +-
 net/ieee802154/6lowpan/reassembly.c     | 153 ++++---
 net/ipv4/inet_fragment.c                | 378 ++++------------
 net/ipv4/ip_fragment.c                  | 578 +++++++++++++-----------
 net/ipv4/proc.c                         |   7 +-
 net/ipv4/tcp_fastopen.c                 |   8 +-
 net/ipv4/tcp_input.c                    |  33 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c | 105 ++---
 net/ipv6/proc.c                         |   5 +-
 net/ipv6/reassembly.c                   | 217 ++++-----
 net/sched/sch_netem.c                   |  14 +-
 20 files changed, 802 insertions(+), 989 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 01/30] inet: frags: change inet_frags_init_net() return value
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 02/30] inet: frags: add a pointer to struct netns_frags Stephen Hemminger
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

We will soon initialize one rhashtable per struct netns_frags
in inet_frags_init_net().

This patch changes the return value to eventually propagate an
error.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
---
 include/net/inet_frag.h                 |  3 ++-
 net/ieee802154/6lowpan/reassembly.c     | 11 ++++++++---
 net/ipv4/ip_fragment.c                  | 12 +++++++++---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 12 +++++++++---
 net/ipv6/reassembly.c                   | 11 +++++++++--
 5 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index a6e4edd8d4a2..2ad894e446ac 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -104,9 +104,10 @@ struct inet_frags {
 int inet_frags_init(struct inet_frags *);
 void inet_frags_fini(struct inet_frags *);
 
-static inline void inet_frags_init_net(struct netns_frags *nf)
+static inline int inet_frags_init_net(struct netns_frags *nf)
 {
 	atomic_set(&nf->mem, 0);
+	return 0;
 }
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
 
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index f85b08baff16..9757ce6c077a 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -580,14 +580,19 @@ static int __net_init lowpan_frags_init_net(struct net *net)
 {
 	struct netns_ieee802154_lowpan *ieee802154_lowpan =
 		net_ieee802154_lowpan(net);
+	int res;
 
 	ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
 
-	inet_frags_init_net(&ieee802154_lowpan->frags);
-
-	return lowpan_frags_ns_sysctl_register(net);
+	res = inet_frags_init_net(&ieee802154_lowpan->frags);
+	if (res < 0)
+		return res;
+	res = lowpan_frags_ns_sysctl_register(net);
+	if (res < 0)
+		inet_frags_exit_net(&ieee802154_lowpan->frags, &lowpan_frags);
+	return res;
 }
 
 static void __net_exit lowpan_frags_exit_net(struct net *net)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 4cb1befc3949..916def2e2afe 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -850,6 +850,8 @@ static void __init ip4_frags_ctl_register(void)
 
 static int __net_init ipv4_frags_init_net(struct net *net)
 {
+	int res;
+
 	/* Fragment cache limits.
 	 *
 	 * The fragment memory accounting code, (tries to) account for
@@ -875,9 +877,13 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 
 	net->ipv4.frags.max_dist = 64;
 
-	inet_frags_init_net(&net->ipv4.frags);
-
-	return ip4_frags_ns_ctl_register(net);
+	res = inet_frags_init_net(&net->ipv4.frags);
+	if (res < 0)
+		return res;
+	res = ip4_frags_ns_ctl_register(net);
+	if (res < 0)
+		inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
+	return res;
 }
 
 static void __net_exit ipv4_frags_exit_net(struct net *net)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index ee33a6743f3b..afa9ea76155d 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -630,12 +630,18 @@ EXPORT_SYMBOL_GPL(nf_ct_frag6_gather);
 
 static int nf_ct_net_init(struct net *net)
 {
+	int res;
+
 	net->nf_frag.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->nf_frag.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->nf_frag.frags.timeout = IPV6_FRAG_TIMEOUT;
-	inet_frags_init_net(&net->nf_frag.frags);
-
-	return nf_ct_frag6_sysctl_register(net);
+	res = inet_frags_init_net(&net->nf_frag.frags);
+	if (res < 0)
+		return res;
+	res = nf_ct_frag6_sysctl_register(net);
+	if (res < 0)
+		inet_frags_exit_net(&net->nf_frag.frags, &nf_frags);
+	return res;
 }
 
 static void nf_ct_net_exit(struct net *net)
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 846012eae526..38bf38a9717f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -714,13 +714,20 @@ static void ip6_frags_sysctl_unregister(void)
 
 static int __net_init ipv6_frags_init_net(struct net *net)
 {
+	int res;
+
 	net->ipv6.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->ipv6.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
 
-	inet_frags_init_net(&net->ipv6.frags);
+	res = inet_frags_init_net(&net->ipv6.frags);
+	if (res < 0)
+		return res;
 
-	return ip6_frags_ns_sysctl_register(net);
+	res = ip6_frags_ns_sysctl_register(net);
+	if (res < 0)
+		inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
+	return res;
 }
 
 static void __net_exit ipv6_frags_exit_net(struct net *net)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 02/30] inet: frags: add a pointer to struct netns_frags
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 01/30] inet: frags: change inet_frags_init_net() return value Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 03/30] inet: frags: refactor ipfrag_init() Stephen Hemminger
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.

These functions no longer have a struct inet_frags parameter :

inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
---
 include/net/inet_frag.h                 | 11 ++++++-----
 include/net/ipv6.h                      |  3 +--
 net/ieee802154/6lowpan/reassembly.c     | 13 +++++++------
 net/ipv4/inet_fragment.c                | 17 ++++++++++-------
 net/ipv4/ip_fragment.c                  |  9 +++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c | 16 +++++++++-------
 net/ipv6/reassembly.c                   | 20 ++++++++++----------
 7 files changed, 48 insertions(+), 41 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 2ad894e446ac..fd338293a095 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -10,6 +10,7 @@ struct netns_frags {
 	int			high_thresh;
 	int			low_thresh;
 	int			max_dist;
+	struct inet_frags	*f;
 };
 
 /**
@@ -109,20 +110,20 @@ static inline int inet_frags_init_net(struct netns_frags *nf)
 	atomic_set(&nf->mem, 0);
 	return 0;
 }
-void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
+void inet_frags_exit_net(struct netns_frags *nf);
 
-void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
-void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f);
+void inet_frag_kill(struct inet_frag_queue *q);
+void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash);
 
 void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
 				   const char *prefix);
 
-static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f)
+static inline void inet_frag_put(struct inet_frag_queue *q)
 {
 	if (refcount_dec_and_test(&q->refcnt))
-		inet_frag_destroy(q, f);
+		inet_frag_destroy(q);
 }
 
 static inline bool inet_frag_evicting(struct inet_frag_queue *q)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f280c61e019a..ff8407b19d05 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -560,8 +560,7 @@ struct frag_queue {
 	u8			ecn;
 };
 
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
-			   struct inet_frags *frags);
+void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
 
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 9757ce6c077a..9ccb8458b5c3 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -93,10 +93,10 @@ static void lowpan_frag_expire(unsigned long data)
 	if (fq->q.flags & INET_FRAG_COMPLETE)
 		goto out;
 
-	inet_frag_kill(&fq->q, &lowpan_frags);
+	inet_frag_kill(&fq->q);
 out:
 	spin_unlock(&fq->q.lock);
-	inet_frag_put(&fq->q, &lowpan_frags);
+	inet_frag_put(&fq->q);
 }
 
 static inline struct lowpan_frag_queue *
@@ -229,7 +229,7 @@ static int lowpan_frag_reasm(struct lowpan_frag_queue *fq, struct sk_buff *prev,
 	struct sk_buff *fp, *head = fq->q.fragments;
 	int sum_truesize;
 
-	inet_frag_kill(&fq->q, &lowpan_frags);
+	inet_frag_kill(&fq->q);
 
 	/* Make the one we just received the head. */
 	if (prev) {
@@ -437,7 +437,7 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 		ret = lowpan_frag_queue(fq, skb, frag_type);
 		spin_unlock(&fq->q.lock);
 
-		inet_frag_put(&fq->q, &lowpan_frags);
+		inet_frag_put(&fq->q);
 		return ret;
 	}
 
@@ -585,13 +585,14 @@ static int __net_init lowpan_frags_init_net(struct net *net)
 	ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
+	ieee802154_lowpan->frags.f = &lowpan_frags;
 
 	res = inet_frags_init_net(&ieee802154_lowpan->frags);
 	if (res < 0)
 		return res;
 	res = lowpan_frags_ns_sysctl_register(net);
 	if (res < 0)
-		inet_frags_exit_net(&ieee802154_lowpan->frags, &lowpan_frags);
+		inet_frags_exit_net(&ieee802154_lowpan->frags);
 	return res;
 }
 
@@ -601,7 +602,7 @@ static void __net_exit lowpan_frags_exit_net(struct net *net)
 		net_ieee802154_lowpan(net);
 
 	lowpan_frags_ns_sysctl_unregister(net);
-	inet_frags_exit_net(&ieee802154_lowpan->frags, &lowpan_frags);
+	inet_frags_exit_net(&ieee802154_lowpan->frags);
 }
 
 static struct pernet_operations lowpan_frags_ops = {
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ba4454ecdf0f..4b44f973c37f 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -219,8 +219,9 @@ void inet_frags_fini(struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_fini);
 
-void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
+void inet_frags_exit_net(struct netns_frags *nf)
 {
+	struct inet_frags *f =nf->f;
 	unsigned int seq;
 	int i;
 
@@ -264,33 +265,34 @@ __acquires(hb->chain_lock)
 	return hb;
 }
 
-static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
+static inline void fq_unlink(struct inet_frag_queue *fq)
 {
 	struct inet_frag_bucket *hb;
 
-	hb = get_frag_bucket_locked(fq, f);
+	hb = get_frag_bucket_locked(fq, fq->net->f);
 	hlist_del(&fq->list);
 	fq->flags |= INET_FRAG_COMPLETE;
 	spin_unlock(&hb->chain_lock);
 }
 
-void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
+void inet_frag_kill(struct inet_frag_queue *fq)
 {
 	if (del_timer(&fq->timer))
 		refcount_dec(&fq->refcnt);
 
 	if (!(fq->flags & INET_FRAG_COMPLETE)) {
-		fq_unlink(fq, f);
+		fq_unlink(fq);
 		refcount_dec(&fq->refcnt);
 	}
 }
 EXPORT_SYMBOL(inet_frag_kill);
 
-void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f)
+void inet_frag_destroy(struct inet_frag_queue *q)
 {
 	struct sk_buff *fp;
 	struct netns_frags *nf;
 	unsigned int sum, sum_truesize = 0;
+	struct inet_frags *f;
 
 	WARN_ON(!(q->flags & INET_FRAG_COMPLETE));
 	WARN_ON(del_timer(&q->timer) != 0);
@@ -298,6 +300,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f)
 	/* Release all fragment data. */
 	fp = q->fragments;
 	nf = q->net;
+	f = nf->f;
 	while (fp) {
 		struct sk_buff *xp = fp->next;
 
@@ -333,7 +336,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 			refcount_inc(&qp->refcnt);
 			spin_unlock(&hb->chain_lock);
 			qp_in->flags |= INET_FRAG_COMPLETE;
-			inet_frag_put(qp_in, f);
+			inet_frag_put(qp_in);
 			return qp;
 		}
 	}
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 916def2e2afe..c32718b00761 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -168,7 +168,7 @@ static void ip4_frag_free(struct inet_frag_queue *q)
 
 static void ipq_put(struct ipq *ipq)
 {
-	inet_frag_put(&ipq->q, &ip4_frags);
+	inet_frag_put(&ipq->q);
 }
 
 /* Kill ipq entry. It is not destroyed immediately,
@@ -176,7 +176,7 @@ static void ipq_put(struct ipq *ipq)
  */
 static void ipq_kill(struct ipq *ipq)
 {
-	inet_frag_kill(&ipq->q, &ip4_frags);
+	inet_frag_kill(&ipq->q);
 }
 
 static bool frag_expire_skip_icmp(u32 user)
@@ -876,20 +876,21 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 	net->ipv4.frags.timeout = IP_FRAG_TIME;
 
 	net->ipv4.frags.max_dist = 64;
+	net->ipv4.frags.f = &ip4_frags;
 
 	res = inet_frags_init_net(&net->ipv4.frags);
 	if (res < 0)
 		return res;
 	res = ip4_frags_ns_ctl_register(net);
 	if (res < 0)
-		inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
+		inet_frags_exit_net(&net->ipv4.frags);
 	return res;
 }
 
 static void __net_exit ipv4_frags_exit_net(struct net *net)
 {
 	ip4_frags_ns_ctl_unregister(net);
-	inet_frags_exit_net(&net->ipv4.frags, &ip4_frags);
+	inet_frags_exit_net(&net->ipv4.frags);
 }
 
 static struct pernet_operations ip4_frags_ops = {
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index afa9ea76155d..7ea2b4490672 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -177,7 +177,7 @@ static void nf_ct_frag6_expire(unsigned long data)
 	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, nf_frag.frags);
 
-	ip6_expire_frag_queue(net, fq, &nf_frags);
+	ip6_expire_frag_queue(net, fq);
 }
 
 /* Creation primitives. */
@@ -263,7 +263,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 			 * this case. -DaveM
 			 */
 			pr_debug("end of fragment not rounded to 8 bytes.\n");
-			inet_frag_kill(&fq->q, &nf_frags);
+			inet_frag_kill(&fq->q);
 			return -EPROTO;
 		}
 		if (end > fq->q.len) {
@@ -356,7 +356,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 	return 0;
 
 discard_fq:
-	inet_frag_kill(&fq->q, &nf_frags);
+	inet_frag_kill(&fq->q);
 err:
 	return -EINVAL;
 }
@@ -378,7 +378,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 	int    payload_len;
 	u8 ecn;
 
-	inet_frag_kill(&fq->q, &nf_frags);
+	inet_frag_kill(&fq->q);
 
 	WARN_ON(head == NULL);
 	WARN_ON(NFCT_FRAG6_CB(head)->offset != 0);
@@ -623,7 +623,7 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
 
 out_unlock:
 	spin_unlock_bh(&fq->q.lock);
-	inet_frag_put(&fq->q, &nf_frags);
+	inet_frag_put(&fq->q);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_ct_frag6_gather);
@@ -635,19 +635,21 @@ static int nf_ct_net_init(struct net *net)
 	net->nf_frag.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->nf_frag.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->nf_frag.frags.timeout = IPV6_FRAG_TIMEOUT;
+	net->nf_frag.frags.f = &nf_frags;
+
 	res = inet_frags_init_net(&net->nf_frag.frags);
 	if (res < 0)
 		return res;
 	res = nf_ct_frag6_sysctl_register(net);
 	if (res < 0)
-		inet_frags_exit_net(&net->nf_frag.frags, &nf_frags);
+		inet_frags_exit_net(&net->nf_frag.frags);
 	return res;
 }
 
 static void nf_ct_net_exit(struct net *net)
 {
 	nf_ct_frags6_sysctl_unregister(net);
-	inet_frags_exit_net(&net->nf_frag.frags, &nf_frags);
+	inet_frags_exit_net(&net->nf_frag.frags);
 }
 
 static struct pernet_operations nf_ct_net_ops = {
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 38bf38a9717f..26f737c3fc7b 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -128,8 +128,7 @@ void ip6_frag_init(struct inet_frag_queue *q, const void *a)
 }
 EXPORT_SYMBOL(ip6_frag_init);
 
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
-			   struct inet_frags *frags)
+void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 {
 	struct net_device *dev = NULL;
 
@@ -138,7 +137,7 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
 	if (fq->q.flags & INET_FRAG_COMPLETE)
 		goto out;
 
-	inet_frag_kill(&fq->q, frags);
+	inet_frag_kill(&fq->q);
 
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, fq->iif);
@@ -166,7 +165,7 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
 	rcu_read_unlock();
 out:
 	spin_unlock(&fq->q.lock);
-	inet_frag_put(&fq->q, frags);
+	inet_frag_put(&fq->q);
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
 
@@ -178,7 +177,7 @@ static void ip6_frag_expire(unsigned long data)
 	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ipv6.frags);
 
-	ip6_expire_frag_queue(net, fq, &ip6_frags);
+	ip6_expire_frag_queue(net, fq);
 }
 
 static struct frag_queue *
@@ -363,7 +362,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 	return -1;
 
 discard_fq:
-	inet_frag_kill(&fq->q, &ip6_frags);
+	inet_frag_kill(&fq->q);
 err:
 	__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
 			IPSTATS_MIB_REASMFAILS);
@@ -390,7 +389,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	int sum_truesize;
 	u8 ecn;
 
-	inet_frag_kill(&fq->q, &ip6_frags);
+	inet_frag_kill(&fq->q);
 
 	ecn = ip_frag_ecn_table[fq->ecn];
 	if (unlikely(ecn == 0xff))
@@ -568,7 +567,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
 
 		spin_unlock(&fq->q.lock);
-		inet_frag_put(&fq->q, &ip6_frags);
+		inet_frag_put(&fq->q);
 		return ret;
 	}
 
@@ -719,6 +718,7 @@ static int __net_init ipv6_frags_init_net(struct net *net)
 	net->ipv6.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->ipv6.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
+	net->ipv6.frags.f = &ip6_frags;
 
 	res = inet_frags_init_net(&net->ipv6.frags);
 	if (res < 0)
@@ -726,14 +726,14 @@ static int __net_init ipv6_frags_init_net(struct net *net)
 
 	res = ip6_frags_ns_sysctl_register(net);
 	if (res < 0)
-		inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
+		inet_frags_exit_net(&net->ipv6.frags);
 	return res;
 }
 
 static void __net_exit ipv6_frags_exit_net(struct net *net)
 {
 	ip6_frags_ns_sysctl_unregister(net);
-	inet_frags_exit_net(&net->ipv6.frags, &ip6_frags);
+	inet_frags_exit_net(&net->ipv6.frags);
 }
 
 static struct pernet_operations ip6_frags_ops = {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 03/30] inet: frags: refactor ipfrag_init()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 01/30] inet: frags: change inet_frags_init_net() return value Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 02/30] inet: frags: add a pointer to struct netns_frags Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 04/30] inet: frags: Convert timers to use timer_setup() Stephen Hemminger
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

We need to call inet_frags_init() before register_pernet_subsys(),
as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 483a6e4fa055123142d8956866fe2aa9c98d546d)
---
 net/ipv4/ip_fragment.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index c32718b00761..9d0b08c8ee00 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -900,8 +900,6 @@ static struct pernet_operations ip4_frags_ops = {
 
 void __init ipfrag_init(void)
 {
-	ip4_frags_ctl_register();
-	register_pernet_subsys(&ip4_frags_ops);
 	ip4_frags.hashfn = ip4_hashfn;
 	ip4_frags.constructor = ip4_frag_init;
 	ip4_frags.destructor = ip4_frag_free;
@@ -911,4 +909,6 @@ void __init ipfrag_init(void)
 	ip4_frags.frags_cache_name = ip_frag_cache_name;
 	if (inet_frags_init(&ip4_frags))
 		panic("IP: failed to allocate ip4_frags cache\n");
+	ip4_frags_ctl_register();
+	register_pernet_subsys(&ip4_frags_ops);
 }
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 04/30] inet: frags: Convert timers to use timer_setup()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (2 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 03/30] inet: frags: refactor ipfrag_init() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 05/30] inet: frags: refactor ipv6_frag_init() Stephen Hemminger
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh
  Cc: netdev, stable, edumazet, Kees Cook, Alexander Aring,
	Stefan Schmidt, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	linux-wpan, netfilter-devel, coreteam

From: Kees Cook <keescook@chromium.org>

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
---
 include/net/inet_frag.h                 | 2 +-
 net/ieee802154/6lowpan/reassembly.c     | 5 +++--
 net/ipv4/inet_fragment.c                | 4 ++--
 net/ipv4/ip_fragment.c                  | 5 +++--
 net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +++--
 net/ipv6/reassembly.c                   | 5 +++--
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index fd338293a095..69e531ed8189 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -97,7 +97,7 @@ struct inet_frags {
 	void			(*constructor)(struct inet_frag_queue *q,
 					       const void *arg);
 	void			(*destructor)(struct inet_frag_queue *);
-	void			(*frag_expire)(unsigned long data);
+	void			(*frag_expire)(struct timer_list *t);
 	struct kmem_cache	*frags_cachep;
 	const char		*frags_cache_name;
 };
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 9ccb8458b5c3..6badc055555b 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -80,12 +80,13 @@ static void lowpan_frag_init(struct inet_frag_queue *q, const void *a)
 	fq->daddr = *arg->dst;
 }
 
-static void lowpan_frag_expire(unsigned long data)
+static void lowpan_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ieee802154_lowpan.frags);
 
 	spin_lock(&fq->q.lock);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4b44f973c37f..97e747b1e9a0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -150,7 +150,7 @@ inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
 	spin_unlock(&hb->chain_lock);
 
 	hlist_for_each_entry_safe(fq, n, &expired, list_evictor)
-		f->frag_expire((unsigned long) fq);
+		f->frag_expire(&fq->timer);
 
 	return evicted;
 }
@@ -367,7 +367,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	f->constructor(q, arg);
 	add_frag_mem_limit(nf, f->qsize);
 
-	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
+	timer_setup(&q->timer, f->frag_expire, 0);
 	spin_lock_init(&q->lock);
 	refcount_set(&q->refcnt, 1);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9d0b08c8ee00..5171c8cc0eb6 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -191,12 +191,13 @@ static bool frag_expire_skip_icmp(u32 user)
 /*
  * Oops, a fragment queue timed out.  Kill it and send an ICMP reply.
  */
-static void ip_expire(unsigned long arg)
+static void ip_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct ipq *qp;
 	struct net *net;
 
-	qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
+	qp = container_of(frag, struct ipq, q);
 	net = container_of(qp->q.net, struct net, ipv4.frags);
 
 	rcu_read_lock();
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 7ea2b4490672..bc776ef392ea 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -169,12 +169,13 @@ static unsigned int nf_hashfn(const struct inet_frag_queue *q)
 	return nf_hash_frag(nq->id, &nq->saddr, &nq->daddr);
 }
 
-static void nf_ct_frag6_expire(unsigned long data)
+static void nf_ct_frag6_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, nf_frag.frags);
 
 	ip6_expire_frag_queue(net, fq);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 26f737c3fc7b..b85ef051b75c 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -169,12 +169,13 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
 
-static void ip6_frag_expire(unsigned long data)
+static void ip6_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ipv6.frags);
 
 	ip6_expire_frag_queue(net, fq);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 05/30] inet: frags: refactor ipv6_frag_init()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (3 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 04/30] inet: frags: Convert timers to use timer_setup() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 06/30] inet: frags: refactor lowpan_net_frag_init() Stephen Hemminger
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

We want to call inet_frags_init() earlier.

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5b975bab23615cd0fdf67af6c9298eb01c4b9f61)
---
 net/ipv6/reassembly.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index b85ef051b75c..42b6b2ba447a 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -746,10 +746,21 @@ int __init ipv6_frag_init(void)
 {
 	int ret;
 
-	ret = inet6_add_protocol(&frag_protocol, IPPROTO_FRAGMENT);
+	ip6_frags.hashfn = ip6_hashfn;
+	ip6_frags.constructor = ip6_frag_init;
+	ip6_frags.destructor = NULL;
+	ip6_frags.qsize = sizeof(struct frag_queue);
+	ip6_frags.match = ip6_frag_match;
+	ip6_frags.frag_expire = ip6_frag_expire;
+	ip6_frags.frags_cache_name = ip6_frag_cache_name;
+	ret = inet_frags_init(&ip6_frags);
 	if (ret)
 		goto out;
 
+	ret = inet6_add_protocol(&frag_protocol, IPPROTO_FRAGMENT);
+	if (ret)
+		goto err_protocol;
+
 	ret = ip6_frags_sysctl_register();
 	if (ret)
 		goto err_sysctl;
@@ -758,16 +769,6 @@ int __init ipv6_frag_init(void)
 	if (ret)
 		goto err_pernet;
 
-	ip6_frags.hashfn = ip6_hashfn;
-	ip6_frags.constructor = ip6_frag_init;
-	ip6_frags.destructor = NULL;
-	ip6_frags.qsize = sizeof(struct frag_queue);
-	ip6_frags.match = ip6_frag_match;
-	ip6_frags.frag_expire = ip6_frag_expire;
-	ip6_frags.frags_cache_name = ip6_frag_cache_name;
-	ret = inet_frags_init(&ip6_frags);
-	if (ret)
-		goto err_pernet;
 out:
 	return ret;
 
@@ -775,6 +776,8 @@ int __init ipv6_frag_init(void)
 	ip6_frags_sysctl_unregister();
 err_sysctl:
 	inet6_del_protocol(&frag_protocol, IPPROTO_FRAGMENT);
+err_protocol:
+	inet_frags_fini(&ip6_frags);
 	goto out;
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 06/30] inet: frags: refactor lowpan_net_frag_init()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (4 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 05/30] inet: frags: refactor ipv6_frag_init() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 07/30] ipv6: export ip6 fragments sysctl to unprivileged users Stephen Hemminger
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

We want to call lowpan_net_frag_init() earlier.
Similar to commit "inet: frags: refactor ipv6_frag_init()"

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 807f1844df4ac23594268fa9f41902d0549e92aa)
---
 net/ieee802154/6lowpan/reassembly.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 6badc055555b..ddada12a044d 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -615,14 +615,6 @@ int __init lowpan_net_frag_init(void)
 {
 	int ret;
 
-	ret = lowpan_frags_sysctl_register();
-	if (ret)
-		return ret;
-
-	ret = register_pernet_subsys(&lowpan_frags_ops);
-	if (ret)
-		goto err_pernet;
-
 	lowpan_frags.hashfn = lowpan_hashfn;
 	lowpan_frags.constructor = lowpan_frag_init;
 	lowpan_frags.destructor = NULL;
@@ -632,11 +624,21 @@ int __init lowpan_net_frag_init(void)
 	lowpan_frags.frags_cache_name = lowpan_frags_cache_name;
 	ret = inet_frags_init(&lowpan_frags);
 	if (ret)
-		goto err_pernet;
+		goto out;
 
+	ret = lowpan_frags_sysctl_register();
+	if (ret)
+		goto err_sysctl;
+
+	ret = register_pernet_subsys(&lowpan_frags_ops);
+	if (ret)
+		goto err_pernet;
+out:
 	return ret;
 err_pernet:
 	lowpan_frags_sysctl_unregister();
+err_sysctl:
+	inet_frags_fini(&lowpan_frags);
 	return ret;
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 07/30] ipv6: export ip6 fragments sysctl to unprivileged users
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (5 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 06/30] inet: frags: refactor lowpan_net_frag_init() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 08/30] rhashtable: add schedule points Stephen Hemminger
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Nikolay Borisov

From: Eric Dumazet <edumazet@google.com>

IPv4 was changed in commit 52a773d645e9 ("net: Export ip fragment
sysctl to unprivileged users")

The only sysctl that is not per-netns is not used :
ip6frag_secret_interval

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 18dcbe12fe9fca0ab825f7eff993060525ac2503)
---
 net/ipv6/reassembly.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 42b6b2ba447a..f0071b113a92 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -649,10 +649,6 @@ static int __net_init ip6_frags_ns_sysctl_register(struct net *net)
 		table[1].data = &net->ipv6.frags.low_thresh;
 		table[1].extra2 = &net->ipv6.frags.high_thresh;
 		table[2].data = &net->ipv6.frags.timeout;
-
-		/* Don't export sysctls to unprivileged users */
-		if (net->user_ns != &init_user_ns)
-			table[0].procname = NULL;
 	}
 
 	hdr = register_net_sysctl(net, "net/ipv6", table);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 08/30] rhashtable: add schedule points
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (6 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 07/30] ipv6: export ip6 fragments sysctl to unprivileged users Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 09/30] inet: frags: use rhashtables for reassembly units Stephen Hemminger
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Rehashing and destroying large hash table takes a lot of time,
and happens in process context. It is safe to add cond_resched()
in rhashtable_rehash_table() and rhashtable_free_and_destroy()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ae6da1f503abb5a5081f9f6c4a6881de97830f3e)
---
 lib/rhashtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 39215c724fc7..cebbcec877d7 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -364,6 +364,7 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 		err = rhashtable_rehash_chain(ht, old_hash);
 		if (err)
 			return err;
+		cond_resched();
 	}
 
 	/* Publish the new table pointer. */
@@ -1073,6 +1074,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
 		for (i = 0; i < tbl->size; i++) {
 			struct rhash_head *pos, *next;
 
+			cond_resched();
 			for (pos = rht_dereference(*rht_bucket(tbl, i), ht),
 			     next = !rht_is_a_nulls(pos) ?
 					rht_dereference(pos->next, ht) : NULL;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 09/30] inet: frags: use rhashtables for reassembly units
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (7 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 08/30] rhashtable: add schedule points Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 10/30] inet: frags: remove some helpers Stephen Hemminger
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh
  Cc: netdev, stable, edumazet, Kirill Tkhai, Herbert Xu,
	Florian Westphal, Jesper Dangaard Brouer, Alexander Aring,
	Stefan Schmidt

From: Eric Dumazet <edumazet@google.com>

Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
---
 Documentation/networking/ip-sysctl.txt  |   7 +-
 include/net/inet_frag.h                 |  81 +++---
 include/net/ipv6.h                      |  16 +-
 net/ieee802154/6lowpan/6lowpan_i.h      |  26 +-
 net/ieee802154/6lowpan/reassembly.c     |  91 +++----
 net/ipv4/inet_fragment.c                | 346 +++++-------------------
 net/ipv4/ip_fragment.c                  | 112 ++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |  51 +---
 net/ipv6/reassembly.c                   | 110 ++++----
 9 files changed, 266 insertions(+), 574 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d499676890d8..f23582a3c661 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -134,13 +134,10 @@ min_adv_mss - INTEGER
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
-	Maximum memory used to reassemble IP fragments. When
-	ipfrag_high_thresh bytes of memory is allocated for this purpose,
-	the fragment handler will toss packets until ipfrag_low_thresh
-	is reached. This also serves as a maximum limit to namespaces
-	different from the initial one.
+	Maximum memory used to reassemble IP fragments.
 
 ipfrag_low_thresh - INTEGER
+	(Obsolete since linux-4.17)
 	Maximum memory used to reassemble IP fragments before the kernel
 	begins to remove incomplete fragment queues to free up resources.
 	The kernel still accepts new fragments for defragmentation.
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 69e531ed8189..3fec0d3a0d01 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -2,7 +2,11 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include <linux/rhashtable.h>
+
 struct netns_frags {
+	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
+
 	/* Keep atomic mem on separate cachelines in structs that include it */
 	atomic_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
@@ -26,12 +30,30 @@ enum {
 	INET_FRAG_COMPLETE	= BIT(2),
 };
 
+struct frag_v4_compare_key {
+	__be32		saddr;
+	__be32		daddr;
+	u32		user;
+	u32		vif;
+	__be16		id;
+	u16		protocol;
+};
+
+struct frag_v6_compare_key {
+	struct in6_addr	saddr;
+	struct in6_addr	daddr;
+	u32		user;
+	__be32		id;
+	u32		iif;
+};
+
 /**
  * struct inet_frag_queue - fragment queue
  *
- * @lock: spinlock protecting the queue
+ * @node: rhash node
+ * @key: keys identifying this frag.
  * @timer: queue expiration timer
- * @list: hash bucket list
+ * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
  * @fragments_tail: received fragments tail
@@ -41,12 +63,16 @@ enum {
  * @flags: fragment queue flags
  * @max_size: maximum received fragment size
  * @net: namespace that this frag belongs to
- * @list_evictor: list of queues to forcefully evict (e.g. due to low memory)
+ * @rcu: rcu head for freeing deferall
  */
 struct inet_frag_queue {
-	spinlock_t		lock;
+	struct rhash_head	node;
+	union {
+		struct frag_v4_compare_key v4;
+		struct frag_v6_compare_key v6;
+	} key;
 	struct timer_list	timer;
-	struct hlist_node	list;
+	spinlock_t		lock;
 	refcount_t		refcnt;
 	struct sk_buff		*fragments;
 	struct sk_buff		*fragments_tail;
@@ -55,51 +81,20 @@ struct inet_frag_queue {
 	int			meat;
 	__u8			flags;
 	u16			max_size;
-	struct netns_frags	*net;
-	struct hlist_node	list_evictor;
-};
-
-#define INETFRAGS_HASHSZ	1024
-
-/* averaged:
- * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
- *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
- *	       struct frag_queue))
- */
-#define INETFRAGS_MAXDEPTH	128
-
-struct inet_frag_bucket {
-	struct hlist_head	chain;
-	spinlock_t		chain_lock;
+	struct netns_frags      *net;
+	struct rcu_head		rcu;
 };
 
 struct inet_frags {
-	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
-
-	struct work_struct	frags_work;
-	unsigned int next_bucket;
-	unsigned long last_rebuild_jiffies;
-	bool rebuild;
-
-	/* The first call to hashfn is responsible to initialize
-	 * rnd. This is best done with net_get_random_once.
-	 *
-	 * rnd_seqlock is used to let hash insertion detect
-	 * when it needs to re-lookup the hash chain to use.
-	 */
-	u32			rnd;
-	seqlock_t		rnd_seqlock;
 	unsigned int		qsize;
 
-	unsigned int		(*hashfn)(const struct inet_frag_queue *);
-	bool			(*match)(const struct inet_frag_queue *q,
-					 const void *arg);
 	void			(*constructor)(struct inet_frag_queue *q,
 					       const void *arg);
 	void			(*destructor)(struct inet_frag_queue *);
 	void			(*frag_expire)(struct timer_list *t);
 	struct kmem_cache	*frags_cachep;
 	const char		*frags_cache_name;
+	struct rhashtable_params rhash_params;
 };
 
 int inet_frags_init(struct inet_frags *);
@@ -108,15 +103,13 @@ void inet_frags_fini(struct inet_frags *);
 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
 	atomic_set(&nf->mem, 0);
-	return 0;
+	return rhashtable_init(&nf->rhashtable, &nf->f->rhash_params);
 }
 void inet_frags_exit_net(struct netns_frags *nf);
 
 void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
-struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
-		struct inet_frags *f, void *key, unsigned int hash);
-
+struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
 				   const char *prefix);
 
@@ -128,7 +121,7 @@ static inline void inet_frag_put(struct inet_frag_queue *q)
 
 static inline bool inet_frag_evicting(struct inet_frag_queue *q)
 {
-	return !hlist_unhashed(&q->list_evictor);
+	return false;
 }
 
 /* Memory Tracking Functions. */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ff8407b19d05..a271611d341d 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -531,17 +531,8 @@ enum ip6_defrag_users {
 	__IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + USHRT_MAX,
 };
 
-struct ip6_create_arg {
-	__be32 id;
-	u32 user;
-	const struct in6_addr *src;
-	const struct in6_addr *dst;
-	int iif;
-	u8 ecn;
-};
-
 void ip6_frag_init(struct inet_frag_queue *q, const void *a);
-bool ip6_frag_match(const struct inet_frag_queue *q, const void *a);
+extern const struct rhashtable_params ip6_rhash_params;
 
 /*
  *	Equivalent of ipv4 struct ip
@@ -549,11 +540,6 @@ bool ip6_frag_match(const struct inet_frag_queue *q, const void *a);
 struct frag_queue {
 	struct inet_frag_queue	q;
 
-	__be32			id;		/* fragment id		*/
-	u32			user;
-	struct in6_addr		saddr;
-	struct in6_addr		daddr;
-
 	int			iif;
 	unsigned int		csum;
 	__u16			nhoffset;
diff --git a/net/ieee802154/6lowpan/6lowpan_i.h b/net/ieee802154/6lowpan/6lowpan_i.h
index d8de3bcfb103..b8d95cb71c25 100644
--- a/net/ieee802154/6lowpan/6lowpan_i.h
+++ b/net/ieee802154/6lowpan/6lowpan_i.h
@@ -17,37 +17,19 @@ typedef unsigned __bitwise lowpan_rx_result;
 #define LOWPAN_DISPATCH_FRAG1           0xc0
 #define LOWPAN_DISPATCH_FRAGN           0xe0
 
-struct lowpan_create_arg {
+struct frag_lowpan_compare_key {
 	u16 tag;
 	u16 d_size;
-	const struct ieee802154_addr *src;
-	const struct ieee802154_addr *dst;
+	const struct ieee802154_addr src;
+	const struct ieee802154_addr dst;
 };
 
-/* Equivalent of ipv4 struct ip
+/* Equivalent of ipv4 struct ipq
  */
 struct lowpan_frag_queue {
 	struct inet_frag_queue	q;
-
-	u16			tag;
-	u16			d_size;
-	struct ieee802154_addr	saddr;
-	struct ieee802154_addr	daddr;
 };
 
-static inline u32 ieee802154_addr_hash(const struct ieee802154_addr *a)
-{
-	switch (a->mode) {
-	case IEEE802154_ADDR_LONG:
-		return (((__force u64)a->extended_addr) >> 32) ^
-			(((__force u64)a->extended_addr) & 0xffffffff);
-	case IEEE802154_ADDR_SHORT:
-		return (__force u32)(a->short_addr + (a->pan_id << 16));
-	default:
-		return 0;
-	}
-}
-
 int lowpan_frag_rcv(struct sk_buff *skb, const u8 frag_type);
 void lowpan_net_frag_exit(void);
 int lowpan_net_frag_init(void);
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index ddada12a044d..0fa0121f85d4 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -37,47 +37,15 @@ static struct inet_frags lowpan_frags;
 static int lowpan_frag_reasm(struct lowpan_frag_queue *fq,
 			     struct sk_buff *prev, struct net_device *ldev);
 
-static unsigned int lowpan_hash_frag(u16 tag, u16 d_size,
-				     const struct ieee802154_addr *saddr,
-				     const struct ieee802154_addr *daddr)
-{
-	net_get_random_once(&lowpan_frags.rnd, sizeof(lowpan_frags.rnd));
-	return jhash_3words(ieee802154_addr_hash(saddr),
-			    ieee802154_addr_hash(daddr),
-			    (__force u32)(tag + (d_size << 16)),
-			    lowpan_frags.rnd);
-}
-
-static unsigned int lowpan_hashfn(const struct inet_frag_queue *q)
-{
-	const struct lowpan_frag_queue *fq;
-
-	fq = container_of(q, struct lowpan_frag_queue, q);
-	return lowpan_hash_frag(fq->tag, fq->d_size, &fq->saddr, &fq->daddr);
-}
-
-static bool lowpan_frag_match(const struct inet_frag_queue *q, const void *a)
-{
-	const struct lowpan_frag_queue *fq;
-	const struct lowpan_create_arg *arg = a;
-
-	fq = container_of(q, struct lowpan_frag_queue, q);
-	return	fq->tag == arg->tag && fq->d_size == arg->d_size &&
-		ieee802154_addr_equal(&fq->saddr, arg->src) &&
-		ieee802154_addr_equal(&fq->daddr, arg->dst);
-}
-
 static void lowpan_frag_init(struct inet_frag_queue *q, const void *a)
 {
-	const struct lowpan_create_arg *arg = a;
+	const struct frag_lowpan_compare_key *key = a;
 	struct lowpan_frag_queue *fq;
 
 	fq = container_of(q, struct lowpan_frag_queue, q);
 
-	fq->tag = arg->tag;
-	fq->d_size = arg->d_size;
-	fq->saddr = *arg->src;
-	fq->daddr = *arg->dst;
+	BUILD_BUG_ON(sizeof(*key) > sizeof(q->key));
+	memcpy(&q->key, key, sizeof(*key));
 }
 
 static void lowpan_frag_expire(struct timer_list *t)
@@ -105,21 +73,17 @@ fq_find(struct net *net, const struct lowpan_802154_cb *cb,
 	const struct ieee802154_addr *src,
 	const struct ieee802154_addr *dst)
 {
-	struct inet_frag_queue *q;
-	struct lowpan_create_arg arg;
-	unsigned int hash;
 	struct netns_ieee802154_lowpan *ieee802154_lowpan =
 		net_ieee802154_lowpan(net);
+	struct frag_lowpan_compare_key key = {
+		.tag = cb->d_tag,
+		.d_size = cb->d_size,
+		.src = *src,
+		.dst = *dst,
+	};
+	struct inet_frag_queue *q;
 
-	arg.tag = cb->d_tag;
-	arg.d_size = cb->d_size;
-	arg.src = src;
-	arg.dst = dst;
-
-	hash = lowpan_hash_frag(cb->d_tag, cb->d_size, src, dst);
-
-	q = inet_frag_find(&ieee802154_lowpan->frags,
-			   &lowpan_frags, &arg, hash);
+	q = inet_frag_find(&ieee802154_lowpan->frags, &key);
 	if (IS_ERR_OR_NULL(q)) {
 		inet_frag_maybe_warn_overflow(q, pr_fmt());
 		return NULL;
@@ -611,17 +575,46 @@ static struct pernet_operations lowpan_frags_ops = {
 	.exit = lowpan_frags_exit_net,
 };
 
+static u32 lowpan_key_hashfn(const void *data, u32 len, u32 seed)
+{
+	return jhash2(data,
+		      sizeof(struct frag_lowpan_compare_key) / sizeof(u32), seed);
+}
+
+static u32 lowpan_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct inet_frag_queue *fq = data;
+
+	return jhash2((const u32 *)&fq->key,
+		      sizeof(struct frag_lowpan_compare_key) / sizeof(u32), seed);
+}
+
+static int lowpan_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+	const struct frag_lowpan_compare_key *key = arg->key;
+	const struct inet_frag_queue *fq = ptr;
+
+	return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+static const struct rhashtable_params lowpan_rhash_params = {
+	.head_offset		= offsetof(struct inet_frag_queue, node),
+	.hashfn			= lowpan_key_hashfn,
+	.obj_hashfn		= lowpan_obj_hashfn,
+	.obj_cmpfn		= lowpan_obj_cmpfn,
+	.automatic_shrinking	= true,
+};
+
 int __init lowpan_net_frag_init(void)
 {
 	int ret;
 
-	lowpan_frags.hashfn = lowpan_hashfn;
 	lowpan_frags.constructor = lowpan_frag_init;
 	lowpan_frags.destructor = NULL;
 	lowpan_frags.qsize = sizeof(struct frag_queue);
-	lowpan_frags.match = lowpan_frag_match;
 	lowpan_frags.frag_expire = lowpan_frag_expire;
 	lowpan_frags.frags_cache_name = lowpan_frags_cache_name;
+	lowpan_frags.rhash_params = lowpan_rhash_params;
 	ret = inet_frags_init(&lowpan_frags);
 	if (ret)
 		goto out;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 97e747b1e9a0..ebb8f411e0db 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -25,12 +25,6 @@
 #include <net/inet_frag.h>
 #include <net/inet_ecn.h>
 
-#define INETFRAGS_EVICT_BUCKETS   128
-#define INETFRAGS_EVICT_MAX	  512
-
-/* don't rebuild inetfrag table with new secret more often than this */
-#define INETFRAGS_MIN_REBUILD_INTERVAL (5 * HZ)
-
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
  *         0 or INET_ECN_CE value, to be ORed in to final iph->tos field
@@ -52,157 +46,8 @@ const u8 ip_frag_ecn_table[16] = {
 };
 EXPORT_SYMBOL(ip_frag_ecn_table);
 
-static unsigned int
-inet_frag_hashfn(const struct inet_frags *f, const struct inet_frag_queue *q)
-{
-	return f->hashfn(q) & (INETFRAGS_HASHSZ - 1);
-}
-
-static bool inet_frag_may_rebuild(struct inet_frags *f)
-{
-	return time_after(jiffies,
-	       f->last_rebuild_jiffies + INETFRAGS_MIN_REBUILD_INTERVAL);
-}
-
-static void inet_frag_secret_rebuild(struct inet_frags *f)
-{
-	int i;
-
-	write_seqlock_bh(&f->rnd_seqlock);
-
-	if (!inet_frag_may_rebuild(f))
-		goto out;
-
-	get_random_bytes(&f->rnd, sizeof(u32));
-
-	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
-		struct inet_frag_bucket *hb;
-		struct inet_frag_queue *q;
-		struct hlist_node *n;
-
-		hb = &f->hash[i];
-		spin_lock(&hb->chain_lock);
-
-		hlist_for_each_entry_safe(q, n, &hb->chain, list) {
-			unsigned int hval = inet_frag_hashfn(f, q);
-
-			if (hval != i) {
-				struct inet_frag_bucket *hb_dest;
-
-				hlist_del(&q->list);
-
-				/* Relink to new hash chain. */
-				hb_dest = &f->hash[hval];
-
-				/* This is the only place where we take
-				 * another chain_lock while already holding
-				 * one.  As this will not run concurrently,
-				 * we cannot deadlock on hb_dest lock below, if its
-				 * already locked it will be released soon since
-				 * other caller cannot be waiting for hb lock
-				 * that we've taken above.
-				 */
-				spin_lock_nested(&hb_dest->chain_lock,
-						 SINGLE_DEPTH_NESTING);
-				hlist_add_head(&q->list, &hb_dest->chain);
-				spin_unlock(&hb_dest->chain_lock);
-			}
-		}
-		spin_unlock(&hb->chain_lock);
-	}
-
-	f->rebuild = false;
-	f->last_rebuild_jiffies = jiffies;
-out:
-	write_sequnlock_bh(&f->rnd_seqlock);
-}
-
-static bool inet_fragq_should_evict(const struct inet_frag_queue *q)
-{
-	if (!hlist_unhashed(&q->list_evictor))
-		return false;
-
-	return q->net->low_thresh == 0 ||
-	       frag_mem_limit(q->net) >= q->net->low_thresh;
-}
-
-static unsigned int
-inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
-{
-	struct inet_frag_queue *fq;
-	struct hlist_node *n;
-	unsigned int evicted = 0;
-	HLIST_HEAD(expired);
-
-	spin_lock(&hb->chain_lock);
-
-	hlist_for_each_entry_safe(fq, n, &hb->chain, list) {
-		if (!inet_fragq_should_evict(fq))
-			continue;
-
-		if (!del_timer(&fq->timer))
-			continue;
-
-		hlist_add_head(&fq->list_evictor, &expired);
-		++evicted;
-	}
-
-	spin_unlock(&hb->chain_lock);
-
-	hlist_for_each_entry_safe(fq, n, &expired, list_evictor)
-		f->frag_expire(&fq->timer);
-
-	return evicted;
-}
-
-static void inet_frag_worker(struct work_struct *work)
-{
-	unsigned int budget = INETFRAGS_EVICT_BUCKETS;
-	unsigned int i, evicted = 0;
-	struct inet_frags *f;
-
-	f = container_of(work, struct inet_frags, frags_work);
-
-	BUILD_BUG_ON(INETFRAGS_EVICT_BUCKETS >= INETFRAGS_HASHSZ);
-
-	local_bh_disable();
-
-	for (i = ACCESS_ONCE(f->next_bucket); budget; --budget) {
-		evicted += inet_evict_bucket(f, &f->hash[i]);
-		i = (i + 1) & (INETFRAGS_HASHSZ - 1);
-		if (evicted > INETFRAGS_EVICT_MAX)
-			break;
-	}
-
-	f->next_bucket = i;
-
-	local_bh_enable();
-
-	if (f->rebuild && inet_frag_may_rebuild(f))
-		inet_frag_secret_rebuild(f);
-}
-
-static void inet_frag_schedule_worker(struct inet_frags *f)
-{
-	if (unlikely(!work_pending(&f->frags_work)))
-		schedule_work(&f->frags_work);
-}
-
 int inet_frags_init(struct inet_frags *f)
 {
-	int i;
-
-	INIT_WORK(&f->frags_work, inet_frag_worker);
-
-	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
-		struct inet_frag_bucket *hb = &f->hash[i];
-
-		spin_lock_init(&hb->chain_lock);
-		INIT_HLIST_HEAD(&hb->chain);
-	}
-
-	seqlock_init(&f->rnd_seqlock);
-	f->last_rebuild_jiffies = 0;
 	f->frags_cachep = kmem_cache_create(f->frags_cache_name, f->qsize, 0, 0,
 					    NULL);
 	if (!f->frags_cachep)
@@ -214,66 +59,42 @@ EXPORT_SYMBOL(inet_frags_init);
 
 void inet_frags_fini(struct inet_frags *f)
 {
-	cancel_work_sync(&f->frags_work);
+	/* We must wait that all inet_frag_destroy_rcu() have completed. */
+	rcu_barrier();
+
 	kmem_cache_destroy(f->frags_cachep);
+	f->frags_cachep = NULL;
 }
 EXPORT_SYMBOL(inet_frags_fini);
 
-void inet_frags_exit_net(struct netns_frags *nf)
-{
-	struct inet_frags *f =nf->f;
-	unsigned int seq;
-	int i;
-
-	nf->low_thresh = 0;
-
-evict_again:
-	local_bh_disable();
-	seq = read_seqbegin(&f->rnd_seqlock);
-
-	for (i = 0; i < INETFRAGS_HASHSZ ; i++)
-		inet_evict_bucket(f, &f->hash[i]);
-
-	local_bh_enable();
-	cond_resched();
-
-	if (read_seqretry(&f->rnd_seqlock, seq) ||
-	    sum_frag_mem_limit(nf))
-		goto evict_again;
-}
-EXPORT_SYMBOL(inet_frags_exit_net);
-
-static struct inet_frag_bucket *
-get_frag_bucket_locked(struct inet_frag_queue *fq, struct inet_frags *f)
-__acquires(hb->chain_lock)
+static void inet_frags_free_cb(void *ptr, void *arg)
 {
-	struct inet_frag_bucket *hb;
-	unsigned int seq, hash;
-
- restart:
-	seq = read_seqbegin(&f->rnd_seqlock);
+	struct inet_frag_queue *fq = ptr;
 
-	hash = inet_frag_hashfn(f, fq);
-	hb = &f->hash[hash];
+	/* If we can not cancel the timer, it means this frag_queue
+	 * is already disappearing, we have nothing to do.
+	 * Otherwise, we own a refcount until the end of this function.
+	 */
+	if (!del_timer(&fq->timer))
+		return;
 
-	spin_lock(&hb->chain_lock);
-	if (read_seqretry(&f->rnd_seqlock, seq)) {
-		spin_unlock(&hb->chain_lock);
-		goto restart;
+	spin_lock_bh(&fq->lock);
+	if (!(fq->flags & INET_FRAG_COMPLETE)) {
+		fq->flags |= INET_FRAG_COMPLETE;
+		refcount_dec(&fq->refcnt);
 	}
+	spin_unlock_bh(&fq->lock);
 
-	return hb;
+	inet_frag_put(fq);
 }
 
-static inline void fq_unlink(struct inet_frag_queue *fq)
+void inet_frags_exit_net(struct netns_frags *nf)
 {
-	struct inet_frag_bucket *hb;
+	nf->low_thresh = 0; /* prevent creation of new frags */
 
-	hb = get_frag_bucket_locked(fq, fq->net->f);
-	hlist_del(&fq->list);
-	fq->flags |= INET_FRAG_COMPLETE;
-	spin_unlock(&hb->chain_lock);
+	rhashtable_free_and_destroy(&nf->rhashtable, inet_frags_free_cb, NULL);
 }
+EXPORT_SYMBOL(inet_frags_exit_net);
 
 void inet_frag_kill(struct inet_frag_queue *fq)
 {
@@ -281,12 +102,26 @@ void inet_frag_kill(struct inet_frag_queue *fq)
 		refcount_dec(&fq->refcnt);
 
 	if (!(fq->flags & INET_FRAG_COMPLETE)) {
-		fq_unlink(fq);
+		struct netns_frags *nf = fq->net;
+
+		fq->flags |= INET_FRAG_COMPLETE;
+		rhashtable_remove_fast(&nf->rhashtable, &fq->node, nf->f->rhash_params);
 		refcount_dec(&fq->refcnt);
 	}
 }
 EXPORT_SYMBOL(inet_frag_kill);
 
+static void inet_frag_destroy_rcu(struct rcu_head *head)
+{
+	struct inet_frag_queue *q = container_of(head, struct inet_frag_queue,
+						 rcu);
+	struct inet_frags *f = q->net->f;
+
+	if (f->destructor)
+		f->destructor(q);
+	kmem_cache_free(f->frags_cachep, q);
+}
+
 void inet_frag_destroy(struct inet_frag_queue *q)
 {
 	struct sk_buff *fp;
@@ -310,55 +145,21 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 	}
 	sum = sum_truesize + f->qsize;
 
-	if (f->destructor)
-		f->destructor(q);
-	kmem_cache_free(f->frags_cachep, q);
+	call_rcu(&q->rcu, inet_frag_destroy_rcu);
 
 	sub_frag_mem_limit(nf, sum);
 }
 EXPORT_SYMBOL(inet_frag_destroy);
 
-static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
-						struct inet_frag_queue *qp_in,
-						struct inet_frags *f,
-						void *arg)
-{
-	struct inet_frag_bucket *hb = get_frag_bucket_locked(qp_in, f);
-	struct inet_frag_queue *qp;
-
-#ifdef CONFIG_SMP
-	/* With SMP race we have to recheck hash table, because
-	 * such entry could have been created on other cpu before
-	 * we acquired hash bucket lock.
-	 */
-	hlist_for_each_entry(qp, &hb->chain, list) {
-		if (qp->net == nf && f->match(qp, arg)) {
-			refcount_inc(&qp->refcnt);
-			spin_unlock(&hb->chain_lock);
-			qp_in->flags |= INET_FRAG_COMPLETE;
-			inet_frag_put(qp_in);
-			return qp;
-		}
-	}
-#endif
-	qp = qp_in;
-	if (!mod_timer(&qp->timer, jiffies + nf->timeout))
-		refcount_inc(&qp->refcnt);
-
-	refcount_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &hb->chain);
-
-	spin_unlock(&hb->chain_lock);
-
-	return qp;
-}
-
 static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 					       struct inet_frags *f,
 					       void *arg)
 {
 	struct inet_frag_queue *q;
 
+	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh)
+		return NULL;
+
 	q = kmem_cache_zalloc(f->frags_cachep, GFP_ATOMIC);
 	if (!q)
 		return NULL;
@@ -369,64 +170,52 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 
 	timer_setup(&q->timer, f->frag_expire, 0);
 	spin_lock_init(&q->lock);
-	refcount_set(&q->refcnt, 1);
+	refcount_set(&q->refcnt, 3);
 
 	return q;
 }
 
 static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
-						struct inet_frags *f,
 						void *arg)
 {
+	struct inet_frags *f = nf->f;
 	struct inet_frag_queue *q;
+	int err;
 
 	q = inet_frag_alloc(nf, f, arg);
 	if (!q)
 		return NULL;
 
-	return inet_frag_intern(nf, q, f, arg);
-}
+	mod_timer(&q->timer, jiffies + nf->timeout);
 
-struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
-				       struct inet_frags *f, void *key,
-				       unsigned int hash)
-{
-	struct inet_frag_bucket *hb;
-	struct inet_frag_queue *q;
-	int depth = 0;
-
-	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
-		inet_frag_schedule_worker(f);
+	err = rhashtable_insert_fast(&nf->rhashtable, &q->node,
+				     f->rhash_params);
+	if (err < 0) {
+		q->flags |= INET_FRAG_COMPLETE;
+		inet_frag_kill(q);
+		inet_frag_destroy(q);
 		return NULL;
 	}
+	return q;
+}
 
-	if (frag_mem_limit(nf) > nf->low_thresh)
-		inet_frag_schedule_worker(f);
-
-	hash &= (INETFRAGS_HASHSZ - 1);
-	hb = &f->hash[hash];
-
-	spin_lock(&hb->chain_lock);
-	hlist_for_each_entry(q, &hb->chain, list) {
-		if (q->net == nf && f->match(q, key)) {
-			refcount_inc(&q->refcnt);
-			spin_unlock(&hb->chain_lock);
-			return q;
-		}
-		depth++;
-	}
-	spin_unlock(&hb->chain_lock);
+/* TODO : call from rcu_read_lock() and no longer use refcount_inc_not_zero() */
+struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
+{
+	struct inet_frag_queue *fq;
 
-	if (depth <= INETFRAGS_MAXDEPTH)
-		return inet_frag_create(nf, f, key);
+	rcu_read_lock();
 
-	if (inet_frag_may_rebuild(f)) {
-		if (!f->rebuild)
-			f->rebuild = true;
-		inet_frag_schedule_worker(f);
+	fq = rhashtable_lookup(&nf->rhashtable, key, nf->f->rhash_params);
+	if (fq) {
+		if (!refcount_inc_not_zero(&fq->refcnt))
+			fq = NULL;
+		rcu_read_unlock();
+		return fq;
 	}
+	rcu_read_unlock();
 
-	return ERR_PTR(-ENOBUFS);
+	return inet_frag_create(nf, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
 
@@ -434,8 +223,7 @@ void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
 				   const char *prefix)
 {
 	static const char msg[] = "inet_frag_find: Fragment hash bucket"
-		" list length grew over limit " __stringify(INETFRAGS_MAXDEPTH)
-		". Dropping fragment.\n";
+		" list length grew over limit. Dropping fragment.\n";
 
 	if (PTR_ERR(q) == -ENOBUFS)
 		net_dbg_ratelimited("%s%s", prefix, msg);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5171c8cc0eb6..1d3b82e96c45 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -69,15 +69,9 @@ struct ipfrag_skb_cb
 struct ipq {
 	struct inet_frag_queue q;
 
-	u32		user;
-	__be32		saddr;
-	__be32		daddr;
-	__be16		id;
-	u8		protocol;
 	u8		ecn; /* RFC3168 support */
 	u16		max_df_size; /* largest frag with DF set seen */
 	int             iif;
-	int             vif;   /* L3 master device index */
 	unsigned int    rid;
 	struct inet_peer *peer;
 };
@@ -97,41 +91,6 @@ int ip_frag_mem(struct net *net)
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 			 struct net_device *dev);
 
-struct ip4_create_arg {
-	struct iphdr *iph;
-	u32 user;
-	int vif;
-};
-
-static unsigned int ipqhashfn(__be16 id, __be32 saddr, __be32 daddr, u8 prot)
-{
-	net_get_random_once(&ip4_frags.rnd, sizeof(ip4_frags.rnd));
-	return jhash_3words((__force u32)id << 16 | prot,
-			    (__force u32)saddr, (__force u32)daddr,
-			    ip4_frags.rnd);
-}
-
-static unsigned int ip4_hashfn(const struct inet_frag_queue *q)
-{
-	const struct ipq *ipq;
-
-	ipq = container_of(q, struct ipq, q);
-	return ipqhashfn(ipq->id, ipq->saddr, ipq->daddr, ipq->protocol);
-}
-
-static bool ip4_frag_match(const struct inet_frag_queue *q, const void *a)
-{
-	const struct ipq *qp;
-	const struct ip4_create_arg *arg = a;
-
-	qp = container_of(q, struct ipq, q);
-	return	qp->id == arg->iph->id &&
-		qp->saddr == arg->iph->saddr &&
-		qp->daddr == arg->iph->daddr &&
-		qp->protocol == arg->iph->protocol &&
-		qp->user == arg->user &&
-		qp->vif == arg->vif;
-}
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
 {
@@ -140,17 +99,12 @@ static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
 					       frags);
 	struct net *net = container_of(ipv4, struct net, ipv4);
 
-	const struct ip4_create_arg *arg = a;
+	const struct frag_v4_compare_key *key = a;
 
-	qp->protocol = arg->iph->protocol;
-	qp->id = arg->iph->id;
-	qp->ecn = ip4_frag_ecn(arg->iph->tos);
-	qp->saddr = arg->iph->saddr;
-	qp->daddr = arg->iph->daddr;
-	qp->vif = arg->vif;
-	qp->user = arg->user;
+	q->key.v4 = *key;
+	qp->ecn = 0;
 	qp->peer = q->net->max_dist ?
-		inet_getpeer_v4(net->ipv4.peers, arg->iph->saddr, arg->vif, 1) :
+		inet_getpeer_v4(net->ipv4.peers, key->saddr, key->vif, 1) :
 		NULL;
 }
 
@@ -234,7 +188,7 @@ static void ip_expire(struct timer_list *t)
 		/* Only an end host needs to send an ICMP
 		 * "Fragment Reassembly Timeout" message, per RFC792.
 		 */
-		if (frag_expire_skip_icmp(qp->user) &&
+		if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
 		    (skb_rtable(head)->rt_type != RTN_LOCAL))
 			goto out;
 
@@ -262,17 +216,17 @@ static void ip_expire(struct timer_list *t)
 static struct ipq *ip_find(struct net *net, struct iphdr *iph,
 			   u32 user, int vif)
 {
+	struct frag_v4_compare_key key = {
+		.saddr = iph->saddr,
+		.daddr = iph->daddr,
+		.user = user,
+		.vif = vif,
+		.id = iph->id,
+		.protocol = iph->protocol,
+	};
 	struct inet_frag_queue *q;
-	struct ip4_create_arg arg;
-	unsigned int hash;
-
-	arg.iph = iph;
-	arg.user = user;
-	arg.vif = vif;
 
-	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
-
-	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
+	q = inet_frag_find(&net->ipv4.frags, &key);
 	if (IS_ERR_OR_NULL(q)) {
 		inet_frag_maybe_warn_overflow(q, pr_fmt());
 		return NULL;
@@ -661,7 +615,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	err = -ENOMEM;
 	goto out_fail;
 out_oversize:
-	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->saddr);
+	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->q.key.v4.saddr);
 out_fail:
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 	return err;
@@ -899,15 +853,47 @@ static struct pernet_operations ip4_frags_ops = {
 	.exit = ipv4_frags_exit_net,
 };
 
+
+static u32 ip4_key_hashfn(const void *data, u32 len, u32 seed)
+{
+	return jhash2(data,
+		      sizeof(struct frag_v4_compare_key) / sizeof(u32), seed);
+}
+
+static u32 ip4_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct inet_frag_queue *fq = data;
+
+	return jhash2((const u32 *)&fq->key.v4,
+		      sizeof(struct frag_v4_compare_key) / sizeof(u32), seed);
+}
+
+static int ip4_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+	const struct frag_v4_compare_key *key = arg->key;
+	const struct inet_frag_queue *fq = ptr;
+
+	return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+static const struct rhashtable_params ip4_rhash_params = {
+	.head_offset		= offsetof(struct inet_frag_queue, node),
+	.key_offset		= offsetof(struct inet_frag_queue, key),
+	.key_len		= sizeof(struct frag_v4_compare_key),
+	.hashfn			= ip4_key_hashfn,
+	.obj_hashfn		= ip4_obj_hashfn,
+	.obj_cmpfn		= ip4_obj_cmpfn,
+	.automatic_shrinking	= true,
+};
+
 void __init ipfrag_init(void)
 {
-	ip4_frags.hashfn = ip4_hashfn;
 	ip4_frags.constructor = ip4_frag_init;
 	ip4_frags.destructor = ip4_frag_free;
 	ip4_frags.qsize = sizeof(struct ipq);
-	ip4_frags.match = ip4_frag_match;
 	ip4_frags.frag_expire = ip_expire;
 	ip4_frags.frags_cache_name = ip_frag_cache_name;
+	ip4_frags.rhash_params = ip4_rhash_params;
 	if (inet_frags_init(&ip4_frags))
 		panic("IP: failed to allocate ip4_frags cache\n");
 	ip4_frags_ctl_register();
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index bc776ef392ea..8b12431ae296 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -152,23 +152,6 @@ static inline u8 ip6_frag_ecn(const struct ipv6hdr *ipv6h)
 	return 1 << (ipv6_get_dsfield(ipv6h) & INET_ECN_MASK);
 }
 
-static unsigned int nf_hash_frag(__be32 id, const struct in6_addr *saddr,
-				 const struct in6_addr *daddr)
-{
-	net_get_random_once(&nf_frags.rnd, sizeof(nf_frags.rnd));
-	return jhash_3words(ipv6_addr_hash(saddr), ipv6_addr_hash(daddr),
-			    (__force u32)id, nf_frags.rnd);
-}
-
-
-static unsigned int nf_hashfn(const struct inet_frag_queue *q)
-{
-	const struct frag_queue *nq;
-
-	nq = container_of(q, struct frag_queue, q);
-	return nf_hash_frag(nq->id, &nq->saddr, &nq->daddr);
-}
-
 static void nf_ct_frag6_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
@@ -182,26 +165,19 @@ static void nf_ct_frag6_expire(struct timer_list *t)
 }
 
 /* Creation primitives. */
-static inline struct frag_queue *fq_find(struct net *net, __be32 id,
-					 u32 user, struct in6_addr *src,
-					 struct in6_addr *dst, int iif, u8 ecn)
+static struct frag_queue *fq_find(struct net *net, __be32 id, u32 user,
+				  const struct ipv6hdr *hdr, int iif)
 {
+	struct frag_v6_compare_key key = {
+		.id = id,
+		.saddr = hdr->saddr,
+		.daddr = hdr->daddr,
+		.user = user,
+		.iif = iif,
+	};
 	struct inet_frag_queue *q;
-	struct ip6_create_arg arg;
-	unsigned int hash;
-
-	arg.id = id;
-	arg.user = user;
-	arg.src = src;
-	arg.dst = dst;
-	arg.iif = iif;
-	arg.ecn = ecn;
-
-	local_bh_disable();
-	hash = nf_hash_frag(id, src, dst);
 
-	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
-	local_bh_enable();
+	q = inet_frag_find(&net->nf_frag.frags, &key);
 	if (IS_ERR_OR_NULL(q)) {
 		inet_frag_maybe_warn_overflow(q, pr_fmt());
 		return NULL;
@@ -593,8 +569,8 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
 	fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
 	skb_orphan(skb);
-	fq = fq_find(net, fhdr->identification, user, &hdr->saddr, &hdr->daddr,
-		     skb->dev ? skb->dev->ifindex : 0, ip6_frag_ecn(hdr));
+	fq = fq_find(net, fhdr->identification, user, hdr,
+		     skb->dev ? skb->dev->ifindex : 0);
 	if (fq == NULL) {
 		pr_debug("Can't find and can't create new queue\n");
 		return -ENOMEM;
@@ -662,13 +638,12 @@ int nf_ct_frag6_init(void)
 {
 	int ret = 0;
 
-	nf_frags.hashfn = nf_hashfn;
 	nf_frags.constructor = ip6_frag_init;
 	nf_frags.destructor = NULL;
 	nf_frags.qsize = sizeof(struct frag_queue);
-	nf_frags.match = ip6_frag_match;
 	nf_frags.frag_expire = nf_ct_frag6_expire;
 	nf_frags.frags_cache_name = nf_frags_cache_name;
+	nf_frags.rhash_params = ip6_rhash_params;
 	ret = inet_frags_init(&nf_frags);
 	if (ret)
 		goto out;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index f0071b113a92..3fc853e4492a 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -79,52 +79,13 @@ static struct inet_frags ip6_frags;
 static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 			  struct net_device *dev);
 
-/*
- * callers should be careful not to use the hash value outside the ipfrag_lock
- * as doing so could race with ipfrag_hash_rnd being recalculated.
- */
-static unsigned int inet6_hash_frag(__be32 id, const struct in6_addr *saddr,
-				    const struct in6_addr *daddr)
-{
-	net_get_random_once(&ip6_frags.rnd, sizeof(ip6_frags.rnd));
-	return jhash_3words(ipv6_addr_hash(saddr), ipv6_addr_hash(daddr),
-			    (__force u32)id, ip6_frags.rnd);
-}
-
-static unsigned int ip6_hashfn(const struct inet_frag_queue *q)
-{
-	const struct frag_queue *fq;
-
-	fq = container_of(q, struct frag_queue, q);
-	return inet6_hash_frag(fq->id, &fq->saddr, &fq->daddr);
-}
-
-bool ip6_frag_match(const struct inet_frag_queue *q, const void *a)
-{
-	const struct frag_queue *fq;
-	const struct ip6_create_arg *arg = a;
-
-	fq = container_of(q, struct frag_queue, q);
-	return	fq->id == arg->id &&
-		fq->user == arg->user &&
-		ipv6_addr_equal(&fq->saddr, arg->src) &&
-		ipv6_addr_equal(&fq->daddr, arg->dst) &&
-		(arg->iif == fq->iif ||
-		 !(ipv6_addr_type(arg->dst) & (IPV6_ADDR_MULTICAST |
-					       IPV6_ADDR_LINKLOCAL)));
-}
-EXPORT_SYMBOL(ip6_frag_match);
-
 void ip6_frag_init(struct inet_frag_queue *q, const void *a)
 {
 	struct frag_queue *fq = container_of(q, struct frag_queue, q);
-	const struct ip6_create_arg *arg = a;
+	const struct frag_v6_compare_key *key = a;
 
-	fq->id = arg->id;
-	fq->user = arg->user;
-	fq->saddr = *arg->src;
-	fq->daddr = *arg->dst;
-	fq->ecn = arg->ecn;
+	q->key.v6 = *key;
+	fq->ecn = 0;
 }
 EXPORT_SYMBOL(ip6_frag_init);
 
@@ -182,23 +143,22 @@ static void ip6_frag_expire(struct timer_list *t)
 }
 
 static struct frag_queue *
-fq_find(struct net *net, __be32 id, const struct in6_addr *src,
-	const struct in6_addr *dst, int iif, u8 ecn)
+fq_find(struct net *net, __be32 id, const struct ipv6hdr *hdr, int iif)
 {
+	struct frag_v6_compare_key key = {
+		.id = id,
+		.saddr = hdr->saddr,
+		.daddr = hdr->daddr,
+		.user = IP6_DEFRAG_LOCAL_DELIVER,
+		.iif = iif,
+	};
 	struct inet_frag_queue *q;
-	struct ip6_create_arg arg;
-	unsigned int hash;
-
-	arg.id = id;
-	arg.user = IP6_DEFRAG_LOCAL_DELIVER;
-	arg.src = src;
-	arg.dst = dst;
-	arg.iif = iif;
-	arg.ecn = ecn;
 
-	hash = inet6_hash_frag(id, src, dst);
+	if (!(ipv6_addr_type(&hdr->daddr) & (IPV6_ADDR_MULTICAST |
+					    IPV6_ADDR_LINKLOCAL)))
+		key.iif = 0;
 
-	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
+	q = inet_frag_find(&net->ipv6.frags, &key);
 	if (IS_ERR_OR_NULL(q)) {
 		inet_frag_maybe_warn_overflow(q, pr_fmt());
 		return NULL;
@@ -530,6 +490,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 	struct frag_queue *fq;
 	const struct ipv6hdr *hdr = ipv6_hdr(skb);
 	struct net *net = dev_net(skb_dst(skb)->dev);
+	int iif;
 
 	if (IP6CB(skb)->flags & IP6SKB_FRAGMENTED)
 		goto fail_hdr;
@@ -558,13 +519,14 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
-	fq = fq_find(net, fhdr->identification, &hdr->saddr, &hdr->daddr,
-		     skb->dev ? skb->dev->ifindex : 0, ip6_frag_ecn(hdr));
+	iif = skb->dev ? skb->dev->ifindex : 0;
+	fq = fq_find(net, fhdr->identification, hdr, iif);
 	if (fq) {
 		int ret;
 
 		spin_lock(&fq->q.lock);
 
+		fq->iif = iif;
 		ret = ip6_frag_queue(fq, skb, fhdr, IP6CB(skb)->nhoff);
 
 		spin_unlock(&fq->q.lock);
@@ -738,17 +700,47 @@ static struct pernet_operations ip6_frags_ops = {
 	.exit = ipv6_frags_exit_net,
 };
 
+static u32 ip6_key_hashfn(const void *data, u32 len, u32 seed)
+{
+	return jhash2(data,
+		      sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static u32 ip6_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct inet_frag_queue *fq = data;
+
+	return jhash2((const u32 *)&fq->key.v6,
+		      sizeof(struct frag_v6_compare_key) / sizeof(u32), seed);
+}
+
+static int ip6_obj_cmpfn(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+	const struct frag_v6_compare_key *key = arg->key;
+	const struct inet_frag_queue *fq = ptr;
+
+	return !!memcmp(&fq->key, key, sizeof(*key));
+}
+
+const struct rhashtable_params ip6_rhash_params = {
+	.head_offset		= offsetof(struct inet_frag_queue, node),
+	.hashfn			= ip6_key_hashfn,
+	.obj_hashfn		= ip6_obj_hashfn,
+	.obj_cmpfn		= ip6_obj_cmpfn,
+	.automatic_shrinking	= true,
+};
+EXPORT_SYMBOL(ip6_rhash_params);
+
 int __init ipv6_frag_init(void)
 {
 	int ret;
 
-	ip6_frags.hashfn = ip6_hashfn;
 	ip6_frags.constructor = ip6_frag_init;
 	ip6_frags.destructor = NULL;
 	ip6_frags.qsize = sizeof(struct frag_queue);
-	ip6_frags.match = ip6_frag_match;
 	ip6_frags.frag_expire = ip6_frag_expire;
 	ip6_frags.frags_cache_name = ip6_frag_cache_name;
+	ip6_frags.rhash_params = ip6_rhash_params;
 	ret = inet_frags_init(&ip6_frags);
 	if (ret)
 		goto out;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 10/30] inet: frags: remove some helpers
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (8 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 09/30] inet: frags: use rhashtables for reassembly units Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 11/30] inet: frags: get rif of inet_frag_evicting() Stephen Hemminger
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405ab ("inet: frag: don't account number
of fragment queues")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
---
 include/net/inet_frag.h | 5 -----
 include/net/ip.h        | 1 -
 include/net/ipv6.h      | 7 -------
 net/ipv4/ip_fragment.c  | 5 -----
 net/ipv4/proc.c         | 6 +++---
 net/ipv6/proc.c         | 5 +++--
 6 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 3fec0d3a0d01..4b5449df0aad 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -141,11 +141,6 @@ static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
 	atomic_add(i, &nf->mem);
 }
 
-static inline int sum_frag_mem_limit(struct netns_frags *nf)
-{
-	return atomic_read(&nf->mem);
-}
-
 /* RFC 3168 support :
  * We want to check ECN values of all fragments, do detect invalid combinations.
  * In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
diff --git a/include/net/ip.h b/include/net/ip.h
index 81da1123fc8e..7c430343176a 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -570,7 +570,6 @@ static inline struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *s
 	return skb;
 }
 #endif
-int ip_frag_mem(struct net *net);
 
 /*
  *	Functions provided by ip_forward.c
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index a271611d341d..fa87a62e9bd3 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -331,13 +331,6 @@ static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 	    idev->cnf.accept_ra;
 }
 
-#if IS_ENABLED(CONFIG_IPV6)
-static inline int ip6_frag_mem(struct net *net)
-{
-	return sum_frag_mem_limit(&net->ipv6.frags);
-}
-#endif
-
 #define IPV6_FRAG_HIGH_THRESH	(4 * 1024*1024)	/* 4194304 */
 #define IPV6_FRAG_LOW_THRESH	(3 * 1024*1024)	/* 3145728 */
 #define IPV6_FRAG_TIMEOUT	(60 * HZ)	/* 60 seconds */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 1d3b82e96c45..94451fad9994 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -83,11 +83,6 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-int ip_frag_mem(struct net *net)
-{
-	return sum_frag_mem_limit(&net->ipv4.frags);
-}
-
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 			 struct net_device *dev);
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 127153f1ed8a..01a337c3a36b 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -54,7 +54,6 @@
 static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
-	unsigned int frag_mem;
 	int orphans, sockets;
 
 	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
@@ -72,8 +71,9 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
 		   sock_prot_inuse_get(net, &raw_prot));
-	frag_mem = ip_frag_mem(net);
-	seq_printf(seq,  "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem);
+	seq_printf(seq,  "FRAG: inuse %u memory %u\n",
+		   atomic_read(&net->ipv4.frags.rhashtable.nelems),
+		   frag_mem_limit(&net->ipv4.frags));
 	return 0;
 }
 
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index e88bcb8ff0fd..5704ec3d3178 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -38,7 +38,6 @@
 static int sockstat6_seq_show(struct seq_file *seq, void *v)
 {
 	struct net *net = seq->private;
-	unsigned int frag_mem = ip6_frag_mem(net);
 
 	seq_printf(seq, "TCP6: inuse %d\n",
 		       sock_prot_inuse_get(net, &tcpv6_prot));
@@ -48,7 +47,9 @@ static int sockstat6_seq_show(struct seq_file *seq, void *v)
 			sock_prot_inuse_get(net, &udplitev6_prot));
 	seq_printf(seq, "RAW6: inuse %d\n",
 		       sock_prot_inuse_get(net, &rawv6_prot));
-	seq_printf(seq, "FRAG6: inuse %u memory %u\n", !!frag_mem, frag_mem);
+	seq_printf(seq, "FRAG6: inuse %u memory %u\n",
+		   atomic_read(&net->ipv6.frags.rhashtable.nelems),
+		   frag_mem_limit(&net->ipv6.frags));
 	return 0;
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 11/30] inet: frags: get rif of inet_frag_evicting()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (9 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 10/30] inet: frags: remove some helpers Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 12/30] inet: frags: remove inet_frag_maybe_warn_overflow() Stephen Hemminger
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
---
 include/net/inet_frag.h |  5 ----
 net/ipv4/ip_fragment.c  | 65 ++++++++++++++++++++---------------------
 net/ipv6/reassembly.c   |  4 ---
 3 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 4b5449df0aad..0e8e159d88f7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -119,11 +119,6 @@ static inline void inet_frag_put(struct inet_frag_queue *q)
 		inet_frag_destroy(q);
 }
 
-static inline bool inet_frag_evicting(struct inet_frag_queue *q)
-{
-	return false;
-}
-
 /* Memory Tracking Functions. */
 
 static inline int frag_mem_limit(struct netns_frags *nf)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 94451fad9994..1222aee3e5ee 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -143,8 +143,11 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
-	struct ipq *qp;
+	struct sk_buff *clone, *head;
+	const struct iphdr *iph;
 	struct net *net;
+	struct ipq *qp;
+	int err;
 
 	qp = container_of(frag, struct ipq, q);
 	net = container_of(qp->q.net, struct net, ipv4.frags);
@@ -158,45 +161,41 @@ static void ip_expire(struct timer_list *t)
 	ipq_kill(qp);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 
-	if (!inet_frag_evicting(&qp->q)) {
-		struct sk_buff *clone, *head = qp->q.fragments;
-		const struct iphdr *iph;
-		int err;
+	head = qp->q.fragments;
 
-		__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-		if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !qp->q.fragments)
-			goto out;
+	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+		goto out;
 
-		head->dev = dev_get_by_index_rcu(net, qp->iif);
-		if (!head->dev)
-			goto out;
+	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	if (!head->dev)
+		goto out;
 
 
-		/* skb has no dst, perform route lookup again */
-		iph = ip_hdr(head);
-		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+	/* skb has no dst, perform route lookup again */
+	iph = ip_hdr(head);
+	err = ip_route_input_noref(head, iph->daddr, iph->saddr,
 					   iph->tos, head->dev);
-		if (err)
-			goto out;
+	if (err)
+		goto out;
 
-		/* Only an end host needs to send an ICMP
-		 * "Fragment Reassembly Timeout" message, per RFC792.
-		 */
-		if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
-		    (skb_rtable(head)->rt_type != RTN_LOCAL))
-			goto out;
-
-		clone = skb_clone(head, GFP_ATOMIC);
-
-		/* Send an ICMP "Fragment Reassembly Timeout" message. */
-		if (clone) {
-			spin_unlock(&qp->q.lock);
-			icmp_send(clone, ICMP_TIME_EXCEEDED,
-				  ICMP_EXC_FRAGTIME, 0);
-			consume_skb(clone);
-			goto out_rcu_unlock;
-		}
+	/* Only an end host needs to send an ICMP
+	 * "Fragment Reassembly Timeout" message, per RFC792.
+	 */
+	if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
+	    (skb_rtable(head)->rt_type != RTN_LOCAL))
+		goto out;
+
+	clone = skb_clone(head, GFP_ATOMIC);
+
+	/* Send an ICMP "Fragment Reassembly Timeout" message. */
+	if (clone) {
+		spin_unlock(&qp->q.lock);
+		icmp_send(clone, ICMP_TIME_EXCEEDED,
+			  ICMP_EXC_FRAGTIME, 0);
+		consume_skb(clone);
+		goto out_rcu_unlock;
 	}
 out:
 	spin_unlock(&qp->q.lock);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 3fc853e4492a..70acad126d04 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -106,10 +106,6 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 		goto out_rcu_unlock;
 
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
-
-	if (inet_frag_evicting(&fq->q))
-		goto out_rcu_unlock;
-
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
 	/* Don't send error if the first segment did not arrive. */
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 12/30] inet: frags: remove inet_frag_maybe_warn_overflow()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (10 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 11/30] inet: frags: get rif of inet_frag_evicting() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 13/30] inet: frags: break the 2GB limit for frags storage Stephen Hemminger
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
---
 include/net/inet_frag.h                 |  2 --
 net/ieee802154/6lowpan/reassembly.c     |  5 ++---
 net/ipv4/inet_fragment.c                | 11 -----------
 net/ipv4/ip_fragment.c                  |  5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  5 ++---
 net/ipv6/reassembly.c                   |  5 ++---
 6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 0e8e159d88f7..95e353e3305b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -110,8 +110,6 @@ void inet_frags_exit_net(struct netns_frags *nf);
 void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 0fa0121f85d4..1aec71a3f904 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -84,10 +84,9 @@ fq_find(struct net *net, const struct lowpan_802154_cb *cb,
 	struct inet_frag_queue *q;
 
 	q = inet_frag_find(&ieee802154_lowpan->frags, &key);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (!q)
 		return NULL;
-	}
+
 	return container_of(q, struct lowpan_frag_queue, q);
 }
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ebb8f411e0db..c9e35b81d093 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -218,14 +218,3 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
 	return inet_frag_create(nf, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix)
-{
-	static const char msg[] = "inet_frag_find: Fragment hash bucket"
-		" list length grew over limit. Dropping fragment.\n";
-
-	if (PTR_ERR(q) == -ENOBUFS)
-		net_dbg_ratelimited("%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 1222aee3e5ee..38cbf56bb48e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -221,10 +221,9 @@ static struct ipq *ip_find(struct net *net, struct iphdr *iph,
 	struct inet_frag_queue *q;
 
 	q = inet_frag_find(&net->ipv4.frags, &key);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (!q)
 		return NULL;
-	}
+
 	return container_of(q, struct ipq, q);
 }
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 8b12431ae296..54ce1d2a9a9d 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -178,10 +178,9 @@ static struct frag_queue *fq_find(struct net *net, __be32 id, u32 user,
 	struct inet_frag_queue *q;
 
 	q = inet_frag_find(&net->nf_frag.frags, &key);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (!q)
 		return NULL;
-	}
+
 	return container_of(q, struct frag_queue, q);
 }
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 70acad126d04..2a77fda5e3bc 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -155,10 +155,9 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr *hdr, int iif)
 		key.iif = 0;
 
 	q = inet_frag_find(&net->ipv6.frags, &key);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (!q)
 		return NULL;
-	}
+
 	return container_of(q, struct frag_queue, q);
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 13/30] inet: frags: break the 2GB limit for frags storage
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (11 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 12/30] inet: frags: remove inet_frag_maybe_warn_overflow() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 14/30] inet: frags: do not clone skb in ip_expire() Stephen Hemminger
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.

Current memory tracking is using one atomic_t and integers.

Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.

Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :

if (... || frag_mem_limit(nf) > nf->high_thresh)

Tested:

$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

<frag DDOS>

$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880

$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds                    3317150            0.0
IpReasmFails                    3317112            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
---
 Documentation/networking/ip-sysctl.txt  |  4 ++--
 include/net/inet_frag.h                 | 20 ++++++++++----------
 net/ieee802154/6lowpan/reassembly.c     | 10 +++++-----
 net/ipv4/ip_fragment.c                  | 10 +++++-----
 net/ipv4/proc.c                         |  2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c | 10 +++++-----
 net/ipv6/proc.c                         |  2 +-
 net/ipv6/reassembly.c                   |  6 +++---
 8 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f23582a3c661..a054b5ad410a 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -133,10 +133,10 @@ min_adv_mss - INTEGER
 
 IP Fragmentation:
 
-ipfrag_high_thresh - INTEGER
+ipfrag_high_thresh - LONG INTEGER
 	Maximum memory used to reassemble IP fragments.
 
-ipfrag_low_thresh - INTEGER
+ipfrag_low_thresh - LONG INTEGER
 	(Obsolete since linux-4.17)
 	Maximum memory used to reassemble IP fragments before the kernel
 	begins to remove incomplete fragment queues to free up resources.
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 95e353e3305b..a52e7273e7a5 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -8,11 +8,11 @@ struct netns_frags {
 	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
 
 	/* Keep atomic mem on separate cachelines in structs that include it */
-	atomic_t		mem ____cacheline_aligned_in_smp;
+	atomic_long_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
+	long			high_thresh;
+	long			low_thresh;
 	int			timeout;
-	int			high_thresh;
-	int			low_thresh;
 	int			max_dist;
 	struct inet_frags	*f;
 };
@@ -102,7 +102,7 @@ void inet_frags_fini(struct inet_frags *);
 
 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
-	atomic_set(&nf->mem, 0);
+	atomic_long_set(&nf->mem, 0);
 	return rhashtable_init(&nf->rhashtable, &nf->f->rhash_params);
 }
 void inet_frags_exit_net(struct netns_frags *nf);
@@ -119,19 +119,19 @@ static inline void inet_frag_put(struct inet_frag_queue *q)
 
 /* Memory Tracking Functions. */
 
-static inline int frag_mem_limit(struct netns_frags *nf)
+static inline long frag_mem_limit(const struct netns_frags *nf)
 {
-	return atomic_read(&nf->mem);
+	return atomic_long_read(&nf->mem);
 }
 
-static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
+static inline void sub_frag_mem_limit(struct netns_frags *nf, long val)
 {
-	atomic_sub(i, &nf->mem);
+	atomic_long_sub(val, &nf->mem);
 }
 
-static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
+static inline void add_frag_mem_limit(struct netns_frags *nf, long val)
 {
-	atomic_add(i, &nf->mem);
+	atomic_long_add(val, &nf->mem);
 }
 
 /* RFC 3168 support :
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 1aec71a3f904..44f148a6bb57 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -411,23 +411,23 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static int zero;
+static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 	{
 		.procname	= "6lowpanfrag_high_thresh",
 		.data		= &init_net.ieee802154_lowpan.frags.high_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &init_net.ieee802154_lowpan.frags.low_thresh
 	},
 	{
 		.procname	= "6lowpanfrag_low_thresh",
 		.data		= &init_net.ieee802154_lowpan.frags.low_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &zero,
 		.extra2		= &init_net.ieee802154_lowpan.frags.high_thresh
 	},
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 38cbf56bb48e..dc3ed0ac4c58 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -683,23 +683,23 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 EXPORT_SYMBOL(ip_check_defrag);
 
 #ifdef CONFIG_SYSCTL
-static int zero;
+static long zero;
 
 static struct ctl_table ip4_frags_ns_ctl_table[] = {
 	{
 		.procname	= "ipfrag_high_thresh",
 		.data		= &init_net.ipv4.frags.high_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &init_net.ipv4.frags.low_thresh
 	},
 	{
 		.procname	= "ipfrag_low_thresh",
 		.data		= &init_net.ipv4.frags.low_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &zero,
 		.extra2		= &init_net.ipv4.frags.high_thresh
 	},
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 01a337c3a36b..8fbd5633e544 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -71,7 +71,7 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
 		   sock_prot_inuse_get(net, &raw_prot));
-	seq_printf(seq,  "FRAG: inuse %u memory %u\n",
+	seq_printf(seq,  "FRAG: inuse %u memory %lu\n",
 		   atomic_read(&net->ipv4.frags.rhashtable.nelems),
 		   frag_mem_limit(&net->ipv4.frags));
 	return 0;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 54ce1d2a9a9d..6613f81e553a 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -63,7 +63,7 @@ struct nf_ct_frag6_skb_cb
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
-static int zero;
+static long zero;
 
 static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 	{
@@ -76,18 +76,18 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 	{
 		.procname	= "nf_conntrack_frag6_low_thresh",
 		.data		= &init_net.nf_frag.frags.low_thresh,
-		.maxlen		= sizeof(unsigned int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &zero,
 		.extra2		= &init_net.nf_frag.frags.high_thresh
 	},
 	{
 		.procname	= "nf_conntrack_frag6_high_thresh",
 		.data		= &init_net.nf_frag.frags.high_thresh,
-		.maxlen		= sizeof(unsigned int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &init_net.nf_frag.frags.low_thresh
 	},
 	{ }
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 5704ec3d3178..dc04c024986c 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -47,7 +47,7 @@ static int sockstat6_seq_show(struct seq_file *seq, void *v)
 			sock_prot_inuse_get(net, &udplitev6_prot));
 	seq_printf(seq, "RAW6: inuse %d\n",
 		       sock_prot_inuse_get(net, &rawv6_prot));
-	seq_printf(seq, "FRAG6: inuse %u memory %u\n",
+	seq_printf(seq, "FRAG6: inuse %u memory %lu\n",
 		   atomic_read(&net->ipv6.frags.rhashtable.nelems),
 		   frag_mem_limit(&net->ipv6.frags));
 	return 0;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 2a77fda5e3bc..905a8aee2671 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -552,15 +552,15 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = {
 	{
 		.procname	= "ip6frag_high_thresh",
 		.data		= &init_net.ipv6.frags.high_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra1		= &init_net.ipv6.frags.low_thresh
 	},
 	{
 		.procname	= "ip6frag_low_thresh",
 		.data		= &init_net.ipv6.frags.low_thresh,
-		.maxlen		= sizeof(int),
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 14/30] inet: frags: do not clone skb in ip_expire()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (12 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 13/30] inet: frags: break the 2GB limit for frags storage Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 15/30] ipv6: frags: rewrite ip6_expire_frag_queue() Stephen Hemminger
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
---
 net/ipv4/ip_fragment.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index dc3ed0ac4c58..88fa8ffc5558 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -143,8 +143,8 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
-	struct sk_buff *clone, *head;
 	const struct iphdr *iph;
+	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -187,16 +187,12 @@ static void ip_expire(struct timer_list *t)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	clone = skb_clone(head, GFP_ATOMIC);
+	skb_get(head);
+	spin_unlock(&qp->q.lock);
+	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+	kfree_skb(head);
+	goto out_rcu_unlock;
 
-	/* Send an ICMP "Fragment Reassembly Timeout" message. */
-	if (clone) {
-		spin_unlock(&qp->q.lock);
-		icmp_send(clone, ICMP_TIME_EXCEEDED,
-			  ICMP_EXC_FRAGTIME, 0);
-		consume_skb(clone);
-		goto out_rcu_unlock;
-	}
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 15/30] ipv6: frags: rewrite ip6_expire_frag_queue()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (13 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 14/30] inet: frags: do not clone skb in ip_expire() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 16/30] rhashtable: reorganize struct rhashtable layout Stephen Hemminger
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3)
---
 net/ipv6/reassembly.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 905a8aee2671..2127da130dc2 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -92,7 +92,9 @@ EXPORT_SYMBOL(ip6_frag_init);
 void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 {
 	struct net_device *dev = NULL;
+	struct sk_buff *head;
 
+	rcu_read_lock();
 	spin_lock(&fq->q.lock);
 
 	if (fq->q.flags & INET_FRAG_COMPLETE)
@@ -100,28 +102,34 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 
 	inet_frag_kill(&fq->q);
 
-	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, fq->iif);
 	if (!dev)
-		goto out_rcu_unlock;
+		goto out;
 
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
 	/* Don't send error if the first segment did not arrive. */
-	if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !fq->q.fragments)
-		goto out_rcu_unlock;
+	head = fq->q.fragments;
+	if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+		goto out;
 
 	/* But use as source device on which LAST ARRIVED
 	 * segment was received. And do not use fq->dev
 	 * pointer directly, device might already disappeared.
 	 */
-	fq->q.fragments->dev = dev;
-	icmpv6_send(fq->q.fragments, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
-out_rcu_unlock:
-	rcu_read_unlock();
+	head->dev = dev;
+	skb_get(head);
+	spin_unlock(&fq->q.lock);
+
+	icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
+	kfree_skb(head);
+	goto out_rcu_unlock;
+
 out:
 	spin_unlock(&fq->q.lock);
+out_rcu_unlock:
+	rcu_read_unlock();
 	inet_frag_put(&fq->q);
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 16/30] rhashtable: reorganize struct rhashtable layout
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (14 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 15/30] ipv6: frags: rewrite ip6_expire_frag_queue() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 17/30] inet: frags: reorganize struct netns_frags Stephen Hemminger
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

While under frags DDOS I noticed unfortunate false sharing between
@nelems and @params.automatic_shrinking

Move @nelems at the end of struct rhashtable so that first cache line
is shared between all cpus, because almost never dirtied.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e5d672a0780d9e7118caad4c171ec88b8299398d)
---
 include/linux/rhashtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 7fd514f36e74..a4be6388a980 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -152,25 +152,25 @@ struct rhashtable_params {
 /**
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
- * @nelems: Number of elements in table
  * @key_len: Key length for hashfn
- * @p: Configuration parameters
  * @max_elems: Maximum number of elements in table
+ * @p: Configuration parameters
  * @rhlist: True if this is an rhltable
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
+ * @nelems: Number of elements in table
  */
 struct rhashtable {
 	struct bucket_table __rcu	*tbl;
-	atomic_t			nelems;
 	unsigned int			key_len;
-	struct rhashtable_params	p;
 	unsigned int			max_elems;
+	struct rhashtable_params	p;
 	bool				rhlist;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
 	spinlock_t			lock;
+	atomic_t			nelems;
 };
 
 /**
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 17/30] inet: frags: reorganize struct netns_frags
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (15 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 16/30] rhashtable: reorganize struct rhashtable layout Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 18/30] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB Stephen Hemminger
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Put the read-mostly fields in a separate cache line
at the beginning of struct netns_frags, to reduce
false sharing noticed in inet_frag_kill()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0)
---
 include/net/inet_frag.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index a52e7273e7a5..ed07e3786d98 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -5,16 +5,17 @@
 #include <linux/rhashtable.h>
 
 struct netns_frags {
-	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
-
-	/* Keep atomic mem on separate cachelines in structs that include it */
-	atomic_long_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
 	long			high_thresh;
 	long			low_thresh;
 	int			timeout;
 	int			max_dist;
 	struct inet_frags	*f;
+
+	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
+
+	/* Keep atomic mem on separate cachelines in structs that include it */
+	atomic_long_t		mem ____cacheline_aligned_in_smp;
 };
 
 /**
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 18/30] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (16 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 17/30] inet: frags: reorganize struct netns_frags Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 19/30] inet: frags: fix ip6frag_low_thresh boundary Stephen Hemminger
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c66 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
---
 include/linux/skbuff.h |  1 +
 net/ipv4/ip_fragment.c | 35 ++++++++++++++---------------------
 2 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6dd77767fd5b..f4749678b7ee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -678,6 +678,7 @@ struct sk_buff {
 		 * UDP receive path is one user.
 		 */
 		unsigned long		dev_scratch;
+		int			ip_defrag_offset;
 	};
 	/*
 	 * This is the control buffer. It is free to use for every
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88fa8ffc5558..5331a0d68374 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,14 +57,6 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
-struct ipfrag_skb_cb
-{
-	struct inet_skb_parm	h;
-	int			offset;
-};
-
-#define FRAG_CB(skb)	((struct ipfrag_skb_cb *)((skb)->cb))
-
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
 	struct inet_frag_queue q;
@@ -353,13 +345,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 * this fragment, right?
 	 */
 	prev = qp->q.fragments_tail;
-	if (!prev || FRAG_CB(prev)->offset < offset) {
+	if (!prev || prev->ip_defrag_offset < offset) {
 		next = NULL;
 		goto found;
 	}
 	prev = NULL;
 	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (FRAG_CB(next)->offset >= offset)
+		if (next->ip_defrag_offset >= offset)
 			break;	/* bingo! */
 		prev = next;
 	}
@@ -370,7 +362,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 * any overlaps are eliminated.
 	 */
 	if (prev) {
-		int i = (FRAG_CB(prev)->offset + prev->len) - offset;
+		int i = (prev->ip_defrag_offset + prev->len) - offset;
 
 		if (i > 0) {
 			offset += i;
@@ -387,8 +379,8 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 
 	err = -ENOMEM;
 
-	while (next && FRAG_CB(next)->offset < end) {
-		int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */
+	while (next && next->ip_defrag_offset < end) {
+		int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
 
 		if (i < next->len) {
 			int delta = -next->truesize;
@@ -401,7 +393,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 			delta += next->truesize;
 			if (delta)
 				add_frag_mem_limit(qp->q.net, delta);
-			FRAG_CB(next)->offset += i;
+			next->ip_defrag_offset += i;
 			qp->q.meat -= i;
 			if (next->ip_summed != CHECKSUM_UNNECESSARY)
 				next->ip_summed = CHECKSUM_NONE;
@@ -425,7 +417,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		}
 	}
 
-	FRAG_CB(skb)->offset = offset;
+	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
+	dev = skb->dev;
+	if (dev)
+		qp->iif = dev->ifindex;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
+	skb->ip_defrag_offset = offset;
 
 	/* Insert this fragment in the chain of fragments. */
 	skb->next = next;
@@ -436,11 +434,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	else
 		qp->q.fragments = skb;
 
-	dev = skb->dev;
-	if (dev) {
-		qp->iif = dev->ifindex;
-		skb->dev = NULL;
-	}
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -516,7 +509,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 
 	WARN_ON(!head);
-	WARN_ON(FRAG_CB(head)->offset != 0);
+	WARN_ON(head->ip_defrag_offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
 	ihlen = ip_hdrlen(head);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 19/30] inet: frags: fix ip6frag_low_thresh boundary
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (17 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 18/30] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 20/30] ip: discard IPv4 datagrams with overlapping segments Stephen Hemminger
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
---
 net/ieee802154/6lowpan/reassembly.c     | 2 --
 net/ipv4/ip_fragment.c                  | 5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 2 --
 net/ipv6/reassembly.c                   | 4 +---
 4 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 44f148a6bb57..1790b65944b3 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -411,7 +411,6 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 	{
@@ -428,7 +427,6 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
-		.extra1		= &zero,
 		.extra2		= &init_net.ieee802154_lowpan.frags.high_thresh
 	},
 	{
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5331a0d68374..d14d741fb05e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -672,7 +672,7 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 EXPORT_SYMBOL(ip_check_defrag);
 
 #ifdef CONFIG_SYSCTL
-static long zero;
+static int dist_min;
 
 static struct ctl_table ip4_frags_ns_ctl_table[] = {
 	{
@@ -689,7 +689,6 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
-		.extra1		= &zero,
 		.extra2		= &init_net.ipv4.frags.high_thresh
 	},
 	{
@@ -705,7 +704,7 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero
+		.extra1		= &dist_min,
 	},
 	{ }
 };
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 6613f81e553a..a1dc0d6a5949 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -63,7 +63,6 @@ struct nf_ct_frag6_skb_cb
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 	{
@@ -79,7 +78,6 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
-		.extra1		= &zero,
 		.extra2		= &init_net.nf_frag.frags.high_thresh
 	},
 	{
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 2127da130dc2..e1c5fa5e3873 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -554,7 +554,6 @@ static const struct inet6_protocol frag_protocol = {
 };
 
 #ifdef CONFIG_SYSCTL
-static int zero;
 
 static struct ctl_table ip6_frags_ns_ctl_table[] = {
 	{
@@ -570,8 +569,7 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = {
 		.data		= &init_net.ipv6.frags.low_thresh,
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
+		.proc_handler	= proc_doulongvec_minmax,
 		.extra2		= &init_net.ipv6.frags.high_thresh
 	},
 	{
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 20/30] ip: discard IPv4 datagrams with overlapping segments.
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (18 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 19/30] inet: frags: fix ip6frag_low_thresh boundary Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 21/30] net: speed up skb_rbtree_purge() Stephen Hemminger
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal

From: Peter Oskolkov <posk@google.com>

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c    | 75 ++++++++++-----------------------------
 net/ipv4/proc.c           |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 0d941cdd8e8c..f5d753e60836 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -56,6 +56,7 @@ enum
 	IPSTATS_MIB_ECT1PKTS,			/* InECT1Pkts */
 	IPSTATS_MIB_ECT0PKTS,			/* InECT0Pkts */
 	IPSTATS_MIB_CEPKTS,			/* InCEPkts */
+	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
 	__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
 	unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 found:
-	/* We found where to put this one.  Check for overlap with
-	 * preceding fragment, and, if needed, align things so that
-	 * any overlaps are eliminated.
+	/* RFC5722, Section 4, amended by Errata ID : 3089
+	 *                          When reassembling an IPv6 datagram, if
+	 *   one or more its constituent fragments is determined to be an
+	 *   overlapping fragment, the entire datagram (and any constituent
+	 *   fragments) MUST be silently discarded.
+	 *
+	 * We do the same here for IPv4.
 	 */
-	if (prev) {
-		int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-		if (i > 0) {
-			offset += i;
-			err = -EINVAL;
-			if (end <= offset)
-				goto err;
-			err = -ENOMEM;
-			if (!pskb_pull(skb, i))
-				goto err;
-			if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-				skb->ip_summed = CHECKSUM_NONE;
-		}
-	}
+	/* Is there an overlap with the previous fragment? */
+	if (prev &&
+	    (prev->ip_defrag_offset + prev->len) > offset)
+		goto discard_qp;
 
-	err = -ENOMEM;
-
-	while (next && next->ip_defrag_offset < end) {
-		int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-		if (i < next->len) {
-			int delta = -next->truesize;
-
-			/* Eat head of the next overlapped fragment
-			 * and leave the loop. The next ones cannot overlap.
-			 */
-			if (!pskb_pull(next, i))
-				goto err;
-			delta += next->truesize;
-			if (delta)
-				add_frag_mem_limit(qp->q.net, delta);
-			next->ip_defrag_offset += i;
-			qp->q.meat -= i;
-			if (next->ip_summed != CHECKSUM_UNNECESSARY)
-				next->ip_summed = CHECKSUM_NONE;
-			break;
-		} else {
-			struct sk_buff *free_it = next;
-
-			/* Old fragment is completely overridden with
-			 * new one drop it.
-			 */
-			next = next->next;
-
-			if (prev)
-				prev->next = next;
-			else
-				qp->q.fragments = next;
-
-			qp->q.meat -= free_it->len;
-			sub_frag_mem_limit(qp->q.net, free_it->truesize);
-			kfree_skb(free_it);
-		}
-	}
+	/* Is there an overlap with the next fragment? */
+	if (next && next->ip_defrag_offset < end)
+		goto discard_qp;
 
 	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
 	dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	skb_dst_drop(skb);
 	return -EINPROGRESS;
 
+discard_qp:
+	inet_frag_kill(&qp->q);
+	err = -EINVAL;
+	__IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
 	kfree_skb(skb);
 	return err;
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 8fbd5633e544..3fbf688a1943 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -132,6 +132,7 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_ITEM("InECT1Pkts", IPSTATS_MIB_ECT1PKTS),
 	SNMP_MIB_ITEM("InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("InCEPkts", IPSTATS_MIB_CEPKTS),
+	SNMP_MIB_ITEM("ReasmOverlaps", IPSTATS_MIB_REASM_OVERLAPS),
 	SNMP_MIB_SENTINEL
 };
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 21/30] net: speed up skb_rbtree_purge()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (19 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 20/30] ip: discard IPv4 datagrams with overlapping segments Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 22/30] net: modify skb_rbtree_purge to return the truesize of all purged skbs Stephen Hemminger
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

Also note that there is not even an increase of text size :
$ size net/core/skbuff.o.before net/core/skbuff.o
   text	   data	    bss	    dec	    hex	filename
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o.before
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o

From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7c90584c66cc4b033a3b684b0e0950f79e7b7166)
---
 net/core/skbuff.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2e5eeba97de9..c7c5f05f2af1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2850,12 +2850,15 @@ EXPORT_SYMBOL(skb_queue_purge);
  */
 void skb_rbtree_purge(struct rb_root *root)
 {
-	struct sk_buff *skb, *next;
+	struct rb_node *p = rb_first(root);
 
-	rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
-		kfree_skb(skb);
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
-	*root = RB_ROOT;
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		kfree_skb(skb);
+	}
 }
 
 /**
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 22/30] net: modify skb_rbtree_purge to return the truesize of all purged skbs.
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (20 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 21/30] net: speed up skb_rbtree_purge() Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu Stephen Hemminger
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal

From: Peter Oskolkov <posk@google.com>

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c      | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f4749678b7ee..9c8457375aee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2581,7 +2581,7 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7c5f05f2af1..8fd690def5c1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2842,23 +2842,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  *	skb_rbtree_purge - empty a skb rbtree
  *	@root: root of the rbtree to empty
+ *	Return value: the sum of truesizes of all purged skbs.
  *
  *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  *	the list and one reference dropped. This function does not take
  *	any lock. Synchronization should be handled by the caller (e.g., TCP
  *	out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
 	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
 
 	while (p) {
 		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
 		p = rb_next(p);
 		rb_erase(&skb->rbnode, root);
+		sum += skb->truesize;
 		kfree_skb(skb);
 	}
+	return sum;
 }
 
 /**
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (21 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 22/30] net: modify skb_rbtree_purge to return the truesize of all purged skbs Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2019-01-10 19:30   ` Tom Herbert
  2018-09-13 14:58 ` [PATCH v3 24/30] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends Stephen Hemminger
                   ` (7 subsequent siblings)
  30 siblings, 1 reply; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Florian Westphal, Peter Oskolkov

From: Florian Westphal <fw@strlen.de>

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
    There were concerns that there could be even smaller frags
    generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6)
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++++
 net/ipv6/reassembly.c                   | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index a1dc0d6a5949..1d2f07cde01a 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -565,6 +565,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
 	hdr = ipv6_hdr(skb);
 	fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+	    fhdr->frag_off & htons(IP6_MF))
+		return -EINVAL;
+
 	skb_orphan(skb);
 	fq = fq_find(net, fhdr->identification, user, hdr,
 		     skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e1c5fa5e3873..afaad60dc2ac 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -522,6 +522,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
+	if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+	    fhdr->frag_off & htons(IP6_MF))
+		goto fail_hdr;
+
 	iif = skb->dev ? skb->dev->ifindex : 0;
 	fq = fq_find(net, fhdr->identification, hdr, iif);
 	if (fq) {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 24/30] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (22 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 25/30] net: add rb_to_skb() and other rb tree helpers Stephen Hemminger
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5)
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c8457375aee..758084b434c8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3135,6 +3135,7 @@ static inline void *skb_push_rcsum(struct sk_buff *skb, unsigned int len)
 	return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  *	pskb_trim_rcsum - trim received skb and update checksum
  *	@skb: buffer to trim
@@ -3148,9 +3149,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 {
 	if (likely(len >= skb->len))
 		return 0;
-	if (skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->ip_summed = CHECKSUM_NONE;
-	return __pskb_trim(skb, len);
+	return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8fd690def5c1..168a3e8883d4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1839,6 +1839,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE) {
+		int delta = skb->len - len;
+
+		skb->csum = csum_sub(skb->csum,
+				     skb_checksum(skb, len, delta, 0));
+	}
+	return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  *	__pskb_pull_tail - advance tail of skb header
  *	@skb: buffer to reallocate
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 25/30] net: add rb_to_skb() and other rb tree helpers
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (23 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 24/30] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:58 ` [PATCH v3 26/30] net: sk_buff rbnode reorg Stephen Hemminger
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet

From: Eric Dumazet <edumazet@google.com>

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
---
 include/linux/skbuff.h  | 18 ++++++++++++++++++
 net/ipv4/tcp_fastopen.c |  8 +++-----
 net/ipv4/tcp_input.c    | 33 ++++++++++++---------------------
 net/sched/sch_netem.c   | 14 ++++----------
 4 files changed, 37 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 758084b434c8..2837e55df03e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3169,6 +3169,12 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, unsigned int len)
 
 #define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
 		for (skb = (queue)->next;					\
 		     skb != (struct sk_buff *)(queue);				\
@@ -3183,6 +3189,18 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, unsigned int len)
 		for (; skb != (struct sk_buff *)(queue);			\
 		     skb = skb->next)
 
+#define skb_rbtree_walk(skb, root)						\
+		for (skb = skb_rb_first(root); skb != NULL;			\
+		     skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from(skb)						\
+		for (; skb != NULL;						\
+		     skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from_safe(skb, tmp)					\
+		for (; tmp = skb ? skb_rb_next(skb) : NULL, (skb != NULL);	\
+		     skb = tmp)
+
 #define skb_queue_walk_from_safe(queue, skb, tmp)				\
 		for (tmp = skb->next;						\
 		     skb != (struct sk_buff *)(queue);				\
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fbbeda647774..0567edb76522 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -458,17 +458,15 @@ bool tcp_fastopen_active_should_disable(struct sock *sk)
 void tcp_fastopen_active_disable_ofo_check(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct rb_node *p;
-	struct sk_buff *skb;
 	struct dst_entry *dst;
+	struct sk_buff *skb;
 
 	if (!tp->syn_fastopen)
 		return;
 
 	if (!tp->data_segs_in) {
-		p = rb_first(&tp->out_of_order_queue);
-		if (p && !rb_next(p)) {
-			skb = rb_entry(p, struct sk_buff, rbnode);
+		skb = skb_rb_first(&tp->out_of_order_queue);
+		if (skb && !skb_rb_next(skb)) {
 			if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
 				tcp_fastopen_active_disable(sk);
 				return;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bdabd748f4bc..991f382afc1b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4372,7 +4372,7 @@ static void tcp_ofo_queue(struct sock *sk)
 
 	p = rb_first(&tp->out_of_order_queue);
 	while (p) {
-		skb = rb_entry(p, struct sk_buff, rbnode);
+		skb = rb_to_skb(p);
 		if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
 			break;
 
@@ -4440,7 +4440,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct rb_node **p, *q, *parent;
+	struct rb_node **p, *parent;
 	struct sk_buff *skb1;
 	u32 seq, end_seq;
 	bool fragstolen;
@@ -4503,7 +4503,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	parent = NULL;
 	while (*p) {
 		parent = *p;
-		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		skb1 = rb_to_skb(parent);
 		if (before(seq, TCP_SKB_CB(skb1)->seq)) {
 			p = &parent->rb_left;
 			continue;
@@ -4548,9 +4548,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 
 merge_right:
 	/* Remove other segments covered by skb. */
-	while ((q = rb_next(&skb->rbnode)) != NULL) {
-		skb1 = rb_entry(q, struct sk_buff, rbnode);
-
+	while ((skb1 = skb_rb_next(skb)) != NULL) {
 		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
 			break;
 		if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
@@ -4565,7 +4563,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		tcp_drop(sk, skb1);
 	}
 	/* If there is no skb after us, we are the last_skb ! */
-	if (!q)
+	if (!skb1)
 		tp->ooo_last_skb = skb;
 
 add_sack:
@@ -4749,7 +4747,7 @@ static struct sk_buff *tcp_skb_next(struct sk_buff *skb, struct sk_buff_head *li
 	if (list)
 		return !skb_queue_is_last(list, skb) ? skb->next : NULL;
 
-	return rb_entry_safe(rb_next(&skb->rbnode), struct sk_buff, rbnode);
+	return skb_rb_next(skb);
 }
 
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
@@ -4778,7 +4776,7 @@ static void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
 
 	while (*p) {
 		parent = *p;
-		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		skb1 = rb_to_skb(parent);
 		if (before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb1)->seq))
 			p = &parent->rb_left;
 		else
@@ -4898,19 +4896,12 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb, *head;
-	struct rb_node *p;
 	u32 start, end;
 
-	p = rb_first(&tp->out_of_order_queue);
-	skb = rb_entry_safe(p, struct sk_buff, rbnode);
+	skb = skb_rb_first(&tp->out_of_order_queue);
 new_range:
 	if (!skb) {
-		p = rb_last(&tp->out_of_order_queue);
-		/* Note: This is possible p is NULL here. We do not
-		 * use rb_entry_safe(), as ooo_last_skb is valid only
-		 * if rbtree is not empty.
-		 */
-		tp->ooo_last_skb = rb_entry(p, struct sk_buff, rbnode);
+		tp->ooo_last_skb = skb_rb_last(&tp->out_of_order_queue);
 		return;
 	}
 	start = TCP_SKB_CB(skb)->seq;
@@ -4918,7 +4909,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 	range_truesize = skb->truesize;
 
 	for (head = skb;;) {
-		skb = tcp_skb_next(skb, NULL);
+		skb = skb_rb_next(skb);
 
 		/* Range is terminated when we see a gap or when
 		 * we are at the queue end.
@@ -4974,7 +4965,7 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 		prev = rb_prev(node);
 		rb_erase(node, &tp->out_of_order_queue);
 		goal -= rb_to_skb(node)->truesize;
-		tcp_drop(sk, rb_entry(node, struct sk_buff, rbnode));
+		tcp_drop(sk, rb_to_skb(node));
 		if (!prev || goal <= 0) {
 			sk_mem_reclaim(sk);
 			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
@@ -4984,7 +4975,7 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 		}
 		node = prev;
 	} while (node);
-	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
+	tp->ooo_last_skb = rb_to_skb(prev);
 
 	/* Reset SACK state.  A conforming SACK implementation will
 	 * do the same at a timeout based retransmit.  When a connection
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 8c8df75dbead..2a2ab6bfe5d8 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -149,12 +149,6 @@ struct netem_skb_cb {
 	ktime_t		tstamp_save;
 };
 
-
-static struct sk_buff *netem_rb_to_skb(struct rb_node *rb)
-{
-	return rb_entry(rb, struct sk_buff, rbnode);
-}
-
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
 {
 	/* we assume we can use skb next/prev/tstamp as storage for rb_node */
@@ -365,7 +359,7 @@ static void tfifo_reset(struct Qdisc *sch)
 	struct rb_node *p;
 
 	while ((p = rb_first(&q->t_root))) {
-		struct sk_buff *skb = netem_rb_to_skb(p);
+		struct sk_buff *skb = rb_to_skb(p);
 
 		rb_erase(p, &q->t_root);
 		rtnl_kfree_skbs(skb, skb);
@@ -382,7 +376,7 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 		struct sk_buff *skb;
 
 		parent = *p;
-		skb = netem_rb_to_skb(parent);
+		skb = rb_to_skb(parent);
 		if (tnext >= netem_skb_cb(skb)->time_to_send)
 			p = &parent->rb_right;
 		else
@@ -538,7 +532,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 				struct sk_buff *t_skb;
 				struct netem_skb_cb *t_last;
 
-				t_skb = netem_rb_to_skb(rb_last(&q->t_root));
+				t_skb = skb_rb_last(&q->t_root);
 				t_last = netem_skb_cb(t_skb);
 				if (!last ||
 				    t_last->time_to_send > last->time_to_send) {
@@ -618,7 +612,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 	if (p) {
 		psched_time_t time_to_send;
 
-		skb = netem_rb_to_skb(p);
+		skb = rb_to_skb(p);
 
 		/* if more time remaining? */
 		time_to_send = netem_skb_cb(skb)->time_to_send;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 26/30] net: sk_buff rbnode reorg
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (24 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 25/30] net: add rb_to_skb() and other rb tree helpers Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-10-18 16:01   ` Christoph Paasch
  2018-09-13 14:58 ` [PATCH v3 27/30] ipv4: frags: precedence bug in ip_expire() Stephen Hemminger
                   ` (4 subsequent siblings)
  30 siblings, 1 reply; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh
  Cc: netdev, stable, edumazet, Soheil Hassas Yeganeh, Wei Wang,
	Willem de Bruijn

From: Eric Dumazet <edumazet@google.com>

commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h                  |  24 ++--
 include/net/inet_frag.h                 |   3 +-
 net/ipv4/inet_fragment.c                |  16 ++-
 net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 6 files changed, 129 insertions(+), 98 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2837e55df03e..f64e88444082 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -663,23 +663,27 @@ struct sk_buff {
 			struct sk_buff		*prev;
 
 			union {
-				ktime_t		tstamp;
-				u64		skb_mstamp;
+				struct net_device	*dev;
+				/* Some protocols might use this space to store information,
+				 * while device pointer would be NULL.
+				 * UDP receive path is one user.
+				 */
+				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
+		struct list_head	list;
 	};
-	struct sock		*sk;
 
 	union {
-		struct net_device	*dev;
-		/* Some protocols might use this space to store information,
-		 * while device pointer would be NULL.
-		 * UDP receive path is one user.
-		 */
-		unsigned long		dev_scratch;
+		struct sock		*sk;
 		int			ip_defrag_offset;
 	};
+
+	union {
+		ktime_t		tstamp;
+		u64		skb_mstamp;
+	};
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index ed07e3786d98..e4c71a7644be 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
 	struct timer_list	timer;
 	spinlock_t		lock;
 	refcount_t		refcnt;
-	struct sk_buff		*fragments;
+	struct sk_buff		*fragments;  /* Used in IPv6. */
+	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
 	int			len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d093..6904cbb7de1a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 	fp = q->fragments;
 	nf = q->net;
 	f = nf->f;
-	while (fp) {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
+	if (fp) {
+		do {
+			struct sk_buff *xp = fp->next;
+
+			sum_truesize += fp->truesize;
+			kfree_skb(fp);
+			fp = xp;
+		} while (fp);
+	} else {
+		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..0e8f8de77e71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
-	struct sk_buff *head;
+	struct sk_buff *head = NULL;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
 	ipq_kill(qp);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-	head = qp->q.fragments;
-
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+	if (!qp->q.flags & INET_FRAG_FIRST_IN)
 		goto out;
 
+	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
+	 * pull the head out of the tree in order to be able to
+	 * deal with head->dev.
+	 */
+	if (qp->q.fragments) {
+		head = qp->q.fragments;
+		qp->q.fragments = head->next;
+	} else {
+		head = skb_rb_first(&qp->q.rb_fragments);
+		if (!head)
+			goto out;
+		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		memset(&head->rbnode, 0, sizeof(head->rbnode));
+		barrier();
+	}
+	if (head == qp->q.fragments_tail)
+		qp->q.fragments_tail = NULL;
+
+	sub_frag_mem_limit(qp->q.net, head->truesize);
+
 	head->dev = dev_get_by_index_rcu(net, qp->iif);
 	if (!head->dev)
 		goto out;
@@ -179,16 +196,16 @@ static void ip_expire(struct timer_list *t)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	skb_get(head);
 	spin_unlock(&qp->q.lock);
 	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
-	kfree_skb(head);
 	goto out_rcu_unlock;
 
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
 	rcu_read_unlock();
+	if (head)
+		kfree_skb(head);
 	ipq_put(qp);
 }
 
@@ -231,7 +248,7 @@ static int ip_frag_too_far(struct ipq *qp)
 	end = atomic_inc_return(&peer->rid);
 	qp->rid = end;
 
-	rc = qp->q.fragments && (end - start) > max;
+	rc = qp->q.fragments_tail && (end - start) > max;
 
 	if (rc) {
 		struct net *net;
@@ -245,7 +262,6 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	struct sk_buff *fp;
 	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
@@ -253,20 +269,14 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	fp = qp->q.fragments;
-	do {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
-	} while (fp);
+	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
 	qp->q.meat = 0;
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
@@ -278,7 +288,8 @@ static int ip_frag_reinit(struct ipq *qp)
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
-	struct sk_buff *prev, *next;
+	struct rb_node **rbn, *parent;
+	struct sk_buff *skb1;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -341,58 +352,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	if (err)
 		goto err;
 
-	/* Find out which fragments are in front and at the back of us
-	 * in the chain of fragments so far.  We must know where to put
-	 * this fragment, right?
-	 */
-	prev = qp->q.fragments_tail;
-	if (!prev || prev->ip_defrag_offset < offset) {
-		next = NULL;
-		goto found;
-	}
-	prev = NULL;
-	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (next->ip_defrag_offset >= offset)
-			break;	/* bingo! */
-		prev = next;
-	}
+	/* Note : skb->rbnode and skb->dev share the same location. */
+	dev = skb->dev;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
 
-found:
 	/* RFC5722, Section 4, amended by Errata ID : 3089
 	 *                          When reassembling an IPv6 datagram, if
 	 *   one or more its constituent fragments is determined to be an
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4.
+	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
 
-	/* Is there an overlap with the previous fragment? */
-	if (prev &&
-	    (prev->ip_defrag_offset + prev->len) > offset)
-		goto discard_qp;
-
-	/* Is there an overlap with the next fragment? */
-	if (next && next->ip_defrag_offset < end)
-		goto discard_qp;
+	/* Find out where to put this fragment.  */
+	skb1 = qp->q.fragments_tail;
+	if (!skb1) {
+		/* This is the first fragment we've received. */
+		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
+		qp->q.fragments_tail = skb;
+	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
+		/* This is the common/special case: skb goes to the end. */
+		/* Detect and discard overlaps. */
+		if (offset < (skb1->ip_defrag_offset + skb1->len))
+			goto discard_qp;
+		/* Insert after skb1. */
+		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
+		qp->q.fragments_tail = skb;
+	} else {
+		/* Binary search. Note that skb can become the first fragment, but
+		 * not the last (covered above). */
+		rbn = &qp->q.rb_fragments.rb_node;
+		do {
+			parent = *rbn;
+			skb1 = rb_to_skb(parent);
+			if (end <= skb1->ip_defrag_offset)
+				rbn = &parent->rb_left;
+			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+				rbn = &parent->rb_right;
+			else /* Found an overlap with skb1. */
+				goto discard_qp;
+		} while (*rbn);
+		/* Here we have parent properly set, and rbn pointing to
+		 * one of its NULL left/right children. Insert skb. */
+		rb_link_node(&skb->rbnode, parent, rbn);
+	}
+	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
-	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
-	dev = skb->dev;
 	if (dev)
 		qp->iif = dev->ifindex;
-	/* Makes sure compiler wont do silly aliasing games */
-	barrier();
 	skb->ip_defrag_offset = offset;
 
-	/* Insert this fragment in the chain of fragments. */
-	skb->next = next;
-	if (!next)
-		qp->q.fragments_tail = skb;
-	if (prev)
-		prev->next = skb;
-	else
-		qp->q.fragments = skb;
-
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -414,7 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, skb, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -431,15 +442,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	return err;
 }
 
-
 /* Build a new IP datagram from all its fragments. */
-
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
-	struct sk_buff *fp, *head = qp->q.fragments;
+	struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
+	struct sk_buff **nextp; /* To build frag_list. */
+	struct rb_node *rbn;
 	int len;
 	int ihlen;
 	int err;
@@ -453,25 +464,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		goto out_fail;
 	}
 	/* Make the one we just received the head. */
-	if (prev) {
-		head = prev->next;
-		fp = skb_clone(head, GFP_ATOMIC);
+	if (head != skb) {
+		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-
-		fp->next = head->next;
-		if (!fp->next)
+		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
-		prev->next = fp;
-
-		skb_morph(head, qp->q.fragments);
-		head->next = qp->q.fragments->next;
-
-		consume_skb(qp->q.fragments);
-		qp->q.fragments = head;
+		skb_morph(skb, head);
+		rb_replace_node(&head->rbnode, &skb->rbnode,
+				&qp->q.rb_fragments);
+		consume_skb(head);
+		head = skb;
 	}
 
-	WARN_ON(!head);
 	WARN_ON(head->ip_defrag_offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
@@ -496,24 +502,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
 			goto out_nomem;
-		clone->next = head->next;
-		head->next = clone;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		head->data_len -= clone->len;
-		head->len -= clone->len;
+		skb->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
+		skb_shinfo(head)->frag_list = clone;
+		nextp = &clone->next;
+	} else {
+		nextp = &skb_shinfo(head)->frag_list;
 	}
 
-	skb_shinfo(head)->frag_list = head->next;
 	skb_push(head, head->data - skb_network_header(head));
 
-	for (fp=head->next; fp; fp = fp->next) {
+	/* Traverse the tree in order, to build frag_list. */
+	rbn = rb_next(&head->rbnode);
+	rb_erase(&head->rbnode, &qp->q.rb_fragments);
+	while (rbn) {
+		struct rb_node *rbnext = rb_next(rbn);
+		fp = rb_to_skb(rbn);
+		rb_erase(rbn, &qp->q.rb_fragments);
+		rbn = rbnext;
+		*nextp = fp;
+		nextp = &fp->next;
+		fp->prev = NULL;
+		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
 		head->data_len += fp->len;
 		head->len += fp->len;
 		if (head->ip_summed != fp->ip_summed)
@@ -524,7 +541,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
+	*nextp = NULL;
 	head->next = NULL;
+	head->prev = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
 	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
@@ -552,6 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	return 0;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 1d2f07cde01a..82ce0d0f54bf 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -471,6 +471,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 
 	return true;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index afaad60dc2ac..ede0061b6f5d 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -472,6 +472,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
 	rcu_read_unlock();
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 	return 1;
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 27/30] ipv4: frags: precedence bug in ip_expire()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (25 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 26/30] net: sk_buff rbnode reorg Stephen Hemminger
@ 2018-09-13 14:58 ` Stephen Hemminger
  2018-09-13 14:59 ` [PATCH v3 28/30] ip: add helpers to process in-order fragments faster Stephen Hemminger
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:58 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Dan Carpenter

From: Dan Carpenter <dan.carpenter@oracle.com>

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
---
 net/ipv4/ip_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0e8f8de77e71..7cb7ed761d8c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -154,7 +154,7 @@ static void ip_expire(struct timer_list *t)
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!qp->q.flags & INET_FRAG_FIRST_IN)
+	if (!(qp->q.flags & INET_FRAG_FIRST_IN))
 		goto out;
 
 	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 28/30] ip: add helpers to process in-order fragments faster.
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (26 preceding siblings ...)
  2018-09-13 14:58 ` [PATCH v3 27/30] ipv4: frags: precedence bug in ip_expire() Stephen Hemminger
@ 2018-09-13 14:59 ` Stephen Hemminger
  2018-09-13 14:59 ` [PATCH v3 29/30] ip: process in-order fragments efficiently Stephen Hemminger
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal

From: Peter Oskolkov <posk@google.com>

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 ++++
 net/ipv4/ip_fragment.c  | 73 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index e4c71a7644be..335cf7851f12 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
 	struct sk_buff		*fragments;  /* Used in IPv6. */
 	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
+	struct sk_buff		*last_run_head;
 	ktime_t			stamp;
 	int			len;
 	int			meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
 	if (refcount_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+	struct inet_skb_parm	h;
+	struct sk_buff		*next_frag;
+	int			frag_run_len;
+};
+
+#define FRAG_CB(skb)		((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+	FRAG_CB(skb)->next_frag = NULL;
+	FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+					struct sk_buff *skb)
+{
+	RB_CLEAR_NODE(&skb->rbnode);
+	FRAG_CB(skb)->next_frag = NULL;
+
+	FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+	FRAG_CB(q->fragments_tail)->next_frag = skb;
+	q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+	if (q->last_run_head)
+		rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+			     &q->last_run_head->rbnode.rb_right);
+	else
+		rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+	rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+	ip4_frag_init_run(skb);
+	q->fragments_tail = skb;
+	q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
 	struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
+
+	while (p) {
+		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, root);
+		while (skb) {
+			struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+			sum += skb->truesize;
+			kfree_skb(skb);
+			skb = next;
+		}
+	}
+	return sum;
+}
+EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
 #ifdef CONFIG_SYSCTL
 static int dist_min;
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 29/30] ip: process in-order fragments efficiently
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (27 preceding siblings ...)
  2018-09-13 14:59 ` [PATCH v3 28/30] ip: add helpers to process in-order fragments faster Stephen Hemminger
@ 2018-09-13 14:59 ` Stephen Hemminger
  2018-09-13 14:59 ` [PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment() Stephen Hemminger
  2018-09-17 12:47 ` [PATCH v3 00/30] backport of IP fragmentation fixes Greg KH
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Peter Oskolkov, Florian Westphal

From: Peter Oskolkov <posk@google.com>

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ++++++++++++++++++++++++---------------
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6904cbb7de1a..f6764537148c 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 			fp = xp;
 		} while (fp);
 	} else {
-		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
+		sum_truesize = inet_frag_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+			 struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
 		head = skb_rb_first(&qp->q.rb_fragments);
 		if (!head)
 			goto out;
-		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		if (FRAG_CB(head)->next_frag)
+			rb_replace_node(&head->rbnode,
+					&FRAG_CB(head)->next_frag->rbnode,
+					&qp->q.rb_fragments);
+		else
+			rb_erase(&head->rbnode, &qp->q.rb_fragments);
 		memset(&head->rbnode, 0, sizeof(head->rbnode));
 		barrier();
 	}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
+	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct rb_node **rbn, *parent;
-	struct sk_buff *skb1;
+	struct sk_buff *skb1, *prev_tail;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 */
 
 	/* Find out where to put this fragment.  */
-	skb1 = qp->q.fragments_tail;
-	if (!skb1) {
-		/* This is the first fragment we've received. */
-		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
-		qp->q.fragments_tail = skb;
-	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-		/* This is the common/special case: skb goes to the end. */
+	prev_tail = qp->q.fragments_tail;
+	if (!prev_tail)
+		ip4_frag_create_run(&qp->q, skb);  /* First fragment. */
+	else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
-		if (offset < (skb1->ip_defrag_offset + skb1->len))
+		if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
 			goto discard_qp;
-		/* Insert after skb1. */
-		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
-		qp->q.fragments_tail = skb;
+		if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+			ip4_frag_append_to_last_run(&qp->q, skb);
+		else
+			ip4_frag_create_run(&qp->q, skb);
 	} else {
-		/* Binary search. Note that skb can become the first fragment, but
-		 * not the last (covered above). */
+		/* Binary search. Note that skb can become the first fragment,
+		 * but not the last (covered above).
+		 */
 		rbn = &qp->q.rb_fragments.rb_node;
 		do {
 			parent = *rbn;
 			skb1 = rb_to_skb(parent);
 			if (end <= skb1->ip_defrag_offset)
 				rbn = &parent->rb_left;
-			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+			else if (offset >= skb1->ip_defrag_offset +
+						FRAG_CB(skb1)->frag_run_len)
 				rbn = &parent->rb_right;
 			else /* Found an overlap with skb1. */
 				goto discard_qp;
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
-		 * one of its NULL left/right children. Insert skb. */
+		 * one of its NULL left/right children. Insert skb.
+		 */
+		ip4_frag_init_run(skb);
 		rb_link_node(&skb->rbnode, parent, rbn);
+		rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 	}
-	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
 	if (dev)
 		qp->iif = dev->ifindex;
@@ -476,7 +485,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, skb, dev);
+		err = ip_frag_reasm(qp, skb, prev_tail, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -495,7 +504,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 
 /* Build a new IP datagram from all its fragments. */
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
-			 struct net_device *dev)
+			 struct sk_buff *prev_tail, struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -519,10 +528,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
+		if (RB_EMPTY_NODE(&skb->rbnode))
+			FRAG_CB(prev_tail)->next_frag = fp;
+		else
+			rb_replace_node(&skb->rbnode, &fp->rbnode,
+					&qp->q.rb_fragments);
 		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
 		skb_morph(skb, head);
+		FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
 		rb_replace_node(&head->rbnode, &skb->rbnode,
 				&qp->q.rb_fragments);
 		consume_skb(head);
@@ -558,7 +573,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		skb->truesize += clone->truesize;
+		head->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
@@ -571,24 +586,36 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	skb_push(head, head->data - skb_network_header(head));
 
 	/* Traverse the tree in order, to build frag_list. */
+	fp = FRAG_CB(head)->next_frag;
 	rbn = rb_next(&head->rbnode);
 	rb_erase(&head->rbnode, &qp->q.rb_fragments);
-	while (rbn) {
-		struct rb_node *rbnext = rb_next(rbn);
-		fp = rb_to_skb(rbn);
-		rb_erase(rbn, &qp->q.rb_fragments);
-		rbn = rbnext;
-		*nextp = fp;
-		nextp = &fp->next;
-		fp->prev = NULL;
-		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
-		head->data_len += fp->len;
-		head->len += fp->len;
-		if (head->ip_summed != fp->ip_summed)
-			head->ip_summed = CHECKSUM_NONE;
-		else if (head->ip_summed == CHECKSUM_COMPLETE)
-			head->csum = csum_add(head->csum, fp->csum);
-		head->truesize += fp->truesize;
+	while (rbn || fp) {
+		/* fp points to the next sk_buff in the current run;
+		 * rbn points to the next run.
+		 */
+		/* Go through the current run. */
+		while (fp) {
+			*nextp = fp;
+			nextp = &fp->next;
+			fp->prev = NULL;
+			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			head->data_len += fp->len;
+			head->len += fp->len;
+			if (head->ip_summed != fp->ip_summed)
+				head->ip_summed = CHECKSUM_NONE;
+			else if (head->ip_summed == CHECKSUM_COMPLETE)
+				head->csum = csum_add(head->csum, fp->csum);
+			head->truesize += fp->truesize;
+			fp = FRAG_CB(fp)->next_frag;
+		}
+		/* Move to the next run. */
+		if (rbn) {
+			struct rb_node *rbnext = rb_next(rbn);
+
+			fp = rb_to_skb(rbn);
+			rb_erase(rbn, &qp->q.rb_fragments);
+			rbn = rbnext;
+		}
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
@@ -624,6 +651,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	qp->q.fragments = NULL;
 	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
+	qp->q.last_run_head = NULL;
 	return 0;
 
 out_nomem:
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment()
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (28 preceding siblings ...)
  2018-09-13 14:59 ` [PATCH v3 29/30] ip: process in-order fragments efficiently Stephen Hemminger
@ 2018-09-13 14:59 ` Stephen Hemminger
  2018-09-17 12:47 ` [PATCH v3 00/30] backport of IP fragmentation fixes Greg KH
  30 siblings, 0 replies; 45+ messages in thread
From: Stephen Hemminger @ 2018-09-13 14:59 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev, stable, edumazet, Taehee Yoo

From: Taehee Yoo <ap420073@gmail.com>

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_fragment.c                  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..e7227128df2c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			nextp = &fp->next;
 			fp->prev = NULL;
 			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			fp->sk = NULL;
 			head->data_len += fp->len;
 			head->len += fp->len;
 			if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 82ce0d0f54bf..2ed8536e10b6 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -453,6 +453,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
+		fp->sk = NULL;
 	}
 	sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 00/30] backport of IP fragmentation fixes
  2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
                   ` (29 preceding siblings ...)
  2018-09-13 14:59 ` [PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment() Stephen Hemminger
@ 2018-09-17 12:47 ` Greg KH
  30 siblings, 0 replies; 45+ messages in thread
From: Greg KH @ 2018-09-17 12:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, stable, edumazet, Stephen Hemminger

On Thu, Sep 13, 2018 at 07:58:32AM -0700, Stephen Hemminger wrote:
> Took the set of patches from 4.19 to handle IP fragmentation DoS
> and applied them against 4.14.69.  Most of these are from Eric.
> In a couple case, it required some manual merge conflict resolution.
> 
> Tested normal IP fragmentation with iperf3 and malicious IP fragments
> with fragmentsmack. Under fragmentation attack (700Kpps) the original
> 4.14.69 consumes 97% CPU; with this patch it drops to 5%.

All now queued up, thanks for doing the backport.

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 26/30] net: sk_buff rbnode reorg
  2018-09-13 14:58 ` [PATCH v3 26/30] net: sk_buff rbnode reorg Stephen Hemminger
@ 2018-10-18 16:01   ` Christoph Paasch
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Paasch @ 2018-10-18 16:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, gregkh, netdev, stable, Eric Dumazet, soheil,
	weiwan, willemb

Hello,

On Thu, Sep 13, 2018 at 8:00 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream
>
> skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
>
> Current uses (TCP receive ofo queue and netem) need to save/restore
> tstamp, while skb->dev is either NULL (TCP) or a constant for a given
> queue (netem).
>
> Since we plan using an RB tree for TCP retransmit queue to speedup SACK
> processing with large BDP, this patch exchanges skb->dev and
> skb->tstamp.
>
> This saves some overhead in both TCP and netem.
>
> v2: removes the swtstamp field from struct tcp_skb_cb
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Wei Wang <weiwan@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/skbuff.h                  |  24 ++--
>  include/net/inet_frag.h                 |   3 +-
>  net/ipv4/inet_fragment.c                |  16 ++-
>  net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
>  net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
>  net/ipv6/reassembly.c                   |   1 +
>  6 files changed, 129 insertions(+), 98 deletions(-)

this change is missing the pieces in sch_netem.c that Eric did as well.

Thus, there is a panic that can be hit à la:
https://github.com/multipath-tcp/mptcp/issues/285#issuecomment-431001559


The following diff fixes it (I can submit a proper patch later today).
Storing of the tstamp can be changed as well (as Eric did in the
original patch), but that's not causing any issues. I see a backport
to v4.9 has been submitted as well...
---
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2a2ab6bfe5d8..3d325b840802 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
  skb->next = NULL;
  skb->prev = NULL;
  skb->tstamp = netem_skb_cb(skb)->tstamp_save;
+ /* skb->dev shares skb->rbnode area,
+ * we need to restore its value.
+ */
+ skb->dev = qdisc_dev(sch);

 #ifdef CONFIG_NET_CLS_ACT
  /*





>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2837e55df03e..f64e88444082 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -663,23 +663,27 @@ struct sk_buff {
>                         struct sk_buff          *prev;
>
>                         union {
> -                               ktime_t         tstamp;
> -                               u64             skb_mstamp;
> +                               struct net_device       *dev;
> +                               /* Some protocols might use this space to store information,
> +                                * while device pointer would be NULL.
> +                                * UDP receive path is one user.
> +                                */
> +                               unsigned long           dev_scratch;
>                         };
>                 };
> -               struct rb_node  rbnode; /* used in netem & tcp stack */
> +               struct rb_node          rbnode; /* used in netem, ip4 defrag, and tcp stack */
> +               struct list_head        list;
>         };
> -       struct sock             *sk;
>
>         union {
> -               struct net_device       *dev;
> -               /* Some protocols might use this space to store information,
> -                * while device pointer would be NULL.
> -                * UDP receive path is one user.
> -                */
> -               unsigned long           dev_scratch;
> +               struct sock             *sk;
>                 int                     ip_defrag_offset;
>         };
> +
> +       union {
> +               ktime_t         tstamp;
> +               u64             skb_mstamp;
> +       };
>         /*
>          * This is the control buffer. It is free to use for every
>          * layer. Please put your private variables there. If you
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index ed07e3786d98..e4c71a7644be 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -75,7 +75,8 @@ struct inet_frag_queue {
>         struct timer_list       timer;
>         spinlock_t              lock;
>         refcount_t              refcnt;
> -       struct sk_buff          *fragments;
> +       struct sk_buff          *fragments;  /* Used in IPv6. */
> +       struct rb_root          rb_fragments; /* Used in IPv4. */
>         struct sk_buff          *fragments_tail;
>         ktime_t                 stamp;
>         int                     len;
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index c9e35b81d093..6904cbb7de1a 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
>         fp = q->fragments;
>         nf = q->net;
>         f = nf->f;
> -       while (fp) {
> -               struct sk_buff *xp = fp->next;
> -
> -               sum_truesize += fp->truesize;
> -               kfree_skb(fp);
> -               fp = xp;
> +       if (fp) {
> +               do {
> +                       struct sk_buff *xp = fp->next;
> +
> +                       sum_truesize += fp->truesize;
> +                       kfree_skb(fp);
> +                       fp = xp;
> +               } while (fp);
> +       } else {
> +               sum_truesize = skb_rbtree_purge(&q->rb_fragments);
>         }
>         sum = sum_truesize + f->qsize;
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 960bf5eab59f..0e8f8de77e71 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
>  {
>         struct inet_frag_queue *frag = from_timer(frag, t, timer);
>         const struct iphdr *iph;
> -       struct sk_buff *head;
> +       struct sk_buff *head = NULL;
>         struct net *net;
>         struct ipq *qp;
>         int err;
> @@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
>
>         ipq_kill(qp);
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
> -
> -       head = qp->q.fragments;
> -
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
>
> -       if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
> +       if (!qp->q.flags & INET_FRAG_FIRST_IN)
>                 goto out;
>
> +       /* sk_buff::dev and sk_buff::rbnode are unionized. So we
> +        * pull the head out of the tree in order to be able to
> +        * deal with head->dev.
> +        */
> +       if (qp->q.fragments) {
> +               head = qp->q.fragments;
> +               qp->q.fragments = head->next;
> +       } else {
> +               head = skb_rb_first(&qp->q.rb_fragments);
> +               if (!head)
> +                       goto out;
> +               rb_erase(&head->rbnode, &qp->q.rb_fragments);
> +               memset(&head->rbnode, 0, sizeof(head->rbnode));
> +               barrier();
> +       }
> +       if (head == qp->q.fragments_tail)
> +               qp->q.fragments_tail = NULL;
> +
> +       sub_frag_mem_limit(qp->q.net, head->truesize);
> +
>         head->dev = dev_get_by_index_rcu(net, qp->iif);
>         if (!head->dev)
>                 goto out;
> @@ -179,16 +196,16 @@ static void ip_expire(struct timer_list *t)
>             (skb_rtable(head)->rt_type != RTN_LOCAL))
>                 goto out;
>
> -       skb_get(head);
>         spin_unlock(&qp->q.lock);
>         icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
> -       kfree_skb(head);
>         goto out_rcu_unlock;
>
>  out:
>         spin_unlock(&qp->q.lock);
>  out_rcu_unlock:
>         rcu_read_unlock();
> +       if (head)
> +               kfree_skb(head);
>         ipq_put(qp);
>  }
>
> @@ -231,7 +248,7 @@ static int ip_frag_too_far(struct ipq *qp)
>         end = atomic_inc_return(&peer->rid);
>         qp->rid = end;
>
> -       rc = qp->q.fragments && (end - start) > max;
> +       rc = qp->q.fragments_tail && (end - start) > max;
>
>         if (rc) {
>                 struct net *net;
> @@ -245,7 +262,6 @@ static int ip_frag_too_far(struct ipq *qp)
>
>  static int ip_frag_reinit(struct ipq *qp)
>  {
> -       struct sk_buff *fp;
>         unsigned int sum_truesize = 0;
>
>         if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
> @@ -253,20 +269,14 @@ static int ip_frag_reinit(struct ipq *qp)
>                 return -ETIMEDOUT;
>         }
>
> -       fp = qp->q.fragments;
> -       do {
> -               struct sk_buff *xp = fp->next;
> -
> -               sum_truesize += fp->truesize;
> -               kfree_skb(fp);
> -               fp = xp;
> -       } while (fp);
> +       sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
>         sub_frag_mem_limit(qp->q.net, sum_truesize);
>
>         qp->q.flags = 0;
>         qp->q.len = 0;
>         qp->q.meat = 0;
>         qp->q.fragments = NULL;
> +       qp->q.rb_fragments = RB_ROOT;
>         qp->q.fragments_tail = NULL;
>         qp->iif = 0;
>         qp->ecn = 0;
> @@ -278,7 +288,8 @@ static int ip_frag_reinit(struct ipq *qp)
>  static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>  {
>         struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
> -       struct sk_buff *prev, *next;
> +       struct rb_node **rbn, *parent;
> +       struct sk_buff *skb1;
>         struct net_device *dev;
>         unsigned int fragsize;
>         int flags, offset;
> @@ -341,58 +352,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>         if (err)
>                 goto err;
>
> -       /* Find out which fragments are in front and at the back of us
> -        * in the chain of fragments so far.  We must know where to put
> -        * this fragment, right?
> -        */
> -       prev = qp->q.fragments_tail;
> -       if (!prev || prev->ip_defrag_offset < offset) {
> -               next = NULL;
> -               goto found;
> -       }
> -       prev = NULL;
> -       for (next = qp->q.fragments; next != NULL; next = next->next) {
> -               if (next->ip_defrag_offset >= offset)
> -                       break;  /* bingo! */
> -               prev = next;
> -       }
> +       /* Note : skb->rbnode and skb->dev share the same location. */
> +       dev = skb->dev;
> +       /* Makes sure compiler wont do silly aliasing games */
> +       barrier();
>
> -found:
>         /* RFC5722, Section 4, amended by Errata ID : 3089
>          *                          When reassembling an IPv6 datagram, if
>          *   one or more its constituent fragments is determined to be an
>          *   overlapping fragment, the entire datagram (and any constituent
>          *   fragments) MUST be silently discarded.
>          *
> -        * We do the same here for IPv4.
> +        * We do the same here for IPv4 (and increment an snmp counter).
>          */
>
> -       /* Is there an overlap with the previous fragment? */
> -       if (prev &&
> -           (prev->ip_defrag_offset + prev->len) > offset)
> -               goto discard_qp;
> -
> -       /* Is there an overlap with the next fragment? */
> -       if (next && next->ip_defrag_offset < end)
> -               goto discard_qp;
> +       /* Find out where to put this fragment.  */
> +       skb1 = qp->q.fragments_tail;
> +       if (!skb1) {
> +               /* This is the first fragment we've received. */
> +               rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
> +               qp->q.fragments_tail = skb;
> +       } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
> +               /* This is the common/special case: skb goes to the end. */
> +               /* Detect and discard overlaps. */
> +               if (offset < (skb1->ip_defrag_offset + skb1->len))
> +                       goto discard_qp;
> +               /* Insert after skb1. */
> +               rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
> +               qp->q.fragments_tail = skb;
> +       } else {
> +               /* Binary search. Note that skb can become the first fragment, but
> +                * not the last (covered above). */
> +               rbn = &qp->q.rb_fragments.rb_node;
> +               do {
> +                       parent = *rbn;
> +                       skb1 = rb_to_skb(parent);
> +                       if (end <= skb1->ip_defrag_offset)
> +                               rbn = &parent->rb_left;
> +                       else if (offset >= skb1->ip_defrag_offset + skb1->len)
> +                               rbn = &parent->rb_right;
> +                       else /* Found an overlap with skb1. */
> +                               goto discard_qp;
> +               } while (*rbn);
> +               /* Here we have parent properly set, and rbn pointing to
> +                * one of its NULL left/right children. Insert skb. */
> +               rb_link_node(&skb->rbnode, parent, rbn);
> +       }
> +       rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
>
> -       /* Note : skb->ip_defrag_offset and skb->dev share the same location */
> -       dev = skb->dev;
>         if (dev)
>                 qp->iif = dev->ifindex;
> -       /* Makes sure compiler wont do silly aliasing games */
> -       barrier();
>         skb->ip_defrag_offset = offset;
>
> -       /* Insert this fragment in the chain of fragments. */
> -       skb->next = next;
> -       if (!next)
> -               qp->q.fragments_tail = skb;
> -       if (prev)
> -               prev->next = skb;
> -       else
> -               qp->q.fragments = skb;
> -
>         qp->q.stamp = skb->tstamp;
>         qp->q.meat += skb->len;
>         qp->ecn |= ecn;
> @@ -414,7 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>                 unsigned long orefdst = skb->_skb_refdst;
>
>                 skb->_skb_refdst = 0UL;
> -               err = ip_frag_reasm(qp, prev, dev);
> +               err = ip_frag_reasm(qp, skb, dev);
>                 skb->_skb_refdst = orefdst;
>                 return err;
>         }
> @@ -431,15 +442,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>         return err;
>  }
>
> -
>  /* Build a new IP datagram from all its fragments. */
> -
> -static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
> +static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
>                          struct net_device *dev)
>  {
>         struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
>         struct iphdr *iph;
> -       struct sk_buff *fp, *head = qp->q.fragments;
> +       struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
> +       struct sk_buff **nextp; /* To build frag_list. */
> +       struct rb_node *rbn;
>         int len;
>         int ihlen;
>         int err;
> @@ -453,25 +464,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>                 goto out_fail;
>         }
>         /* Make the one we just received the head. */
> -       if (prev) {
> -               head = prev->next;
> -               fp = skb_clone(head, GFP_ATOMIC);
> +       if (head != skb) {
> +               fp = skb_clone(skb, GFP_ATOMIC);
>                 if (!fp)
>                         goto out_nomem;
> -
> -               fp->next = head->next;
> -               if (!fp->next)
> +               rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
> +               if (qp->q.fragments_tail == skb)
>                         qp->q.fragments_tail = fp;
> -               prev->next = fp;
> -
> -               skb_morph(head, qp->q.fragments);
> -               head->next = qp->q.fragments->next;
> -
> -               consume_skb(qp->q.fragments);
> -               qp->q.fragments = head;
> +               skb_morph(skb, head);
> +               rb_replace_node(&head->rbnode, &skb->rbnode,
> +                               &qp->q.rb_fragments);
> +               consume_skb(head);
> +               head = skb;
>         }
>
> -       WARN_ON(!head);
>         WARN_ON(head->ip_defrag_offset != 0);
>
>         /* Allocate a new buffer for the datagram. */
> @@ -496,24 +502,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>                 clone = alloc_skb(0, GFP_ATOMIC);
>                 if (!clone)
>                         goto out_nomem;
> -               clone->next = head->next;
> -               head->next = clone;
>                 skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
>                 skb_frag_list_init(head);
>                 for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
>                         plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
>                 clone->len = clone->data_len = head->data_len - plen;
> -               head->data_len -= clone->len;
> -               head->len -= clone->len;
> +               skb->truesize += clone->truesize;
>                 clone->csum = 0;
>                 clone->ip_summed = head->ip_summed;
>                 add_frag_mem_limit(qp->q.net, clone->truesize);
> +               skb_shinfo(head)->frag_list = clone;
> +               nextp = &clone->next;
> +       } else {
> +               nextp = &skb_shinfo(head)->frag_list;
>         }
>
> -       skb_shinfo(head)->frag_list = head->next;
>         skb_push(head, head->data - skb_network_header(head));
>
> -       for (fp=head->next; fp; fp = fp->next) {
> +       /* Traverse the tree in order, to build frag_list. */
> +       rbn = rb_next(&head->rbnode);
> +       rb_erase(&head->rbnode, &qp->q.rb_fragments);
> +       while (rbn) {
> +               struct rb_node *rbnext = rb_next(rbn);
> +               fp = rb_to_skb(rbn);
> +               rb_erase(rbn, &qp->q.rb_fragments);
> +               rbn = rbnext;
> +               *nextp = fp;
> +               nextp = &fp->next;
> +               fp->prev = NULL;
> +               memset(&fp->rbnode, 0, sizeof(fp->rbnode));
>                 head->data_len += fp->len;
>                 head->len += fp->len;
>                 if (head->ip_summed != fp->ip_summed)
> @@ -524,7 +541,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>         }
>         sub_frag_mem_limit(qp->q.net, head->truesize);
>
> +       *nextp = NULL;
>         head->next = NULL;
> +       head->prev = NULL;
>         head->dev = dev;
>         head->tstamp = qp->q.stamp;
>         IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
> @@ -552,6 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
>
>         __IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
>         qp->q.fragments = NULL;
> +       qp->q.rb_fragments = RB_ROOT;
>         qp->q.fragments_tail = NULL;
>         return 0;
>
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 1d2f07cde01a..82ce0d0f54bf 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -471,6 +471,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
>                                           head->csum);
>
>         fq->q.fragments = NULL;
> +       fq->q.rb_fragments = RB_ROOT;
>         fq->q.fragments_tail = NULL;
>
>         return true;
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index afaad60dc2ac..ede0061b6f5d 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -472,6 +472,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
>         __IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
>         rcu_read_unlock();
>         fq->q.fragments = NULL;
> +       fq->q.rb_fragments = RB_ROOT;
>         fq->q.fragments_tail = NULL;
>         return 1;
>
> --
> 2.18.0
>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2018-09-13 14:58 ` [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu Stephen Hemminger
@ 2019-01-10 19:30   ` Tom Herbert
  2019-01-10 22:22     ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Tom Herbert @ 2019-01-10 19:30 UTC (permalink / raw)
  To: Linux Kernel Network Developers
  Cc: David S. Miller, Eric Dumazet, Florian Westphal, Peter Oskolkov,
	Timothy Winters

Hi,

This patch popped on 6man IETF list as possibly causing
interoperability problems. Here's the description from Timothy
Winters:

"We have encountered a potential Interoperability issue at the UNH-IOL
while running some testing.  The issue is around fragments. The Linux
Kernel updated based on the following CVE:
https://nvd.nist.gov/vuln/detail/CVE-2018-5391.

The fix was to reject IPv6 fragments less than 1280 that aren't last
fragment.  Section 4.5 of RFC 8200 allows for sending any fragment for
fragments as long they add up to the original packet.  This means that
an implementation that generates a non-last fragments with a size then
1280, will be dropped by the updated kernel."

I couldn't find any mention of the advisory in the commit logs or
netdev discussion, and apparently there's no protocol requirement that
intermediate fragements need to be at least minimal MTU. Maybe this
patch should be reverted?

Tom

On Thu, Sep 13, 2018 at 8:00 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> From: Florian Westphal <fw@strlen.de>
>
> don't bother with pathological cases, they only waste cycles.
> IPv6 requires a minimum MTU of 1280 so we should never see fragments
> smaller than this (except last frag).
>
> v3: don't use awkward "-offset + len"
> v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
>     There were concerns that there could be even smaller frags
>     generated by intermediate nodes, e.g. on radio networks.
>
> Cc: Peter Oskolkov <posk@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> (cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6)
> ---
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 4 ++++
>  net/ipv6/reassembly.c                   | 4 ++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index a1dc0d6a5949..1d2f07cde01a 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -565,6 +565,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
>         hdr = ipv6_hdr(skb);
>         fhdr = (struct frag_hdr *)skb_transport_header(skb);
>
> +       if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
> +           fhdr->frag_off & htons(IP6_MF))
> +               return -EINVAL;
> +
>         skb_orphan(skb);
>         fq = fq_find(net, fhdr->identification, user, hdr,
>                      skb->dev ? skb->dev->ifindex : 0);
> diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
> index e1c5fa5e3873..afaad60dc2ac 100644
> --- a/net/ipv6/reassembly.c
> +++ b/net/ipv6/reassembly.c
> @@ -522,6 +522,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
>                 return 1;
>         }
>
> +       if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
> +           fhdr->frag_off & htons(IP6_MF))
> +               goto fail_hdr;
> +
>         iif = skb->dev ? skb->dev->ifindex : 0;
>         fq = fq_find(net, fhdr->identification, hdr, iif);
>         if (fq) {
> --
> 2.18.0
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-10 19:30   ` Tom Herbert
@ 2019-01-10 22:22     ` Florian Westphal
  2019-01-11 10:57       ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2019-01-10 22:22 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet,
	Florian Westphal, Peter Oskolkov, Timothy Winters

Tom Herbert <tom@herbertland.com> wrote:
> I couldn't find any mention of the advisory in the commit logs or
> netdev discussion, and apparently there's no protocol requirement that
> intermediate fragements need to be at least minimal MTU. Maybe this
> patch should be reverted?

Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
have to be converted first.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-10 22:22     ` Florian Westphal
@ 2019-01-11 10:57       ` Eric Dumazet
  2019-01-11 12:21         ` Michal Kubecek
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2019-01-11 10:57 UTC (permalink / raw)
  To: Florian Westphal, Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet,
	Peter Oskolkov, Timothy Winters



On 01/10/2019 02:22 PM, Florian Westphal wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> I couldn't find any mention of the advisory in the commit logs or
>> netdev discussion, and apparently there's no protocol requirement that
>> intermediate fragements need to be at least minimal MTU. Maybe this
>> patch should be reverted?
> 
> Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
> have to be converted first.
> 

<quote>
Section 4.5 of RFC 8200 allows for sending any fragment for
fragments as long they add up to the original packet. 
</quote>

I do not believe we need an rbtree to implement this idea.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 10:57       ` Eric Dumazet
@ 2019-01-11 12:21         ` Michal Kubecek
  2019-01-11 12:27           ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Kubecek @ 2019-01-11 12:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Tom Herbert, Linux Kernel Network Developers,
	David S. Miller, Eric Dumazet, Peter Oskolkov, Timothy Winters

On Fri, Jan 11, 2019 at 02:57:39AM -0800, Eric Dumazet wrote:
> On 01/10/2019 02:22 PM, Florian Westphal wrote:
> > Tom Herbert <tom@herbertland.com> wrote:
> >> I couldn't find any mention of the advisory in the commit logs or
> >> netdev discussion, and apparently there's no protocol requirement that
> >> intermediate fragements need to be at least minimal MTU. Maybe this
> >> patch should be reverted?
> > 
> > Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
> > have to be converted first.
> 
> <quote>
> Section 4.5 of RFC 8200 allows for sending any fragment for
> fragments as long they add up to the original packet. 
> </quote>
> 
> I do not believe we need an rbtree to implement this idea.

IMHO Florian meant that allowing arbitrarily small fragments would harm
resistance against FragmentSmack type attacks so that we might need
rbtree based queues to be reasonably safe.

Michal Kubecek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 12:21         ` Michal Kubecek
@ 2019-01-11 12:27           ` Eric Dumazet
  2019-01-11 12:52             ` Michal Kubecek
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2019-01-11 12:27 UTC (permalink / raw)
  To: mkubecek
  Cc: Eric Dumazet, Florian Westphal, Tom Herbert, netdev,
	David Miller, Peter Oskolkov, twinters

On Fri, Jan 11, 2019 at 4:21 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Fri, Jan 11, 2019 at 02:57:39AM -0800, Eric Dumazet wrote:
> > On 01/10/2019 02:22 PM, Florian Westphal wrote:
> > > Tom Herbert <tom@herbertland.com> wrote:
> > >> I couldn't find any mention of the advisory in the commit logs or
> > >> netdev discussion, and apparently there's no protocol requirement that
> > >> intermediate fragements need to be at least minimal MTU. Maybe this
> > >> patch should be reverted?
> > >
> > > Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
> > > have to be converted first.
> >
> > <quote>
> > Section 4.5 of RFC 8200 allows for sending any fragment for
> > fragments as long they add up to the original packet.
> > </quote>
> >
> > I do not believe we need an rbtree to implement this idea.
>
> IMHO Florian meant that allowing arbitrarily small fragments would harm
> resistance against FragmentSmack type attacks so that we might need
> rbtree based queues to be reasonably safe.
>

I fail to see why.

Adding a fragment to the tail of the list is O(1), whatever of its size.

If we ever receive a 'too small for IPv6' fragment not at the tail,
we would immediately discard the whole datagram in O(1) as well.

rbtree was needed because we assumed that we needed to allow IPv4 to
receive arbitrary fragments.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 12:27           ` Eric Dumazet
@ 2019-01-11 12:52             ` Michal Kubecek
  2019-01-11 13:07               ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Kubecek @ 2019-01-11 12:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Florian Westphal, Tom Herbert, netdev,
	David Miller, Peter Oskolkov, twinters

On Fri, Jan 11, 2019 at 04:27:24AM -0800, Eric Dumazet wrote:
> On Fri, Jan 11, 2019 at 4:21 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > On Fri, Jan 11, 2019 at 02:57:39AM -0800, Eric Dumazet wrote:
> > > On 01/10/2019 02:22 PM, Florian Westphal wrote:
> > > > Tom Herbert <tom@herbertland.com> wrote:
> > > >> I couldn't find any mention of the advisory in the commit logs or
> > > >> netdev discussion, and apparently there's no protocol requirement that
> > > >> intermediate fragements need to be at least minimal MTU. Maybe this
> > > >> patch should be reverted?
> > > >
> > > > Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
> > > > have to be converted first.
> > >
> > > <quote>
> > > Section 4.5 of RFC 8200 allows for sending any fragment for
> > > fragments as long they add up to the original packet.
> > > </quote>
> > >
> > > I do not believe we need an rbtree to implement this idea.
> >
> > IMHO Florian meant that allowing arbitrarily small fragments would harm
> > resistance against FragmentSmack type attacks so that we might need
> > rbtree based queues to be reasonably safe.
> >
> 
> I fail to see why.
> 
> Adding a fragment to the tail of the list is O(1), whatever of its size.
> 
> If we ever receive a 'too small for IPv6' fragment not at the tail,
> we would immediately discard the whole datagram in O(1) as well.
> 
> rbtree was needed because we assumed that we needed to allow IPv4 to
> receive arbitrary fragments.

My understanding of the original e-mail was that our assumption that
only last IPv6 fragment is allowed to be shorter than 1280 bytes (which
commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min
mtu") is based on) was wrong and we should probably revert that commit.
How do you understand it?

Michal Kubecek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 12:52             ` Michal Kubecek
@ 2019-01-11 13:07               ` Eric Dumazet
       [not found]                 ` <CAOSSMjUODMbBuW=GgwcEt6avKoyYD5A9CzdBtE6NR6dz4pnD6w@mail.gmail.com>
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2019-01-11 13:07 UTC (permalink / raw)
  To: mkubecek
  Cc: Eric Dumazet, Florian Westphal, Tom Herbert, netdev,
	David Miller, Peter Oskolkov, twinters

On Fri, Jan 11, 2019 at 4:52 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Fri, Jan 11, 2019 at 04:27:24AM -0800, Eric Dumazet wrote:
> > On Fri, Jan 11, 2019 at 4:21 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > >
> > > On Fri, Jan 11, 2019 at 02:57:39AM -0800, Eric Dumazet wrote:
> > > > On 01/10/2019 02:22 PM, Florian Westphal wrote:
> > > > > Tom Herbert <tom@herbertland.com> wrote:
> > > > >> I couldn't find any mention of the advisory in the commit logs or
> > > > >> netdev discussion, and apparently there's no protocol requirement that
> > > > >> intermediate fragements need to be at least minimal MTU. Maybe this
> > > > >> patch should be reverted?
> > > > >
> > > > > Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
> > > > > have to be converted first.
> > > >
> > > > <quote>
> > > > Section 4.5 of RFC 8200 allows for sending any fragment for
> > > > fragments as long they add up to the original packet.
> > > > </quote>
> > > >
> > > > I do not believe we need an rbtree to implement this idea.
> > >
> > > IMHO Florian meant that allowing arbitrarily small fragments would harm
> > > resistance against FragmentSmack type attacks so that we might need
> > > rbtree based queues to be reasonably safe.
> > >
> >
> > I fail to see why.
> >
> > Adding a fragment to the tail of the list is O(1), whatever of its size.
> >
> > If we ever receive a 'too small for IPv6' fragment not at the tail,
> > we would immediately discard the whole datagram in O(1) as well.
> >
> > rbtree was needed because we assumed that we needed to allow IPv4 to
> > receive arbitrary fragments.
>
> My understanding of the original e-mail was that our assumption that
> only last IPv6 fragment is allowed to be shorter than 1280 bytes (which
> commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min
> mtu") is based on) was wrong and we should probably revert that commit.
> How do you understand it?
>

Reverting the commit is a no go, hackers will simply kill the hosts.

I was saying that we can refine the check, and allow whatever
fragments to come in order.
(Even if this makes no sense, since fragments could come out of order)

Of course, the rb-tree implementation might be okay (even for stable
trees), since we can copy/paste the IPv4 one.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
       [not found]                 ` <CAOSSMjUODMbBuW=GgwcEt6avKoyYD5A9CzdBtE6NR6dz4pnD6w@mail.gmail.com>
@ 2019-01-11 14:09                   ` Eric Dumazet
  2019-01-11 14:21                   ` Michal Kubecek
  1 sibling, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2019-01-11 14:09 UTC (permalink / raw)
  To: twinters
  Cc: mkubecek, Eric Dumazet, Florian Westphal, Tom Herbert, netdev,
	David Miller, Peter Oskolkov

On Fri, Jan 11, 2019 at 5:26 AM Timothy Winters <twinters@iol.unh.edu> wrote:
>
> Hi Eric,
>
> So I understand correctly the attack that you are trying to prevent is many small fragments from different IPs?

No... Please  look for CVE-2018-5391

This has been one of the most discussed topic in 2018

https://access.redhat.com/articles/3553061

https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180824-linux-ip-fragment

>
> The 6MAN working group has had some discussion about this topic, if you want read some IPv6 networking prospectives.
>
> https://mailarchive.ietf.org/arch/browse/ipv6/?gbt=1&index=zsYUwQTt0mAdXdk9MN4dP-uiiag
>
> The issue you will see is devices that don't send a first fragment of a size 1280 will never be able to communicate with newer kernel.  Another issue is if a the packets are traversing over tunneled interfaces (VPNs) they will be lower than 1280 due to the headers used on tunnels.
>
> What about lowering the value of accepted fragments?  to something like 1280/2?
>
> ~Tim
>
> On Fri, Jan 11, 2019 at 8:07 AM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Fri, Jan 11, 2019 at 4:52 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>> >
>> > On Fri, Jan 11, 2019 at 04:27:24AM -0800, Eric Dumazet wrote:
>> > > On Fri, Jan 11, 2019 at 4:21 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>> > > >
>> > > > On Fri, Jan 11, 2019 at 02:57:39AM -0800, Eric Dumazet wrote:
>> > > > > On 01/10/2019 02:22 PM, Florian Westphal wrote:
>> > > > > > Tom Herbert <tom@herbertland.com> wrote:
>> > > > > >> I couldn't find any mention of the advisory in the commit logs or
>> > > > > >> netdev discussion, and apparently there's no protocol requirement that
>> > > > > >> intermediate fragements need to be at least minimal MTU. Maybe this
>> > > > > >> patch should be reverted?
>> > > > > >
>> > > > > > Currently ipv6 reasm doesn't use rbtree infrastructure, so it would
>> > > > > > have to be converted first.
>> > > > >
>> > > > > <quote>
>> > > > > Section 4.5 of RFC 8200 allows for sending any fragment for
>> > > > > fragments as long they add up to the original packet.
>> > > > > </quote>
>> > > > >
>> > > > > I do not believe we need an rbtree to implement this idea.
>> > > >
>> > > > IMHO Florian meant that allowing arbitrarily small fragments would harm
>> > > > resistance against FragmentSmack type attacks so that we might need
>> > > > rbtree based queues to be reasonably safe.
>> > > >
>> > >
>> > > I fail to see why.
>> > >
>> > > Adding a fragment to the tail of the list is O(1), whatever of its size.
>> > >
>> > > If we ever receive a 'too small for IPv6' fragment not at the tail,
>> > > we would immediately discard the whole datagram in O(1) as well.
>> > >
>> > > rbtree was needed because we assumed that we needed to allow IPv4 to
>> > > receive arbitrary fragments.
>> >
>> > My understanding of the original e-mail was that our assumption that
>> > only last IPv6 fragment is allowed to be shorter than 1280 bytes (which
>> > commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min
>> > mtu") is based on) was wrong and we should probably revert that commit.
>> > How do you understand it?
>> >
>>
>> Reverting the commit is a no go, hackers will simply kill the hosts.
>>
>> I was saying that we can refine the check, and allow whatever
>> fragments to come in order.
>> (Even if this makes no sense, since fragments could come out of order)
>>
>> Of course, the rb-tree implementation might be okay (even for stable
>> trees), since we can copy/paste the IPv4 one.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
       [not found]                 ` <CAOSSMjUODMbBuW=GgwcEt6avKoyYD5A9CzdBtE6NR6dz4pnD6w@mail.gmail.com>
  2019-01-11 14:09                   ` Eric Dumazet
@ 2019-01-11 14:21                   ` Michal Kubecek
       [not found]                     ` <CAOSSMjVMVWxzkT5M2LHgf0+GPHdaWHV01a6mBqbGRVXOaQ04PQ@mail.gmail.com>
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Kubecek @ 2019-01-11 14:21 UTC (permalink / raw)
  To: netdev
  Cc: Timothy Winters, Eric Dumazet, Eric Dumazet, Florian Westphal,
	Tom Herbert, David Miller, Peter Oskolkov

On Friday, 11 January 2019 14:26 Timothy Winters wrote:
> Hi Eric,
> 
> So I understand correctly the attack that you are trying to prevent is
> many small fragments from different IPs?
> 
> The 6MAN working group has had some discussion about this topic, if
> you want read some IPv6 networking prospectives.
> 
> https://mailarchive.ietf.org/arch/browse/ipv6/?gbt=1&index=zsYUwQTt0mA
> dXdk9MN4dP-uiiag

I haven't read all mails in that discussion but most seem to be missing 
the point. The problem is not memory consumption, we have a (runtime 
configurable) limit for that. The problem is that by sending many small 
(8 bytes) fragments of a large (up to 64 KB) packet but never finishing 
it, an attacker can force receiving host into using quite a lot of CPU 
time just by looking up the fragment queue of the partially reassembled 
packet.

Currently, IPv6 reassembly uses a simple linear list which is fine with 
1280 byte long fragments (up to ~50 of them) but not with 8 byte long 
ones (there could be as many as ~8000). The IPv4 reassembly code (where 
we cannot assume minimal size of non-last fragment) switched to rbtree 
(with logarithmic lookup time) exactly for this reason.

> What about lowering the value of accepted fragments?  to something
> like 1280/2?

That would be probably sufficient to mitigate the DoS attacks.

Michal Kubecek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
       [not found]                     ` <CAOSSMjVMVWxzkT5M2LHgf0+GPHdaWHV01a6mBqbGRVXOaQ04PQ@mail.gmail.com>
@ 2019-01-11 17:09                       ` Peter Oskolkov
  2019-01-11 18:10                         ` Michal Kubecek
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Oskolkov @ 2019-01-11 17:09 UTC (permalink / raw)
  To: Timothy Winters
  Cc: Michal Kubecek, netdev, Eric Dumazet, Eric Dumazet,
	Florian Westphal, Tom Herbert, David Miller

On Fri, Jan 11, 2019 at 6:54 AM Timothy Winters <twinters@iol.unh.edu> wrote:
>
> Hi Michal and Eric,
>
> Thanks for the clarification.   I'm thinking about creating a draft to say no fragments less then 640 unless it's the last fragment.   Does that work for your code going forward?

I will prepare a patchset to convert IPv6 defrag queue to rbtree+list,
similarly to how IPv4 defrag queue currently works. Just in case it is
decided to go this route. I don't think having an
arbitrary/non-standard size cap (640) is a good approach.

Thanks,
Peter


>
> ~Tim
>
> On Fri, Jan 11, 2019 at 9:21 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>> On Friday, 11 January 2019 14:26 Timothy Winters wrote:
>> > Hi Eric,
>> >
>> > So I understand correctly the attack that you are trying to prevent is
>> > many small fragments from different IPs?
>> >
>> > The 6MAN working group has had some discussion about this topic, if
>> > you want read some IPv6 networking prospectives.
>> >
>> > https://mailarchive.ietf.org/arch/browse/ipv6/?gbt=1&index=zsYUwQTt0mA
>> > dXdk9MN4dP-uiiag
>>
>> I haven't read all mails in that discussion but most seem to be missing
>> the point. The problem is not memory consumption, we have a (runtime
>> configurable) limit for that. The problem is that by sending many small
>> (8 bytes) fragments of a large (up to 64 KB) packet but never finishing
>> it, an attacker can force receiving host into using quite a lot of CPU
>> time just by looking up the fragment queue of the partially reassembled
>> packet.
>>
>> Currently, IPv6 reassembly uses a simple linear list which is fine with
>> 1280 byte long fragments (up to ~50 of them) but not with 8 byte long
>> ones (there could be as many as ~8000). The IPv4 reassembly code (where
>> we cannot assume minimal size of non-last fragment) switched to rbtree
>> (with logarithmic lookup time) exactly for this reason.
>>
>> > What about lowering the value of accepted fragments?  to something
>> > like 1280/2?
>>
>> That would be probably sufficient to mitigate the DoS attacks.
>>
>> Michal Kubecek
>>
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 17:09                       ` Peter Oskolkov
@ 2019-01-11 18:10                         ` Michal Kubecek
  2019-01-12  3:21                           ` Tom Herbert
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Kubecek @ 2019-01-11 18:10 UTC (permalink / raw)
  To: netdev
  Cc: Peter Oskolkov, Timothy Winters, Eric Dumazet, Eric Dumazet,
	Florian Westphal, Tom Herbert, David Miller

On Friday, 11 January 2019 18:09 Peter Oskolkov wrote:
> On Fri, Jan 11, 2019 at 6:54 AM Timothy Winters <twinters@iol.unh.edu> wrote:
> > Thanks for the clarification.   I'm thinking about creating a draft
> > to say no fragments less then 640 unless it's the last fragment.  
> > Does that work for your code going forward?
> 
> I will prepare a patchset to convert IPv6 defrag queue to rbtree+list,
> similarly to how IPv4 defrag queue currently works. Just in case it
> is decided to go this route. I don't think having an
> arbitrary/non-standard size cap (640) is a good approach.

It's not completely arbitrary. The idea is that two most obvious
fragment sizing strategies are

  (a) use maximum possible size for all except last, then the rest
  (b) calculate minimum required fragment count and use (almost) the
      same size for all of them

Both strategies create non-last fragments of size at least 1280 / 2.
But I agree that using the same data structure and algorithm as for
IPv4 is more future proof.

Michal Kubecek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu
  2019-01-11 18:10                         ` Michal Kubecek
@ 2019-01-12  3:21                           ` Tom Herbert
  0 siblings, 0 replies; 45+ messages in thread
From: Tom Herbert @ 2019-01-12  3:21 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, Peter Oskolkov, Timothy Winters, Eric Dumazet,
	Eric Dumazet, Florian Westphal, David Miller

On Fri, Jan 11, 2019 at 10:10 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Friday, 11 January 2019 18:09 Peter Oskolkov wrote:
> > On Fri, Jan 11, 2019 at 6:54 AM Timothy Winters <twinters@iol.unh.edu> wrote:
> > > Thanks for the clarification.   I'm thinking about creating a draft
> > > to say no fragments less then 640 unless it's the last fragment.
> > > Does that work for your code going forward?
> >
> > I will prepare a patchset to convert IPv6 defrag queue to rbtree+list,
> > similarly to how IPv4 defrag queue currently works. Just in case it
> > is decided to go this route. I don't think having an
> > arbitrary/non-standard size cap (640) is a good approach.
>
> It's not completely arbitrary. The idea is that two most obvious
> fragment sizing strategies are
>
>   (a) use maximum possible size for all except last, then the rest
>   (b) calculate minimum required fragment count and use (almost) the
>       same size for all of them
>
> Both strategies create non-last fragments of size at least 1280 / 2.
> But I agree that using the same data structure and algorithm as for
> IPv4 is more future proof.
>
Yes, the 640 value is being discussed on 6man. I think the correct
approach might be to turn the limit into a sysctl and make the default
value 640. I'm not sure how, or if this should be applied to IPv4
though.

Tom

> Michal Kubecek
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-01-12  3:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13 14:58 [PATCH v3 00/30] backport of IP fragmentation fixes Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 01/30] inet: frags: change inet_frags_init_net() return value Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 02/30] inet: frags: add a pointer to struct netns_frags Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 03/30] inet: frags: refactor ipfrag_init() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 04/30] inet: frags: Convert timers to use timer_setup() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 05/30] inet: frags: refactor ipv6_frag_init() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 06/30] inet: frags: refactor lowpan_net_frag_init() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 07/30] ipv6: export ip6 fragments sysctl to unprivileged users Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 08/30] rhashtable: add schedule points Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 09/30] inet: frags: use rhashtables for reassembly units Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 10/30] inet: frags: remove some helpers Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 11/30] inet: frags: get rif of inet_frag_evicting() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 12/30] inet: frags: remove inet_frag_maybe_warn_overflow() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 13/30] inet: frags: break the 2GB limit for frags storage Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 14/30] inet: frags: do not clone skb in ip_expire() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 15/30] ipv6: frags: rewrite ip6_expire_frag_queue() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 16/30] rhashtable: reorganize struct rhashtable layout Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 17/30] inet: frags: reorganize struct netns_frags Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 18/30] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 19/30] inet: frags: fix ip6frag_low_thresh boundary Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 20/30] ip: discard IPv4 datagrams with overlapping segments Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 21/30] net: speed up skb_rbtree_purge() Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 22/30] net: modify skb_rbtree_purge to return the truesize of all purged skbs Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu Stephen Hemminger
2019-01-10 19:30   ` Tom Herbert
2019-01-10 22:22     ` Florian Westphal
2019-01-11 10:57       ` Eric Dumazet
2019-01-11 12:21         ` Michal Kubecek
2019-01-11 12:27           ` Eric Dumazet
2019-01-11 12:52             ` Michal Kubecek
2019-01-11 13:07               ` Eric Dumazet
     [not found]                 ` <CAOSSMjUODMbBuW=GgwcEt6avKoyYD5A9CzdBtE6NR6dz4pnD6w@mail.gmail.com>
2019-01-11 14:09                   ` Eric Dumazet
2019-01-11 14:21                   ` Michal Kubecek
     [not found]                     ` <CAOSSMjVMVWxzkT5M2LHgf0+GPHdaWHV01a6mBqbGRVXOaQ04PQ@mail.gmail.com>
2019-01-11 17:09                       ` Peter Oskolkov
2019-01-11 18:10                         ` Michal Kubecek
2019-01-12  3:21                           ` Tom Herbert
2018-09-13 14:58 ` [PATCH v3 24/30] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 25/30] net: add rb_to_skb() and other rb tree helpers Stephen Hemminger
2018-09-13 14:58 ` [PATCH v3 26/30] net: sk_buff rbnode reorg Stephen Hemminger
2018-10-18 16:01   ` Christoph Paasch
2018-09-13 14:58 ` [PATCH v3 27/30] ipv4: frags: precedence bug in ip_expire() Stephen Hemminger
2018-09-13 14:59 ` [PATCH v3 28/30] ip: add helpers to process in-order fragments faster Stephen Hemminger
2018-09-13 14:59 ` [PATCH v3 29/30] ip: process in-order fragments efficiently Stephen Hemminger
2018-09-13 14:59 ` [PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment() Stephen Hemminger
2018-09-17 12:47 ` [PATCH v3 00/30] backport of IP fragmentation fixes Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.