[net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue
@ 2013-04-24 15:47 Jesper Dangaard Brouer
  2013-04-24 15:48 ` [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists" Jesper Dangaard Brouer
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-24 15:47 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet

This patchset replaces my prevous patchset:
  "net: frag code fixes and RFC for LRU removal"
  http://thread.gmane.org/gmane.linux.network/266323/

I have dropped the idea of doing "direct hash cleaning".  Instead I
have simply choosen to solve the global LRU list problem, by making
the LRU list be per CPU.


Patch-01: Revert "inet: limit length of fragment queue hash table bucket lists"
- Sorry this patch is broken and need to be reverted.
  I'll leave it up to Hannes for come up with a better solution,
  perhaps better IPv6 hashing for frag queues.

Patch-02: net: increase frag hash size
- Simply increase the hash size

Patch-03: net: avoid false perf interpretations in frag code
- The compiler make us misinterpret performance issues in the frag
  code, because its auto inlining functions.  This cause too many
  arguments between developers on the list.

Patch-04: net: frag LRU list per CPU
- Change the global LRU list to be per CPU instead.

Besides my usual 10G tests, I have also tested that a 512 Kbit/s
simulated link (with HTB) still works (with sending 3x UDP fragments)
under the DoS test 20G3F+MQ, which is sending approx 1Mpps on a
10Gbit/s NIC.  (And I have verified that the 512Kbit/s link gets the
right spacing between packets (23.6 ms)).

HTB setup:
 tc qdisc del dev eth63 root
 tc qdisc add dev eth63 root handle 1: htb default 1
 tc class add dev eth63 parent 1: classid 1:1 htb rate 512kbit ceil 512kbit burst 1600 cburst 1600

Patchset based on top of net-next commit 953c96e0d8 (v3.9-rc5-1145-g953c96e).

---
p.s. I have now reached 300 compiles on this git tree for my
frag-work, this work effort getting out of hand...

Jesper Dangaard Brouer (4):
      net: frag LRU list per CPU
      net: avoid false perf interpretations in frag code
      net: increase frag hash size
      Revert "inet: limit length of fragment queue hash table bucket lists"


 Documentation/networking/ip-sysctl.txt  |    2 -
 include/net/inet_frag.h                 |  120 ++++++++++++++++++-------------
 include/net/ipv6.h                      |    8 +-
 net/ipv4/inet_fragment.c                |   83 ++++++++++++---------
 net/ipv4/ip_fragment.c                  |   35 +++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   14 ++--
 net/ipv6/reassembly.c                   |   10 +--
 7 files changed, 153 insertions(+), 119 deletions(-)


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
@ 2013-04-24 15:48 ` Jesper Dangaard Brouer
  2013-04-25  0:00   ` Eric Dumazet
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-24 15:48 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet

This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.

The problem with commit 5a3da1fe (inet: limit length of fragment queue
hash table bucket lists) is that, once we hit the hash depth limit (of
128), the we *keep* the existing frag queues, not allowing new frag
queues to be created.  Thus, an attacker can effectivly block handling
of fragments for 30 sec (as each frag queue have a timeout of 30 sec)

For this situation to occur the mem limit need to increase (from
default 4MB per netns). This can either happen by 1) creating more
netns (network namespaces) or 2) by manually increasing the mem limits
via proc files:

 /proc/sys/net/ipv4/ipfrag_high_thresh
 /proc/sys/net/ipv4/ipfrag_low_thresh

To be exact, situation occurs when, increasing the thresh to something
allowing 128 elements in each bucket, which is not that high given the
hash array size of 64 (64*128=8192), e.g.
   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968

Thus, with small frags we only need to start >=3 netns instances, for
the situation to be possible.

The reason this is inevitable, is the attackers invalid fragments will
never finish (timeout 30 sec), while valid fragments will complete and
"exit" the queue, thus the end result is hash bucket is filled with
attackers invalid/incomplete fragments.

Fixed conflicts in:
    include/net/inet_frag.h

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |    9 ---------
 net/ipv4/inet_fragment.c                |   20 +-------------------
 net/ipv4/ip_fragment.c                  |   11 +++++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   12 ++++++------
 net/ipv6/reassembly.c                   |    8 ++------
 5 files changed, 16 insertions(+), 44 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6f41b45..eb1d6ee 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,13 +43,6 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
-/* averaged:
- * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
- *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
- *	       struct frag_queue))
- */
-#define INETFRAGS_MAXDEPTH		128
-
 struct inet_frag_bucket {
 	struct hlist_head	chain;
 	spinlock_t		chain_lock;
@@ -89,8 +82,6 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f)
 {
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index e97d66a..cabe3d7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -21,7 +21,6 @@
 #include <linux/rtnetlink.h>
 #include <linux/slab.h>
 
-#include <net/sock.h>
 #include <net/inet_frag.h>
 #include <net/inet_ecn.h>
 
@@ -327,7 +326,6 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 {
 	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
-	int depth = 0;
 
 	hb = &f->hash[hash];
 
@@ -339,26 +337,10 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			read_unlock(&f->lock);
 			return q;
 		}
-		depth++;
 	}
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
 
-	if (depth <= INETFRAGS_MAXDEPTH)
-		return inet_frag_create(nf, f, key);
-	else
-		return ERR_PTR(-ENOBUFS);
+	return inet_frag_create(nf, f, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix)
-{
-	static const char msg[] = "inet_frag_find: Fragment hash bucket"
-		" list length grew over limit " __stringify(INETFRAGS_MAXDEPTH)
-		". Dropping fragment.\n";
-
-	if (PTR_ERR(q) == -ENOBUFS)
-		LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9385206..cda5514 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -263,11 +263,14 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
-		return NULL;
-	}
+	if (q == NULL)
+		goto out_nomem;
+
 	return container_of(q, struct ipq, q);
+
+out_nomem:
+	LIMIT_NETDEBUG(KERN_ERR pr_fmt("ip_frag_create: no memory left !\n"));
+	return NULL;
 }
 
 /* Is the fragment too far ahead to be part of ipq? */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index dffdc1a..7cfa829 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -14,8 +14,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#define pr_fmt(fmt) "IPv6-nf: " fmt
-
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -189,11 +187,13 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
 	local_bh_enable();
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
-		return NULL;
-	}
+	if (q == NULL)
+		goto oom;
+
 	return container_of(q, struct frag_queue, q);
+
+oom:
+	return NULL;
 }
 
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e6e44ce..74505c5 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -26,9 +26,6 @@
  *	YOSHIFUJI,H. @USAGI	Always remove fragment header to
  *				calculate ICV correctly.
  */
-
-#define pr_fmt(fmt) "IPv6: " fmt
-
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -196,10 +193,9 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src,
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (q == NULL)
 		return NULL;
-	}
+
 	return container_of(q, struct frag_queue, q);
 }
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next PATCH 2/4] net: increase frag hash size
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
  2013-04-24 15:48 ` [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists" Jesper Dangaard Brouer
@ 2013-04-24 15:48 ` Jesper Dangaard Brouer
  2013-04-24 22:09   ` Sergei Shtylyov
                     ` (3 more replies)
  2013-04-24 15:48 ` [net-next PATCH 3/4] net: avoid false perf interpretations in frag code Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-24 15:48 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet

Increase fragmentation hash bucket size to 1024 from old 64 elems.

After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)
the hash size of 64 elements is simply too small.  Also considering
the mem limit is per netns and the hash table is shared for all netns.

For the embedded people, note that this increase will change the hash
table/array from using approx 1 Kbytes to 16 Kbytes.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index eb1d6ee..4e15856 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -41,7 +41,7 @@ struct inet_frag_queue {
 	struct netns_frags	*net;
 };
 
-#define INETFRAGS_HASHSZ		64
+#define INETFRAGS_HASHSZ	1024
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next PATCH 3/4] net: avoid false perf interpretations in frag code
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
  2013-04-24 15:48 ` [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists" Jesper Dangaard Brouer
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
@ 2013-04-24 15:48 ` Jesper Dangaard Brouer
  2013-04-24 23:48   ` Eric Dumazet
  2013-04-24 15:48 ` [net-next PATCH 4/4] net: frag LRU list per CPU Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-24 15:48 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet

The compiler make us misinterpret performance issues in the frag code,
because its auto inlining functions.  Lets instead do explicit
inlining to make this situation obvious to the programmer.

The function inet_frag_find() get the perf blame, because auto
inlining of the functions inet_frag_create(), inet_frag_alloc() and
inet_frag_intern().

My solution is to explicit inline inet_frag_alloc() and
inet_frag_intern(), but explicitly "noinline" inet_frag_create(),
in-order to make it explicit to the performance engineer, that
creation phase is a bottleneck. Then, when reading the code the
programmer should notice the inline, and see the bottleneck is really
located in inet_frag_intern().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/ipv4/inet_fragment.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index cabe3d7..db30a01 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -240,7 +240,7 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 }
 EXPORT_SYMBOL(inet_frag_evictor);
 
-static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
+static inline struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
@@ -288,7 +288,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 	return qp;
 }
 
-static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
+static inline struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 		struct inet_frags *f, void *arg)
 {
 	struct inet_frag_queue *q;
@@ -308,7 +308,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	return q;
 }
 
-static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
+static noinline struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
 		struct inet_frags *f, void *arg)
 {
 	struct inet_frag_queue *q;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2013-04-24 15:48 ` [net-next PATCH 3/4] net: avoid false perf interpretations in frag code Jesper Dangaard Brouer
@ 2013-04-24 15:48 ` Jesper Dangaard Brouer
  2013-04-25  0:25   ` Eric Dumazet
  2013-04-24 16:21 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue David Laight
  2013-04-24 17:27 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Hannes Frederic Sowa
  5 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-24 15:48 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet

The global LRU list is the major bottleneck in fragmentation handling
(after the recent frag optimization).

Simply change to use a LRU list per CPU, instead of a single shared
LRU list.  This was the simples approach of removing the LRU list, I
could come up with.  The previous "direct hash cleaning" approach was
getting too complicated, and interacted badly with netns.

The /proc/sys/net/ipv4/ipfrag_*_thresh values are now per CPU limits,
and have been reduced to 2 Mbytes (from 4 MB).

Performance compared to net-next (953c96e):

Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
----------  -------   -------  ----------  ---------  --------  -------
(953c96e)
 net-next:  17417.4   11376.5   3853.43     6170.56     174.8    402.9
LRU-pr-CPU: 19047.0   13503.9  10314.10    12363.20    1528.7   2064.9

I have also tested that a 512 Kbit/s simulated link (with HTB) still
works (with sending 3x UDP fragments) under the DoS test 20G3F+MQ,
which is sending approx 1Mpps on a 10Gbit/s NIC

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 Documentation/networking/ip-sysctl.txt  |    2 -
 include/net/inet_frag.h                 |  109 +++++++++++++++++++------------
 include/net/ipv6.h                      |    8 +-
 net/ipv4/inet_fragment.c                |   57 ++++++++++++----
 net/ipv4/ip_fragment.c                  |   24 ++++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |    2 -
 net/ipv6/reassembly.c                   |    2 -
 7 files changed, 133 insertions(+), 71 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f98ca63..dd972d2 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -67,7 +67,7 @@ ipfrag_high_thresh - INTEGER
 	Maximum memory used to reassemble IP fragments. When
 	ipfrag_high_thresh bytes of memory is allocated for this purpose,
 	the fragment handler will toss packets until ipfrag_low_thresh
-	is reached.
+	is reached. This max memory usage is per CPU.
 
 ipfrag_low_thresh - INTEGER
 	See ipfrag_high_thresh
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 4e15856..ca93056 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,17 +1,22 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
-#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/percpu.h>
 
-struct netns_frags {
+/* Maintain these resource limits per CPU, else performance will suffer
+ * due to cache-line bouncing
+ */
+struct frag_cpu_limit {
+	atomic_t                mem;
 	int			nqueues;
-	struct list_head	lru_list;
-	spinlock_t		lru_lock;
+	struct list_head        lru_list;
+	spinlock_t              lru_lock;
+};
 
-	/* The percpu_counter "mem" need to be cacheline aligned.
-	 *  mem.count must not share cacheline with other writers
-	 */
-	struct percpu_counter   mem ____cacheline_aligned_in_smp;
+struct netns_frags {
+	struct frag_cpu_limit __percpu *percpu;
 
 	/* sysctls */
 	int			timeout;
@@ -25,6 +30,7 @@ struct inet_frag_queue {
 	struct list_head	lru_list;   /* lru list member */
 	struct hlist_node	list;
 	atomic_t		refcnt;
+	u32			cpu_alloc;  /* for mem limit track per CPU */
 	struct sk_buff		*fragments; /* list of received fragments */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
@@ -78,7 +84,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
 void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
 void inet_frag_destroy(struct inet_frag_queue *q,
 				struct inet_frags *f, int *work);
-int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force);
+int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
+		      bool force, int on_cpu);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock);
@@ -91,66 +98,86 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
 
 /* Memory Tracking Functions. */
 
-/* The default percpu_counter batch size is not big enough to scale to
- * fragmentation mem acct sizes.
- * The mem size of a 64K fragment is approx:
- *  (44 fragments * 2944 truesize) + frag_queue struct(200) = 129736 bytes
- */
-static unsigned int frag_percpu_counter_batch = 130000;
-
-static inline int frag_mem_limit(struct netns_frags *nf)
-{
-	return percpu_counter_read(&nf->mem);
-}
-
 static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
-	__percpu_counter_add(&q->net->mem, -i, frag_percpu_counter_batch);
+	int cpu = q->cpu_alloc;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+	atomic_sub(i, &percpu->mem);
 }
 
 static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
-	__percpu_counter_add(&q->net->mem, i, frag_percpu_counter_batch);
-}
-
-static inline void init_frag_mem_limit(struct netns_frags *nf)
-{
-	percpu_counter_init(&nf->mem, 0);
+	int cpu = q->cpu_alloc;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+	atomic_add(i, &percpu->mem);
 }
 
 static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
-	int res;
+	unsigned int sum = 0;
+	int cpu;
 
 	local_bh_disable();
-	res = percpu_counter_sum_positive(&nf->mem);
+	for_each_possible_cpu(cpu) {
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		sum += atomic_read(&percpu->mem);
+	}
 	local_bh_enable();
 
-	return res;
+	return sum;
+}
+
+static inline int sum_frag_nqueues(struct netns_frags *nf)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		spin_lock(&percpu->lru_lock);
+		sum += percpu->nqueues;
+		spin_unlock(&percpu->lru_lock);
+	}
+
+	return sum;
 }
 
+
+/* LRU (Least Recently Used) resource functions */
+
 static inline void inet_frag_lru_move(struct inet_frag_queue *q)
 {
-	spin_lock(&q->net->lru_lock);
-	list_move_tail(&q->lru_list, &q->net->lru_list);
-	spin_unlock(&q->net->lru_lock);
+	int cpu = q->cpu_alloc;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
+	list_move_tail(&q->lru_list, &percpu->lru_list);
+	spin_unlock(&percpu->lru_lock);
 }
 
 static inline void inet_frag_lru_del(struct inet_frag_queue *q)
 {
-	spin_lock(&q->net->lru_lock);
+	int cpu = q->cpu_alloc;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
 	list_del(&q->lru_list);
-	q->net->nqueues--;
-	spin_unlock(&q->net->lru_lock);
+	percpu->nqueues--;
+	spin_unlock(&percpu->lru_lock);
 }
 
 static inline void inet_frag_lru_add(struct netns_frags *nf,
 				     struct inet_frag_queue *q)
 {
-	spin_lock(&nf->lru_lock);
-	list_add_tail(&q->lru_list, &nf->lru_list);
-	q->net->nqueues++;
-	spin_unlock(&nf->lru_lock);
+	int cpu = q->cpu_alloc;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
+	list_add_tail(&q->lru_list, &percpu->lru_list);
+	percpu->nqueues++;
+	spin_unlock(&percpu->lru_lock);
 }
 
 /* RFC 3168 support :
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 0810aa5..f108b80 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -286,7 +286,7 @@ static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 #if IS_ENABLED(CONFIG_IPV6)
 static inline int ip6_frag_nqueues(struct net *net)
 {
-	return net->ipv6.frags.nqueues;
+	return sum_frag_nqueues(&net->ipv6.frags);
 }
 
 static inline int ip6_frag_mem(struct net *net)
@@ -295,8 +295,10 @@ static inline int ip6_frag_mem(struct net *net)
 }
 #endif
 
-#define IPV6_FRAG_HIGH_THRESH	(4 * 1024*1024)	/* 4194304 */
-#define IPV6_FRAG_LOW_THRESH	(3 * 1024*1024)	/* 3145728 */
+/* Frag mem thresholds are per CPU */
+#define IPV6_FRAG_MAXSZ 	(1 * 128 *1024) /*  131072 */
+#define IPV6_FRAG_HIGH_THRESH	(2 * 1024*1024) /* 2097152 */
+#define IPV6_FRAG_LOW_THRESH	IPV6_FRAG_HIGH_THRESH - IPV6_FRAG_MAXSZ
 #define IPV6_FRAG_TIMEOUT	(60 * HZ)	/* 60 seconds */
 
 extern int __ipv6_addr_type(const struct in6_addr *addr);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index db30a01..94c45c6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -45,6 +45,18 @@ const u8 ip_frag_ecn_table[16] = {
 };
 EXPORT_SYMBOL(ip_frag_ecn_table);
 
+static inline int frag_mem_limit_on_cpu(struct netns_frags *nf, int on_cpu)
+{
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, on_cpu);
+	return atomic_read(&percpu->mem);
+}
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+	int cpu = smp_processor_id();
+	return frag_mem_limit_on_cpu(nf, cpu);
+}
+
 static void inet_frag_secret_rebuild(unsigned long dummy)
 {
 	struct inet_frags *f = (struct inet_frags *)dummy;
@@ -104,10 +116,20 @@ EXPORT_SYMBOL(inet_frags_init);
 
 void inet_frags_init_net(struct netns_frags *nf)
 {
-	nf->nqueues = 0;
-	init_frag_mem_limit(nf);
-	INIT_LIST_HEAD(&nf->lru_list);
-	spin_lock_init(&nf->lru_lock);
+	int cpu;
+
+	nf->percpu = alloc_percpu(struct frag_cpu_limit);
+	if (!nf->percpu)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		INIT_LIST_HEAD(&percpu->lru_list);
+		spin_lock_init(&percpu->lru_lock);
+		atomic_set(&percpu->mem, 0);
+		percpu->nqueues = 0;
+	}
 }
 EXPORT_SYMBOL(inet_frags_init_net);
 
@@ -119,13 +141,16 @@ EXPORT_SYMBOL(inet_frags_fini);
 
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 {
+	int cpu;
+
 	nf->low_thresh = 0;
 
 	local_bh_disable();
-	inet_frag_evictor(nf, f, true);
+	for_each_possible_cpu(cpu)
+		inet_frag_evictor(nf, f, true, cpu);
 	local_bh_enable();
 
-	percpu_counter_destroy(&nf->mem);
+	free_percpu(nf->percpu);
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
 
@@ -199,32 +224,35 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 }
 EXPORT_SYMBOL(inet_frag_destroy);
 
-int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
+int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
+		      bool force, int on_cpu)
 {
 	struct inet_frag_queue *q;
 	int work, evicted = 0;
+	int cpu = (likely(on_cpu < 0)) ? smp_processor_id() : on_cpu;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
 
 	if (!force) {
-		if (frag_mem_limit(nf) <= nf->high_thresh)
+		if (frag_mem_limit_on_cpu(nf, cpu) <= nf->high_thresh)
 			return 0;
 	}
 
-	work = frag_mem_limit(nf) - nf->low_thresh;
+	work = frag_mem_limit_on_cpu(nf, cpu) - nf->low_thresh;
 	while (work > 0) {
-		spin_lock(&nf->lru_lock);
+		spin_lock(&percpu->lru_lock);
 
-		if (list_empty(&nf->lru_list)) {
-			spin_unlock(&nf->lru_lock);
+		if (list_empty(&percpu->lru_list)) {
+			spin_unlock(&percpu->lru_lock);
 			break;
 		}
 
-		q = list_first_entry(&nf->lru_list,
+		q = list_first_entry(&percpu->lru_list,
 				struct inet_frag_queue, lru_list);
 		atomic_inc(&q->refcnt);
 		/* Remove q from list to avoid several CPUs grabbing it */
 		list_del_init(&q->lru_list);
 
-		spin_unlock(&nf->lru_lock);
+		spin_unlock(&percpu->lru_lock);
 
 		spin_lock(&q->lock);
 		if (!(q->last_in & INET_FRAG_COMPLETE))
@@ -298,6 +326,7 @@ static inline struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 		return NULL;
 
 	q->net = nf;
+	q->cpu_alloc = (u32) smp_processor_id();
 	f->constructor(q, arg);
 	add_frag_mem_limit(q, f->qsize);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index cda5514..7bbe7cd 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -18,6 +18,7 @@
  *		John McDonald	:	0 length frag bug.
  *		Alexey Kuznetsov:	SMP races, threading, cleanup.
  *		Patrick McHardy :	LRU queue of frag heads for evictor.
+ *		Jesper D. Brouer:	SMP/NUMA scalability
  */
 
 #define pr_fmt(fmt) "IPv4: " fmt
@@ -88,7 +89,7 @@ static struct inet_frags ip4_frags;
 
 int ip_frag_nqueues(struct net *net)
 {
-	return net->ipv4.frags.nqueues;
+	return sum_frag_nqueues(&net->ipv4.frags);
 }
 
 int ip_frag_mem(struct net *net)
@@ -183,7 +184,7 @@ static void ip_evictor(struct net *net)
 {
 	int evicted;
 
-	evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false);
+	evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false, -1);
 	if (evicted)
 		IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted);
 }
@@ -816,6 +817,12 @@ static inline void ip4_frags_ctl_register(void)
 }
 #endif
 
+/* A 64K fragment consumes 129736 bytes (44*2944)+200
+ * (1500 truesize == 2944, sizeof(struct ipq) == 200)
+ */
+#define IPV4_FRAG_MAXSZ 	(1 * 128 * 1024)  /*  131072 */
+#define IPV4_FRAG_HIGH_THRESH	(2 * 1024 * 1024) /* 2097152 */
+
 static int __net_init ipv4_frags_init_net(struct net *net)
 {
 	/* Fragment cache limits.
@@ -825,15 +832,12 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 	 * queue struct (inet_frag_queue (ipv4:ipq/ipv6:frag_queue))
 	 * and the SKB's truesize.
 	 *
-	 * A 64K fragment consumes 129736 bytes (44*2944)+200
-	 * (1500 truesize == 2944, sizeof(struct ipq) == 200)
-	 *
-	 * We will commit 4MB at one time. Should we cross that limit
-	 * we will prune down to 3MB, making room for approx 8 big 64K
-	 * fragments 8x128k.
+	 * These mem limits are per CPU (scalability reasons), for each CPU
+	 * we will commit 2MB at one time. Should we cross that limit
+	 * we will prune down making room for one big 64K fragment 128k.
 	 */
-	net->ipv4.frags.high_thresh = 4 * 1024 * 1024;
-	net->ipv4.frags.low_thresh  = 3 * 1024 * 1024;
+	net->ipv4.frags.high_thresh = IPV4_FRAG_HIGH_THRESH;
+	net->ipv4.frags.low_thresh  = IPV4_FRAG_HIGH_THRESH - IPV4_FRAG_MAXSZ;
 	/*
 	 * Important NOTE! Fragment queue must be destroyed before MSL expires.
 	 * RFC791 is wrong proposing to prolongate timer each fragment arrival
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 7cfa829..291d1d8 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -586,7 +586,7 @@ struct sk_buff *nf_ct_frag6_gather(struct sk_buff *skb, u32 user)
 	fhdr = (struct frag_hdr *)skb_transport_header(clone);
 
 	local_bh_disable();
-	inet_frag_evictor(&net->nf_frag.frags, &nf_frags, false);
+	inet_frag_evictor(&net->nf_frag.frags, &nf_frags, false, -1);
 	local_bh_enable();
 
 	fq = fq_find(net, fhdr->identification, user, &hdr->saddr, &hdr->daddr,
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 74505c5..399321d 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -535,7 +535,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
-	evicted = inet_frag_evictor(&net->ipv6.frags, &ip6_frags, false);
+	evicted = inet_frag_evictor(&net->ipv6.frags, &ip6_frags, false, -1);
 	if (evicted)
 		IP6_ADD_STATS_BH(net, ip6_dst_idev(skb_dst(skb)),
 				 IPSTATS_MIB_REASMFAILS, evicted);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2013-04-24 15:48 ` [net-next PATCH 4/4] net: frag LRU list per CPU Jesper Dangaard Brouer
@ 2013-04-24 16:21 ` David Laight
  2013-04-25 11:39   ` Jesper Dangaard Brouer
  2013-04-24 17:27 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Hannes Frederic Sowa
  5 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2013-04-24 16:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David S. Miller, Hannes Frederic Sowa
  Cc: netdev, Eric Dumazet

> I have dropped the idea of doing "direct hash cleaning".  Instead I
> have simply choosen to solve the global LRU list problem, by making
> the LRU list be per CPU.

How can a per-cpu LRU list work?
I see two immediate problems:
- Ensuring the normal 'allocate' and 'free' are always done
  on the same cpu (free will need to remove items from any
  LRU list).
- Ensuring that there all the items aren't on the LRU lists
  of other cpus - meaning one can't be allocated.

The only way this could work is if the allocation limit
is also per-cpu and you can guarantee that the alloc and
free for any given item will always happen on the same cpu.

(alloc as in 'add to LRU list, free as in 'remove from LRU list).

	David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue
  2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2013-04-24 16:21 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue David Laight
@ 2013-04-24 17:27 ` Hannes Frederic Sowa
  5 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-24 17:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, netdev, Eric Dumazet

On Wed, Apr 24, 2013 at 05:47:55PM +0200, Jesper Dangaard Brouer wrote:
> This patchset replaces my prevous patchset:
>   "net: frag code fixes and RFC for LRU removal"
>   http://thread.gmane.org/gmane.linux.network/266323/
> 
> I have dropped the idea of doing "direct hash cleaning".  Instead I
> have simply choosen to solve the global LRU list problem, by making
> the LRU list be per CPU.
> 
> 
> Patch-01: Revert "inet: limit length of fragment queue hash table bucket lists"
> - Sorry this patch is broken and need to be reverted.
>   I'll leave it up to Hannes for come up with a better solution,
>   perhaps better IPv6 hashing for frag queues.

In general I am fine with this approach. I will have a look if we can
mitigate the IPv6 hash collisions without reverting 279e9f2 ("ipv6:
optimize inet6_hash_frag()") ontop of this patchset. I will send a patch
if these changes land in net-next.

> Patch-02: net: increase frag hash size
> - Simply increase the hash size
> 
> Patch-03: net: avoid false perf interpretations in frag code
> - The compiler make us misinterpret performance issues in the frag
>   code, because its auto inlining functions.  This cause too many
>   arguments between developers on the list.
> 
> Patch-04: net: frag LRU list per CPU
> - Change the global LRU list to be per CPU instead.

I will review the other patches later today. Btw. do we have to deal
with online cpu-addition/removal here?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
@ 2013-04-24 22:09   ` Sergei Shtylyov
  2013-04-25 10:13     ` Jesper Dangaard Brouer
  2013-04-24 23:48   ` Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Sergei Shtylyov @ 2013-04-24 22:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Hannes Frederic Sowa, netdev, Eric Dumazet

Hello.

On 24-04-2013 19:48, Jesper Dangaard Brouer wrote:

> Increase fragmentation hash bucket size to 1024 from old 64 elems.

> After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)

    This is not a commit ID, commit ID is SHA1 key alone. And you should also 
specify that commit's summary line with it, in parens.

> the hash size of 64 elements is simply too small.  Also considering
> the mem limit is per netns and the hash table is shared for all netns.

> For the embedded people, note that this increase will change the hash
> table/array from using approx 1 Kbytes to 16 Kbytes.

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

WBR, Sergei

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 3/4] net: avoid false perf interpretations in frag code
  2013-04-24 15:48 ` [net-next PATCH 3/4] net: avoid false perf interpretations in frag code Jesper Dangaard Brouer
@ 2013-04-24 23:48   ` Eric Dumazet
  2013-04-24 23:54     ` David Miller
  2013-04-25 10:57     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-24 23:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> The compiler make us misinterpret performance issues in the frag code,
> because its auto inlining functions.  Lets instead do explicit
> inlining to make this situation obvious to the programmer.
> 
> The function inet_frag_find() get the perf blame, because auto
> inlining of the functions inet_frag_create(), inet_frag_alloc() and
> inet_frag_intern().
> 
> My solution is to explicit inline inet_frag_alloc() and
> inet_frag_intern(), but explicitly "noinline" inet_frag_create(),
> in-order to make it explicit to the performance engineer, that
> creation phase is a bottleneck. Then, when reading the code the
> programmer should notice the inline, and see the bottleneck is really
> located in inet_frag_intern().
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

There is no way we add inline/noinline attributes to help developers to
use performance tools.

noinline can make sense when we want to avoid consuming too much stack
space, and in this case we use the explicit noinline_for_stack.

Another case would be when we know a particular function is called on
very rare occasions, and we want to avoid compiler being smart and
inline the function in the caller.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
  2013-04-24 22:09   ` Sergei Shtylyov
@ 2013-04-24 23:48   ` Eric Dumazet
  2013-04-25  3:26   ` Hannes Frederic Sowa
  2013-04-25 19:52   ` [net-next PATCH V2] " Jesper Dangaard Brouer
  3 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-24 23:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> Increase fragmentation hash bucket size to 1024 from old 64 elems.
> 
> After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)
> the hash size of 64 elements is simply too small.  Also considering
> the mem limit is per netns and the hash table is shared for all netns.
> 
> For the embedded people, note that this increase will change the hash
> table/array from using approx 1 Kbytes to 16 Kbytes.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> 
>  include/net/inet_frag.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index eb1d6ee..4e15856 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -41,7 +41,7 @@ struct inet_frag_queue {
>  	struct netns_frags	*net;
>  };
>  
> -#define INETFRAGS_HASHSZ		64
> +#define INETFRAGS_HASHSZ	1024

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 3/4] net: avoid false perf interpretations in frag code
  2013-04-24 23:48   ` Eric Dumazet
@ 2013-04-24 23:54     ` David Miller
  2013-04-25 10:57     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2013-04-24 23:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brouer, hannes, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 24 Apr 2013 16:48:27 -0700

> noinline can make sense when we want to avoid consuming too much stack
> space, and in this case we use the explicit noinline_for_stack.
> 
> Another case would be when we know a particular function is called on
> very rare occasions, and we want to avoid compiler being smart and
> inline the function in the caller.

Yet another case is where we must force the function in a special
section, f.e. __kprobes.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-04-24 15:48 ` [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists" Jesper Dangaard Brouer
@ 2013-04-25  0:00   ` Eric Dumazet
  2013-04-25 13:10     ` Jesper Dangaard Brouer
  2013-05-02  7:59     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25  0:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> 
> The problem with commit 5a3da1fe (inet: limit length of fragment queue
> hash table bucket lists) is that, once we hit the hash depth limit (of
> 128), the we *keep* the existing frag queues, not allowing new frag
> queues to be created.  Thus, an attacker can effectivly block handling
> of fragments for 30 sec (as each frag queue have a timeout of 30 sec)
> 

I do not think its good to revert this patch. It was a step in right
direction.

Limiting chain length to 128 is good.

An attacker can attack a defrag unit, no matter strategy you use, thats
why fragments are doomed.

The only way to resist to an attack is to have enough memory in defrag
unit to accumulate 30 seconds worth of traffic.

Thats 30GB of memory if you receive 1GB per second, bit counting the
overhead.

If the attacker knows the IP address of the user of your defrag unit,
you are doomed because IP id are 16bits.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-24 15:48 ` [net-next PATCH 4/4] net: frag LRU list per CPU Jesper Dangaard Brouer
@ 2013-04-25  0:25   ` Eric Dumazet
  2013-04-25  2:05     ` Eric Dumazet
  2013-04-25 13:59     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25  0:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> The global LRU list is the major bottleneck in fragmentation handling
> (after the recent frag optimization).
> 
> Simply change to use a LRU list per CPU, instead of a single shared
> LRU list.  This was the simples approach of removing the LRU list, I
> could come up with.  The previous "direct hash cleaning" approach was
> getting too complicated, and interacted badly with netns.
> 
> The /proc/sys/net/ipv4/ipfrag_*_thresh values are now per CPU limits,
> and have been reduced to 2 Mbytes (from 4 MB).
> 
> Performance compared to net-next (953c96e):
> 
> Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
> ----------  -------   -------  ----------  ---------  --------  -------
> (953c96e)
>  net-next:  17417.4   11376.5   3853.43     6170.56     174.8    402.9
> LRU-pr-CPU: 19047.0   13503.9  10314.10    12363.20    1528.7   2064.9

Having per cpu memory limit is going to be a nightmare for machines with
64+ cpus

Most machines use a single cpu to receive network packets. In some
situations, every network interrupt is balanced onto all cpus. fragments
for the same reassembled packet can be serviced on different cpus.

So your results are good because your irq affinities were properly
tuned.

Why don't you remove the lru instead ?

Clearly, removing the oldest frag was an implementation choice.

We know that a slow sender has no chance to complete a packet if the
attacker can create new fragments fast enough : frag_evictor() will keep
the attacker fragments in memory and throw away good fragments.

I wish we could make all this code simpler instead of more complex.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25  0:25   ` Eric Dumazet
@ 2013-04-25  2:05     ` Eric Dumazet
  2013-04-25 14:06       ` Jesper Dangaard Brouer
  2013-04-25 13:59     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25  2:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:25 -0700, Eric Dumazet wrote:

> We know that a slow sender has no chance to complete a packet if the
> attacker can create new fragments fast enough : frag_evictor() will keep
> the attacker fragments in memory and throw away good fragments.
> 

By the way, the frag_evictor() idea of cleaning 20% or 30% of the frags
simply doesn't scale to thousands of fragments.

It adds huge latencies in softirq context.

If we really want to evict old fragments before expiration timer, then
we can introduce a garbage collector in a work queue, and remove the
need of a timer per fragment.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
  2013-04-24 22:09   ` Sergei Shtylyov
  2013-04-24 23:48   ` Eric Dumazet
@ 2013-04-25  3:26   ` Hannes Frederic Sowa
  2013-04-25 19:52   ` [net-next PATCH V2] " Jesper Dangaard Brouer
  3 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-25  3:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, netdev, Eric Dumazet

On Wed, Apr 24, 2013 at 05:48:31PM +0200, Jesper Dangaard Brouer wrote:
> Increase fragmentation hash bucket size to 1024 from old 64 elems.
> 
> After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)
> the hash size of 64 elements is simply too small.  Also considering
> the mem limit is per netns and the hash table is shared for all netns.
> 
> For the embedded people, note that this increase will change the hash
> table/array from using approx 1 Kbytes to 16 Kbytes.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

If this change goes in and patch 1 does not get applied I do think we
can even lower the INETFRAGS_MAXDEPTH a bit (altough I don't see a need
for that because of stability issues)? It would just decrease latency
a bit in case someone attacks the hash buckets directly.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-24 22:09   ` Sergei Shtylyov
@ 2013-04-25 10:13     ` Jesper Dangaard Brouer
  2013-04-25 12:13       ` Sergei Shtylyov
  2013-04-25 19:11       ` David Miller
  0 siblings, 2 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 10:13 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: David S. Miller, Hannes Frederic Sowa, netdev, Eric Dumazet

On Thu, 2013-04-25 at 02:09 +0400, Sergei Shtylyov wrote:
> On 24-04-2013 19:48, Jesper Dangaard Brouer wrote:
> 
> > Increase fragmentation hash bucket size to 1024 from old 64 elems.
> 
> > After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)
> 
>     This is not a commit ID, commit ID is SHA1 key alone. And you should also 
> specify that commit's summary line with it, in parens.

Did you know that you can use the string "v3.8-rc3-503-gc2a9366" like:
 git show v3.8-rc3-503-gc2a9366
That's why I like this kind of commit ID better, as it also tell the
reader what approx version this patch were in.

But you are right about the title in parentheses...

Dave, do you want me to resubmit this, with nitpicked commit message?

--JEsper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 3/4] net: avoid false perf interpretations in frag code
  2013-04-24 23:48   ` Eric Dumazet
  2013-04-24 23:54     ` David Miller
@ 2013-04-25 10:57     ` Jesper Dangaard Brouer
  2013-04-25 19:13       ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 10:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 16:48 -0700, Eric Dumazet wrote:
> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > The compiler make us misinterpret performance issues in the frag code,
> > because its auto inlining functions.  Lets instead do explicit
> > inlining to make this situation obvious to the programmer.
> > 
> > The function inet_frag_find() get the perf blame, because auto
> > inlining of the functions inet_frag_create(), inet_frag_alloc() and
> > inet_frag_intern().
> > 
> > My solution is to explicit inline inet_frag_alloc() and
> > inet_frag_intern(), but explicitly "noinline" inet_frag_create(),
> > in-order to make it explicit to the performance engineer, that
> > creation phase is a bottleneck. Then, when reading the code the
> > programmer should notice the inline, and see the bottleneck is really
> > located in inet_frag_intern().
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> 
> There is no way we add inline/noinline attributes to help developers to
> use performance tools.

I take that as a NACK

Would we add the "inlines" only to the code, to make it clear what is
happening in the code?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue
  2013-04-24 16:21 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue David Laight
@ 2013-04-25 11:39   ` Jesper Dangaard Brouer
  2013-04-25 12:57     ` David Laight
  0 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 11:39 UTC (permalink / raw)
  To: David Laight; +Cc: David S. Miller, Hannes Frederic Sowa, netdev, Eric Dumazet

On Wed, 2013-04-24 at 17:21 +0100, David Laight wrote:
> > I have dropped the idea of doing "direct hash cleaning".  Instead I
> > have simply choosen to solve the global LRU list problem, by making
> > the LRU list be per CPU.
> 
> How can a per-cpu LRU list work?
> I see two immediate problems:
> - Ensuring the normal 'allocate' and 'free' are always done
>   on the same cpu (free will need to remove items from any
>   LRU list).
> - Ensuring that there all the items aren't on the LRU lists
>   of other cpus - meaning one can't be allocated.
> 
> The only way this could work is if the allocation limit
> is also per-cpu and you can guarantee that the alloc and
> free for any given item will always happen on the same cpu.
> 
> (alloc as in 'add to LRU list, free as in 'remove from LRU list).

Please read patch-04.

I've added "cpu_alloc" to record the CPU were the first fragment got
allocated.  After which all frags gets accounted to that CPU,
add/remove/timeout etc.

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-25 10:13     ` Jesper Dangaard Brouer
@ 2013-04-25 12:13       ` Sergei Shtylyov
  2013-04-25 19:11       ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: Sergei Shtylyov @ 2013-04-25 12:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Hannes Frederic Sowa, netdev, Eric Dumazet

Hello.

On 25-04-2013 14:13, Jesper Dangaard Brouer wrote:

>>> Increase fragmentation hash bucket size to 1024 from old 64 elems.

>>> After we increased the frag mem limits (in commit v3.8-rc3-503-gc2a9366)

>>      This is not a commit ID, commit ID is SHA1 key alone. And you should also
>> specify that commit's summary line with it, in parens.

> Did you know that you can use the string "v3.8-rc3-503-gc2a9366" like:
>   git show v3.8-rc3-503-gc2a9366

    No, I didn't. I only know such string is a result of the command 'git 
describe' (if I don't mistake).

> That's why I like this kind of commit ID better, as it also tell the
> reader what approx version this patch were in.

   Understood. Everybody else is using SHA1 though for which gitweb certainly 
could generate a coimmit link if it encountered SHA1 in the changelog. Now, 
with the switch of kernel.org to cgit though, this ability seems to be gone.
That's why (in part) I don't think using cgit was such a good idea... I want 
the old gitweb back!

> --JEsper

WBR, Sergei

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue
  2013-04-25 11:39   ` Jesper Dangaard Brouer
@ 2013-04-25 12:57     ` David Laight
  0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2013-04-25 12:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Hannes Frederic Sowa, netdev, Eric Dumazet

> On Wed, 2013-04-24 at 17:21 +0100, David Laight wrote:
> > > I have dropped the idea of doing "direct hash cleaning".  Instead I
> > > have simply choosen to solve the global LRU list problem, by making
> > > the LRU list be per CPU.
> >
> > How can a per-cpu LRU list work?
> > I see two immediate problems:
> > - Ensuring the normal 'allocate' and 'free' are always done
> >   on the same cpu (free will need to remove items from any
> >   LRU list).
> > - Ensuring that there all the items aren't on the LRU lists
> >   of other cpus - meaning one can't be allocated.
> >
> > The only way this could work is if the allocation limit
> > is also per-cpu and you can guarantee that the alloc and
> > free for any given item will always happen on the same cpu.
> >
> > (alloc as in 'add to LRU list, free as in 'remove from LRU list).
> 
> Please read patch-04.
> 
> I've added "cpu_alloc" to record the CPU were the first fragment got
> allocated.  After which all frags gets accounted to that CPU,
> add/remove/timeout etc.

Hmmm... the general idea behind making anything 'per-cpu' is so
that locking isn't required.
If you end up doing the 'remove' accounting on a different cpu then
all the actions require locks - which makes per-cpu data more or
less pointless.

In which case you are much better off using a hash to determine
the list.

	David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-04-25  0:00   ` Eric Dumazet
@ 2013-04-25 13:10     ` Jesper Dangaard Brouer
  2013-04-25 13:58       ` David Laight
  2013-05-02  7:59     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 13:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:00 -0700, Eric Dumazet wrote:
> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > 
> > The problem with commit 5a3da1fe (inet: limit length of fragment queue
> > hash table bucket lists) is that, once we hit the hash depth limit (of
> > 128), the we *keep* the existing frag queues, not allowing new frag
> > queues to be created.  Thus, an attacker can effectivly block handling
> > of fragments for 30 sec (as each frag queue have a timeout of 30 sec)
> > 
> 
> I do not think its good to revert this patch. It was a step in right
> direction.

But in its current state I consider this patch dangerous.

> Limiting chain length to 128 is good.

Yes, its good to have a limit on the hash depth, but with current mem
limit per netns this creates an unfortunate side-effect.  There is a
disconnect between the netns mem limits and the global hash table.

Even with hash size of 1024, this just postpones the problem.
We now need 35 netns instances to reach the point where we block all
fragments for 30 sec.

Given a min frag size of 1108 bytes:
  1024*128*1108 = 145227776 
  145227776/(4*1024*1024) = 34.62500000000000000000

The reason this is inevitable, is the attackers invalid fragments will
never finish (timeout 30 sec), while valid fragments will complete and
"exit" the queue, thus the end result is hash bucket is filled with
attackers invalid/incomplete fragments.  IMHO this is a very dangerous
"feature" to support.

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-04-25 13:10     ` Jesper Dangaard Brouer
@ 2013-04-25 13:58       ` David Laight
  0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2013-04-25 13:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: David S. Miller, Hannes Frederic Sowa, netdev

> The reason this is inevitable, is the attackers invalid fragments will
> never finish (timeout 30 sec), while valid fragments will complete and
> "exit" the queue, thus the end result is hash bucket is filled with
> attackers invalid/incomplete fragments.  IMHO this is a very dangerous
> "feature" to support.

If you are being attacked, the hash buckets will immediately
be filled with invalid packets, no valid packets will enter.
If you delete old entries to add new ones then any valid
packet will soon be flushed out by invalid ones.
The only time you'll manage to assemble anything is if
the valid fragments arrive 'back to back'.

The only hope is to part-decode initial fragments and decide
whether they are valid.

	David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25  0:25   ` Eric Dumazet
  2013-04-25  2:05     ` Eric Dumazet
@ 2013-04-25 13:59     ` Jesper Dangaard Brouer
  2013-04-25 14:10       ` Eric Dumazet
  2013-04-25 14:18       ` Eric Dumazet
  1 sibling, 2 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 13:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 17:25 -0700, Eric Dumazet wrote:
> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > The global LRU list is the major bottleneck in fragmentation handling
> > (after the recent frag optimization).
> > 
> > Simply change to use a LRU list per CPU, instead of a single shared
> > LRU list.  This was the simples approach of removing the LRU list, I
> > could come up with.  The previous "direct hash cleaning" approach was
> > getting too complicated, and interacted badly with netns.
> > 
> > The /proc/sys/net/ipv4/ipfrag_*_thresh values are now per CPU limits,
> > and have been reduced to 2 Mbytes (from 4 MB).
> > 
> > Performance compared to net-next (953c96e):
> > 
> > Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
> > ----------  -------   -------  ----------  ---------  --------  -------
> > (953c96e)
> >  net-next:  17417.4   11376.5   3853.43     6170.56     174.8    402.9
> > LRU-pr-CPU: 19047.0   13503.9  10314.10    12363.20    1528.7   2064.9
> 
> Having per cpu memory limit is going to be a nightmare for machines with
> 64+ cpus

I do see your concern, but the struct frag_cpu_limit is only 26 bytes,
well actually 32 bytes due alignment.  And the limit sort of scales with
the system size.  But yes, I'm not a complete fan of it...


> Most machines use a single cpu to receive network packets. In some
> situations, every network interrupt is balanced onto all cpus. fragments
> for the same reassembled packet can be serviced on different cpus.
> 
> So your results are good because your irq affinities were properly
> tuned.

Yes, the irq affinities needs to be aligned for this to work.  I do see
your point about the single CPU that distributes via RSS (Receive Side
Scaling).


> Why don't you remove the lru instead ?

That was my attempt in my previous patch set:
  "net: frag code fixes and RFC for LRU removal"
  http://thread.gmane.org/gmane.linux.network/266323/

I though that you "shot-me-down" on that approach.  That is why I'm
doing all this work on the LRU per CPU, stuff now... (this is actually
consuming a lot of my time, not knowing which direction you want me to
run in...)

We/you have to make a choice between:
1) "Remove LRU and do direct cleaning on hash table"
2) "Fix LRU mem acct scalability issue"


> Clearly, removing the oldest frag was an implementation choice.
> 
> We know that a slow sender has no chance to complete a packet if the
> attacker can create new fragments fast enough : frag_evictor() will keep
> the attacker fragments in memory and throw away good fragments.

Let me quote my self (which you cut out):

On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> I have also tested that a 512 Kbit/s simulated link (with HTB) still
> works (with sending 3x UDP fragments) under the DoS test 20G3F+MQ,
> which is sending approx 1Mpps on a 10Gbit/s NIC

My experiments show that removing the oldest frag is actually a quite
good solution.  And the LRU list do have some advantages...

The LRU list is the hole reason, that I can make a 512 Kbit/s link work,
at the same time of a of a 10Gbit/s DoS attack (well my packet generator
was limited to 1 million packets per sec, which is not 10G)

The calculations:
  The 512 Kbit/s link will send packets spaced with 23.43 ms.
  (1500*8)/512000 = 0.0234375

Thus, I need enough buffering/ mem limit to keep minimum 24 ms worth of
data, for a new fragment to arrive, which will move the frag queue to
the tail of the LRU.

On a 1Gbit/s link this is: approx 24 MB
 (0.0234375*1000*10^6/10^6 23.4375 MB)

On a 10Gbit/s link this is: approx 240 MB
 (0.0234375*10000*10^6/10^6 234.375 MB)

Yes, the frag sizes do get account to be bigger, but I hope you get the
point, which is:
 We don't need that much memory to "protect" fragment from slow sender,
with the LRU system.

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25  2:05     ` Eric Dumazet
@ 2013-04-25 14:06       ` Jesper Dangaard Brouer
  2013-04-25 14:37         ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 14:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Wed, 2013-04-24 at 19:05 -0700, Eric Dumazet wrote:
> By the way, the frag_evictor() idea of cleaning 20% or 30% of the
> frags simply doesn't scale to thousands of fragments.

Yes I know, that's why I changed the mem limit in this patch to "only"
clean 128K (one max frag mem acct size).

> It adds huge latencies in softirq context.

Yes, it sucks.  This is one thing I would really like to get rid of.
Does the "direct-hash-cleaning" solve this?

> If we really want to evict old fragments before expiration timer, then
> we can introduce a garbage collector in a work queue, and remove the
> need of a timer per fragment. 

I like this idea (just don't know how to implement it).  As my perf
results with "direct-hash-cleaning", show that we can hit a very
unfortunate situation, where the add/remove timer spinlock gets
hot/congested.

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25 13:59     ` Jesper Dangaard Brouer
@ 2013-04-25 14:10       ` Eric Dumazet
  2013-04-25 14:18       ` Eric Dumazet
  1 sibling, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25 14:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-04-25 at 15:59 +0200, Jesper Dangaard Brouer wrote:

>  We don't need that much memory to "protect" fragment from slow sender,
> with the LRU system.

Only on your particular synthetic workload.

A slow sender sends one frame per second.

You need at least 1GB of memory to cope with that. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25 13:59     ` Jesper Dangaard Brouer
  2013-04-25 14:10       ` Eric Dumazet
@ 2013-04-25 14:18       ` Eric Dumazet
  2013-04-25 19:15         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25 14:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-04-25 at 15:59 +0200, Jesper Dangaard Brouer wrote:

> Let me quote my self (which you cut out):
> 

I cut because most people don't bother to read mails with 100 lines.

Let be clear :

Q) You want performance, and LRU hurts performance
A) Remove the LRU, problem solved.

Q) You want performance and big number of frags to cope with attacks
A) resize the hash table when the admin sets a bigger limit
   If one million fragments are allowed, then hash table should be 1
million slots. A single percpu_counter is enough to track memory use.

Q) How to perform eviction if LRU is removed ?
A) In softirq handler, eviction is bad. If we hit the limit, drop the
incoming packet. There is no other solution.

   Way before hitting the limit, schedule a workqueue.
  This wq is responsible for evicting frags. Each frag has a "unsigned
long last_frag_jiffie", and you automatically evict frags that are aged
more than xxx jiffies.

As a bonus, we can remove the timer per frag and save a lot of memory
and timer overhead.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25 14:06       ` Jesper Dangaard Brouer
@ 2013-04-25 14:37         ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25 14:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-04-25 at 16:06 +0200, Jesper Dangaard Brouer wrote:

> I like this idea (just don't know how to implement it).  As my perf
> results with "direct-hash-cleaning", show that we can hit a very
> unfortunate situation, where the add/remove timer spinlock gets
> hot/congested.

Remove the timer per frag, problem solved ;)

With one million frags around, we do not want one million active timers.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 2/4] net: increase frag hash size
  2013-04-25 10:13     ` Jesper Dangaard Brouer
  2013-04-25 12:13       ` Sergei Shtylyov
@ 2013-04-25 19:11       ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2013-04-25 19:11 UTC (permalink / raw)
  To: brouer; +Cc: sergei.shtylyov, hannes, netdev, eric.dumazet

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 25 Apr 2013 12:13:15 +0200

> Did you know that you can use the string "v3.8-rc3-503-gc2a9366" like:
>  git show v3.8-rc3-503-gc2a9366
> That's why I like this kind of commit ID better, as it also tell the
> reader what approx version this patch were in.

Don't invent your own conventions without discussing it with other
developers first.

Being different in style from what other people are doing hurts all
by itself, because people have to interpret and read your commit
messages differently.

> Dave, do you want me to resubmit this, with nitpicked commit message?

Of course.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 3/4] net: avoid false perf interpretations in frag code
  2013-04-25 10:57     ` Jesper Dangaard Brouer
@ 2013-04-25 19:13       ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2013-04-25 19:13 UTC (permalink / raw)
  To: brouer; +Cc: eric.dumazet, hannes, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 25 Apr 2013 12:57:08 +0200

> Would we add the "inlines" only to the code, to make it clear what is
> happening in the code?

Never add inline in *.c code, in *.h headers, let the compiler sort it out.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25 14:18       ` Eric Dumazet
@ 2013-04-25 19:15         ` Jesper Dangaard Brouer
  2013-04-25 19:22           ` Eric Dumazet
  0 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 19:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-04-25 at 07:18 -0700, Eric Dumazet wrote:

> Let be clear :
[cut Q&A worklist]

Thanks, that is perfect. I'll start working on it Monday (as tomorrow is
a public holiday, and my home gets invaded by family, who will be
sleeping in my office).

One more question:

> Q) You want performance and big number of frags to cope with attacks
> A) resize the hash table when the admin sets a bigger limit

Q) How do we solve the per netns limit vs hash size? (Make limit
global?)

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 4/4] net: frag LRU list per CPU
  2013-04-25 19:15         ` Jesper Dangaard Brouer
@ 2013-04-25 19:22           ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2013-04-25 19:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-04-25 at 21:15 +0200, Jesper Dangaard Brouer wrote:

> > Q) You want performance and big number of frags to cope with attacks
> > A) resize the hash table when the admin sets a bigger limit
> 
> Q) How do we solve the per netns limit vs hash size? (Make limit
> global?)

global limit would be needed anyway.

If we want a per netns limit, that would be a second (per netns)
percpu_counter, but I feel this can be added later.

Thanks !

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [net-next PATCH V2] net: increase frag hash size
  2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2013-04-25  3:26   ` Hannes Frederic Sowa
@ 2013-04-25 19:52   ` Jesper Dangaard Brouer
  2013-04-29 17:44     ` David Miller
  3 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-25 19:52 UTC (permalink / raw)
  To: David S. Miller, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet, sergei.shtylyov

Increase fragmentation hash bucket size to 1024 from old 64 elems.

After we increased the frag mem limits commit c2a93660 (net: increase
fragment memory usage limits) the hash size of 64 elements is simply
too small.  Also considering the mem limit is per netns and the hash
table is shared for all netns.

For the embedded people, note that this increase will change the hash
table/array from using approx 1 Kbytes to 16 Kbytes.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>

---
V2:
 - Fix commit message, correct formatted commit IDs
 - Added prevous acked-by to (hopfully) easy DaveM's work

 include/net/inet_frag.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index eb1d6ee..4e15856 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -41,7 +41,7 @@ struct inet_frag_queue {
 	struct netns_frags	*net;
 };
 
-#define INETFRAGS_HASHSZ		64
+#define INETFRAGS_HASHSZ	1024
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH V2] net: increase frag hash size
  2013-04-25 19:52   ` [net-next PATCH V2] " Jesper Dangaard Brouer
@ 2013-04-29 17:44     ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2013-04-29 17:44 UTC (permalink / raw)
  To: brouer; +Cc: hannes, netdev, eric.dumazet, sergei.shtylyov

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 25 Apr 2013 21:52:25 +0200

> Increase fragmentation hash bucket size to 1024 from old 64 elems.
> 
> After we increased the frag mem limits commit c2a93660 (net: increase
> fragment memory usage limits) the hash size of 64 elements is simply
> too small.  Also considering the mem limit is per netns and the hash
> table is shared for all netns.
> 
> For the embedded people, note that this increase will change the hash
> table/array from using approx 1 Kbytes to 16 Kbytes.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-04-25  0:00   ` Eric Dumazet
  2013-04-25 13:10     ` Jesper Dangaard Brouer
@ 2013-05-02  7:59     ` Jesper Dangaard Brouer
  2013-05-02 15:16       ` Eric Dumazet
  1 sibling, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-05-02  7:59 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller
  Cc: Jesper Dangaard Brouer, Hannes Frederic Sowa, netdev

On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > 
> > The problem with commit 5a3da1fe (inet: limit length of fragment
> > queue hash table bucket lists) is that, once we hit the hash depth
> > limit (of 128), the we *keep* the existing frag queues, not
> > allowing new frag queues to be created.  Thus, an attacker can
> > effectivly block handling of fragments for 30 sec (as each frag
> > queue have a timeout of 30 sec)
> > 
> 
> I do not think its good to revert this patch. It was a step in right
> direction.

We need a revert, because we are too close to the merge window, and
cannot complete the needed "steps" to make this patch safe, sorry.

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-05-02  7:59     ` Jesper Dangaard Brouer
@ 2013-05-02 15:16       ` Eric Dumazet
  2013-05-03  9:15         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Dumazet @ 2013-05-02 15:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 2013-05-02 at 09:59 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > > 
> > > The problem with commit 5a3da1fe (inet: limit length of fragment
> > > queue hash table bucket lists) is that, once we hit the hash depth
> > > limit (of 128), the we *keep* the existing frag queues, not
> > > allowing new frag queues to be created.  Thus, an attacker can
> > > effectivly block handling of fragments for 30 sec (as each frag
> > > queue have a timeout of 30 sec)
> > > 
> > 
> > I do not think its good to revert this patch. It was a step in right
> > direction.
> 
> We need a revert, because we are too close to the merge window, and
> cannot complete the needed "steps" to make this patch safe, sorry.

Again, a limit of 128 is totally OK. Its in fact too big.

128 cache misses consume 5 us

Allowing a chain being non limited is a more severe bug.

Reverting will allow an attacker to consume all your cpu cycles.

We changed INETFRAGS_HASHSZ to 1024, so 128*1024 max frags is already a
very big limit.

No matter what we do, we need to limit both :

- Memory consumption
- Cpu consumption

For people willing to allow more memory to be used, the only way is to
resize hash table, or using a bigger INETFRAGS_HASHSZ

I do not think there is a hurry, current defrag code is already better
than what we had years ago.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists"
  2013-05-02 15:16       ` Eric Dumazet
@ 2013-05-03  9:15         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2013-05-03  9:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, David S. Miller, Hannes Frederic Sowa, netdev

On Thu, 02 May 2013 08:16:41 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2013-05-02 at 09:59 +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet
> > <eric.dumazet@gmail.com> wrote:
> > 
> > > On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > > > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > > > 
> > > > The problem with commit 5a3da1fe (inet: limit length of fragment
> > > > queue hash table bucket lists) is that, once we hit the hash
> > > > depth limit (of 128), the we *keep* the existing frag queues,
> > > > not allowing new frag queues to be created.  Thus, an attacker
> > > > can effectivly block handling of fragments for 30 sec (as each
> > > > frag queue have a timeout of 30 sec)
> > > > 
> > > 
> > > I do not think its good to revert this patch. It was a step in
> > > right direction.
> > 
> > We need a revert, because we are too close to the merge window, and
> > cannot complete the needed "steps" to make this patch safe, sorry.
> 
[...]
> 
> For people willing to allow more memory to be used, the only way is to
> resize hash table, or using a bigger INETFRAGS_HASHSZ
>
> I do not think there is a hurry, current defrag code is already better
> than what we had years ago.

Eric I think we agree that:
 1) we need resizing of hash table based on mem limit
 2) mem limit per netns "blocks" the hash resize patch

Without these two patches/changes, the static 128 depth limit
introduces an undocumented limit on the max mem limit
setting (/proc/sys/net/ipv4/ipfrag_high_thresh).

I think we only disagree on the order of the patches.

But lets keep this, because after we have increased hash
size (INETFRAGS_HASHSZ) to 1024, we have pushed the "undocumented
limit" so-far that is very unlikely to be hit.  We would have to
start >36 netns instances, all being overloaded with small
incomplete fragments at the same time (30 sec time window).

--Jesper

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-05-03  9:15 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-24 15:47 [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Jesper Dangaard Brouer
2013-04-24 15:48 ` [net-next PATCH 1/4] Revert "inet: limit length of fragment queue hash table bucket lists" Jesper Dangaard Brouer
2013-04-25  0:00   ` Eric Dumazet
2013-04-25 13:10     ` Jesper Dangaard Brouer
2013-04-25 13:58       ` David Laight
2013-05-02  7:59     ` Jesper Dangaard Brouer
2013-05-02 15:16       ` Eric Dumazet
2013-05-03  9:15         ` Jesper Dangaard Brouer
2013-04-24 15:48 ` [net-next PATCH 2/4] net: increase frag hash size Jesper Dangaard Brouer
2013-04-24 22:09   ` Sergei Shtylyov
2013-04-25 10:13     ` Jesper Dangaard Brouer
2013-04-25 12:13       ` Sergei Shtylyov
2013-04-25 19:11       ` David Miller
2013-04-24 23:48   ` Eric Dumazet
2013-04-25  3:26   ` Hannes Frederic Sowa
2013-04-25 19:52   ` [net-next PATCH V2] " Jesper Dangaard Brouer
2013-04-29 17:44     ` David Miller
2013-04-24 15:48 ` [net-next PATCH 3/4] net: avoid false perf interpretations in frag code Jesper Dangaard Brouer
2013-04-24 23:48   ` Eric Dumazet
2013-04-24 23:54     ` David Miller
2013-04-25 10:57     ` Jesper Dangaard Brouer
2013-04-25 19:13       ` David Miller
2013-04-24 15:48 ` [net-next PATCH 4/4] net: frag LRU list per CPU Jesper Dangaard Brouer
2013-04-25  0:25   ` Eric Dumazet
2013-04-25  2:05     ` Eric Dumazet
2013-04-25 14:06       ` Jesper Dangaard Brouer
2013-04-25 14:37         ` Eric Dumazet
2013-04-25 13:59     ` Jesper Dangaard Brouer
2013-04-25 14:10       ` Eric Dumazet
2013-04-25 14:18       ` Eric Dumazet
2013-04-25 19:15         ` Jesper Dangaard Brouer
2013-04-25 19:22           ` Eric Dumazet
2013-04-24 16:21 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalabilityissue David Laight
2013-04-25 11:39   ` Jesper Dangaard Brouer
2013-04-25 12:57     ` David Laight
2013-04-24 17:27 ` [net-next PATCH 0/4] net: frag patchset for fixing LRU scalability issue Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).