linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] net/fib: Speed up trie rebalancing for full view
@ 2019-03-26 15:30 Dmitry Safonov
  2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 15:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Alexander Duyck, Alexey Kuznetsov, David Ahern,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev, linux-doc, Jonathan Corbet

During moving from 3.18 stable kernel to 4.9 on the switches, rebasing
local -specific patches and stuff, it was found that BGP benchmarks for
full view have started to hit the soft lockup detector by holding rtnl
mutex for a couple of seconds on routes changes.

I've found that the hard-coded MAX_WORK doesn't limit the amount of
pending rebalancing work anymore. So that any route adding/removal may
cause massive rebalancing of the trie. And making the hit even worse,
tnodes are freed by RCU with a call to synchronise_rcu() every 512Kb.
That is way too small and even on 2-cores switches is painfully
noticable.

To address those problems, I've introduced sysctl knob to limit the
amount of rebalancing work (by default unlimited as is de facto now).
I've moved synchronise_rcu() into a new shrinker to actually release
memory in OOM.. I believe non-visible to userspace shrinker is better
than a new sysctl knob for the limit (or any hard-coded value).
Though, not sure how sane the result is.
So, I send it as RFC, having qualms that it's not ready for inclusion
as-is yet.

I've looked further into the origin of problems here and was thinking if
it make sense to do the following (rough plan):
1. Introduce a new flag RTNL_FLAG_DUMP_UNLOCKED to dump fib trie without
   rtnl lock. I'm a bit out of context, so probably I miss some obvious
   reasons why lock needs to be held at this point.
2. Add a new fib_lock mutex for updating a trie. I'm not really sure
   that we can always release rtnl for updates, so probably there should
   be a strict locking order: rtnl_lock (when needed) => fib_lock.
3. Correct current documentation, that mentions fib_lock as rwsem.
4. ??

I did some experiments on the plan above, but decided to send this RFC
to get opinions of people who understand more. Maybe, my plan is
nonsense and it's not worth invest amount of time it requires.

I've also looked into changes between v3.18...v4.9, and found the
following patches set: https://www.spinics.net/lists/netdev/msg309586.html
While it has impressive results on lookups, it seems to be the reason of
the regression on the big scale: one of the patches has 5 times penalty
to remove a route on a big scale, another adds 10 times penalty by
calling synchronise_rcu() more frequently (I've marked them by Fixes tag
in the patches).

[I was very lavish on Cc list, please ping me in private if you don't
want to be copied on the next version]

Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: David Ahern <dsahern@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Ido Schimmel <idosch@mellanox.com>
Cc: netdev@vger.kernel.org

Thanks,
        Dmitry

Dmitry Safonov (4):
  net/ipv4/fib: Remove run-time check in tnode_alloc()
  net/fib: Provide fib_balance_budget sysctl
  net/fib: Check budget before should_{inflate,halve}()
  net/ipv4/fib: Don't synchronise_rcu() every 512Kb

 Documentation/networking/ip-sysctl.txt |   6 ++
 include/net/ip.h                       |   1 +
 net/ipv4/fib_trie.c                    | 121 +++++++++++++++----------
 net/ipv4/sysctl_net_ipv4.c             |   7 ++
 4 files changed, 87 insertions(+), 48 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc()
  2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
@ 2019-03-26 15:30 ` Dmitry Safonov
  2019-04-01 15:40   ` Alexander Duyck
  2019-03-26 15:30 ` [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Dmitry Safonov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 15:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Alexander Duyck, Alexey Kuznetsov, David Ahern,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev

TNODE_KMALLOC_MAX is not used anywhere, while TNODE_VMALLOC_MAX check in
tnode_alloc() only adds additional cmp/jmp instructions to tnode
allocation. During rebalancing of the trie the function can be called
thousands of times. Runtime check takes cache line and predictor entry.
Futhermore, this check is always false on 64-bit platfroms and ipv4 has
only 4 byte address range and bits are limited by KEYLENGTH (32).

Move the check under unlikely() and change comparison to BITS_PER_LONG,
optimizing allocation of tnode during rebalancing (and removing it
complitely for platforms with BITS_PER_LONG > KEYLENGTH).

Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 net/ipv4/fib_trie.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index a573e37e0615..ad7d56c421cb 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -312,11 +312,6 @@ static inline void alias_free_mem_rcu(struct fib_alias *fa)
 	call_rcu(&fa->rcu, __alias_free_mem);
 }
 
-#define TNODE_KMALLOC_MAX \
-	ilog2((PAGE_SIZE - TNODE_SIZE(0)) / sizeof(struct key_vector *))
-#define TNODE_VMALLOC_MAX \
-	ilog2((SIZE_MAX - TNODE_SIZE(0)) / sizeof(struct key_vector *))
-
 static void __node_free_rcu(struct rcu_head *head)
 {
 	struct tnode *n = container_of(head, struct tnode, rcu);
@@ -333,8 +328,7 @@ static struct tnode *tnode_alloc(int bits)
 {
 	size_t size;
 
-	/* verify bits is within bounds */
-	if (bits > TNODE_VMALLOC_MAX)
+	if ((BITS_PER_LONG <= KEYLENGTH) && unlikely(bits >= BITS_PER_LONG))
 		return NULL;
 
 	/* determine size and verify it is non-zero and didn't overflow */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 2/4] net/fib: Provide fib_balance_budget sysctl
  2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
  2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
@ 2019-03-26 15:30 ` Dmitry Safonov
  2019-04-01 18:09   ` Alexander Duyck
  2019-03-26 15:30 ` [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Dmitry Safonov
  2019-03-26 15:30 ` [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Dmitry Safonov
  3 siblings, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 15:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Alexander Duyck, Alexey Kuznetsov, David Ahern,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev, linux-doc, Jonathan Corbet

Unfortunately, MAX_WORK at this moment is broken: during
trie_rebalance() climbing upwards resize() is being called for each
tnode. Which makes the limit useless the bigger trie gets: at each level
there are 10 attempts to balance tnode and childs, resulting in
O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
the level where we changed the trie originally (adding/removing
alias/route).

Which results in the following price of removing one route under big
trie (similar for some other single routes that results in reallocating
tnodes by hitting the threshold limits for halve/inflate resize):

Before:
   Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
   Main:
           Aver depth:     1.99
           Max depth:      2
           Leaves:         77825
           Prefixes:       77825
           Internal nodes: 77
             9: 1  10: 76
           Pointers: 78336
   Null ptrs: 435
   Total size: 7912  kB
   Local:
   [omitted]

After:
   Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
   Main:
           Aver depth:     3.00
           Max depth:      3
           Leaves:         77824
           Prefixes:       77824
           Internal nodes: 20491
             1: 2048  2: 18432  6: 1  11: 10
           Pointers: 98368
   Null ptrs: 54
   Total size: 8865  kB
   Local:
   [omitted]

Provide a sysctl to control amount of pending balancing work.
(by default unlimited as it was)

Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
into inflate/halve")
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 Documentation/networking/ip-sysctl.txt |  6 +++
 include/net/ip.h                       |  1 +
 net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 4 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index acdfb5d2bcaa..fb71dacff4dd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
 	0 - Layer 3
 	1 - Layer 4
 
+fib_balance_budget - UNSIGNED INTEGER
+	Limits the number of resize attempts during balancing fib trie
+	on adding/removing new routes.
+	Possible values:
+	Default: UINT_MAX (0xFFFFFFFF)
+
 ip_forward_update_priority - INTEGER
 	Whether to update SKB priority from "TOS" field in IPv4 header after it
 	is forwarded. The new SKB priority is mapped from TOS field value
diff --git a/include/net/ip.h b/include/net/ip.h
index be3cad9c2e4c..305d0e43088b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
 	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
 }
 
+extern unsigned int fib_balance_budget;
 struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx,
 					int fc_mx_len,
 					struct netlink_ext_ack *extack);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index ad7d56c421cb..d90cf9dfd443 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -182,8 +182,10 @@ struct trie {
 #endif
 };
 
-static struct key_vector *resize(struct trie *t, struct key_vector *tn);
+static struct key_vector *resize(struct trie *t, struct key_vector *tn,
+				 unsigned int *budget);
 static size_t tnode_free_size;
+unsigned int fib_balance_budget = UINT_MAX;
 
 /*
  * synchronize_rcu after call_rcu for that many pages; it should be especially
@@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn)
 
 static struct key_vector *replace(struct trie *t,
 				  struct key_vector *oldtnode,
-				  struct key_vector *tn)
+				  struct key_vector *tn,
+				  unsigned int *budget)
 {
 	struct key_vector *tp = node_parent(oldtnode);
 	unsigned long i;
@@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t,
 	tnode_free(oldtnode);
 
 	/* resize children now that oldtnode is freed */
-	for (i = child_length(tn); i;) {
+	for (i = child_length(tn); i && *budget;) {
 		struct key_vector *inode = get_child(tn, --i);
 
 		/* resize child node */
 		if (tnode_full(tn, inode))
-			tn = resize(t, inode);
+			tn = resize(t, inode, budget);
 	}
 
 	return tp;
 }
 
-static struct key_vector *inflate(struct trie *t,
-				  struct key_vector *oldtnode)
+static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode,
+				  unsigned int *budget)
 {
 	struct key_vector *tn;
 	unsigned long i;
@@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t,
 	}
 
 	/* setup the parent pointers into and out of this node */
-	return replace(t, oldtnode, tn);
+	return replace(t, oldtnode, tn, budget);
 nomem:
 	/* all pointers should be clean so we are done */
 	tnode_free(tn);
@@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t,
 	return NULL;
 }
 
-static struct key_vector *halve(struct trie *t,
-				struct key_vector *oldtnode)
+static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode,
+				unsigned int *budget)
 {
 	struct key_vector *tn;
 	unsigned long i;
@@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t,
 	}
 
 	/* setup the parent pointers into and out of this node */
-	return replace(t, oldtnode, tn);
+	return replace(t, oldtnode, tn, budget);
 nomem:
 	/* all pointers should be clean so we are done */
 	tnode_free(tn);
@@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn)
 	return used < 2;
 }
 
-#define MAX_WORK 10
-static struct key_vector *resize(struct trie *t, struct key_vector *tn)
+static struct key_vector *resize(struct trie *t, struct key_vector *tn,
+				 unsigned int *budget)
 {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 	struct trie_use_stats __percpu *stats = t->stats;
 #endif
 	struct key_vector *tp = node_parent(tn);
 	unsigned long cindex = get_index(tn->key, tp);
-	int max_work = MAX_WORK;
+	bool inflated = false;
 
 	pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n",
 		 tn, inflate_threshold, halve_threshold);
@@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	/* Double as long as the resulting node has a number of
 	 * nonempty nodes that are above the threshold.
 	 */
-	while (should_inflate(tp, tn) && max_work) {
-		tp = inflate(t, tn);
+	while (should_inflate(tp, tn) && *budget) {
+		tp = inflate(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 			this_cpu_inc(stats->resize_node_skipped);
@@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 			break;
 		}
 
-		max_work--;
+		(*budget)--;
+		inflated = true;
 		tn = get_child(tp, cindex);
 	}
 
 	/* update parent in case inflate failed */
 	tp = node_parent(tn);
 
-	/* Return if at least one inflate is run */
-	if (max_work != MAX_WORK)
+	/* Return if at least one inflate is run:
+	 * microoptimization to not recalculate thresholds
+	 */
+	if (inflated)
 		return tp;
 
 	/* Halve as long as the number of empty children in this
 	 * node is above threshold.
 	 */
-	while (should_halve(tp, tn) && max_work) {
-		tp = halve(t, tn);
+	while (should_halve(tp, tn) && *budget) {
+		tp = halve(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 			this_cpu_inc(stats->resize_node_skipped);
@@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 			break;
 		}
 
-		max_work--;
+		(*budget)--;
 		tn = get_child(tp, cindex);
 	}
 
@@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
 
 static void trie_rebalance(struct trie *t, struct key_vector *tn)
 {
-	while (!IS_TRIE(tn))
-		tn = resize(t, tn);
+	unsigned int budget = fib_balance_budget;
+
+	while (budget && !IS_TRIE(tn))
+		tn = resize(t, tn, &budget);
 }
 
 static int fib_insert_node(struct trie *t, struct key_vector *tp,
@@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
 void fib_table_flush_external(struct fib_table *tb)
 {
 	struct trie *t = (struct trie *)tb->tb_data;
+	unsigned int budget = fib_balance_budget;
 	struct key_vector *pn = t->kv;
 	unsigned long cindex = 1;
 	struct hlist_node *tmp;
@@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
 				update_suffix(pn);
 
 			/* resize completed node */
-			pn = resize(t, pn);
+			pn = resize(t, pn, &budget);
 			cindex = get_index(pkey, pn);
 
 			continue;
@@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
 int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
 {
 	struct trie *t = (struct trie *)tb->tb_data;
+	unsigned int budget = fib_balance_budget;
 	struct key_vector *pn = t->kv;
 	unsigned long cindex = 1;
 	struct hlist_node *tmp;
@@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
 				update_suffix(pn);
 
 			/* resize completed node */
-			pn = resize(t, pn);
+			pn = resize(t, pn, &budget);
 			cindex = get_index(pkey, pn);
 
 			continue;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ba0fc4b18465..d7274cc442af 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "fib_balance_budget",
+		.data		= &fib_balance_budget,
+		.maxlen		= sizeof(fib_balance_budget),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec,
+	},
 	{
 		.procname	= "inet_peer_threshold",
 		.data		= &inet_peer_threshold,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 3/4] net/fib: Check budget before should_{inflate,halve}()
  2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
  2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
  2019-03-26 15:30 ` [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Dmitry Safonov
@ 2019-03-26 15:30 ` Dmitry Safonov
  2019-04-01 18:20   ` Alexander Duyck
  2019-03-26 15:30 ` [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Dmitry Safonov
  3 siblings, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 15:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Alexander Duyck, Alexey Kuznetsov, David Ahern,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev

Those functions are compute-costly, if we're out of budget - better
omit additional computations.

Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 net/ipv4/fib_trie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index d90cf9dfd443..2ce2739e7693 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -868,7 +868,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn,
 	/* Double as long as the resulting node has a number of
 	 * nonempty nodes that are above the threshold.
 	 */
-	while (should_inflate(tp, tn) && *budget) {
+	while (*budget && should_inflate(tp, tn)) {
 		tp = inflate(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
@@ -894,7 +894,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn,
 	/* Halve as long as the number of empty children in this
 	 * node is above threshold.
 	 */
-	while (should_halve(tp, tn) && *budget) {
+	while (*budget && should_halve(tp, tn)) {
 		tp = halve(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
                   ` (2 preceding siblings ...)
  2019-03-26 15:30 ` [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Dmitry Safonov
@ 2019-03-26 15:30 ` Dmitry Safonov
  2019-03-26 15:39   ` David Ahern
  3 siblings, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 15:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dmitry Safonov, Alexander Duyck, Alexey Kuznetsov, David Ahern,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev

Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
The limit is 128 pages or 512Kb (considering common case with 4Kb
pages).

Unfortunately, at Arista we have use-scenarios with full view software
forwarding. At the scale of 100K and more routes even on 2 core boxes
the hard-coded limit starts actively shooting in the leg: lockup
detector notices that rtnl_lock is held for seconds.
First reason is previously broken MAX_WORK, that didn't limit pending
balancing work. While fixing it, I've noticed that the bottle-neck is
actually in the number of synchronise_rcu() calls.

I've tried to fix it with a patch to decrement number of tnodes in rcu
callback, but it hasn't much affected performance.

One possible way to "fix" it - provide another sysctl to control
sync_pages, but in my POV it's nasty - exposing another realisation
detail into user-space.

To be complete honest, I'm not sure if calling rcu_synchronise() from
shrinker is a sane idea: during OOM we're slowed down enough and adding
synchronise there probably will noticeably falter a shrinker.

Anyway, I've got the following results on a very stupid benchmark that
adds one-by-one routes and removes them (with unlimited
fib_balance_budget) and measures time spent to remove one route:

*Before* on 4-cores switch (AMD GX-420CA SOC):
v4 Create of 4194304 routes: 76806ms
0(2097152): 000. 32.  0.  0	3353ms
1(3145729): 000. 48.  0.  1	1311ms
2(1048577): 000. 16.  0.  1	1286ms
3(524289): 000.  8.  0.  1	865ms
4(4098744): 000. 62.138.184	858ms
5(3145728): 000. 48.  0.  0	832ms
6(1048576): 000. 16.  0.  0	794ms
7(2621441): 000. 40.  0.  1	663ms
8(2621440): 000. 40.  0.  0	525ms
9(524288): 000.  8.  0.  0	508ms

v4 Delete of 4194304 routes: 111129ms
0(1589247): 000. 24. 63.255	3033ms
1(3702783): 000. 56.127.255	2833ms
2(3686399): 000. 56. 63.255	2630ms
3(1605631): 000. 24.127.255	2574ms
4(1581055): 000. 24. 31.255	2395ms
5(3671039): 000. 56.  3.255	2289ms
6(1573887): 000. 24.  3.255	2234ms
7(3678207): 000. 56. 31.255	2143ms
8(3670527): 000. 56.  1.255	2109ms
9(1573375): 000. 24.  1.255	2070ms

*After* on 4-cores switch:
v4 Create of 4194304 routes: 65305ms
0(2097153): 000. 32.  0.  1	1871ms
1(1048577): 000. 16.  0.  1	1064ms
2(2097152): 000. 32.  0.  0	905ms
3(524289): 000.  8.  0.  1	507ms
4(1048576): 000. 16.  0.  0	451ms
5(2097154): 000. 32.  0.  2	355ms
6(262145): 000.  4.  0.  1	240ms
7(524288): 000.  8.  0.  0	230ms
8(262144): 000.  4.  0.  0	115ms
9(131073): 000.  2.  0.  1	109ms

v4 Delete of 4194304 routes: 38015ms
0(3571711): 000. 54.127.255	1616ms
1(3565567): 000. 54.103.255	1340ms
2(3670015): 000. 55.255.255	1297ms
3(3565183): 000. 54.102.127	1226ms
4(3565159): 000. 54.102.103	912ms
5(3604479): 000. 54.255.255	596ms
6(3670016): 000. 56.  0.  0	474ms
7(3565311): 000. 54.102.255	434ms
8(3567615): 000. 54.111.255	388ms
9(3565167): 000. 54.102.111	376ms

After the patch there is one core, completely busy with the benchmark,
while previously neither CPU was busy. Controlling balancing budget
sysctl knob, one can distribute balancing work on add/remove a route
between neighbour changes (with the price of possibly less balanced trie
and a bit more expensive lookups).

Fixes: fc86a93b46d7 ("fib_trie: Push tnode flushing down to
inflate/halve")
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 net/ipv4/fib_trie.c | 53 +++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2ce2739e7693..5773d479e7d2 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -184,16 +184,9 @@ struct trie {
 
 static struct key_vector *resize(struct trie *t, struct key_vector *tn,
 				 unsigned int *budget);
-static size_t tnode_free_size;
+static atomic_long_t objects_waiting_rcu;
 unsigned int fib_balance_budget = UINT_MAX;
 
-/*
- * synchronize_rcu after call_rcu for that many pages; it should be especially
- * useful before resizing the root node with PREEMPT_NONE configs; the value was
- * obtained experimentally, aiming to avoid visible slowdown.
- */
-static const int sync_pages = 128;
-
 static struct kmem_cache *fn_alias_kmem __ro_after_init;
 static struct kmem_cache *trie_leaf_kmem __ro_after_init;
 
@@ -306,11 +299,16 @@ static const int inflate_threshold_root = 30;
 static void __alias_free_mem(struct rcu_head *head)
 {
 	struct fib_alias *fa = container_of(head, struct fib_alias, rcu);
+
+	atomic_long_dec(&objects_waiting_rcu);
 	kmem_cache_free(fn_alias_kmem, fa);
 }
 
 static inline void alias_free_mem_rcu(struct fib_alias *fa)
 {
+	lockdep_rtnl_is_held();
+
+	atomic_long_inc(&objects_waiting_rcu);
 	call_rcu(&fa->rcu, __alias_free_mem);
 }
 
@@ -318,13 +316,40 @@ static void __node_free_rcu(struct rcu_head *head)
 {
 	struct tnode *n = container_of(head, struct tnode, rcu);
 
+	atomic_long_dec(&objects_waiting_rcu);
 	if (!n->tn_bits)
 		kmem_cache_free(trie_leaf_kmem, n);
 	else
 		kvfree(n);
 }
 
-#define node_free(n) call_rcu(&tn_info(n)->rcu, __node_free_rcu)
+static inline void node_free(struct key_vector *n)
+{
+	lockdep_rtnl_is_held();
+
+	atomic_long_inc(&objects_waiting_rcu);
+	call_rcu(&tn_info(n)->rcu, __node_free_rcu);
+}
+
+static unsigned long fib_shrink_count(struct shrinker *s,
+				      struct shrink_control *sc)
+{
+	return (unsigned long)atomic_long_read(&objects_waiting_rcu);
+}
+
+static unsigned long fib_shrink_scan(struct shrinker *s,
+				    struct shrink_control *sc)
+{
+	long ret = (unsigned long)atomic_long_read(&objects_waiting_rcu);
+
+	synchronize_rcu();
+	return (unsigned long)ret;
+}
+
+static struct shrinker fib_shrinker = {
+	.count_objects = fib_shrink_count,
+	.scan_objects = fib_shrink_scan,
+};
 
 static struct tnode *tnode_alloc(int bits)
 {
@@ -494,16 +519,9 @@ static void tnode_free(struct key_vector *tn)
 
 	while (head) {
 		head = head->next;
-		tnode_free_size += TNODE_SIZE(1ul << tn->bits);
 		node_free(tn);
-
 		tn = container_of(head, struct tnode, rcu)->kv;
 	}
-
-	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
-		tnode_free_size = 0;
-		synchronize_rcu();
-	}
 }
 
 static struct key_vector *replace(struct trie *t,
@@ -2118,6 +2136,9 @@ void __init fib_trie_init(void)
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
 					   0, SLAB_PANIC, NULL);
+
+	if (register_shrinker(&fib_shrinker))
+		panic("IP FIB: failed to register fib_shrinker\n");
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 15:30 ` [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Dmitry Safonov
@ 2019-03-26 15:39   ` David Ahern
  2019-03-26 17:15     ` Dmitry Safonov
  2019-03-26 23:14     ` Dmitry Safonov
  0 siblings, 2 replies; 18+ messages in thread
From: David Ahern @ 2019-03-26 15:39 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Alexander Duyck, Alexey Kuznetsov, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On 3/26/19 9:30 AM, Dmitry Safonov wrote:
> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
> The limit is 128 pages or 512Kb (considering common case with 4Kb
> pages).
> 
> Unfortunately, at Arista we have use-scenarios with full view software
> forwarding. At the scale of 100K and more routes even on 2 core boxes
> the hard-coded limit starts actively shooting in the leg: lockup
> detector notices that rtnl_lock is held for seconds.
> First reason is previously broken MAX_WORK, that didn't limit pending
> balancing work. While fixing it, I've noticed that the bottle-neck is
> actually in the number of synchronise_rcu() calls.
> 
> I've tried to fix it with a patch to decrement number of tnodes in rcu
> callback, but it hasn't much affected performance.
> 
> One possible way to "fix" it - provide another sysctl to control
> sync_pages, but in my POV it's nasty - exposing another realisation
> detail into user-space.

well, that was accepted last week. ;-)

commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
Author: David Ahern <dsahern@gmail.com>
Date:   Wed Mar 20 09:18:59 2019 -0700

    ipv4: Allow amount of dirty memory from fib resizing to be controllable


Can you see how that change (should backport easily) affects your test
case? From my perspective 16MB was the sweet spot.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 15:39   ` David Ahern
@ 2019-03-26 17:15     ` Dmitry Safonov
  2019-03-26 17:57       ` David Ahern
  2019-03-26 23:14     ` Dmitry Safonov
  1 sibling, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 17:15 UTC (permalink / raw)
  To: David Ahern, linux-kernel
  Cc: Alexander Duyck, Alexey Kuznetsov, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

Hi David,

On 3/26/19 3:39 PM, David Ahern wrote:
> On 3/26/19 9:30 AM, Dmitry Safonov wrote:
>> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
>> The limit is 128 pages or 512Kb (considering common case with 4Kb
>> pages).
>>
>> Unfortunately, at Arista we have use-scenarios with full view software
>> forwarding. At the scale of 100K and more routes even on 2 core boxes
>> the hard-coded limit starts actively shooting in the leg: lockup
>> detector notices that rtnl_lock is held for seconds.
>> First reason is previously broken MAX_WORK, that didn't limit pending
>> balancing work. While fixing it, I've noticed that the bottle-neck is
>> actually in the number of synchronise_rcu() calls.
>>
>> I've tried to fix it with a patch to decrement number of tnodes in rcu
>> callback, but it hasn't much affected performance.
>>
>> One possible way to "fix" it - provide another sysctl to control
>> sync_pages, but in my POV it's nasty - exposing another realisation
>> detail into user-space.
> 
> well, that was accepted last week. ;-)
> 
> commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> Author: David Ahern <dsahern@gmail.com>
> Date:   Wed Mar 20 09:18:59 2019 -0700
> 
>     ipv4: Allow amount of dirty memory from fib resizing to be controllable
> 
> 
> Can you see how that change (should backport easily) affects your test
> case? From my perspective 16MB was the sweet spot.

Heh, I based on master, so haven't seen it yet.

I still wonder if it's good to expose it to userspace rather than
shrinker, but this probably should work for me - I'll test it in near days.

Thanks,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 17:15     ` Dmitry Safonov
@ 2019-03-26 17:57       ` David Ahern
  2019-03-26 18:17         ` Dmitry Safonov
  0 siblings, 1 reply; 18+ messages in thread
From: David Ahern @ 2019-03-26 17:57 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Alexander Duyck, Alexey Kuznetsov, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On 3/26/19 11:15 AM, Dmitry Safonov wrote:
> I still wonder if it's good to expose it to userspace rather than
> shrinker, but this probably should work for me - I'll test it in near days.

I did not know about the shrinker, so can not comment if it is better or
not. If so, my patch can always be reverted since it was just applied.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 17:57       ` David Ahern
@ 2019-03-26 18:17         ` Dmitry Safonov
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 18:17 UTC (permalink / raw)
  To: David Ahern, linux-kernel
  Cc: Alexander Duyck, Alexey Kuznetsov, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On 3/26/19 5:57 PM, David Ahern wrote:
> On 3/26/19 11:15 AM, Dmitry Safonov wrote:
>> I still wonder if it's good to expose it to userspace rather than
>> shrinker, but this probably should work for me - I'll test it in near days.
> 
> I did not know about the shrinker, so can not comment if it is better or
> not. If so, my patch can always be reverted since it was just applied.

Oh, don't misunderstand me - I didn't mean to say your patch is wrong,
but I wondered what made this preferable for you.
I've sent my patches as RFC, so probably - it's a place to discuss
what's better, rather than running, rushing and reverting you commit.

Initially, I was also thinking about introducing some sysctl like that,
but thought it's a bit "dirty" solution to expose it to userspace for
tuning, rather than make it work automatically in OOM.

On other side with your sysctl, there are some minor bits, those are
not-yet-perfect and asking to be improved if we go this way:

1. I've noticed that the limit is only being increased or set to zero.
   Which in turn can result in more calls to synchronise rcu than
   needed.
2. The memory summed up under the limit is only tnodes in tnode_free(),
   while nodes are not summed there. Also some tnodes are being freed
   with node_free(), if I'm not mistaken.

So please, don't take me wrong, your patch may also work for me, but
probably worth to discuss both ways and I very much value your opinion here.

Thanks,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 15:39   ` David Ahern
  2019-03-26 17:15     ` Dmitry Safonov
@ 2019-03-26 23:14     ` Dmitry Safonov
  2019-03-27  3:33       ` Paul E. McKenney
  1 sibling, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-03-26 23:14 UTC (permalink / raw)
  To: David Ahern, linux-kernel, Paul E. McKenney
  Cc: Alexander Duyck, Alexey Kuznetsov, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On 3/26/19 3:39 PM, David Ahern wrote:
> On 3/26/19 9:30 AM, Dmitry Safonov wrote:
>> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
>> The limit is 128 pages or 512Kb (considering common case with 4Kb
>> pages).
>>
>> Unfortunately, at Arista we have use-scenarios with full view software
>> forwarding. At the scale of 100K and more routes even on 2 core boxes
>> the hard-coded limit starts actively shooting in the leg: lockup
>> detector notices that rtnl_lock is held for seconds.
>> First reason is previously broken MAX_WORK, that didn't limit pending
>> balancing work. While fixing it, I've noticed that the bottle-neck is
>> actually in the number of synchronise_rcu() calls.
>>
>> I've tried to fix it with a patch to decrement number of tnodes in rcu
>> callback, but it hasn't much affected performance.
>>
>> One possible way to "fix" it - provide another sysctl to control
>> sync_pages, but in my POV it's nasty - exposing another realisation
>> detail into user-space.
> 
> well, that was accepted last week. ;-)
> 
> commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> Author: David Ahern <dsahern@gmail.com>
> Date:   Wed Mar 20 09:18:59 2019 -0700
> 
>     ipv4: Allow amount of dirty memory from fib resizing to be controllable
> 
> 
> Can you see how that change (should backport easily) affects your test
> case? From my perspective 16MB was the sweet spot.

FWIW, I would like to +Cc Paul here.

TLDR; we're looking with David into ways to improve a hardcoded limit
tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
(512Kb). David created a patch to provide sysctl that controls the limit
and it would solve a problem for both of us. In parallel, I thought that
exposing this to userspace is not much fun and added a shrinker with
synchronize_rcu(). I'm not any sure that the latter is actually a sane
solution..
Is there any guarantee that memory to-be freed by call_rcu() will get
freed in OOM conditions? Might there be a chance that we don't need any
limit here at all?

Worth to mention that I don't argue David's patch as I pointed that it
would (will) solve the problem for us both, but with good intentions
wondering if we can do something here rather a new sysctl knob.

Thanks,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
  2019-03-26 23:14     ` Dmitry Safonov
@ 2019-03-27  3:33       ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2019-03-27  3:33 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: David Ahern, linux-kernel, Alexander Duyck, Alexey Kuznetsov,
	David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, Ido Schimmel,
	netdev

On Tue, Mar 26, 2019 at 11:14:43PM +0000, Dmitry Safonov wrote:
> On 3/26/19 3:39 PM, David Ahern wrote:
> > On 3/26/19 9:30 AM, Dmitry Safonov wrote:
> >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
> >> The limit is 128 pages or 512Kb (considering common case with 4Kb
> >> pages).
> >>
> >> Unfortunately, at Arista we have use-scenarios with full view software
> >> forwarding. At the scale of 100K and more routes even on 2 core boxes
> >> the hard-coded limit starts actively shooting in the leg: lockup
> >> detector notices that rtnl_lock is held for seconds.
> >> First reason is previously broken MAX_WORK, that didn't limit pending
> >> balancing work. While fixing it, I've noticed that the bottle-neck is
> >> actually in the number of synchronise_rcu() calls.
> >>
> >> I've tried to fix it with a patch to decrement number of tnodes in rcu
> >> callback, but it hasn't much affected performance.
> >>
> >> One possible way to "fix" it - provide another sysctl to control
> >> sync_pages, but in my POV it's nasty - exposing another realisation
> >> detail into user-space.
> > 
> > well, that was accepted last week. ;-)
> > 
> > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> > Author: David Ahern <dsahern@gmail.com>
> > Date:   Wed Mar 20 09:18:59 2019 -0700
> > 
> >     ipv4: Allow amount of dirty memory from fib resizing to be controllable
> > 
> > 
> > Can you see how that change (should backport easily) affects your test
> > case? From my perspective 16MB was the sweet spot.
> 
> FWIW, I would like to +Cc Paul here.
> 
> TLDR; we're looking with David into ways to improve a hardcoded limit
> tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
> (512Kb). David created a patch to provide sysctl that controls the limit
> and it would solve a problem for both of us. In parallel, I thought that
> exposing this to userspace is not much fun and added a shrinker with
> synchronize_rcu(). I'm not any sure that the latter is actually a sane
> solution..
> Is there any guarantee that memory to-be freed by call_rcu() will get
> freed in OOM conditions? Might there be a chance that we don't need any
> limit here at all?

Yes, unless whatever is causing the OOM is also stalling a CPU or task
that RCU is waiting on.  The extreme case is of course when the OOM is
in fact being caused by the fact that RCU is waiting on a stalled CPU
or task.  Of course, the fact that the CPU or task is being stalled is
a bug in its own right.

So, in the absence of bugs, yes, the memory that was passed to call_rcu()
should be freed within a reasonable length of time, even under OOM
conditions.

> Worth to mention that I don't argue David's patch as I pointed that it
> would (will) solve the problem for us both, but with good intentions
> wondering if we can do something here rather a new sysctl knob.

An intermediate position would be to have a reasonably high setting so
that the sysctl knob almost never needed to be adjusted.

RCU used to detect OOM conditions and work harder to finish the grace
period in those cases, but this was abandoned because it was found not
to make a significant difference in production.  Which might support
the position of assuming that memory passed to call_rcu() gets freed
reasonably quickly even under OOM conditions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc()
  2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
@ 2019-04-01 15:40   ` Alexander Duyck
  2019-04-01 15:55     ` Dmitry Safonov
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2019-04-01 15:40 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
> TNODE_KMALLOC_MAX is not used anywhere, while TNODE_VMALLOC_MAX check in
> tnode_alloc() only adds additional cmp/jmp instructions to tnode
> allocation. During rebalancing of the trie the function can be called
> thousands of times. Runtime check takes cache line and predictor entry.
> Futhermore, this check is always false on 64-bit platfroms and ipv4 has
> only 4 byte address range and bits are limited by KEYLENGTH (32).
> 
> Move the check under unlikely() and change comparison to BITS_PER_LONG,
> optimizing allocation of tnode during rebalancing (and removing it
> complitely for platforms with BITS_PER_LONG > KEYLENGTH).
> 
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  net/ipv4/fib_trie.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index a573e37e0615..ad7d56c421cb 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -312,11 +312,6 @@ static inline void alias_free_mem_rcu(struct fib_alias *fa)
>  	call_rcu(&fa->rcu, __alias_free_mem);
>  }
>  
> -#define TNODE_KMALLOC_MAX \
> -	ilog2((PAGE_SIZE - TNODE_SIZE(0)) / sizeof(struct key_vector *))
> -#define TNODE_VMALLOC_MAX \
> -	ilog2((SIZE_MAX - TNODE_SIZE(0)) / sizeof(struct key_vector *))
> -
>  static void __node_free_rcu(struct rcu_head *head)
>  {
>  	struct tnode *n = container_of(head, struct tnode, rcu);
> @@ -333,8 +328,7 @@ static struct tnode *tnode_alloc(int bits)
>  {
>  	size_t size;
>  
> -	/* verify bits is within bounds */
> -	if (bits > TNODE_VMALLOC_MAX)
> +	if ((BITS_PER_LONG <= KEYLENGTH) && unlikely(bits >= BITS_PER_LONG))
>  		return NULL;
>  
>  	/* determine size and verify it is non-zero and didn't overflow */

I think it would be better if we left TNODE_VMALLOC_MAX instead of
replacing it with BITS_PER_LONG. This way we know that we are limited
by the size of the node on 32b systems, and by the KEYLENGTH on 64b
systems. The basic idea is to maintain the logic as to why we are doing
it this way instead of just burying things by using built in constants
that are close enough to work.

So for example I believe TNODE_VMALLOC_MAX is 31 on a 32b system. The
main reason for that is because we have to subtract the TNODE_SIZE from
the upper limit for size. By replacing TNODE_VMALLOC_MAX with
BITS_PER_LONG that becomes abstracted away and it becomes more likely
that somebody will mishandle it later.

- Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc()
  2019-04-01 15:40   ` Alexander Duyck
@ 2019-04-01 15:55     ` Dmitry Safonov
  2019-04-01 17:50       ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Dmitry Safonov @ 2019-04-01 15:55 UTC (permalink / raw)
  To: Alexander Duyck, Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

Hi Alexander,

On 4/1/19 4:40 PM, Alexander Duyck wrote:
>> @@ -333,8 +328,7 @@ static struct tnode *tnode_alloc(int bits)
>>  {
>>  	size_t size;
>>  
>> -	/* verify bits is within bounds */
>> -	if (bits > TNODE_VMALLOC_MAX)
>> +	if ((BITS_PER_LONG <= KEYLENGTH) && unlikely(bits >= BITS_PER_LONG))
>>  		return NULL;
>>  
>>  	/* determine size and verify it is non-zero and didn't overflow */
> 
> I think it would be better if we left TNODE_VMALLOC_MAX instead of
> replacing it with BITS_PER_LONG. This way we know that we are limited
> by the size of the node on 32b systems, and by the KEYLENGTH on 64b
> systems. The basic idea is to maintain the logic as to why we are doing
> it this way instead of just burying things by using built in constants
> that are close enough to work.
> 
> So for example I believe TNODE_VMALLOC_MAX is 31 on a 32b system.

This is also true after the change: bits == 31 will *not* return.

> The
> main reason for that is because we have to subtract the TNODE_SIZE from
> the upper limit for size. By replacing TNODE_VMALLOC_MAX with
> BITS_PER_LONG that becomes abstracted away and it becomes more likely
> that somebody will mishandle it later.

So, I wanted to remove run-time check here on x86_64..
I could do it by adding !CONFIG_64BIT around the check.

But, I thought about the value of the check:
I believe it's here not to limit maximum allocated size:
kzalloc()/vzalloc() will fail and we should be fine with that.

In my opinion it's rather to check that (1UL << bits) wouldn't result in UB.


Thanks,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc()
  2019-04-01 15:55     ` Dmitry Safonov
@ 2019-04-01 17:50       ` Alexander Duyck
  2019-04-04 16:33         ` Dmitry Safonov
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2019-04-01 17:50 UTC (permalink / raw)
  To: Dmitry Safonov, Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On Mon, 2019-04-01 at 16:55 +0100, Dmitry Safonov wrote:
> Hi Alexander,
> 
> On 4/1/19 4:40 PM, Alexander Duyck wrote:
> > > @@ -333,8 +328,7 @@ static struct tnode *tnode_alloc(int bits)
> > >  {
> > >  	size_t size;
> > >  
> > > -	/* verify bits is within bounds */
> > > -	if (bits > TNODE_VMALLOC_MAX)
> > > +	if ((BITS_PER_LONG <= KEYLENGTH) && unlikely(bits >= BITS_PER_LONG))
> > >  		return NULL;
> > >  
> > >  	/* determine size and verify it is non-zero and didn't overflow */
> > 
> > I think it would be better if we left TNODE_VMALLOC_MAX instead of
> > replacing it with BITS_PER_LONG. This way we know that we are limited
> > by the size of the node on 32b systems, and by the KEYLENGTH on 64b
> > systems. The basic idea is to maintain the logic as to why we are doing
> > it this way instead of just burying things by using built in constants
> > that are close enough to work.
> > 
> > So for example I believe TNODE_VMALLOC_MAX is 31 on a 32b system.
> 
> This is also true after the change: bits == 31 will *not* return.

Actually now that I think about it TNODE_VMALLOC_MAX is likely much
less than 31. The logic that we have to be concerned with is:
	size = TNODE_SIZE(1ul << bits);

If size is a 32b value, and the size of a pointer is 4 bytes, then our
upper limit is roughly ilog2((4G - 28) / 4), which comes out to 29.
What we are trying to avoid is overflowing the size variable, not
actually limiting the vmalloc itself.

> > The
> > main reason for that is because we have to subtract the TNODE_SIZE from
> > the upper limit for size. By replacing TNODE_VMALLOC_MAX with
> > BITS_PER_LONG that becomes abstracted away and it becomes more likely
> > that somebody will mishandle it later.
> 
> So, I wanted to remove run-time check here on x86_64..
> I could do it by adding !CONFIG_64BIT around the check.

I have no problem with that. All I am suggesting is that if at all
possible we should use TNODE_VMALLOC_MAX instead of BITS_PER_LONG.

> But, I thought about the value of the check:
> I believe it's here not to limit maximum allocated size:
> kzalloc()/vzalloc() will fail and we should be fine with that.

No, the problem is we don't want to overflow size. The allocation will
succeed, but give us a much smaller allocation then we expected.

> In my opinion it's rather to check that (1UL << bits) wouldn't result in UB.

Sort of, however we have to keep mind that 1ul << bits is an index so
it is also increased by the size of a pointer. As such the logic might
be better expressed as sizeof(void*) << bits.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl
  2019-03-26 15:30 ` [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Dmitry Safonov
@ 2019-04-01 18:09   ` Alexander Duyck
  2019-04-04 18:31     ` Dmitry Safonov
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2019-04-01 18:09 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev, linux-doc,
	Jonathan Corbet

On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
> Unfortunately, MAX_WORK at this moment is broken: during
> trie_rebalance() climbing upwards resize() is being called for each
> tnode. Which makes the limit useless the bigger trie gets: at each level
> there are 10 attempts to balance tnode and childs, resulting in
> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
> the level where we changed the trie originally (adding/removing
> alias/route).
> 
> Which results in the following price of removing one route under big
> trie (similar for some other single routes that results in reallocating
> tnodes by hitting the threshold limits for halve/inflate resize):

The info below doesn't really tell us the price. It might be useful if
you could just time the removal of that one route and provide that
information.

> Before:
>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>    Main:
>            Aver depth:     1.99
>            Max depth:      2
>            Leaves:         77825
>            Prefixes:       77825
>            Internal nodes: 77
>              9: 1  10: 76
>            Pointers: 78336
>    Null ptrs: 435
>    Total size: 7912  kB
>    Local:
>    [omitted]
> 
> After:
>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>    Main:
>            Aver depth:     3.00
>            Max depth:      3
>            Leaves:         77824
>            Prefixes:       77824
>            Internal nodes: 20491
>              1: 2048  2: 18432  6: 1  11: 10
>            Pointers: 98368
>    Null ptrs: 54
>    Total size: 8865  kB
>    Local:
>    [omitted]
> 

Is this before/after the patch, or just before/after the removal of the
one route? I assume you are are just talking about the one route since
the number of prefixes has dropped by 1.

> Provide a sysctl to control amount of pending balancing work.
> (by default unlimited as it was)
> 
> Cc: linux-doc@vger.kernel.org
> Cc: Jonathan Corbet <corbet@lwn.net>
> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
> into inflate/halve")
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  Documentation/networking/ip-sysctl.txt |  6 +++
>  include/net/ip.h                       |  1 +
>  net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
>  net/ipv4/sysctl_net_ipv4.c             |  7 +++
>  4 files changed, 49 insertions(+), 25 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index acdfb5d2bcaa..fb71dacff4dd 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>  	0 - Layer 3
>  	1 - Layer 4
>  
> +fib_balance_budget - UNSIGNED INTEGER
> +	Limits the number of resize attempts during balancing fib trie
> +	on adding/removing new routes.
> +	Possible values:
> +	Default: UINT_MAX (0xFFFFFFFF)
> +

So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
much rather see this be a signed integer and the test be for the value
going negative instead of having to test in so many places if the value
is 0.

>  ip_forward_update_priority - INTEGER
>  	Whether to update SKB priority from "TOS" field in IPv4 header after it
>  	is forwarded. The new SKB priority is mapped from TOS field value
> diff --git a/include/net/ip.h b/include/net/ip.h
> index be3cad9c2e4c..305d0e43088b 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
>  	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
>  }
>  
> +extern unsigned int fib_balance_budget;
>  struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx,
>  					int fc_mx_len,
>  					struct netlink_ext_ack *extack);
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index ad7d56c421cb..d90cf9dfd443 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -182,8 +182,10 @@ struct trie {
>  #endif
>  };
>  
> -static struct key_vector *resize(struct trie *t, struct key_vector *tn);
> +static struct key_vector *resize(struct trie *t, struct key_vector *tn,
> +				 unsigned int *budget);
>  static size_t tnode_free_size;
> +unsigned int fib_balance_budget = UINT_MAX;
>  
>  /*
>   * synchronize_rcu after call_rcu for that many pages; it should be especially
> @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn)
>  
>  static struct key_vector *replace(struct trie *t,
>  				  struct key_vector *oldtnode,
> -				  struct key_vector *tn)
> +				  struct key_vector *tn,
> +				  unsigned int *budget)
>  {
>  	struct key_vector *tp = node_parent(oldtnode);
>  	unsigned long i;
> @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t,
>  	tnode_free(oldtnode);
>  
>  	/* resize children now that oldtnode is freed */
> -	for (i = child_length(tn); i;) {
> +	for (i = child_length(tn); i && *budget;) {

So this is an example of the kind of stuff I am talking about. Having a
check for budget here isn't going to add much in the way of value. I
would much rather keep the logic contained within resize, and just do a
check for *budget >= 0.

>  		struct key_vector *inode = get_child(tn, --i);
>  
>  		/* resize child node */
>  		if (tnode_full(tn, inode))
> -			tn = resize(t, inode);
> +			tn = resize(t, inode, budget);
>  	}
>  
>  	return tp;
>  }
>  
> -static struct key_vector *inflate(struct trie *t,
> -				  struct key_vector *oldtnode)
> +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode,
> +				  unsigned int *budget)
>  {
>  	struct key_vector *tn;
>  	unsigned long i;
> @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t,
>  	}
>  
>  	/* setup the parent pointers into and out of this node */
> -	return replace(t, oldtnode, tn);
> +	return replace(t, oldtnode, tn, budget);
>  nomem:
>  	/* all pointers should be clean so we are done */
>  	tnode_free(tn);
> @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t,
>  	return NULL;
>  }
>  
> -static struct key_vector *halve(struct trie *t,
> -				struct key_vector *oldtnode)
> +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode,
> +				unsigned int *budget)
>  {
>  	struct key_vector *tn;
>  	unsigned long i;
> @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t,
>  	}
>  
>  	/* setup the parent pointers into and out of this node */
> -	return replace(t, oldtnode, tn);
> +	return replace(t, oldtnode, tn, budget);
>  nomem:
>  	/* all pointers should be clean so we are done */
>  	tnode_free(tn);
> @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn)
>  	return used < 2;
>  }
>  
> -#define MAX_WORK 10
> -static struct key_vector *resize(struct trie *t, struct key_vector *tn)
> +static struct key_vector *resize(struct trie *t, struct key_vector *tn,
> +				 unsigned int *budget)
>  {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  	struct trie_use_stats __percpu *stats = t->stats;
>  #endif
>  	struct key_vector *tp = node_parent(tn);
>  	unsigned long cindex = get_index(tn->key, tp);
> -	int max_work = MAX_WORK;
> +	bool inflated = false;
>  
>  	pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n",
>  		 tn, inflate_threshold, halve_threshold);
> @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  	/* Double as long as the resulting node has a number of
>  	 * nonempty nodes that are above the threshold.
>  	 */
> -	while (should_inflate(tp, tn) && max_work) {
> -		tp = inflate(t, tn);
> +	while (should_inflate(tp, tn) && *budget) {
> +		tp = inflate(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  			this_cpu_inc(stats->resize_node_skipped);
> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  			break;
>  		}
>  
> -		max_work--;
> +		(*budget)--;

One thing we might explore here would be to look at pulling out the
check of budget at the start of the loop and instead combine the
decrement with the check for < 0 here so we have something like:
		if (--*budget < 0)
			break;

That way what should happen is that we will achieve maximum depth in
any resize and make at least one pass optimizing from the bottom up
instead of the top down.

> +		inflated = true;
>  		tn = get_child(tp, cindex);
>  	}
>  
>  	/* update parent in case inflate failed */
>  	tp = node_parent(tn);
>  
> -	/* Return if at least one inflate is run */
> -	if (max_work != MAX_WORK)
> +	/* Return if at least one inflate is run:
> +	 * microoptimization to not recalculate thresholds
> +	 */
> +	if (inflated)
>  		return tp;

An alternative to having to add another variable would be to look at
switching the loop to a combination of an if and a do/while like so:
	if (should_inflate(tp, tn)) {
		do {
			...
		} while (should_inflate(tp, tn));
	
		/* update parent in case inflate failed */
		return node_parent(tn);
	}
		

>  	/* Halve as long as the number of empty children in this
>  	 * node is above threshold.
>  	 */
> -	while (should_halve(tp, tn) && max_work) {
> -		tp = halve(t, tn);
> +	while (should_halve(tp, tn) && *budget) {
> +		tp = halve(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  			this_cpu_inc(stats->resize_node_skipped);
> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  			break;
>  		}
>  
> -		max_work--;
> +		(*budget)--;

Same here. It might make sense to pull the budget check out of the
start of the loop and move it here so that it becomes a depth first
optimization instead of being a shallow one.

>  		tn = get_child(tp, cindex);
>  	}
>  
> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>  
>  static void trie_rebalance(struct trie *t, struct key_vector *tn)
>  {
> -	while (!IS_TRIE(tn))
> -		tn = resize(t, tn);
> +	unsigned int budget = fib_balance_budget;
> +
> +	while (budget && !IS_TRIE(tn))
> +		tn = resize(t, tn, &budget);

I would not have the budget check here. At a minimum we should be
replacing trie nodes with leaves in the case that we are down to one
leaf node in the trie.

>  }
>  
>  static int fib_insert_node(struct trie *t, struct key_vector *tp,
> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>  void fib_table_flush_external(struct fib_table *tb)
>  {
>  	struct trie *t = (struct trie *)tb->tb_data;
> +	unsigned int budget = fib_balance_budget;
>  	struct key_vector *pn = t->kv;
>  	unsigned long cindex = 1;
>  	struct hlist_node *tmp;
> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>  				update_suffix(pn);
>  
>  			/* resize completed node */
> -			pn = resize(t, pn);
> +			pn = resize(t, pn, &budget);
>  			cindex = get_index(pkey, pn);
>  
>  			continue;

So we would probably want a completely different budget here, or at
least we would want the budget to be multiplied by the leaves we are
removing. The problem is we aren't pulling single addresses, but are
literally splitting the tree into two separate tries. The same budget
isn't going to work for both cases.

> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>  {
>  	struct trie *t = (struct trie *)tb->tb_data;
> +	unsigned int budget = fib_balance_budget;
>  	struct key_vector *pn = t->kv;
>  	unsigned long cindex = 1;
>  	struct hlist_node *tmp;
> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>  				update_suffix(pn);
>  
>  			/* resize completed node */
> -			pn = resize(t, pn);
> +			pn = resize(t, pn, &budget);
>  			cindex = get_index(pkey, pn);
>  
>  			continue;

Similar issues here, though I don't know what the typical amount of
change is we expect to see from a fib_table_flush call versus something
like the fib_table_flush_external call.

> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index ba0fc4b18465..d7274cc442af 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname	= "fib_balance_budget",
> +		.data		= &fib_balance_budget,
> +		.maxlen		= sizeof(fib_balance_budget),
> +		.mode		= 0644,
> +		.proc_handler	= proc_douintvec,
> +	},
>  	{
>  		.procname	= "inet_peer_threshold",
>  		.data		= &inet_peer_threshold,


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/4] net/fib: Check budget before should_{inflate,halve}()
  2019-03-26 15:30 ` [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Dmitry Safonov
@ 2019-04-01 18:20   ` Alexander Duyck
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-04-01 18:20 UTC (permalink / raw)
  To: Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
> Those functions are compute-costly, if we're out of budget - better
> omit additional computations.
> 
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  net/ipv4/fib_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index d90cf9dfd443..2ce2739e7693 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -868,7 +868,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn,
>  	/* Double as long as the resulting node has a number of
>  	 * nonempty nodes that are above the threshold.
>  	 */
> -	while (should_inflate(tp, tn) && *budget) {
> +	while (*budget && should_inflate(tp, tn)) {
>  		tp = inflate(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
> @@ -894,7 +894,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn,
>  	/* Halve as long as the number of empty children in this
>  	 * node is above threshold.
>  	 */
> -	while (should_halve(tp, tn) && *budget) {
> +	while (*budget && should_halve(tp, tn)) {
>  		tp = halve(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS

Based on my comments in the other patches I would say this is a bad
idea.

Really the budget should allow at least 1 pass through the trie to
attempt to either inflate or deflate the node and all children. This
logic is optimizing for the case where *budget is 0 and that should not
occur that often.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc()
  2019-04-01 17:50       ` Alexander Duyck
@ 2019-04-04 16:33         ` Dmitry Safonov
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Safonov @ 2019-04-04 16:33 UTC (permalink / raw)
  To: Alexander Duyck, Dmitry Safonov, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev

On 4/1/19 6:50 PM, Alexander Duyck wrote:
> On Mon, 2019-04-01 at 16:55 +0100, Dmitry Safonov wrote:
> Actually now that I think about it TNODE_VMALLOC_MAX is likely much
> less than 31. The logic that we have to be concerned with is:
> 	size = TNODE_SIZE(1ul << bits);
> 
> If size is a 32b value, and the size of a pointer is 4 bytes, then our
> upper limit is roughly ilog2((4G - 28) / 4), which comes out to 29.
> What we are trying to avoid is overflowing the size variable, not
> actually limiting the vmalloc itself.

Oh yes, I see - managed to forget that size can also overflow inside
TNODE_SIZE().
>> So, I wanted to remove run-time check here on x86_64..
>> I could do it by adding !CONFIG_64BIT around the check.
> 
> I have no problem with that. All I am suggesting is that if at all
> possible we should use TNODE_VMALLOC_MAX instead of BITS_PER_LONG.

Yeah, will rework this part.

Thanks,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl
  2019-04-01 18:09   ` Alexander Duyck
@ 2019-04-04 18:31     ` Dmitry Safonov
  0 siblings, 0 replies; 18+ messages in thread
From: Dmitry Safonov @ 2019-04-04 18:31 UTC (permalink / raw)
  To: Alexander Duyck, linux-kernel
  Cc: Alexey Kuznetsov, David Ahern, David S. Miller, Eric Dumazet,
	Hideaki YOSHIFUJI, Ido Schimmel, netdev, linux-doc,
	Jonathan Corbet

On 4/1/19 7:09 PM, Alexander Duyck wrote:
> On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
>> Unfortunately, MAX_WORK at this moment is broken: during
>> trie_rebalance() climbing upwards resize() is being called for each
>> tnode. Which makes the limit useless the bigger trie gets: at each level
>> there are 10 attempts to balance tnode and childs, resulting in
>> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
>> the level where we changed the trie originally (adding/removing
>> alias/route).
>>
>> Which results in the following price of removing one route under big
>> trie (similar for some other single routes that results in reallocating
>> tnodes by hitting the threshold limits for halve/inflate resize):
> 
> The info below doesn't really tell us the price. It might be useful if
> you could just time the removal of that one route and provide that
> information.

Yes, well, that was an attempt to point that MAX_WORK in the current
state doesn't limit much complexity to balance trie.

So, most of the time (up to a couple of seconds) is spent on
synchronise_rcu() (which was addressed by 4/4 and a patch from David
Ahern in -next).

But I thought fixing max_work is also worth attention, regardless the
fix for synchronising rcu. And playing with the sysctl limit that is
provided by the patch, I've found out that even with 4/4 applied one can
tune this sysctl knob the way that results in dividing the work between
consecutive route manipulations. So that on a 4-core switch the time
spent to add a route in maximum was more than a second and adjusting
work limit can bring it down to 300msec in max. Though, the total time
spent to add the whole route table was the same.

I will provide more details for v2.

> 
>> Before:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     1.99
>>            Max depth:      2
>>            Leaves:         77825
>>            Prefixes:       77825
>>            Internal nodes: 77
>>              9: 1  10: 76
>>            Pointers: 78336
>>    Null ptrs: 435
>>    Total size: 7912  kB
>>    Local:
>>    [omitted]
>>
>> After:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     3.00
>>            Max depth:      3
>>            Leaves:         77824
>>            Prefixes:       77824
>>            Internal nodes: 20491
>>              1: 2048  2: 18432  6: 1  11: 10
>>            Pointers: 98368
>>    Null ptrs: 54
>>    Total size: 8865  kB
>>    Local:
>>    [omitted]
>>
> 
> Is this before/after the patch, or just before/after the removal of the
> one route? I assume you are are just talking about the one route since
> the number of prefixes has dropped by 1.

Yes, just for removing one route. Should have mentioned that explicitly,
sorry.

> 
>> Provide a sysctl to control amount of pending balancing work.
>> (by default unlimited as it was)
>>
>> Cc: linux-doc@vger.kernel.org
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
>> into inflate/halve")
>> Signed-off-by: Dmitry Safonov <dima@arista.com>
>> ---
>>  Documentation/networking/ip-sysctl.txt |  6 +++
>>  include/net/ip.h                       |  1 +
>>  net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
>>  net/ipv4/sysctl_net_ipv4.c             |  7 +++
>>  4 files changed, 49 insertions(+), 25 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index acdfb5d2bcaa..fb71dacff4dd 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>>  	0 - Layer 3
>>  	1 - Layer 4
>>  
>> +fib_balance_budget - UNSIGNED INTEGER
>> +	Limits the number of resize attempts during balancing fib trie
>> +	on adding/removing new routes.
>> +	Possible values:
>> +	Default: UINT_MAX (0xFFFFFFFF)
>> +
> 
> So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
> much rather see this be a signed integer and the test be for the value
> going negative instead of having to test in so many places if the value
> is 0.

Ok, sounds good to me - will change the type for v2.

[..]
>> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> One thing we might explore here would be to look at pulling out the
> check of budget at the start of the loop and instead combine the
> decrement with the check for < 0 here so we have something like:
> 		if (--*budget < 0)
> 			break;
> 
> That way what should happen is that we will achieve maximum depth in
> any resize and make at least one pass optimizing from the bottom up
> instead of the top down.

Yeah, sounds reasonable - will do this way and retest.

> 
>> +		inflated = true;
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>>  	/* update parent in case inflate failed */
>>  	tp = node_parent(tn);
>>  
>> -	/* Return if at least one inflate is run */
>> -	if (max_work != MAX_WORK)
>> +	/* Return if at least one inflate is run:
>> +	 * microoptimization to not recalculate thresholds
>> +	 */
>> +	if (inflated)
>>  		return tp;
> 
> An alternative to having to add another variable would be to look at
> switching the loop to a combination of an if and a do/while like so:
> 	if (should_inflate(tp, tn)) {
> 		do {
> 			...
> 		} while (should_inflate(tp, tn));
> 	
> 		/* update parent in case inflate failed */
> 		return node_parent(tn);
> 	}

That's neat! Will drop the variable.

> 		
> 
>>  	/* Halve as long as the number of empty children in this
>>  	 * node is above threshold.
>>  	 */
>> -	while (should_halve(tp, tn) && max_work) {
>> -		tp = halve(t, tn);
>> +	while (should_halve(tp, tn) && *budget) {
>> +		tp = halve(t, tn, budget);
>>  		if (!tp) {
>>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>>  			this_cpu_inc(stats->resize_node_skipped);
>> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> Same here. It might make sense to pull the budget check out of the
> start of the loop and move it here so that it becomes a depth first
> optimization instead of being a shallow one.

Makes sense, will rework.

> 
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>>  
>>  static void trie_rebalance(struct trie *t, struct key_vector *tn)
>>  {
>> -	while (!IS_TRIE(tn))
>> -		tn = resize(t, tn);
>> +	unsigned int budget = fib_balance_budget;
>> +
>> +	while (budget && !IS_TRIE(tn))
>> +		tn = resize(t, tn, &budget);
> 
> I would not have the budget check here. At a minimum we should be
> replacing trie nodes with leaves in the case that we are down to one
> leaf node in the trie.

Yeah, I see.

> 
>>  }
>>  
>>  static int fib_insert_node(struct trie *t, struct key_vector *tp,
>> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>>  void fib_table_flush_external(struct fib_table *tb)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> So we would probably want a completely different budget here, or at
> least we would want the budget to be multiplied by the leaves we are
> removing. The problem is we aren't pulling single addresses, but are
> literally splitting the tree into two separate tries. The same budget
> isn't going to work for both cases.

Which will multiply the work (and the time spent).
Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if
needed by introducing fib_lock. Does that sounds good?

> 
>> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> Similar issues here, though I don't know what the typical amount of
> change is we expect to see from a fib_table_flush call versus something
> like the fib_table_flush_external call.

Yes. Well, I was also curious if it's possible to avoid resize() call
under (flush_all == true)?
So, that exiting a namespace wouldn't result in rebalancing a possibly
huge routing trie.

Thanks for your notes and review,
          Dmitry

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-04-04 18:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-26 15:30 [RFC 0/4] net/fib: Speed up trie rebalancing for full view Dmitry Safonov
2019-03-26 15:30 ` [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Dmitry Safonov
2019-04-01 15:40   ` Alexander Duyck
2019-04-01 15:55     ` Dmitry Safonov
2019-04-01 17:50       ` Alexander Duyck
2019-04-04 16:33         ` Dmitry Safonov
2019-03-26 15:30 ` [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Dmitry Safonov
2019-04-01 18:09   ` Alexander Duyck
2019-04-04 18:31     ` Dmitry Safonov
2019-03-26 15:30 ` [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Dmitry Safonov
2019-04-01 18:20   ` Alexander Duyck
2019-03-26 15:30 ` [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Dmitry Safonov
2019-03-26 15:39   ` David Ahern
2019-03-26 17:15     ` Dmitry Safonov
2019-03-26 17:57       ` David Ahern
2019-03-26 18:17         ` Dmitry Safonov
2019-03-26 23:14     ` Dmitry Safonov
2019-03-27  3:33       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).