All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] bpf: Optimize lpm trie delete
@ 2017-09-20 16:22 Craig Gallek
  2017-09-20 16:51 ` Daniel Mack
  0 siblings, 1 reply; 5+ messages in thread
From: Craig Gallek @ 2017-09-20 16:22 UTC (permalink / raw)
  To: Daniel Mack, Alexei Starovoitov, Daniel Borkmann, David S . Miller; +Cc: netdev

From: Craig Gallek <kraig@google.com>

Before the delete operator was added, this datastructure maintained
an invariant that intermediate nodes were only present when necessary
to build the tree.  This patch updates the delete operation to reinstate
that invariant by removing unnecessary intermediate nodes after a node is
removed and thus keeping the tree structure at a minimal size.

Suggested-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: Craig Gallek <kraig@google.com>
---
 kernel/bpf/lpm_trie.c | 55 +++++++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9d58a576b2ae..b5a7d70ec8b5 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 	struct lpm_trie_node __rcu **trim;
 	struct lpm_trie_node *node;
 	unsigned long irq_flags;
-	unsigned int next_bit;
+	unsigned int next_bit = 0;
 	size_t matchlen = 0;
 	int ret = 0;
 
@@ -408,14 +408,12 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 
 	/* Walk the tree looking for an exact key/length match and keeping
 	 * track of where we could begin trimming the tree.  The trim-point
-	 * is the sub-tree along the walk consisting of only single-child
-	 * intermediate nodes and ending at a leaf node that we want to
-	 * remove.
+	 * is the location of the pointer where we will remove a node from the
+	 * tree.
 	 */
 	trim = &trie->root;
-	node = rcu_dereference_protected(
-		trie->root, lockdep_is_held(&trie->lock));
-	while (node) {
+	while ((node = rcu_dereference_protected(
+		       *trim, lockdep_is_held(&trie->lock)))) {
 		matchlen = longest_prefix_match(trie, node, key);
 
 		if (node->prefixlen != matchlen ||
@@ -423,15 +421,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 			break;
 
 		next_bit = extract_bit(key->data, node->prefixlen);
-		/* If we hit a node that has more than one child or is a valid
-		 * prefix itself, do not remove it. Reset the root of the trim
-		 * path to its descendant on our path.
-		 */
-		if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
-		    (node->child[0] && node->child[1]))
-			trim = &node->child[next_bit];
-		node = rcu_dereference_protected(
-			node->child[next_bit], lockdep_is_held(&trie->lock));
+		trim = &node->child[next_bit];
 	}
 
 	if (!node || node->prefixlen != key->prefixlen ||
@@ -442,25 +432,38 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 
 	trie->n_entries--;
 
-	/* If the node we are removing is not a leaf node, simply mark it
+	/* If the node we are removing has two children, simply mark it
 	 * as intermediate and we are done.
 	 */
-	if (rcu_access_pointer(node->child[0]) ||
+	if (rcu_access_pointer(node->child[0]) &&
 	    rcu_access_pointer(node->child[1])) {
 		node->flags |= LPM_TREE_NODE_FLAG_IM;
 		goto out;
 	}
 
-	/* trim should now point to the slot holding the start of a path from
-	 * zero or more intermediate nodes to our leaf node for deletion.
-	 */
-	while ((node = rcu_dereference_protected(
-			*trim, lockdep_is_held(&trie->lock)))) {
+	/* If the node has no children, it can be completely removed */
+	if (!rcu_access_pointer(node->child[0]) &&
+	    !rcu_access_pointer(node->child[1])) {
 		RCU_INIT_POINTER(*trim, NULL);
-		trim = rcu_access_pointer(node->child[0]) ?
-			&node->child[0] :
-			&node->child[1];
 		kfree_rcu(node, rcu);
+		goto out;
+	}
+
+	/* If the node has one child, we may be able to collapse the tree
+	 * while removing this node if the node's child is in the same
+	 * 'next bit' slot as this node was in its parent or if the node
+	 * itself is the root.
+	 */
+	if (trim == &trie->root) {
+		next_bit = node->child[0] ? 0 : 1;
+		rcu_assign_pointer(trie->root, node->child[next_bit]);
+		kfree_rcu(node, rcu);
+	} else if (rcu_access_pointer(node->child[next_bit])) {
+		rcu_assign_pointer(*trim, node->child[next_bit]);
+		kfree_rcu(node, rcu);
+	} else {
+		/* If we can't collapse, just mark this node as intermediate */
+		node->flags |= LPM_TREE_NODE_FLAG_IM;
 	}
 
 out:
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
  2017-09-20 16:22 [PATCH net-next] bpf: Optimize lpm trie delete Craig Gallek
@ 2017-09-20 16:51 ` Daniel Mack
  2017-09-20 18:51   ` Craig Gallek
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Mack @ 2017-09-20 16:51 UTC (permalink / raw)
  To: Craig Gallek, Alexei Starovoitov, Daniel Borkmann, David S . Miller
  Cc: netdev

Hi Craig,

Thanks, this looks much cleaner already :)

On 09/20/2017 06:22 PM, Craig Gallek wrote:
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9d58a576b2ae..b5a7d70ec8b5 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>  	struct lpm_trie_node __rcu **trim;
>  	struct lpm_trie_node *node;
>  	unsigned long irq_flags;
> -	unsigned int next_bit;
> +	unsigned int next_bit = 0;

This default assignment seems wrong, and I guess you only added it to
squelch a compiler warning?

[...]

> +	/* If the node has one child, we may be able to collapse the tree
> +	 * while removing this node if the node's child is in the same
> +	 * 'next bit' slot as this node was in its parent or if the node
> +	 * itself is the root.
> +	 */
> +	if (trim == &trie->root) {
> +		next_bit = node->child[0] ? 0 : 1;
> +		rcu_assign_pointer(trie->root, node->child[next_bit]);
> +		kfree_rcu(node, rcu);

I don't think you should treat this 'root' case special.

Instead, move the 'next_bit' assignment outside of the condition ...

> +	} else if (rcu_access_pointer(node->child[next_bit])) {
> +		rcu_assign_pointer(*trim, node->child[next_bit]);
> +		kfree_rcu(node, rcu);

... and then this branch would handle the case just fine. Correct?

Otherwise, looks good to me!



Thanks,
Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
  2017-09-20 16:51 ` Daniel Mack
@ 2017-09-20 18:51   ` Craig Gallek
  2017-09-20 22:56     ` Daniel Mack
  0 siblings, 1 reply; 5+ messages in thread
From: Craig Gallek @ 2017-09-20 18:51 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev

On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>       struct lpm_trie_node __rcu **trim;
>>       struct lpm_trie_node *node;
>>       unsigned long irq_flags;
>> -     unsigned int next_bit;
>> +     unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> +     /* If the node has one child, we may be able to collapse the tree
>> +      * while removing this node if the node's child is in the same
>> +      * 'next bit' slot as this node was in its parent or if the node
>> +      * itself is the root.
>> +      */
>> +     if (trim == &trie->root) {
>> +             next_bit = node->child[0] ? 0 : 1;
>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>> +             kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == &trie->root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
  2017-09-20 18:51   ` Craig Gallek
@ 2017-09-20 22:56     ` Daniel Mack
  2017-09-21 13:13       ` Craig Gallek
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Mack @ 2017-09-20 22:56 UTC (permalink / raw)
  To: Craig Gallek
  Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev

On 09/20/2017 08:51 PM, Craig Gallek wrote:
> On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
>> Hi Craig,
>>
>> Thanks, this looks much cleaner already :)
>>
>> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>>> --- a/kernel/bpf/lpm_trie.c
>>> +++ b/kernel/bpf/lpm_trie.c
>>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>>       struct lpm_trie_node __rcu **trim;
>>>       struct lpm_trie_node *node;
>>>       unsigned long irq_flags;
>>> -     unsigned int next_bit;
>>> +     unsigned int next_bit = 0;
>>
>> This default assignment seems wrong, and I guess you only added it to
>> squelch a compiler warning?
> Yes, this variable is only initialized after the lookup iterations
> below (meaning it will never be initialized the the root-removal
> case).

Right, and once set, it's only updated in case we don't have an exact
match and try to drill down further.

>> [...]
>>
>>> +     /* If the node has one child, we may be able to collapse the tree
>>> +      * while removing this node if the node's child is in the same
>>> +      * 'next bit' slot as this node was in its parent or if the node
>>> +      * itself is the root.
>>> +      */
>>> +     if (trim == &trie->root) {
>>> +             next_bit = node->child[0] ? 0 : 1;
>>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>>> +             kfree_rcu(node, rcu);
>>
>> I don't think you should treat this 'root' case special.
>>
>> Instead, move the 'next_bit' assignment outside of the condition ...
> I'm not quite sure I follow.  Are you saying do something like this:
> 
> if (trim == &trie->root) {
>   next_bit = node->child[0] ? 0 : 1;
> }
> if (rcu_access_pointer(node->child[next_bit])) {
> ...
> 
> This would save a couple lines of code, but I think the as-is
> implementation is slightly easier to understand.  I don't have a
> strong opinion either way, though.

Me neither :)

My idea was to set

  next_bit = node->child[0] ? 0 : 1;

unconditionally, because it should result in the same in both cases.

It might be a bit of bike shedding, but I dislike this default
assignment, and I believe that not relying on next_bit to be set as a
side effect of the lookup loop makes the code a bit more readable.

WDYT?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
  2017-09-20 22:56     ` Daniel Mack
@ 2017-09-21 13:13       ` Craig Gallek
  0 siblings, 0 replies; 5+ messages in thread
From: Craig Gallek @ 2017-09-21 13:13 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev

On Wed, Sep 20, 2017 at 6:56 PM, Daniel Mack <daniel@zonque.org> wrote:
> On 09/20/2017 08:51 PM, Craig Gallek wrote:
>> On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
>>> Hi Craig,
>>>
>>> Thanks, this looks much cleaner already :)
>>>
>>> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>>>> --- a/kernel/bpf/lpm_trie.c
>>>> +++ b/kernel/bpf/lpm_trie.c
>>>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>>>       struct lpm_trie_node __rcu **trim;
>>>>       struct lpm_trie_node *node;
>>>>       unsigned long irq_flags;
>>>> -     unsigned int next_bit;
>>>> +     unsigned int next_bit = 0;
>>>
>>> This default assignment seems wrong, and I guess you only added it to
>>> squelch a compiler warning?
>> Yes, this variable is only initialized after the lookup iterations
>> below (meaning it will never be initialized the the root-removal
>> case).
>
> Right, and once set, it's only updated in case we don't have an exact
> match and try to drill down further.
>
>>> [...]
>>>
>>>> +     /* If the node has one child, we may be able to collapse the tree
>>>> +      * while removing this node if the node's child is in the same
>>>> +      * 'next bit' slot as this node was in its parent or if the node
>>>> +      * itself is the root.
>>>> +      */
>>>> +     if (trim == &trie->root) {
>>>> +             next_bit = node->child[0] ? 0 : 1;
>>>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>>>> +             kfree_rcu(node, rcu);
>>>
>>> I don't think you should treat this 'root' case special.
>>>
>>> Instead, move the 'next_bit' assignment outside of the condition ...
>> I'm not quite sure I follow.  Are you saying do something like this:
>>
>> if (trim == &trie->root) {
>>   next_bit = node->child[0] ? 0 : 1;
>> }
>> if (rcu_access_pointer(node->child[next_bit])) {
>> ...
>>
>> This would save a couple lines of code, but I think the as-is
>> implementation is slightly easier to understand.  I don't have a
>> strong opinion either way, though.
>
> Me neither :)
>
> My idea was to set
>
>   next_bit = node->child[0] ? 0 : 1;
>
> unconditionally, because it should result in the same in both cases.
>
> It might be a bit of bike shedding, but I dislike this default
> assignment, and I believe that not relying on next_bit to be set as a
> side effect of the lookup loop makes the code a bit more readable.
>
> WDYT?
That sounds reasonable.  I'll spin a v2 today if no one else has any comments.

Thanks again,
Craig

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-09-21 13:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20 16:22 [PATCH net-next] bpf: Optimize lpm trie delete Craig Gallek
2017-09-20 16:51 ` Daniel Mack
2017-09-20 18:51   ` Craig Gallek
2017-09-20 22:56     ` Daniel Mack
2017-09-21 13:13       ` Craig Gallek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.