All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
@ 2022-05-12 18:34 Stefano Brivio
  2022-05-16 18:16 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-05-12 18:34 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

In the overlap detection performed as part of the insertion operation,
we have to skip nodes that are not active in the current generation or
expired. This was done by adding several conditions in overlap decision
clauses, which made it hard to check for correctness, and debug the
actual issue this patch is fixing.

Consolidate these conditions into a single clause, so that we can
proceed with debugging and fixing the following issue.

Case b3. described in comments covers the insertion of a start
element after an existing end, as a leaf. If all the descendants of
a given element are inactive, functionally, for the purposes of
overlap detection, we still have to treat this as a leaf, but we don't.

If, in the same transaction, we remove a start element to the right,
remove an end element to the left, and add a start element to the right
of an existing, active end element, we don't hit case b3. For example:

- existing intervals: 1-2, 3-5, 6-7
- transaction: delete 3-5, insert 4-5

node traversal might happen as follows:
- 2 (active end)
- 5 (inactive end)
- 3 (inactive start)

when we add 4 as start element, we should simply consider 2 as
preceding end, and consider it as a leaf, because interval 3-5 has been
deleted. If we don't, we'll report an overlap because we forgot about
this.

Add an additional flag which is set as we find an active end, and reset
it if we find an active start (to the left). If we finish the traversal
with this flag set, it means we need to functionally consider the
previous active end as a leaf, and allow insertion instead of reporting
overlap.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/netfilter/nft_set_rbtree.c | 92 ++++++++++++++++++++--------------
 1 file changed, 54 insertions(+), 38 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 7325bee7d144..dc2184bbe722 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -222,6 +222,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	bool overlap = false, dup_end_left = false, dup_end_right = false;
 	struct nft_rbtree *priv = nft_set_priv(set);
 	u8 genmask = nft_genmask_next(net);
+	bool last_left_node_is_end = false;
 	struct nft_rbtree_elem *rbe;
 	struct rb_node *parent, **p;
 	int d;
@@ -287,80 +288,95 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 		d = memcmp(nft_set_ext_key(&rbe->ext),
 			   nft_set_ext_key(&new->ext),
 			   set->klen);
-		if (d < 0) {
+
+		if (d < 0)
 			p = &parent->rb_left;
+		else if (d > 0)
+			p = &parent->rb_right;
+		else if (nft_rbtree_interval_end(rbe))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
 
+		/* There might be inactive elements in the tree: ignore them by
+		 * traversing them without affecting flags.
+		 *
+		 * We need to reset the dup_end_left and dup_end_right flags,
+		 * though, because those only apply to adjacent nodes.
+		 */
+		if (!nft_set_elem_active(&rbe->ext, genmask) ||
+		    nft_set_elem_expired(&rbe->ext)) {
+			dup_end_left = dup_end_right = false;
+			continue;
+		}
+
+		if (d < 0) {
 			if (nft_rbtree_interval_start(new)) {
-				if (nft_rbtree_interval_end(rbe) &&
-				    nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext) && !*p)
-					overlap = false;
+				/* See case b3. described above.
+				 *
+				 * If this is not a leaf, but all nodes below
+				 * this one are inactive, except for a leaf, we
+				 * still have to consider it a leaf for the
+				 * purposes of overlap detection.
+				 *
+				 * Set last_left_node_is_end if this is not a
+				 * leaf and an active end element, and reset it
+				 * if we find an active start element to the
+				 * left.
+				 *
+				 * If we end the traversal with this flag set,
+				 * this node is a leaf for the purposes of case
+				 * b3., and no overlap will be reported.
+				 */
+				if (nft_rbtree_interval_end(rbe)) {
+					if (!*p)
+						overlap = false;
+					else
+						last_left_node_is_end = true;
+				} else {
+					last_left_node_is_end = false;
+				}
 			} else {
+
 				if (dup_end_left && !*p)
 					return -ENOTEMPTY;
 
-				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
-
+				overlap = nft_rbtree_interval_end(rbe);
 				if (overlap) {
 					dup_end_right = true;
 					continue;
 				}
 			}
 		} else if (d > 0) {
-			p = &parent->rb_right;
-
 			if (nft_rbtree_interval_end(new)) {
 				if (dup_end_right && !*p)
 					return -ENOTEMPTY;
 
-				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
-
+				overlap = nft_rbtree_interval_end(rbe);
 				if (overlap) {
 					dup_end_left = true;
 					continue;
 				}
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
+			} else {
 				overlap = nft_rbtree_interval_end(rbe);
 			}
 		} else {
 			if (nft_rbtree_interval_end(rbe) &&
 			    nft_rbtree_interval_start(new)) {
-				p = &parent->rb_left;
-
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
-					overlap = false;
+				overlap = false;
 			} else if (nft_rbtree_interval_start(rbe) &&
 				   nft_rbtree_interval_end(new)) {
-				p = &parent->rb_right;
-
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
-					overlap = false;
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
+				overlap = false;
+			} else {
 				*ext = &rbe->ext;
 				return -EEXIST;
-			} else {
-				overlap = false;
-				if (nft_rbtree_interval_end(rbe))
-					p = &parent->rb_left;
-				else
-					p = &parent->rb_right;
 			}
 		}
 
 		dup_end_left = dup_end_right = false;
 	}
 
-	if (overlap)
+	if (overlap && !last_left_node_is_end)
 		return -ENOTEMPTY;
 
 	rb_link_node_rcu(&new->node, parent, p);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-12 18:34 [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf Stefano Brivio
@ 2022-05-16 18:16 ` Pablo Neira Ayuso
  2022-05-17 12:57   ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-16 18:16 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 7301 bytes --]

Hi Stefano,

On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote:
> In the overlap detection performed as part of the insertion operation,
> we have to skip nodes that are not active in the current generation or
> expired. This was done by adding several conditions in overlap decision
> clauses, which made it hard to check for correctness, and debug the
> actual issue this patch is fixing.
> 
> Consolidate these conditions into a single clause, so that we can
> proceed with debugging and fixing the following issue.
> 
> Case b3. described in comments covers the insertion of a start
> element after an existing end, as a leaf. If all the descendants of
> a given element are inactive, functionally, for the purposes of
> overlap detection, we still have to treat this as a leaf, but we don't.
> 
> If, in the same transaction, we remove a start element to the right,
> remove an end element to the left, and add a start element to the right
> of an existing, active end element, we don't hit case b3. For example:
> 
> - existing intervals: 1-2, 3-5, 6-7
> - transaction: delete 3-5, insert 4-5
> 
> node traversal might happen as follows:
> - 2 (active end)
> - 5 (inactive end)
> - 3 (inactive start)
> 
> when we add 4 as start element, we should simply consider 2 as
> preceding end, and consider it as a leaf, because interval 3-5 has been
> deleted. If we don't, we'll report an overlap because we forgot about
> this.
> 
> Add an additional flag which is set as we find an active end, and reset
> it if we find an active start (to the left). If we finish the traversal
> with this flag set, it means we need to functionally consider the
> previous active end as a leaf, and allow insertion instead of reporting
> overlap.

I can still trigger EEXIST with deletion of existing interval. It
became harder to reproduce after this patch.

After hitting EEXIST, if I do:

        echo "flush ruleset" > test.nft
        nft list ruleset >> test.nft

to dump the existing ruleset, then I run the delete element command
again to remove the interval and it works. Before this patch I could
reproduce it by reloading an existing ruleset dump.

I'm running the script that I'm attaching manually, just one manual
invocation after another.

I ocassionally hit ENOBUFS, but that sounds like a different issue
that I'm looking into.

Thanks.

> Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  net/netfilter/nft_set_rbtree.c | 92 ++++++++++++++++++++--------------
>  1 file changed, 54 insertions(+), 38 deletions(-)
> 
> diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
> index 7325bee7d144..dc2184bbe722 100644
> --- a/net/netfilter/nft_set_rbtree.c
> +++ b/net/netfilter/nft_set_rbtree.c
> @@ -222,6 +222,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
>  	bool overlap = false, dup_end_left = false, dup_end_right = false;
>  	struct nft_rbtree *priv = nft_set_priv(set);
>  	u8 genmask = nft_genmask_next(net);
> +	bool last_left_node_is_end = false;
>  	struct nft_rbtree_elem *rbe;
>  	struct rb_node *parent, **p;
>  	int d;
> @@ -287,80 +288,95 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
>  		d = memcmp(nft_set_ext_key(&rbe->ext),
>  			   nft_set_ext_key(&new->ext),
>  			   set->klen);
> -		if (d < 0) {
> +
> +		if (d < 0)
>  			p = &parent->rb_left;
> +		else if (d > 0)
> +			p = &parent->rb_right;
> +		else if (nft_rbtree_interval_end(rbe))
> +			p = &parent->rb_left;
> +		else
> +			p = &parent->rb_right;
>  
> +		/* There might be inactive elements in the tree: ignore them by
> +		 * traversing them without affecting flags.
> +		 *
> +		 * We need to reset the dup_end_left and dup_end_right flags,
> +		 * though, because those only apply to adjacent nodes.
> +		 */
> +		if (!nft_set_elem_active(&rbe->ext, genmask) ||
> +		    nft_set_elem_expired(&rbe->ext)) {
> +			dup_end_left = dup_end_right = false;
> +			continue;
> +		}
> +
> +		if (d < 0) {
>  			if (nft_rbtree_interval_start(new)) {
> -				if (nft_rbtree_interval_end(rbe) &&
> -				    nft_set_elem_active(&rbe->ext, genmask) &&
> -				    !nft_set_elem_expired(&rbe->ext) && !*p)
> -					overlap = false;
> +				/* See case b3. described above.
> +				 *
> +				 * If this is not a leaf, but all nodes below
> +				 * this one are inactive, except for a leaf, we
> +				 * still have to consider it a leaf for the
> +				 * purposes of overlap detection.
> +				 *
> +				 * Set last_left_node_is_end if this is not a
> +				 * leaf and an active end element, and reset it
> +				 * if we find an active start element to the
> +				 * left.
> +				 *
> +				 * If we end the traversal with this flag set,
> +				 * this node is a leaf for the purposes of case
> +				 * b3., and no overlap will be reported.
> +				 */
> +				if (nft_rbtree_interval_end(rbe)) {
> +					if (!*p)
> +						overlap = false;
> +					else
> +						last_left_node_is_end = true;
> +				} else {
> +					last_left_node_is_end = false;
> +				}
>  			} else {
> +
>  				if (dup_end_left && !*p)
>  					return -ENOTEMPTY;
>  
> -				overlap = nft_rbtree_interval_end(rbe) &&
> -					  nft_set_elem_active(&rbe->ext,
> -							      genmask) &&
> -					  !nft_set_elem_expired(&rbe->ext);
> -
> +				overlap = nft_rbtree_interval_end(rbe);
>  				if (overlap) {
>  					dup_end_right = true;
>  					continue;
>  				}
>  			}
>  		} else if (d > 0) {
> -			p = &parent->rb_right;
> -
>  			if (nft_rbtree_interval_end(new)) {
>  				if (dup_end_right && !*p)
>  					return -ENOTEMPTY;
>  
> -				overlap = nft_rbtree_interval_end(rbe) &&
> -					  nft_set_elem_active(&rbe->ext,
> -							      genmask) &&
> -					  !nft_set_elem_expired(&rbe->ext);
> -
> +				overlap = nft_rbtree_interval_end(rbe);
>  				if (overlap) {
>  					dup_end_left = true;
>  					continue;
>  				}
> -			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
> -				   !nft_set_elem_expired(&rbe->ext)) {
> +			} else {
>  				overlap = nft_rbtree_interval_end(rbe);
>  			}
>  		} else {
>  			if (nft_rbtree_interval_end(rbe) &&
>  			    nft_rbtree_interval_start(new)) {
> -				p = &parent->rb_left;
> -
> -				if (nft_set_elem_active(&rbe->ext, genmask) &&
> -				    !nft_set_elem_expired(&rbe->ext))
> -					overlap = false;
> +				overlap = false;
>  			} else if (nft_rbtree_interval_start(rbe) &&
>  				   nft_rbtree_interval_end(new)) {
> -				p = &parent->rb_right;
> -
> -				if (nft_set_elem_active(&rbe->ext, genmask) &&
> -				    !nft_set_elem_expired(&rbe->ext))
> -					overlap = false;
> -			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
> -				   !nft_set_elem_expired(&rbe->ext)) {
> +				overlap = false;
> +			} else {
>  				*ext = &rbe->ext;
>  				return -EEXIST;
> -			} else {
> -				overlap = false;
> -				if (nft_rbtree_interval_end(rbe))
> -					p = &parent->rb_left;
> -				else
> -					p = &parent->rb_right;
>  			}
>  		}
>  
>  		dup_end_left = dup_end_right = false;
>  	}
>  
> -	if (overlap)
> +	if (overlap && !last_left_node_is_end)
>  		return -ENOTEMPTY;
>  
>  	rb_link_node_rcu(&new->node, parent, p);
> -- 
> 2.35.1
> 

[-- Attachment #2: rbtree-bug.sh --]
[-- Type: application/x-sh, Size: 1241 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-16 18:16 ` Pablo Neira Ayuso
@ 2022-05-17 12:57   ` Stefano Brivio
  2022-05-20 15:45     ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-05-17 12:57 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Mon, 16 May 2022 20:16:53 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote:
> > In the overlap detection performed as part of the insertion operation,
> > we have to skip nodes that are not active in the current generation or
> > expired. This was done by adding several conditions in overlap decision
> > clauses, which made it hard to check for correctness, and debug the
> > actual issue this patch is fixing.
> > 
> > Consolidate these conditions into a single clause, so that we can
> > proceed with debugging and fixing the following issue.
> > 
> > Case b3. described in comments covers the insertion of a start
> > element after an existing end, as a leaf. If all the descendants of
> > a given element are inactive, functionally, for the purposes of
> > overlap detection, we still have to treat this as a leaf, but we don't.
> > 
> > If, in the same transaction, we remove a start element to the right,
> > remove an end element to the left, and add a start element to the right
> > of an existing, active end element, we don't hit case b3. For example:
> > 
> > - existing intervals: 1-2, 3-5, 6-7
> > - transaction: delete 3-5, insert 4-5
> > 
> > node traversal might happen as follows:
> > - 2 (active end)
> > - 5 (inactive end)
> > - 3 (inactive start)
> > 
> > when we add 4 as start element, we should simply consider 2 as
> > preceding end, and consider it as a leaf, because interval 3-5 has been
> > deleted. If we don't, we'll report an overlap because we forgot about
> > this.
> > 
> > Add an additional flag which is set as we find an active end, and reset
> > it if we find an active start (to the left). If we finish the traversal
> > with this flag set, it means we need to functionally consider the
> > previous active end as a leaf, and allow insertion instead of reporting
> > overlap.  
> 
> I can still trigger EEXIST with deletion of existing interval. It
> became harder to reproduce after this patch.
> 
> After hitting EEXIST, if I do:
> 
>         echo "flush ruleset" > test.nft
>         nft list ruleset >> test.nft
> 
> to dump the existing ruleset, then I run the delete element command
> again to remove the interval and it works. Before this patch I could
> reproduce it by reloading an existing ruleset dump.
> 
> I'm running the script that I'm attaching manually, just one manual
> invocation after another.

Ouch, sorry for that.

It looks like there's another case where inactive elements still affect
overlap detection in an unexpected way... at least with the structure
of this patch it should be easier to find, I'm looking into that now.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-17 12:57   ` Stefano Brivio
@ 2022-05-20 15:45     ` Stefano Brivio
  2022-05-23 14:59       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-05-20 15:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 17 May 2022 14:57:09 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Mon, 16 May 2022 20:16:53 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > Hi Stefano,
> > 
> > On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote:  
> > > In the overlap detection performed as part of the insertion operation,
> > > we have to skip nodes that are not active in the current generation or
> > > expired. This was done by adding several conditions in overlap decision
> > > clauses, which made it hard to check for correctness, and debug the
> > > actual issue this patch is fixing.
> > > 
> > > Consolidate these conditions into a single clause, so that we can
> > > proceed with debugging and fixing the following issue.
> > > 
> > > Case b3. described in comments covers the insertion of a start
> > > element after an existing end, as a leaf. If all the descendants of
> > > a given element are inactive, functionally, for the purposes of
> > > overlap detection, we still have to treat this as a leaf, but we don't.
> > > 
> > > If, in the same transaction, we remove a start element to the right,
> > > remove an end element to the left, and add a start element to the right
> > > of an existing, active end element, we don't hit case b3. For example:
> > > 
> > > - existing intervals: 1-2, 3-5, 6-7
> > > - transaction: delete 3-5, insert 4-5
> > > 
> > > node traversal might happen as follows:
> > > - 2 (active end)
> > > - 5 (inactive end)
> > > - 3 (inactive start)
> > > 
> > > when we add 4 as start element, we should simply consider 2 as
> > > preceding end, and consider it as a leaf, because interval 3-5 has been
> > > deleted. If we don't, we'll report an overlap because we forgot about
> > > this.
> > > 
> > > Add an additional flag which is set as we find an active end, and reset
> > > it if we find an active start (to the left). If we finish the traversal
> > > with this flag set, it means we need to functionally consider the
> > > previous active end as a leaf, and allow insertion instead of reporting
> > > overlap.    
> > 
> > I can still trigger EEXIST with deletion of existing interval. It
> > became harder to reproduce after this patch.
> > 
> > After hitting EEXIST, if I do:
> > 
> >         echo "flush ruleset" > test.nft
> >         nft list ruleset >> test.nft
> > 
> > to dump the existing ruleset, then I run the delete element command
> > again to remove the interval and it works. Before this patch I could
> > reproduce it by reloading an existing ruleset dump.
> > 
> > I'm running the script that I'm attaching manually, just one manual
> > invocation after another.  
> 
> Ouch, sorry for that.
> 
> It looks like there's another case where inactive elements still affect
> overlap detection in an unexpected way... at least with the structure
> of this patch it should be easier to find, I'm looking into that now.

...what a mess. I could figure that part out (it was a case symmetric
to what this patch fixed, in this case resolving to case b5.) but
there's then another case (found by triggering a specific tree topology
with 0044interval_overlap_1) where we first add a start element, then
fail to add the end element because the start element is completely
"hidden" in the tree by inactive nodes.

I tried to solve that with some backtracking, but that looks also
fragile. If I clean up the tree before insertion, instead, that will
only deal with expired nodes, not inactive nodes -- I can't erase
non-expired, inactive nodes because the API expects to find them at
some later point and call nft_rbtree_remove() on them.

Now I'm trying another approach that looks more robust: instead of
descending the tree to find overlaps, just going through it in the same
way nft_rbtree_gc() does (linearly, node by node), marking the
value-wise closest points from left and right _valid_ nodes, and
applying the same reasoning. I need a bit more time for this, but it
looks way simpler. Insertion itself would keep working as it does now.

In hindsight, it looks like it was a terrible idea to try to fix this
implementation. I really underestimated how bad this is. Functionally
speaking, it's not a red-black tree because:

- we can't use it as a sorted binary tree, given that some elements
  "don't matter" for some operations, or have another colour. We might
  try to think of it as some other structure and rebuild from there
  useful properties of sorted binary trees, but I'm not sure a
  "red-brown-black" tree would have any other use making it worth of
  any further research

- end elements being represented as their value plus one also break
  assumptions of sorted trees (this is the issue I'm actually facing
  with backtracking)

- left subtrees store keys greater than right subtrees, but this
  looks consistent so it's just added fun and could be fixed
  trivially (it's all reversed)

By the way, I think we _should_ have similar issues in the lookup
function. Given that it's possible to build a tree with a subtree of at
least three levels with entirely non-valid nodes, I guess we can hide a
valid interval from the lookup too. It's just harder to test.

In the perspective of getting rid of it, I think we need:

1. some "introductory" documentation for nft_set_pipapo -- I just
   got back to it (drawing some diagrams first...)

2. to understand if the performance gap in the few (maybe not
   reasonable) cases where nft_set_rbtree matches faster than
   nft_set_pipapo is acceptable. Summary:
     https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@redhat.com/

3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583,
   it's an atomicity issue which has little to do with nft_set_pipapo
   structures themselves, but I couldn't figure out the exact issue
   yet. I'm struggling to find the time for it, so if somebody wants to
   give it a try, I'd be more than happy to reassign it...

Anyway, I'll post a different patch for nft_set_rbtree soon.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-20 15:45     ` Stefano Brivio
@ 2022-05-23 14:59       ` Pablo Neira Ayuso
  2022-05-25 12:15         ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2022-05-23 14:59 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

On Fri, May 20, 2022 at 05:45:24PM +0200, Stefano Brivio wrote:
> On Tue, 17 May 2022 14:57:09 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Mon, 16 May 2022 20:16:53 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > 
> > > Hi Stefano,
> > > 
> > > On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote:  
> > > > In the overlap detection performed as part of the insertion operation,
> > > > we have to skip nodes that are not active in the current generation or
> > > > expired. This was done by adding several conditions in overlap decision
> > > > clauses, which made it hard to check for correctness, and debug the
> > > > actual issue this patch is fixing.
> > > > 
> > > > Consolidate these conditions into a single clause, so that we can
> > > > proceed with debugging and fixing the following issue.
> > > > 
> > > > Case b3. described in comments covers the insertion of a start
> > > > element after an existing end, as a leaf. If all the descendants of
> > > > a given element are inactive, functionally, for the purposes of
> > > > overlap detection, we still have to treat this as a leaf, but we don't.
> > > > 
> > > > If, in the same transaction, we remove a start element to the right,
> > > > remove an end element to the left, and add a start element to the right
> > > > of an existing, active end element, we don't hit case b3. For example:
> > > > 
> > > > - existing intervals: 1-2, 3-5, 6-7
> > > > - transaction: delete 3-5, insert 4-5
> > > > 
> > > > node traversal might happen as follows:
> > > > - 2 (active end)
> > > > - 5 (inactive end)
> > > > - 3 (inactive start)
> > > > 
> > > > when we add 4 as start element, we should simply consider 2 as
> > > > preceding end, and consider it as a leaf, because interval 3-5 has been
> > > > deleted. If we don't, we'll report an overlap because we forgot about
> > > > this.
> > > > 
> > > > Add an additional flag which is set as we find an active end, and reset
> > > > it if we find an active start (to the left). If we finish the traversal
> > > > with this flag set, it means we need to functionally consider the
> > > > previous active end as a leaf, and allow insertion instead of reporting
> > > > overlap.    
> > > 
> > > I can still trigger EEXIST with deletion of existing interval. It
> > > became harder to reproduce after this patch.
> > > 
> > > After hitting EEXIST, if I do:
> > > 
> > >         echo "flush ruleset" > test.nft
> > >         nft list ruleset >> test.nft
> > > 
> > > to dump the existing ruleset, then I run the delete element command
> > > again to remove the interval and it works. Before this patch I could
> > > reproduce it by reloading an existing ruleset dump.
> > > 
> > > I'm running the script that I'm attaching manually, just one manual
> > > invocation after another.  
> > 
> > Ouch, sorry for that.
> > 
> > It looks like there's another case where inactive elements still affect
> > overlap detection in an unexpected way... at least with the structure
> > of this patch it should be easier to find, I'm looking into that now.
> 
> ...what a mess. I could figure that part out (it was a case symmetric
> to what this patch fixed, in this case resolving to case b5.) but
> there's then another case (found by triggering a specific tree topology
> with 0044interval_overlap_1) where we first add a start element, then
> fail to add the end element because the start element is completely
> "hidden" in the tree by inactive nodes.
> 
> I tried to solve that with some backtracking, but that looks also
> fragile. If I clean up the tree before insertion, instead, that will
> only deal with expired nodes, not inactive nodes -- I can't erase
> non-expired, inactive nodes because the API expects to find them at
> some later point and call nft_rbtree_remove() on them.
> 
> Now I'm trying another approach that looks more robust: instead of
> descending the tree to find overlaps, just going through it in the same
> way nft_rbtree_gc() does (linearly, node by node), marking the
> value-wise closest points from left and right _valid_ nodes, and
> applying the same reasoning. I need a bit more time for this, but it
> looks way simpler. Insertion itself would keep working as it does now.
> 
> In hindsight, it looks like it was a terrible idea to try to fix this
> implementation. I really underestimated how bad this is. Functionally
> speaking, it's not a red-black tree because:
> 
> - we can't use it as a sorted binary tree, given that some elements
>   "don't matter" for some operations, or have another colour. We might
>   try to think of it as some other structure and rebuild from there
>   useful properties of sorted binary trees, but I'm not sure a
>   "red-brown-black" tree would have any other use making it worth of
>   any further research
> 
> - end elements being represented as their value plus one also break
>   assumptions of sorted trees (this is the issue I'm actually facing
>   with backtracking)
> 
> - left subtrees store keys greater than right subtrees, but this
>   looks consistent so it's just added fun and could be fixed
>   trivially (it's all reversed)
> 
> By the way, I think we _should_ have similar issues in the lookup
> function. Given that it's possible to build a tree with a subtree of at
> least three levels with entirely non-valid nodes, I guess we can hide a
> valid interval from the lookup too. It's just harder to test.

Thanks for the detailed report.

Another possibility? Maintain two trees, one for the existing
generation (this is read-only) and another for the next generation
(insertion/removals are performed on it), then swap them when commit
happens? pipapo has similar requirements, currently it is relying on a
workqueue to make some postprocess after the commit phase. At the
expense of consuming more memory.

> In the perspective of getting rid of it, I think we need:
> 
> 1. some "introductory" documentation for nft_set_pipapo -- I just
>    got back to it (drawing some diagrams first...)
> 
> 2. to understand if the performance gap in the few (maybe not
>    reasonable) cases where nft_set_rbtree matches faster than
>    nft_set_pipapo is acceptable. Summary:
>      https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@redhat.com/

IIRC pipapo was not too far behind from rbtree for a few scenarios.

> 3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583,
>    it's an atomicity issue which has little to do with nft_set_pipapo
>    structures themselves, but I couldn't figure out the exact issue
>    yet. I'm struggling to find the time for it, so if somebody wants to
>    give it a try, I'd be more than happy to reassign it...

OK, a different problem, related to pipapo.

> Anyway, I'll post a different patch for nft_set_rbtree soon.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-23 14:59       ` Pablo Neira Ayuso
@ 2022-05-25 12:15         ` Stefano Brivio
  2022-06-01 11:15           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-05-25 12:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Mon, 23 May 2022 16:59:30 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, May 20, 2022 at 05:45:24PM +0200, Stefano Brivio wrote:
> > On Tue, 17 May 2022 14:57:09 +0200
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> >   
> > > On Mon, 16 May 2022 20:16:53 +0200
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > >   
> > > > Hi Stefano,
> > > > 
> > > > On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote:    
> > > > > In the overlap detection performed as part of the insertion operation,
> > > > > we have to skip nodes that are not active in the current generation or
> > > > > expired. This was done by adding several conditions in overlap decision
> > > > > clauses, which made it hard to check for correctness, and debug the
> > > > > actual issue this patch is fixing.
> > > > > 
> > > > > Consolidate these conditions into a single clause, so that we can
> > > > > proceed with debugging and fixing the following issue.
> > > > > 
> > > > > Case b3. described in comments covers the insertion of a start
> > > > > element after an existing end, as a leaf. If all the descendants of
> > > > > a given element are inactive, functionally, for the purposes of
> > > > > overlap detection, we still have to treat this as a leaf, but we don't.
> > > > > 
> > > > > If, in the same transaction, we remove a start element to the right,
> > > > > remove an end element to the left, and add a start element to the right
> > > > > of an existing, active end element, we don't hit case b3. For example:
> > > > > 
> > > > > - existing intervals: 1-2, 3-5, 6-7
> > > > > - transaction: delete 3-5, insert 4-5
> > > > > 
> > > > > node traversal might happen as follows:
> > > > > - 2 (active end)
> > > > > - 5 (inactive end)
> > > > > - 3 (inactive start)
> > > > > 
> > > > > when we add 4 as start element, we should simply consider 2 as
> > > > > preceding end, and consider it as a leaf, because interval 3-5 has been
> > > > > deleted. If we don't, we'll report an overlap because we forgot about
> > > > > this.
> > > > > 
> > > > > Add an additional flag which is set as we find an active end, and reset
> > > > > it if we find an active start (to the left). If we finish the traversal
> > > > > with this flag set, it means we need to functionally consider the
> > > > > previous active end as a leaf, and allow insertion instead of reporting
> > > > > overlap.      
> > > > 
> > > > I can still trigger EEXIST with deletion of existing interval. It
> > > > became harder to reproduce after this patch.
> > > > 
> > > > After hitting EEXIST, if I do:
> > > > 
> > > >         echo "flush ruleset" > test.nft
> > > >         nft list ruleset >> test.nft
> > > > 
> > > > to dump the existing ruleset, then I run the delete element command
> > > > again to remove the interval and it works. Before this patch I could
> > > > reproduce it by reloading an existing ruleset dump.
> > > > 
> > > > I'm running the script that I'm attaching manually, just one manual
> > > > invocation after another.    
> > > 
> > > Ouch, sorry for that.
> > > 
> > > It looks like there's another case where inactive elements still affect
> > > overlap detection in an unexpected way... at least with the structure
> > > of this patch it should be easier to find, I'm looking into that now.  
> > 
> > ...what a mess. I could figure that part out (it was a case symmetric
> > to what this patch fixed, in this case resolving to case b5.) but
> > there's then another case (found by triggering a specific tree topology
> > with 0044interval_overlap_1) where we first add a start element, then
> > fail to add the end element because the start element is completely
> > "hidden" in the tree by inactive nodes.
> > 
> > I tried to solve that with some backtracking, but that looks also
> > fragile. If I clean up the tree before insertion, instead, that will
> > only deal with expired nodes, not inactive nodes -- I can't erase
> > non-expired, inactive nodes because the API expects to find them at
> > some later point and call nft_rbtree_remove() on them.
> > 
> > Now I'm trying another approach that looks more robust: instead of
> > descending the tree to find overlaps, just going through it in the same
> > way nft_rbtree_gc() does (linearly, node by node), marking the
> > value-wise closest points from left and right _valid_ nodes, and
> > applying the same reasoning. I need a bit more time for this, but it
> > looks way simpler. Insertion itself would keep working as it does now.
> > 
> > In hindsight, it looks like it was a terrible idea to try to fix this
> > implementation. I really underestimated how bad this is. Functionally
> > speaking, it's not a red-black tree because:
> > 
> > - we can't use it as a sorted binary tree, given that some elements
> >   "don't matter" for some operations, or have another colour. We might
> >   try to think of it as some other structure and rebuild from there
> >   useful properties of sorted binary trees, but I'm not sure a
> >   "red-brown-black" tree would have any other use making it worth of
> >   any further research
> > 
> > - end elements being represented as their value plus one also break
> >   assumptions of sorted trees (this is the issue I'm actually facing
> >   with backtracking)
> > 
> > - left subtrees store keys greater than right subtrees, but this
> >   looks consistent so it's just added fun and could be fixed
> >   trivially (it's all reversed)
> > 
> > By the way, I think we _should_ have similar issues in the lookup
> > function. Given that it's possible to build a tree with a subtree of at
> > least three levels with entirely non-valid nodes, I guess we can hide a
> > valid interval from the lookup too. It's just harder to test.  
> 
> Thanks for the detailed report.
> 
> Another possibility? Maintain two trees, one for the existing
> generation (this is read-only) and another for the next generation
> (insertion/removals are performed on it), then swap them when commit
> happens?

It sounded like a good idea and I actually started sketching it, but
there's one fundamental issue: it doesn't help with overlap detection
because we also want to check insertions that will be part of the live
copy. If, within one transaction, you delete and create elements, the
"dirty" copy is still dirty for the purposes of overlaps.

For the lookup, that might help. Is it worth it right now, though? At
the moment I would go back and try to get overlap detection work
properly, at least, with my previous idea.

> pipapo has similar requirements, currently it is relying on a
> workqueue to make some postprocess after the commit phase. At the
> expense of consuming more memory.

Well, it keeps two copies already: all the insertions and removals are
done on the dirty copy. The two problems we have there are:

- the clean copy might also contain inactive elements (because on a
  "commit" operation the API doesn't guarantee that all inserted
  elements are active), so we need to check for those during lookup,
  which is quite heavy (in my tests that was 4% of the clock cycles
  needed for lookup in a set with 30 000 "port, protocol" entries)

- on every _activate() call, we also need to commit the dirty copy onto
  a clean one, instead of having one commit per transaction (because if
  there's a newly activated item, we need to see it from the lookup),
  so every transaction translates to a number of commit operations for
  the back-end.

  That also makes things a bit complicated and it might very well be
  related to point 3. below

...there's no actual workqueue: garbage collection (for timed out
entries) only happens on commit, I don't see a particular problem with
it.

I think both issues would be solved if we had a more natural API, that
is, having a single call to the back-end implementing a commit
operation, instead of separately activating single entries. I don't
know how complicated this change would be.

From a set back-end perspective it looks trivial (pipapo would be
greatly simplified, hash would also need to keep two copies but we
would remove some complexity by getting rid of some checks).

> > In the perspective of getting rid of it, I think we need:
> > 
> > 1. some "introductory" documentation for nft_set_pipapo -- I just
> >    got back to it (drawing some diagrams first...)
> > 
> > 2. to understand if the performance gap in the few (maybe not
> >    reasonable) cases where nft_set_rbtree matches faster than
> >    nft_set_pipapo is acceptable. Summary:
> >      https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@redhat.com/  
> 
> IIRC pipapo was not too far behind from rbtree for a few scenarios.

Perhaps it would be good enough (minus points 1. and 3. here) to start
offering it as a default option (the change is trivial, setting
NFT_PIPAPO_MIN_FIELDS to 1) and see if regressions are reported (actually,
I doubt it).

> > 3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583,
> >    it's an atomicity issue which has little to do with nft_set_pipapo
> >    structures themselves, but I couldn't figure out the exact issue
> >    yet. I'm struggling to find the time for it, so if somebody wants to
> >    give it a try, I'd be more than happy to reassign it...  
> 
> OK, a different problem, related to pipapo.

Yes, I included it here because I wouldn't offer pipapo as rbtree
replacement as long as that issue is there.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-05-25 12:15         ` Stefano Brivio
@ 2022-06-01 11:15           ` Pablo Neira Ayuso
  2022-06-03 13:04             ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2022-06-01 11:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

On Wed, May 25, 2022 at 02:15:07PM +0200, Stefano Brivio wrote:
> On Mon, 23 May 2022 16:59:30 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > Another possibility? Maintain two trees, one for the existing
> > generation (this is read-only) and another for the next generation
> > (insertion/removals are performed on it), then swap them when commit
> > happens?
> 
> It sounded like a good idea and I actually started sketching it, but
> there's one fundamental issue: it doesn't help with overlap detection
> because we also want to check insertions that will be part of the live
> copy. If, within one transaction, you delete and create elements, the
> "dirty" copy is still dirty for the purposes of overlaps.

Updates on the copy could be done without using the deactivate/active
logic since it is not exposed to the packet path. Then, you use the
copy (next generation of the datastructure) to check for overlaps? We
need keep pointer to two sets in the rbtree private data area, the
generation ID would point to the current set that is being used from
the packet path. The stale tree is released from the commit phase (see
below).

> For the lookup, that might help. Is it worth it right now, though? At
> the moment I would go back and try to get overlap detection work
> properly, at least, with my previous idea.

If your idea is still in planning phase, could you summarize again the
idea? You mentioned about using gc you mentioned, if it is more simple
than my proposal, then it should be good to go too.

> > pipapo has similar requirements, currently it is relying on a
> > workqueue to make some postprocess after the commit phase. At the
> > expense of consuming more memory.
> 
> Well, it keeps two copies already: all the insertions and removals are
> done on the dirty copy. The two problems we have there are:
> 
> - the clean copy might also contain inactive elements (because on a
>   "commit" operation the API doesn't guarantee that all inserted
>   elements are active), so we need to check for those during lookup,
>   which is quite heavy (in my tests that was 4% of the clock cycles
>   needed for lookup in a set with 30 000 "port, protocol" entries)
> 
> - on every _activate() call, we also need to commit the dirty copy onto
>   a clean one, instead of having one commit per transaction (because if
>   there's a newly activated item, we need to see it from the lookup),
>   so every transaction translates to a number of commit operations for
>   the back-end.
> 
>   That also makes things a bit complicated and it might very well be
>   related to point 3. below
> 
> ...there's no actual workqueue: garbage collection (for timed out
> entries) only happens on commit, I don't see a particular problem with
> it.
> 
> I think both issues would be solved if we had a more natural API, that
> is, having a single call to the back-end implementing a commit
> operation, instead of separately activating single entries. I don't
> know how complicated this change would be.

It should be possible to add a ->commit operation to set->ops, then
call it at the end of the commit phase, ie. iterate over the list of
existing sets in the table and call set->ops->commit().

> From a set back-end perspective it looks trivial (pipapo would be
> greatly simplified, hash would also need to keep two copies but we
> would remove some complexity by getting rid of some checks).
> 
> > > In the perspective of getting rid of it, I think we need:
> > > 
> > > 1. some "introductory" documentation for nft_set_pipapo -- I just
> > >    got back to it (drawing some diagrams first...)
> > > 
> > > 2. to understand if the performance gap in the few (maybe not
> > >    reasonable) cases where nft_set_rbtree matches faster than
> > >    nft_set_pipapo is acceptable. Summary:
> > >      https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@redhat.com/  
> > 
> > IIRC pipapo was not too far behind from rbtree for a few scenarios.
> 
> Perhaps it would be good enough (minus points 1. and 3. here) to start
> offering it as a default option (the change is trivial, setting
> NFT_PIPAPO_MIN_FIELDS to 1) and see if regressions are reported (actually,
> I doubt it).
> 
> > > 3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583,
> > >    it's an atomicity issue which has little to do with nft_set_pipapo
> > >    structures themselves, but I couldn't figure out the exact issue
> > >    yet. I'm struggling to find the time for it, so if somebody wants to
> > >    give it a try, I'd be more than happy to reassign it...  
> > 
> > OK, a different problem, related to pipapo.
> 
> Yes, I included it here because I wouldn't offer pipapo as rbtree
> replacement as long as that issue is there.

Makes sense, thanks for explaining.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-06-01 11:15           ` Pablo Neira Ayuso
@ 2022-06-03 13:04             ` Stefano Brivio
  2022-06-06  9:01               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-06-03 13:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Wed, 1 Jun 2022 13:15:08 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Wed, May 25, 2022 at 02:15:07PM +0200, Stefano Brivio wrote:
> > On Mon, 23 May 2022 16:59:30 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > > Another possibility? Maintain two trees, one for the existing
> > > generation (this is read-only) and another for the next generation
> > > (insertion/removals are performed on it), then swap them when commit
> > > happens?  
> > 
> > It sounded like a good idea and I actually started sketching it, but
> > there's one fundamental issue: it doesn't help with overlap detection
> > because we also want to check insertions that will be part of the live
> > copy. If, within one transaction, you delete and create elements, the
> > "dirty" copy is still dirty for the purposes of overlaps.  
> 
> Updates on the copy could be done without using the deactivate/active
> logic since it is not exposed to the packet path. Then, you use the
> copy (next generation of the datastructure) to check for overlaps? We
> need keep pointer to two sets in the rbtree private data area, the
> generation ID would point to the current set that is being used from
> the packet path. The stale tree is released from the commit phase (see
> below).

Oh, right, I guess that would work. But at a first glance it looks more
complicated than the other idea:

> > For the lookup, that might help. Is it worth it right now, though? At
> > the moment I would go back and try to get overlap detection work
> > properly, at least, with my previous idea.  
> 
> If your idea is still in planning phase, could you summarize again the
> idea? You mentioned about using gc you mentioned, if it is more simple
> than my proposal, then it should be good to go too.

...hmm, no, forget about gc, that was flawed. I'm just "walking" the
tree (going through elements as a list, instead of descending it),
noting down closest left and right elements to what we're inserting,
and check it with similar criteria to what we already have (but much
simpler, because we don't have to infer anything from what might be in
other leaves/nodes).

That is, if you have elements 3 (start), 5 (end), 7 (start), 8 (end),
and you're inserting 6 as a start, we'll end up the tree walk with 5
(end) on the left and 7 (start) on the right, so we know it's not
overlapping.

If you insert 4 (as start or end), we know we have 3 and 5 around, so
it overlaps.

It's essentially the same as it is now, just dropping a lot of corner
cases and changing the way we go through the tree.

I kind of implemented it, I still need a bit to make it working.

> > > pipapo has similar requirements, currently it is relying on a
> > > workqueue to make some postprocess after the commit phase. At the
> > > expense of consuming more memory.  
> > 
> > Well, it keeps two copies already: all the insertions and removals are
> > done on the dirty copy. The two problems we have there are:
> > 
> > - the clean copy might also contain inactive elements (because on a
> >   "commit" operation the API doesn't guarantee that all inserted
> >   elements are active), so we need to check for those during lookup,
> >   which is quite heavy (in my tests that was 4% of the clock cycles
> >   needed for lookup in a set with 30 000 "port, protocol" entries)
> > 
> > - on every _activate() call, we also need to commit the dirty copy onto
> >   a clean one, instead of having one commit per transaction (because if
> >   there's a newly activated item, we need to see it from the lookup),
> >   so every transaction translates to a number of commit operations for
> >   the back-end.
> > 
> >   That also makes things a bit complicated and it might very well be
> >   related to point 3. below
> > 
> > ...there's no actual workqueue: garbage collection (for timed out
> > entries) only happens on commit, I don't see a particular problem with
> > it.
> > 
> > I think both issues would be solved if we had a more natural API, that
> > is, having a single call to the back-end implementing a commit
> > operation, instead of separately activating single entries. I don't
> > know how complicated this change would be.  
> 
> It should be possible to add a ->commit operation to set->ops, then
> call it at the end of the commit phase, ie. iterate over the list of
> existing sets in the table and call set->ops->commit().

That sounds good, but when would we call it? Can it be independent of
the userspace version? Could we then obsolete the "activate" operation
(of course, by implementing commit() in all the sets)?

> > From a set back-end perspective it looks trivial (pipapo would be
> > greatly simplified, hash would also need to keep two copies but we
> > would remove some complexity by getting rid of some checks).
> >   
> > > > In the perspective of getting rid of it, I think we need:
> > > > 
> > > > 1. some "introductory" documentation for nft_set_pipapo -- I just
> > > >    got back to it (drawing some diagrams first...)
> > > > 
> > > > 2. to understand if the performance gap in the few (maybe not
> > > >    reasonable) cases where nft_set_rbtree matches faster than
> > > >    nft_set_pipapo is acceptable. Summary:
> > > >      https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@redhat.com/    
> > > 
> > > IIRC pipapo was not too far behind from rbtree for a few scenarios.  
> > 
> > Perhaps it would be good enough (minus points 1. and 3. here) to start
> > offering it as a default option (the change is trivial, setting
> > NFT_PIPAPO_MIN_FIELDS to 1) and see if regressions are reported (actually,
> > I doubt it).
> >   
> > > > 3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583,
> > > >    it's an atomicity issue which has little to do with nft_set_pipapo
> > > >    structures themselves, but I couldn't figure out the exact issue
> > > >    yet. I'm struggling to find the time for it, so if somebody wants to
> > > >    give it a try, I'd be more than happy to reassign it...    
> > > 
> > > OK, a different problem, related to pipapo.  
> > 
> > Yes, I included it here because I wouldn't offer pipapo as rbtree
> > replacement as long as that issue is there.  
> 
> Makes sense, thanks for explaining.
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-06-03 13:04             ` Stefano Brivio
@ 2022-06-06  9:01               ` Pablo Neira Ayuso
  2022-06-14  9:58                 ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2022-06-06  9:01 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

On Fri, Jun 03, 2022 at 03:04:45PM +0200, Stefano Brivio wrote:
> On Wed, 1 Jun 2022 13:15:08 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > On Wed, May 25, 2022 at 02:15:07PM +0200, Stefano Brivio wrote:
> > > On Mon, 23 May 2022 16:59:30 +0200
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> > [...]
> > > > Another possibility? Maintain two trees, one for the existing
> > > > generation (this is read-only) and another for the next generation
> > > > (insertion/removals are performed on it), then swap them when commit
> > > > happens?  
> > > 
> > > It sounded like a good idea and I actually started sketching it, but
> > > there's one fundamental issue: it doesn't help with overlap detection
> > > because we also want to check insertions that will be part of the live
> > > copy. If, within one transaction, you delete and create elements, the
> > > "dirty" copy is still dirty for the purposes of overlaps.  
> > 
> > Updates on the copy could be done without using the deactivate/active
> > logic since it is not exposed to the packet path. Then, you use the
> > copy (next generation of the datastructure) to check for overlaps? We
> > need keep pointer to two sets in the rbtree private data area, the
> > generation ID would point to the current set that is being used from
> > the packet path. The stale tree is released from the commit phase (see
> > below).
> 
> Oh, right, I guess that would work. But at a first glance it looks more
> complicated than the other idea:

Yes, my idea would trigger a larger rewrite.

> > > For the lookup, that might help. Is it worth it right now, though? At
> > > the moment I would go back and try to get overlap detection work
> > > properly, at least, with my previous idea.  
> > 
> > If your idea is still in planning phase, could you summarize again the
> > idea? You mentioned about using gc you mentioned, if it is more simple
> > than my proposal, then it should be good to go too.
> 
> ...hmm, no, forget about gc, that was flawed. I'm just "walking" the
> tree (going through elements as a list, instead of descending it),
> noting down closest left and right elements to what we're inserting,
> and check it with similar criteria to what we already have (but much
> simpler, because we don't have to infer anything from what might be in
> other leaves/nodes).
> 
> That is, if you have elements 3 (start), 5 (end), 7 (start), 8 (end),
> and you're inserting 6 as a start, we'll end up the tree walk with 5
> (end) on the left and 7 (start) on the right, so we know it's not
> overlapping.
> 
> If you insert 4 (as start or end), we know we have 3 and 5 around, so
> it overlaps.
> 
> It's essentially the same as it is now, just dropping a lot of corner
> cases and changing the way we go through the tree.
> 
> I kind of implemented it, I still need a bit to make it working.

That sounds an incremental fix, I prefer this too.

> > > > pipapo has similar requirements, currently it is relying on a
> > > > workqueue to make some postprocess after the commit phase. At the
> > > > expense of consuming more memory.  
> > > 
> > > Well, it keeps two copies already: all the insertions and removals are
> > > done on the dirty copy. The two problems we have there are:
> > > 
> > > - the clean copy might also contain inactive elements (because on a
> > >   "commit" operation the API doesn't guarantee that all inserted
> > >   elements are active), so we need to check for those during lookup,
> > >   which is quite heavy (in my tests that was 4% of the clock cycles
> > >   needed for lookup in a set with 30 000 "port, protocol" entries)
> > > 
> > > - on every _activate() call, we also need to commit the dirty copy onto
> > >   a clean one, instead of having one commit per transaction (because if
> > >   there's a newly activated item, we need to see it from the lookup),
> > >   so every transaction translates to a number of commit operations for
> > >   the back-end.
> > > 
> > >   That also makes things a bit complicated and it might very well be
> > >   related to point 3. below
> > > 
> > > ...there's no actual workqueue: garbage collection (for timed out
> > > entries) only happens on commit, I don't see a particular problem with
> > > it.
> > > 
> > > I think both issues would be solved if we had a more natural API, that
> > > is, having a single call to the back-end implementing a commit
> > > operation, instead of separately activating single entries. I don't
> > > know how complicated this change would be.  
> > 
> > It should be possible to add a ->commit operation to set->ops, then
> > call it at the end of the commit phase, ie. iterate over the list of
> > existing sets in the table and call set->ops->commit().
> 
> That sounds good, but when would we call it? Can it be independent of
> the userspace version? Could we then obsolete the "activate" operation
> (of course, by implementing commit() in all the sets)?

Call it from nf_tables_commit().

I don't see how we can obsolete "activate" operation, though, the
existing approach works at set element granularity.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-06-06  9:01               ` Pablo Neira Ayuso
@ 2022-06-14  9:58                 ` Stefano Brivio
  2022-06-16  9:08                   ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2022-06-14  9:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Mon, 6 Jun 2022 11:01:21 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, Jun 03, 2022 at 03:04:45PM +0200, Stefano Brivio wrote:
> > On Wed, 1 Jun 2022 13:15:08 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >   
> > > On Wed, May 25, 2022 at 02:15:07PM +0200, Stefano Brivio wrote:  
> > > > On Mon, 23 May 2022 16:59:30 +0200
> > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:    
> > > [...]  
> > > > > Another possibility? Maintain two trees, one for the existing
> > > > > generation (this is read-only) and another for the next generation
> > > > > (insertion/removals are performed on it), then swap them when commit
> > > > > happens?    
> > > > 
> > > > It sounded like a good idea and I actually started sketching it, but
> > > > there's one fundamental issue: it doesn't help with overlap detection
> > > > because we also want to check insertions that will be part of the live
> > > > copy. If, within one transaction, you delete and create elements, the
> > > > "dirty" copy is still dirty for the purposes of overlaps.    
> > > 
> > > Updates on the copy could be done without using the deactivate/active
> > > logic since it is not exposed to the packet path. Then, you use the
> > > copy (next generation of the datastructure) to check for overlaps? We
> > > need keep pointer to two sets in the rbtree private data area, the
> > > generation ID would point to the current set that is being used from
> > > the packet path. The stale tree is released from the commit phase (see
> > > below).  
> > 
> > Oh, right, I guess that would work. But at a first glance it looks more
> > complicated than the other idea:  
> 
> Yes, my idea would trigger a larger rewrite.
> 
> > > > For the lookup, that might help. Is it worth it right now, though? At
> > > > the moment I would go back and try to get overlap detection work
> > > > properly, at least, with my previous idea.    
> > > 
> > > If your idea is still in planning phase, could you summarize again the
> > > idea? You mentioned about using gc you mentioned, if it is more simple
> > > than my proposal, then it should be good to go too.  
> > 
> > ...hmm, no, forget about gc, that was flawed. I'm just "walking" the
> > tree (going through elements as a list, instead of descending it),
> > noting down closest left and right elements to what we're inserting,
> > and check it with similar criteria to what we already have (but much
> > simpler, because we don't have to infer anything from what might be in
> > other leaves/nodes).
> > 
> > That is, if you have elements 3 (start), 5 (end), 7 (start), 8 (end),
> > and you're inserting 6 as a start, we'll end up the tree walk with 5
> > (end) on the left and 7 (start) on the right, so we know it's not
> > overlapping.
> > 
> > If you insert 4 (as start or end), we know we have 3 and 5 around, so
> > it overlaps.
> > 
> > It's essentially the same as it is now, just dropping a lot of corner
> > cases and changing the way we go through the tree.
> > 
> > I kind of implemented it, I still need a bit to make it working.  
> 
> That sounds an incremental fix, I prefer this too.

...finally posted now.

> > > > > pipapo has similar requirements, currently it is relying on a
> > > > > workqueue to make some postprocess after the commit phase. At the
> > > > > expense of consuming more memory.    
> > > > 
> > > > Well, it keeps two copies already: all the insertions and removals are
> > > > done on the dirty copy. The two problems we have there are:
> > > > 
> > > > - the clean copy might also contain inactive elements (because on a
> > > >   "commit" operation the API doesn't guarantee that all inserted
> > > >   elements are active), so we need to check for those during lookup,
> > > >   which is quite heavy (in my tests that was 4% of the clock cycles
> > > >   needed for lookup in a set with 30 000 "port, protocol" entries)
> > > > 
> > > > - on every _activate() call, we also need to commit the dirty copy onto
> > > >   a clean one, instead of having one commit per transaction (because if
> > > >   there's a newly activated item, we need to see it from the lookup),
> > > >   so every transaction translates to a number of commit operations for
> > > >   the back-end.
> > > > 
> > > >   That also makes things a bit complicated and it might very well be
> > > >   related to point 3. below
> > > > 
> > > > ...there's no actual workqueue: garbage collection (for timed out
> > > > entries) only happens on commit, I don't see a particular problem with
> > > > it.
> > > > 
> > > > I think both issues would be solved if we had a more natural API, that
> > > > is, having a single call to the back-end implementing a commit
> > > > operation, instead of separately activating single entries. I don't
> > > > know how complicated this change would be.    
> > > 
> > > It should be possible to add a ->commit operation to set->ops, then
> > > call it at the end of the commit phase, ie. iterate over the list of
> > > existing sets in the table and call set->ops->commit().  
> > 
> > That sounds good, but when would we call it? Can it be independent of
> > the userspace version? Could we then obsolete the "activate" operation
> > (of course, by implementing commit() in all the sets)?  
> 
> Call it from nf_tables_commit().
> 
> I don't see how we can obsolete "activate" operation, though, the
> existing approach works at set element granularity.

Yes, and that's what I'm arguing against: it would be more natural, in
a transaction, to have a single commit operation for all the elements
at hand -- otherwise it's not so much of a transaction.

To the user it's atomic (minus bugs) because we have tricks to ensure
it, but to the set back-ends it's absolutely not. I think we have this
kind of situation:


nft            <->     core       <->   set back-end    <->    storage
                |                  |                     |

hash:   transaction commit    element commit       element commit

rbtree: transaction commit    element commit       element commit
                                                   ^ problematic to the
                                                   point we're
                                                   considering a
                                                   transaction approach

pipapo: transaction commit    element commit       transaction commit

The single advantage I see of the current approach is that with the
hash back-ends we don't need two copies of the hash table, but that
also has the downside of the nft_set_elem_active(&he->ext, genmask)
check in the lookup function, which should be, in relative terms, even
more expensive than it is in the pipapo implementation, given that hash
back-ends are (in most cases) faster.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf
  2022-06-14  9:58                 ` Stefano Brivio
@ 2022-06-16  9:08                   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 11+ messages in thread
From: Pablo Neira Ayuso @ 2022-06-16  9:08 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

On Tue, Jun 14, 2022 at 11:58:14AM +0200, Stefano Brivio wrote:
> On Mon, 6 Jun 2022 11:01:21 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > That sounds an incremental fix, I prefer this too.
> 
> ...finally posted now.

Thanks.

[...]
> > I don't see how we can obsolete "activate" operation, though, the
> > existing approach works at set element granularity.
> 
> Yes, and that's what I'm arguing against: it would be more natural, in
> a transaction, to have a single commit operation for all the elements
> at hand -- otherwise it's not so much of a transaction.
> 
> To the user it's atomic (minus bugs) because we have tricks to ensure
> it, but to the set back-ends it's absolutely not. I think we have this
> kind of situation:
> 
> 
> nft            <->     core       <->   set back-end    <->    storage
>                 |                  |                     |
> 
> hash:   transaction commit    element commit       element commit
> 
> rbtree: transaction commit    element commit       element commit
>                                                    ^ problematic to the
>                                                    point we're
>                                                    considering a
>                                                    transaction approach
> 
> pipapo: transaction commit    element commit       transaction commit
> 
> The single advantage I see of the current approach is that with the
> hash back-ends we don't need two copies of the hash table, but that
> also has the downside of the nft_set_elem_active(&he->ext, genmask)
> check in the lookup function, which should be, in relative terms, even
> more expensive than it is in the pipapo implementation, given that hash
> back-ends are (in most cases) faster.

There is also runtime set updates from packet path. In that case, we
cannot keep a copy of the data structure that is being updated from
the control plane while the packet path is also adding/deleting
entries from it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-16  9:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12 18:34 [PATCH nf] nft_set_rbtree: Move clauses for expired nodes, last active node as leaf Stefano Brivio
2022-05-16 18:16 ` Pablo Neira Ayuso
2022-05-17 12:57   ` Stefano Brivio
2022-05-20 15:45     ` Stefano Brivio
2022-05-23 14:59       ` Pablo Neira Ayuso
2022-05-25 12:15         ` Stefano Brivio
2022-06-01 11:15           ` Pablo Neira Ayuso
2022-06-03 13:04             ` Stefano Brivio
2022-06-06  9:01               ` Pablo Neira Ayuso
2022-06-14  9:58                 ` Stefano Brivio
2022-06-16  9:08                   ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.