All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
@ 2022-06-14  1:07 Stefano Brivio
  2022-06-27 16:59 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Brivio @ 2022-06-14  1:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

...instead of a tree descent, which became overly complicated in an
attempt to cover cases where expired or inactive elements would
affect comparisons with the new element being inserted.

Further, it turned out that it's probably impossible to cover all
those cases, as inactive nodes might entirely hide subtrees
consisting of a complete interval plus a node that makes the current
insertion not overlap.

For the insertion operation itself, this essentially reverts back to
the implementation before commit 7c84d41416d8
("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
except that cases of complete overlap are already handled in the
overlap detection phase itself, which slightly simplifies the loop to
find the insertion point.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
 1 file changed, 58 insertions(+), 136 deletions(-)

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 7325bee7d144..87a28d2dca77 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -38,10 +38,12 @@ static bool nft_rbtree_interval_start(const struct nft_rbtree_elem *rbe)
 	return !nft_rbtree_interval_end(rbe);
 }
 
-static bool nft_rbtree_equal(const struct nft_set *set, const void *this,
-			     const struct nft_rbtree_elem *interval)
+static int nft_rbtree_cmp(const struct nft_set *set,
+			  const struct nft_rbtree_elem *e1,
+			  const struct nft_rbtree_elem *e2)
 {
-	return memcmp(this, nft_set_ext_key(&interval->ext), set->klen) == 0;
+	return memcmp(nft_set_ext_key(&e1->ext), nft_set_ext_key(&e2->ext),
+		      set->klen);
 }
 
 static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set,
@@ -52,7 +54,6 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
 	const struct nft_rbtree_elem *rbe, *interval = NULL;
 	u8 genmask = nft_genmask_cur(net);
 	const struct rb_node *parent;
-	const void *this;
 	int d;
 
 	parent = rcu_dereference_raw(priv->root.rb_node);
@@ -62,12 +63,11 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
 
 		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
 
-		this = nft_set_ext_key(&rbe->ext);
-		d = memcmp(this, key, set->klen);
+		d = memcmp(nft_set_ext_key(&rbe->ext), key, set->klen);
 		if (d < 0) {
 			parent = rcu_dereference_raw(parent->rb_left);
 			if (interval &&
-			    nft_rbtree_equal(set, this, interval) &&
+			    !nft_rbtree_cmp(set, rbe, interval) &&
 			    nft_rbtree_interval_end(rbe) &&
 			    nft_rbtree_interval_start(interval))
 				continue;
@@ -219,150 +219,72 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 			       struct nft_rbtree_elem *new,
 			       struct nft_set_ext **ext)
 {
-	bool overlap = false, dup_end_left = false, dup_end_right = false;
+	struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL;
 	struct nft_rbtree *priv = nft_set_priv(set);
 	u8 genmask = nft_genmask_next(net);
-	struct nft_rbtree_elem *rbe;
-	struct rb_node *parent, **p;
+	struct rb_node *node, *parent, **p;
 	int d;
 
-	/* Detect overlaps as we descend the tree. Set the flag in these cases:
-	 *
-	 * a1. _ _ __>|  ?_ _ __|  (insert end before existing end)
-	 * a2. _ _ ___|  ?_ _ _>|  (insert end after existing end)
-	 * a3. _ _ ___? >|_ _ __|  (insert start before existing end)
-	 *
-	 * and clear it later on, as we eventually reach the points indicated by
-	 * '?' above, in the cases described below. We'll always meet these
-	 * later, locally, due to tree ordering, and overlaps for the intervals
-	 * that are the closest together are always evaluated last.
-	 *
-	 * b1. _ _ __>|  !_ _ __|  (insert end before existing start)
-	 * b2. _ _ ___|  !_ _ _>|  (insert end after existing start)
-	 * b3. _ _ ___! >|_ _ __|  (insert start after existing end, as a leaf)
-	 *            '--' no nodes falling in this range
-	 * b4.          >|_ _   !  (insert start before existing start)
-	 *
-	 * Case a3. resolves to b3.:
-	 * - if the inserted start element is the leftmost, because the '0'
-	 *   element in the tree serves as end element
-	 * - otherwise, if an existing end is found immediately to the left. If
-	 *   there are existing nodes in between, we need to further descend the
-	 *   tree before we can conclude the new start isn't causing an overlap
-	 *
-	 * or to b4., which, preceded by a3., means we already traversed one or
-	 * more existing intervals entirely, from the right.
-	 *
-	 * For a new, rightmost pair of elements, we'll hit cases b3. and b2.,
-	 * in that order.
-	 *
-	 * The flag is also cleared in two special cases:
-	 *
-	 * b5. |__ _ _!|<_ _ _   (insert start right before existing end)
-	 * b6. |__ _ >|!__ _ _   (insert end right after existing start)
-	 *
-	 * which always happen as last step and imply that no further
-	 * overlapping is possible.
-	 *
-	 * Another special case comes from the fact that start elements matching
-	 * an already existing start element are allowed: insertion is not
-	 * performed but we return -EEXIST in that case, and the error will be
-	 * cleared by the caller if NLM_F_EXCL is not present in the request.
-	 * This way, request for insertion of an exact overlap isn't reported as
-	 * error to userspace if not desired.
-	 *
-	 * However, if the existing start matches a pre-existing start, but the
-	 * end element doesn't match the corresponding pre-existing end element,
-	 * we need to report a partial overlap. This is a local condition that
-	 * can be noticed without need for a tracking flag, by checking for a
-	 * local duplicated end for a corresponding start, from left and right,
-	 * separately.
+	/* Detect overlaps by going through the list of valid tree nodes: */
+	for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) {
+		rbe = rb_entry(node, struct nft_rbtree_elem, node);
+
+		if (!nft_set_elem_active(&rbe->ext, genmask) ||
+		    nft_set_elem_expired(&rbe->ext))
+			continue;
+
+		d = nft_rbtree_cmp(set, rbe, new);
+
+		if (d <= 0 && (!rbe_le || nft_rbtree_cmp(set, rbe, rbe_le) > 0))
+			rbe_le = rbe;
+
+		if (d >= 0 && (!rbe_ge || nft_rbtree_cmp(set, rbe, rbe_ge) < 0))
+			rbe_ge = rbe;
+	}
+
+	/* - new start element matching existing start element, or end element
+	 *   matching end element: full overlap reported as -EEXIST, cleared by
+	 *   caller if NLM_F_EXCL is not given
+	 */
+	rbe = rbe_le;
+	if (rbe && !nft_rbtree_cmp(set, new, rbe) &&
+	    nft_rbtree_interval_start(rbe) == nft_rbtree_interval_start(new)) {
+		*ext = &rbe->ext;
+		return -EEXIST;
+	}
+
+	/* - new start element with existing closest, less or equal key value
+	 *   being a start element: partial overlap, reported as -ENOTEMPTY
 	 */
+	if (rbe_le &&
+	    nft_rbtree_interval_start(rbe_le) && nft_rbtree_interval_start(new))
+		return -ENOTEMPTY;
+
+	/* - new end element with existing closest, greater or equal key value
+	 *   being an end element: partial overlap, reported as -ENOTEMPTY
+	 */
+	if (rbe_ge &&
+	    nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
+		return -ENOTEMPTY;
 
+	/* Accepted element: pick insertion point depending on key value */
 	parent = NULL;
 	p = &priv->root.rb_node;
 	while (*p != NULL) {
 		parent = *p;
 		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
-		d = memcmp(nft_set_ext_key(&rbe->ext),
-			   nft_set_ext_key(&new->ext),
-			   set->klen);
-		if (d < 0) {
-			p = &parent->rb_left;
+		d = nft_rbtree_cmp(set, rbe, new);
 
-			if (nft_rbtree_interval_start(new)) {
-				if (nft_rbtree_interval_end(rbe) &&
-				    nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext) && !*p)
-					overlap = false;
-			} else {
-				if (dup_end_left && !*p)
-					return -ENOTEMPTY;
-
-				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
-
-				if (overlap) {
-					dup_end_right = true;
-					continue;
-				}
-			}
-		} else if (d > 0) {
+		if (d < 0)
+			p = &parent->rb_left;
+		else if (d > 0)
+			p = &parent->rb_right;
+		else if (nft_rbtree_interval_end(rbe))
+			p = &parent->rb_left;
+		else
 			p = &parent->rb_right;
-
-			if (nft_rbtree_interval_end(new)) {
-				if (dup_end_right && !*p)
-					return -ENOTEMPTY;
-
-				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
-
-				if (overlap) {
-					dup_end_left = true;
-					continue;
-				}
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
-				overlap = nft_rbtree_interval_end(rbe);
-			}
-		} else {
-			if (nft_rbtree_interval_end(rbe) &&
-			    nft_rbtree_interval_start(new)) {
-				p = &parent->rb_left;
-
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
-					overlap = false;
-			} else if (nft_rbtree_interval_start(rbe) &&
-				   nft_rbtree_interval_end(new)) {
-				p = &parent->rb_right;
-
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
-					overlap = false;
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
-				*ext = &rbe->ext;
-				return -EEXIST;
-			} else {
-				overlap = false;
-				if (nft_rbtree_interval_end(rbe))
-					p = &parent->rb_left;
-				else
-					p = &parent->rb_right;
-			}
-		}
-
-		dup_end_left = dup_end_right = false;
 	}
 
-	if (overlap)
-		return -ENOTEMPTY;
-
 	rb_link_node_rcu(&new->node, parent, p);
 	rb_insert_color(&new->node, &priv->root);
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-06-14  1:07 [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection Stefano Brivio
@ 2022-06-27 16:59 ` Pablo Neira Ayuso
  2022-06-29  8:50   ` Stefano Brivio
  2022-07-01 23:55   ` Stefano Brivio
  0 siblings, 2 replies; 9+ messages in thread
From: Pablo Neira Ayuso @ 2022-06-27 16:59 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

Hi Stefano,

On Tue, Jun 14, 2022 at 03:07:04AM +0200, Stefano Brivio wrote:
> ...instead of a tree descent, which became overly complicated in an
> attempt to cover cases where expired or inactive elements would
> affect comparisons with the new element being inserted.
>
> Further, it turned out that it's probably impossible to cover all
> those cases, as inactive nodes might entirely hide subtrees
> consisting of a complete interval plus a node that makes the current
> insertion not overlap.
>
> For the insertion operation itself, this essentially reverts back to
> the implementation before commit 7c84d41416d8
> ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
> except that cases of complete overlap are already handled in the
> overlap detection phase itself, which slightly simplifies the loop to
> find the insertion point.
>
> Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
>  1 file changed, 58 insertions(+), 136 deletions(-)

When running tests this is increasing the time to detect overlaps in
my testbed, because of the linear list walk for each element.

So I have been looking at an alternative approach (see attached patch) to
address your comments. The idea is to move out the overlapping nodes
from the element in the tree, instead keep them in a list.

                        root
                        /  \
                     elem   elem -> update -> update
                            /  \
                         elem  elem

Each rbtree element in the tree .has pending_list which stores the
element that supersede the existing (inactive) element. There is also a
.list which is used to add the element to the .pending_list. Elements
in the tree might have a .pending_list with one or more elements.

The .deactivate path looks for the last (possibly) active element. The
.remove path depends on the element state: a) element is singleton (no
pending elements), then erase from tree, b) element has a pending
list, then replace the first element in the pending_list by this node,
and splice pending_list (there might be more elements), c) this
element is in the pending_list, the just remove it. This handles both
commit (walks through the list of transaction forward direction) and
abort path (walks through the list of transactions in backward
direction).

With this approach, there is no more 'gray' nodes in the tree, as you
describe in your problem statement.

This is increasing the size of the rbtree element, probably I can
just add a pointer to a new nft_rbtree_elem_update structure (not done
in this patch) and allocate it dynamically on updates, so this only
takes 8 extra bytes per rbtree element, not sure how complex that is.

I'm attaching the patch I've been testing here.

Let me know, thanks.

[-- Attachment #2: rbtree.patch --]
[-- Type: text/x-diff, Size: 9227 bytes --]

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 7325bee7d144..14d15efbe545 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -24,9 +24,36 @@ struct nft_rbtree {
 
 struct nft_rbtree_elem {
 	struct rb_node		node;
+	struct list_head	pending_list;
+	struct list_head	list;
 	struct nft_set_ext	ext;
 };
 
+static struct nft_rbtree_elem *nft_rbtree_elem_leaf(const struct nft_rbtree_elem *rbe, u8 genmask)
+{
+	struct nft_rbtree_elem *elem;
+
+	if (unlikely(!list_empty(&rbe->pending_list))) {
+		elem = list_last_entry(&rbe->pending_list, struct nft_rbtree_elem, list);
+		return elem;
+	}
+
+	return (struct nft_rbtree_elem *)rbe;
+}
+
+static bool __nft_rbtree_elem_active(const struct nft_rbtree_elem *rbe, u8 genmask)
+{
+	return nft_set_elem_active(&rbe->ext, genmask) &&
+	       !nft_set_elem_expired(&rbe->ext);
+}
+
+static bool nft_rbtree_elem_active(const struct nft_rbtree_elem *rbe, u8 genmask)
+{
+	struct nft_rbtree_elem *elem = nft_rbtree_elem_leaf(rbe, genmask);
+
+	return __nft_rbtree_elem_active(elem, genmask);
+}
+
 static bool nft_rbtree_interval_end(const struct nft_rbtree_elem *rbe)
 {
 	return nft_set_ext_exists(&rbe->ext, NFT_SET_EXT_FLAGS) &&
@@ -75,12 +102,7 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
 		} else if (d > 0)
 			parent = rcu_dereference_raw(parent->rb_right);
 		else {
-			if (!nft_set_elem_active(&rbe->ext, genmask)) {
-				parent = rcu_dereference_raw(parent->rb_left);
-				continue;
-			}
-
-			if (nft_set_elem_expired(&rbe->ext))
+			if (!nft_rbtree_elem_active(rbe, genmask))
 				return false;
 
 			if (nft_rbtree_interval_end(rbe)) {
@@ -97,8 +119,7 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
 	}
 
 	if (set->flags & NFT_SET_INTERVAL && interval != NULL &&
-	    nft_set_elem_active(&interval->ext, genmask) &&
-	    !nft_set_elem_expired(&interval->ext) &&
+	    nft_rbtree_elem_active(interval, genmask) &&
 	    nft_rbtree_interval_start(interval)) {
 		*ext = &interval->ext;
 		return true;
@@ -155,12 +176,7 @@ static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set,
 			if (flags & NFT_SET_ELEM_INTERVAL_END)
 				interval = rbe;
 		} else {
-			if (!nft_set_elem_active(&rbe->ext, genmask)) {
-				parent = rcu_dereference_raw(parent->rb_left);
-				continue;
-			}
-
-			if (nft_set_elem_expired(&rbe->ext))
+			if (!nft_rbtree_elem_active(rbe, genmask))
 				return false;
 
 			if (!nft_set_ext_exists(&rbe->ext, NFT_SET_EXT_FLAGS) ||
@@ -178,8 +194,7 @@ static bool __nft_rbtree_get(const struct net *net, const struct nft_set *set,
 	}
 
 	if (set->flags & NFT_SET_INTERVAL && interval != NULL &&
-	    nft_set_elem_active(&interval->ext, genmask) &&
-	    !nft_set_elem_expired(&interval->ext) &&
+	    nft_rbtree_elem_active(interval, genmask) &&
 	    ((!nft_rbtree_interval_end(interval) &&
 	      !(flags & NFT_SET_ELEM_INTERVAL_END)) ||
 	     (nft_rbtree_interval_end(interval) &&
@@ -215,6 +230,22 @@ static void *nft_rbtree_get(const struct net *net, const struct nft_set *set,
 	return rbe;
 }
 
+static int nft_rbtree_insert_pending(struct nft_rbtree_elem *rbe,
+				     struct nft_rbtree_elem *new, u8 genmask)
+{
+	struct nft_rbtree_elem *elem;
+
+	if (unlikely(!list_empty(&rbe->pending_list))) {
+		elem = list_last_entry(&rbe->pending_list, struct nft_rbtree_elem, list);
+		if (nft_set_elem_active(&elem->ext, genmask))
+			return -EEXIST;
+	}
+
+	list_add_tail(&new->list, &rbe->pending_list);
+
+	return 0;
+}
+
 static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 			       struct nft_rbtree_elem *new,
 			       struct nft_set_ext **ext)
@@ -226,6 +257,9 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	struct rb_node *parent, **p;
 	int d;
 
+	INIT_LIST_HEAD(&new->pending_list);
+	INIT_LIST_HEAD(&new->list);
+
 	/* Detect overlaps as we descend the tree. Set the flag in these cases:
 	 *
 	 * a1. _ _ __>|  ?_ _ __|  (insert end before existing end)
@@ -292,17 +326,14 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 
 			if (nft_rbtree_interval_start(new)) {
 				if (nft_rbtree_interval_end(rbe) &&
-				    nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext) && !*p)
+				    nft_rbtree_elem_active(rbe, genmask) && !*p)
 					overlap = false;
 			} else {
 				if (dup_end_left && !*p)
 					return -ENOTEMPTY;
 
 				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
+					  nft_rbtree_elem_active(rbe, genmask);
 
 				if (overlap) {
 					dup_end_right = true;
@@ -317,16 +348,13 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 					return -ENOTEMPTY;
 
 				overlap = nft_rbtree_interval_end(rbe) &&
-					  nft_set_elem_active(&rbe->ext,
-							      genmask) &&
-					  !nft_set_elem_expired(&rbe->ext);
+					  nft_rbtree_elem_active(rbe, genmask);
 
 				if (overlap) {
 					dup_end_left = true;
 					continue;
 				}
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
+			} else if (nft_rbtree_elem_active(rbe, genmask)) {
 				overlap = nft_rbtree_interval_end(rbe);
 			}
 		} else {
@@ -334,26 +362,19 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 			    nft_rbtree_interval_start(new)) {
 				p = &parent->rb_left;
 
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
+				if (nft_rbtree_elem_active(rbe, genmask))
 					overlap = false;
 			} else if (nft_rbtree_interval_start(rbe) &&
 				   nft_rbtree_interval_end(new)) {
 				p = &parent->rb_right;
 
-				if (nft_set_elem_active(&rbe->ext, genmask) &&
-				    !nft_set_elem_expired(&rbe->ext))
+				if (nft_rbtree_elem_active(rbe, genmask))
 					overlap = false;
-			} else if (nft_set_elem_active(&rbe->ext, genmask) &&
-				   !nft_set_elem_expired(&rbe->ext)) {
+			} else if (nft_rbtree_elem_active(rbe, genmask)) {
 				*ext = &rbe->ext;
 				return -EEXIST;
 			} else {
-				overlap = false;
-				if (nft_rbtree_interval_end(rbe))
-					p = &parent->rb_left;
-				else
-					p = &parent->rb_right;
+				return nft_rbtree_insert_pending(rbe, new, genmask);
 			}
 		}
 
@@ -365,6 +386,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 
 	rb_link_node_rcu(&new->node, parent, p);
 	rb_insert_color(&new->node, &priv->root);
+
 	return 0;
 }
 
@@ -389,12 +411,21 @@ static void nft_rbtree_remove(const struct net *net,
 			      const struct nft_set *set,
 			      const struct nft_set_elem *elem)
 {
+	struct nft_rbtree_elem *rbe = elem->priv, *new;
 	struct nft_rbtree *priv = nft_set_priv(set);
-	struct nft_rbtree_elem *rbe = elem->priv;
 
 	write_lock_bh(&priv->lock);
 	write_seqcount_begin(&priv->count);
-	rb_erase(&rbe->node, &priv->root);
+	if (!list_empty(&rbe->list)) {
+		list_del(&rbe->list);
+	} else if (!list_empty(&rbe->pending_list)) {
+		new = list_first_entry(&rbe->pending_list, struct nft_rbtree_elem, list);
+		list_del_init(&new->list);
+		list_splice(&rbe->pending_list, &new->pending_list);
+		rb_replace_node(&rbe->node, &new->node, &priv->root);
+	} else {
+		rb_erase(&rbe->node, &priv->root);
+	}
 	write_seqcount_end(&priv->count);
 	write_unlock_bh(&priv->lock);
 }
@@ -426,9 +457,9 @@ static void *nft_rbtree_deactivate(const struct net *net,
 				   const struct nft_set *set,
 				   const struct nft_set_elem *elem)
 {
+	struct nft_rbtree_elem *rbe, *this = elem->priv, *leaf;
 	const struct nft_rbtree *priv = nft_set_priv(set);
 	const struct rb_node *parent = priv->root.rb_node;
-	struct nft_rbtree_elem *rbe, *this = elem->priv;
 	u8 genmask = nft_genmask_next(net);
 	int d;
 
@@ -450,12 +481,14 @@ static void *nft_rbtree_deactivate(const struct net *net,
 				   nft_rbtree_interval_end(this)) {
 				parent = parent->rb_right;
 				continue;
-			} else if (!nft_set_elem_active(&rbe->ext, genmask)) {
-				parent = parent->rb_left;
-				continue;
+			} else if (!nft_rbtree_elem_active(rbe, genmask)) {
+				break;
 			}
-			nft_rbtree_flush(net, set, rbe);
-			return rbe;
+			leaf = nft_rbtree_elem_leaf(rbe, genmask);
+			if (!nft_rbtree_flush(net, set, leaf))
+				break;
+
+			return leaf;
 		}
 	}
 	return NULL;
@@ -466,7 +499,7 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
 			    struct nft_set_iter *iter)
 {
 	struct nft_rbtree *priv = nft_set_priv(set);
-	struct nft_rbtree_elem *rbe;
+	struct nft_rbtree_elem *rbe, *leaf;
 	struct nft_set_elem elem;
 	struct rb_node *node;
 
@@ -476,12 +509,12 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
 
 		if (iter->count < iter->skip)
 			goto cont;
-		if (nft_set_elem_expired(&rbe->ext))
-			goto cont;
-		if (!nft_set_elem_active(&rbe->ext, iter->genmask))
+
+		leaf = nft_rbtree_elem_leaf(rbe, iter->genmask);
+		if (!__nft_rbtree_elem_active(leaf, iter->genmask))
 			goto cont;
 
-		elem.priv = rbe;
+		elem.priv = leaf;
 
 		iter->err = iter->fn(ctx, set, iter, &elem);
 		if (iter->err < 0) {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-06-27 16:59 ` Pablo Neira Ayuso
@ 2022-06-29  8:50   ` Stefano Brivio
  2022-07-01 23:55   ` Stefano Brivio
  1 sibling, 0 replies; 9+ messages in thread
From: Stefano Brivio @ 2022-06-29  8:50 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Hi Pablo,

On Mon, 27 Jun 2022 18:59:06 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> On Tue, Jun 14, 2022 at 03:07:04AM +0200, Stefano Brivio wrote:
> > ...instead of a tree descent, which became overly complicated in an
> > attempt to cover cases where expired or inactive elements would
> > affect comparisons with the new element being inserted.
> >
> > Further, it turned out that it's probably impossible to cover all
> > those cases, as inactive nodes might entirely hide subtrees
> > consisting of a complete interval plus a node that makes the current
> > insertion not overlap.
> >
> > For the insertion operation itself, this essentially reverts back to
> > the implementation before commit 7c84d41416d8
> > ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
> > except that cases of complete overlap are already handled in the
> > overlap detection phase itself, which slightly simplifies the loop to
> > find the insertion point.
> >
> > Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
> >  1 file changed, 58 insertions(+), 136 deletions(-)  
> 
> When running tests this is increasing the time to detect overlaps in
> my testbed, because of the linear list walk for each element.
> 
> So I have been looking at an alternative approach (see attached patch) to
> address your comments. The idea is to move out the overlapping nodes
> from the element in the tree, instead keep them in a list.
> 
>                         root
>                         /  \
>                      elem   elem -> update -> update
>                             /  \
>                          elem  elem
> 
> Each rbtree element in the tree .has pending_list which stores the
> element that supersede the existing (inactive) element. There is also a
> .list which is used to add the element to the .pending_list. Elements
> in the tree might have a .pending_list with one or more elements.
> 
> The .deactivate path looks for the last (possibly) active element. The
> .remove path depends on the element state: a) element is singleton (no
> pending elements), then erase from tree, b) element has a pending
> list, then replace the first element in the pending_list by this node,
> and splice pending_list (there might be more elements), c) this
> element is in the pending_list, the just remove it. This handles both
> commit (walks through the list of transaction forward direction) and
> abort path (walks through the list of transactions in backward
> direction).

I think that's brilliant, give me a couple of days to have a thorough
look at it.

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-06-27 16:59 ` Pablo Neira Ayuso
  2022-06-29  8:50   ` Stefano Brivio
@ 2022-07-01 23:55   ` Stefano Brivio
  2022-07-05 11:53     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 9+ messages in thread
From: Stefano Brivio @ 2022-07-01 23:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Mon, 27 Jun 2022 18:59:06 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> On Tue, Jun 14, 2022 at 03:07:04AM +0200, Stefano Brivio wrote:
> > ...instead of a tree descent, which became overly complicated in an
> > attempt to cover cases where expired or inactive elements would
> > affect comparisons with the new element being inserted.
> >
> > Further, it turned out that it's probably impossible to cover all
> > those cases, as inactive nodes might entirely hide subtrees
> > consisting of a complete interval plus a node that makes the current
> > insertion not overlap.
> >
> > For the insertion operation itself, this essentially reverts back to
> > the implementation before commit 7c84d41416d8
> > ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
> > except that cases of complete overlap are already handled in the
> > overlap detection phase itself, which slightly simplifies the loop to
> > find the insertion point.
> >
> > Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
> >  1 file changed, 58 insertions(+), 136 deletions(-)  
> 
> When running tests this is increasing the time to detect overlaps in
> my testbed, because of the linear list walk for each element.

...by the way, I observed it as well, and I was wondering: how bad is
too bad? My guess was that as long as we insert a few thousand elements
(with more, I expect hash or pipapo to be used) in a few seconds, it
should be good enough.

> So I have been looking at an alternative approach (see attached patch) to
> address your comments. The idea is to move out the overlapping nodes
> from the element in the tree, instead keep them in a list.
> 
>                         root
>                         /  \
>                      elem   elem -> update -> update
>                             /  \
>                          elem  elem
> 
> Each rbtree element in the tree .has pending_list which stores the
> element that supersede the existing (inactive) element. There is also a
> .list which is used to add the element to the .pending_list. Elements
> in the tree might have a .pending_list with one or more elements.

I see a problem with this, that perhaps you already solved, but I don't
understand how.

The original issue here was that we have inactive elements in the tree
affecting the way we descend it to look for overlaps. Those inactive
elements are not necessarily overlapping with anything.

If they overlap, the issue is solved with your patch. But if they
don't...?

Sure, we'll grant insertion of overlapping elements in case the overlap
is with an inactive one, but this solves the particular case of
matching elements, not overlapping intervals.

At a first reading, I thought you found some magic way to push out all
inactive elements to some parallel, linked structure, which we can
ignore as we look for overlapping _intervals_. But that doesn't seem to
be the case, right?

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-07-01 23:55   ` Stefano Brivio
@ 2022-07-05 11:53     ` Pablo Neira Ayuso
  2022-07-06 21:12       ` Stefano Brivio
  0 siblings, 1 reply; 9+ messages in thread
From: Pablo Neira Ayuso @ 2022-07-05 11:53 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

Hi Stefano,

On Sat, Jul 02, 2022 at 01:55:10AM +0200, Stefano Brivio wrote:
> On Mon, 27 Jun 2022 18:59:06 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > Hi Stefano,
> > 
> > On Tue, Jun 14, 2022 at 03:07:04AM +0200, Stefano Brivio wrote:
> > > ...instead of a tree descent, which became overly complicated in an
> > > attempt to cover cases where expired or inactive elements would
> > > affect comparisons with the new element being inserted.
> > >
> > > Further, it turned out that it's probably impossible to cover all
> > > those cases, as inactive nodes might entirely hide subtrees
> > > consisting of a complete interval plus a node that makes the current
> > > insertion not overlap.
> > >
> > > For the insertion operation itself, this essentially reverts back to
> > > the implementation before commit 7c84d41416d8
> > > ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
> > > except that cases of complete overlap are already handled in the
> > > overlap detection phase itself, which slightly simplifies the loop to
> > > find the insertion point.
> > >
> > > Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > > Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
> > >  1 file changed, 58 insertions(+), 136 deletions(-)  
> > 
> > When running tests this is increasing the time to detect overlaps in
> > my testbed, because of the linear list walk for each element.
> 
> ...by the way, I observed it as well, and I was wondering: how bad is
> too bad? My guess was that as long as we insert a few thousand elements
> (with more, I expect hash or pipapo to be used) in a few seconds, it
> should be good enough.

From few seconds to less than 30 seconds in one testbed here.

> > So I have been looking at an alternative approach (see attached patch) to
> > address your comments. The idea is to move out the overlapping nodes
> > from the element in the tree, instead keep them in a list.
> > 
> >                         root
> >                         /  \
> >                      elem   elem -> update -> update
> >                             /  \
> >                          elem  elem
> > 
> > Each rbtree element in the tree .has pending_list which stores the
> > element that supersede the existing (inactive) element. There is also a
> > .list which is used to add the element to the .pending_list. Elements
> > in the tree might have a .pending_list with one or more elements.
> 
> I see a problem with this, that perhaps you already solved, but I don't
> understand how.
> 
> The original issue here was that we have inactive elements in the tree
> affecting the way we descend it to look for overlaps. Those inactive
> elements are not necessarily overlapping with anything.
> 
> If they overlap, the issue is solved with your patch. But if they
> don't...?
>
> Sure, we'll grant insertion of overlapping elements in case the overlap
> is with an inactive one, but this solves the particular case of
> matching elements, not overlapping intervals.
> 
> At a first reading, I thought you found some magic way to push out all
> inactive elements to some parallel, linked structure, which we can
> ignore as we look for overlapping _intervals_. But that doesn't seem to
> be the case, right?

With my patch, when descending the tree, the right or left branch is
selected uniquely based on the key value (regardless the element
state), I removed the "turn left" when node is inactive case. There
are also no more duplicated elements with the same value.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-07-05 11:53     ` Pablo Neira Ayuso
@ 2022-07-06 21:12       ` Stefano Brivio
  2022-07-15 11:13         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 9+ messages in thread
From: Stefano Brivio @ 2022-07-06 21:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 5 Jul 2022 13:53:47 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi Stefano,
> 
> On Sat, Jul 02, 2022 at 01:55:10AM +0200, Stefano Brivio wrote:
> > On Mon, 27 Jun 2022 18:59:06 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >   
> > > Hi Stefano,
> > > 
> > > On Tue, Jun 14, 2022 at 03:07:04AM +0200, Stefano Brivio wrote:  
> > > > ...instead of a tree descent, which became overly complicated in an
> > > > attempt to cover cases where expired or inactive elements would
> > > > affect comparisons with the new element being inserted.
> > > >
> > > > Further, it turned out that it's probably impossible to cover all
> > > > those cases, as inactive nodes might entirely hide subtrees
> > > > consisting of a complete interval plus a node that makes the current
> > > > insertion not overlap.
> > > >
> > > > For the insertion operation itself, this essentially reverts back to
> > > > the implementation before commit 7c84d41416d8
> > > > ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"),
> > > > except that cases of complete overlap are already handled in the
> > > > overlap detection phase itself, which slightly simplifies the loop to
> > > > find the insertion point.
> > > >
> > > > Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > > > Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > >  net/netfilter/nft_set_rbtree.c | 194 ++++++++++-----------------------
> > > >  1 file changed, 58 insertions(+), 136 deletions(-)    
> > > 
> > > When running tests this is increasing the time to detect overlaps in
> > > my testbed, because of the linear list walk for each element.  
> > 
> > ...by the way, I observed it as well, and I was wondering: how bad is
> > too bad? My guess was that as long as we insert a few thousand elements
> > (with more, I expect hash or pipapo to be used) in a few seconds, it
> > should be good enough.  
> 
> From few seconds to less than 30 seconds in one testbed here.

I didn't understand: so it's really bad? Because sure, I see that
running the test scripts you shared now takes much longer, but I wonder
how far that is from actual use cases.

> > > So I have been looking at an alternative approach (see attached patch) to
> > > address your comments. The idea is to move out the overlapping nodes
> > > from the element in the tree, instead keep them in a list.
> > > 
> > >                         root
> > >                         /  \
> > >                      elem   elem -> update -> update
> > >                             /  \
> > >                          elem  elem
> > > 
> > > Each rbtree element in the tree .has pending_list which stores the
> > > element that supersede the existing (inactive) element. There is also a
> > > .list which is used to add the element to the .pending_list. Elements
> > > in the tree might have a .pending_list with one or more elements.  
> > 
> > I see a problem with this, that perhaps you already solved, but I don't
> > understand how.
> > 
> > The original issue here was that we have inactive elements in the tree
> > affecting the way we descend it to look for overlaps. Those inactive
> > elements are not necessarily overlapping with anything.
> > 
> > If they overlap, the issue is solved with your patch. But if they
> > don't...?
> >
> > Sure, we'll grant insertion of overlapping elements in case the overlap
> > is with an inactive one, but this solves the particular case of
> > matching elements, not overlapping intervals.
> > 
> > At a first reading, I thought you found some magic way to push out all
> > inactive elements to some parallel, linked structure, which we can
> > ignore as we look for overlapping _intervals_. But that doesn't seem to
> > be the case, right?  
> 
> With my patch, when descending the tree, the right or left branch is
> selected uniquely based on the key value (regardless the element
> state)

Hmm, but wait, that was exactly the problem I introduced (or at least
my understanding of it): if subtrees of entirely inactive nodes hide
(by affecting how we descend the tree) active nodes (that affect the
overlap decision), we can't actually reach any conclusion.

It's fine as long as it's inactive leaves, or single nodes, but not
more.

Let's say we need to insert element with key 6, as interval start,
given this tree, where:

- (s) and (e) mark starts and ends
- (i) and (a) mark inactive and active elements

                 4 (s, i)
                 /      \
                /        \
        2 (s, i)          7 (e, i)
        /      \          /      \
       /        \        /        \
  1 (s, a)  3 (e, a)  5 (s, i)  8 (s, i)

we visit elements with keys 4, 7, 5: we have an inactive start on the
left (5), an inactive end to the right (7), and we don't know if it
overlaps, because we couldn't see 1 and 3. Depending on how hard we
tried to fix bugs we hit in the past, we'll consider that an overlap
or not.

Without inactive elements, we would have this tree:

1 (s, a)
     \
      \
    3 (e, a)

where we visit elements with keys 1, 3: we find an active start on
the left, and then the corresponding end, also to the left, so we
conclude that start element with key 6 doesn't overlap.

> I removed the "turn left" when node is inactive case. There
> are also no more duplicated elements with the same value.

This simplifies the handling of those cases, we wouldn't need all those
clauses anymore, but I really think that the existing problem comes from
the fact we can *not* descend the tree just by selecting key values.

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-07-06 21:12       ` Stefano Brivio
@ 2022-07-15 11:13         ` Pablo Neira Ayuso
  2022-07-17 13:39           ` Stefano Brivio
  2022-07-19 15:47           ` Stefano Brivio
  0 siblings, 2 replies; 9+ messages in thread
From: Pablo Neira Ayuso @ 2022-07-15 11:13 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1404 bytes --]

On Wed, Jul 06, 2022 at 11:12:42PM +0200, Stefano Brivio wrote:
> On Tue, 5 Jul 2022 13:53:47 +0200
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> This simplifies the handling of those cases, we wouldn't need all those
> clauses anymore, but I really think that the existing problem comes from
> the fact we can *not* descend the tree just by selecting key values.

Thanks for explaining.

The traversal rbtree via rb_first() and rb_next() is like an ordered
linear list walk, maybe it is possible to reduce the number of
elements to find an overlap?

I'm attaching an incremental patch on top of yours, idea is:

1) find the closest element whose key is less than the new element
   by descending the tree. This provides the first node to walk.

2) annotate closest active element that is less than the new element,
   walking over the ordered list.

3) annotate closest active element that is more than the new element,
   Stop walking the ordered list.

4) if new element is an exact match, then EEXIST.

5) if new element is end and closest less than element is end, or
   if new element is start and closest less than element is start, or
   if new element is end and closest more than element is end,
   Then ENOTEMPTY.

Inactive/expired elements are skipped while walking the ordered linear
list as usual.

With this incremental patch, I don't observe longer time to load
interval sets.

[-- Attachment #2: x.patch --]
[-- Type: text/x-diff, Size: 3136 bytes --]

commit 796b1f40a42b505d0e614fd2fbb6dad9f4e3c2c5
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Fri Jul 15 10:38:31 2022 +0200

    x

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 87a28d2dca77..176173f770fd 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -220,13 +220,37 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 			       struct nft_set_ext **ext)
 {
 	struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL;
+	struct rb_node *node, *parent, **p, *first = NULL;
 	struct nft_rbtree *priv = nft_set_priv(set);
 	u8 genmask = nft_genmask_next(net);
-	struct rb_node *node, *parent, **p;
 	int d;
 
+	parent = NULL;
+	p = &priv->root.rb_node;
+	while (*p != NULL) {
+		parent = *p;
+		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
+		d = nft_rbtree_cmp(set, rbe, new);
+
+		if (d < 0)
+			p = &parent->rb_left;
+		else if (d > 0) {
+			first = &rbe->node;
+			p = &parent->rb_right;
+		} else {
+			first = &rbe->node;
+			if (nft_rbtree_interval_end(rbe))
+				p = &parent->rb_left;
+			else
+				p = &parent->rb_right;
+		}
+	}
+
+	if (!first)
+		first = rb_first(&priv->root);
+
 	/* Detect overlaps by going through the list of valid tree nodes: */
-	for (node = rb_first(&priv->root); node != NULL; node = rb_next(node)) {
+	for (node = first; node != NULL; node = rb_next(node)) {
 		rbe = rb_entry(node, struct nft_rbtree_elem, node);
 
 		if (!nft_set_elem_active(&rbe->ext, genmask) ||
@@ -235,9 +259,13 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 
 		d = nft_rbtree_cmp(set, rbe, new);
 
-		if (d <= 0 && (!rbe_le || nft_rbtree_cmp(set, rbe, rbe_le) > 0))
+		/* annotate element coming before new element. */
+		if (d < 0 && (!rbe_le || nft_rbtree_cmp(set, rbe, rbe_le) > 0)) {
 			rbe_le = rbe;
+			break;
+		}
 
+		/* annotate existing element coming after new element. */
 		if (d >= 0 && (!rbe_ge || nft_rbtree_cmp(set, rbe, rbe_ge) < 0))
 			rbe_ge = rbe;
 	}
@@ -246,7 +274,7 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	 *   matching end element: full overlap reported as -EEXIST, cleared by
 	 *   caller if NLM_F_EXCL is not given
 	 */
-	rbe = rbe_le;
+	rbe = rbe_ge;
 	if (rbe && !nft_rbtree_cmp(set, new, rbe) &&
 	    nft_rbtree_interval_start(rbe) == nft_rbtree_interval_start(new)) {
 		*ext = &rbe->ext;
@@ -257,7 +285,14 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	 *   being a start element: partial overlap, reported as -ENOTEMPTY
 	 */
 	if (rbe_le &&
-	    nft_rbtree_interval_start(rbe_le) && nft_rbtree_interval_start(new))
+	    nft_rbtree_interval_end(rbe_le) && nft_rbtree_interval_end(new))
+		return -ENOTEMPTY;
+
+	/* - new start element before existing closest, less or equal key value
+	 *   element: partial overlap, reported as -ENOTEMPTY
+	 */
+	if (rbe_ge &&
+	    nft_rbtree_interval_start(rbe_ge) && nft_rbtree_interval_start(new))
 		return -ENOTEMPTY;
 
 	/* - new end element with existing closest, greater or equal key value

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-07-15 11:13         ` Pablo Neira Ayuso
@ 2022-07-17 13:39           ` Stefano Brivio
  2022-07-19 15:47           ` Stefano Brivio
  1 sibling, 0 replies; 9+ messages in thread
From: Stefano Brivio @ 2022-07-17 13:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Fri, 15 Jul 2022 13:13:52 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Wed, Jul 06, 2022 at 11:12:42PM +0200, Stefano Brivio wrote:
> > On Tue, 5 Jul 2022 13:53:47 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > This simplifies the handling of those cases, we wouldn't need all those
> > clauses anymore, but I really think that the existing problem comes from
> > the fact we can *not* descend the tree just by selecting key values.  
> 
> Thanks for explaining.
> 
> The traversal rbtree via rb_first() and rb_next() is like an ordered
> linear list walk, maybe it is possible to reduce the number of
> elements to find an overlap?

Hah, yes, that's what I was also thinking about, but:

> I'm attaching an incremental patch on top of yours, idea is:
> 
> 1) find the closest element whose key is less than the new element
>    by descending the tree. This provides the first node to walk.

this step is also prone to the original issue, that is, the choice of
the first node to walk might also be affected by inactive elements.

Now, I _guess_ the way you implemented it (not checking if elements
are active in this step) should be correct in any case, because the
tree ordering is always correct, so any active element that might be
"hidden" should become part of the walk anyway.

But give me a couple of days to think about it and if there might be
other corner cases.

> 2) annotate closest active element that is less than the new element,
>    walking over the ordered list.
> 
> 3) annotate closest active element that is more than the new element,
>    Stop walking the ordered list.
> 
> 4) if new element is an exact match, then EEXIST.
> 
> 5) if new element is end and closest less than element is end, or
>    if new element is start and closest less than element is start, or
>    if new element is end and closest more than element is end,
>    Then ENOTEMPTY.

All these look safe (and correct) to me.

> Inactive/expired elements are skipped while walking the ordered linear
> list as usual.
> 
> With this incremental patch, I don't observe longer time to load
> interval sets.

Thanks a lot. I'll get back to you in a bit.

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection
  2022-07-15 11:13         ` Pablo Neira Ayuso
  2022-07-17 13:39           ` Stefano Brivio
@ 2022-07-19 15:47           ` Stefano Brivio
  1 sibling, 0 replies; 9+ messages in thread
From: Stefano Brivio @ 2022-07-19 15:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Fri, 15 Jul 2022 13:13:52 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Wed, Jul 06, 2022 at 11:12:42PM +0200, Stefano Brivio wrote:
> > On Tue, 5 Jul 2022 13:53:47 +0200
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > This simplifies the handling of those cases, we wouldn't need all those
> > clauses anymore, but I really think that the existing problem comes from
> > the fact we can *not* descend the tree just by selecting key values.  
> 
> Thanks for explaining.
> 
> The traversal rbtree via rb_first() and rb_next() is like an ordered
> linear list walk, maybe it is possible to reduce the number of
> elements to find an overlap?
> 
> I'm attaching an incremental patch on top of yours, idea is:
> 
> 1) find the closest element whose key is less than the new element
>    by descending the tree. This provides the first node to walk.

I think this is almost correct, but we need to modify it slightly.
Consider this tree:

      A: 1 (s, a)
       /        \
      /          \
B: 1 (s, i)    C: 3 (e, i)

where, again, 's': starts, 'e': ends, 'a': active, 'i': inactive. Nodes
are additionally named.

We want to insert 2 as a start element, and 'first' in your patch
becomes leaf B, "1 (s, i)" -- not the A node, "1 (s, a)".

Now, depending on the red-black tree insertion implementation (I didn't
bother to check, I guess it should be independent), in the list walk,
A might come before or after B.

If B is before A, fine, we'll meet A and mark it as "rbe_le" (closest
from left).

If A comes before B, we won't meet A, and we won't have a closest
element from the left. This affects the overlapping decision.

The only ambiguity here is represented by elements having the same key,
and a set of such elements is allowed in the tree iff at most one is
active. So we just need to avoid replacing an active "first" by an
inactive node with the same key. Eventually, we'll visit all the nodes
having the same keys, if any:

> +	parent = NULL;
> +	p = &priv->root.rb_node;
> +	while (*p != NULL) {
> +		parent = *p;
> +		rbe = rb_entry(parent, struct nft_rbtree_elem, node);
> +		d = nft_rbtree_cmp(set, rbe, new);
> +
> +		if (d < 0)
> +			p = &parent->rb_left;
> +		else if (d > 0) {
> +			first = &rbe->node;

			if (!first || nft_rbtree_cmp(set, rbe, first) ||
			    nft_set_elem_expired(&first->ext)) ||
			    !nft_set_elem_active(&first->ext, genmask))
				first = &rbe->node;

> +			p = &parent->rb_right;
> +		} else {
> +			first = &rbe->node;

...and the same here. Maybe we should re-introduce the "expired or
inactive" helper.

> +			if (nft_rbtree_interval_end(rbe))
> +				p = &parent->rb_left;
> +			else
> +				p = &parent->rb_right;
> +		}
> +	}
> +
> +	if (!first)
> +		first = rb_first(&priv->root);

> 2) annotate closest active element that is less than the new element,
>    walking over the ordered list.

The description looks correct to me, but I'm not sure why you add a
break here:

> -		if (d <= 0 && (!rbe_le || nft_rbtree_cmp(set, rbe, rbe_le) > 0))
> +		/* annotate element coming before new element. */
> +		if (d < 0 && (!rbe_le || nft_rbtree_cmp(set, rbe, rbe_le) > 0)) {
>  			rbe_le = rbe;
> +			break;
> +		}

we should stop here iff rbe_ge is already set, right? Otherwise we are
skipping step 3) below.

> 3) annotate closest active element that is more than the new element,
>    Stop walking the ordered list.
> 
> 4) if new element is an exact match, then EEXIST.
> 
> 5) if new element is end and closest less than element is end, or
>    if new element is start and closest less than element is start, or
>    if new element is end and closest more than element is end,
>    Then ENOTEMPTY.
> 
> Inactive/expired elements are skipped while walking the ordered linear
> list as usual.
> 
> With this incremental patch, I don't observe longer time to load
> interval sets.

Everything else looks good to me, thanks a lot, I hope we're finally
sorting this for good :)

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-19 15:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14  1:07 [PATCH nf] nft_set_rbtree: Switch to node list walk for overlap detection Stefano Brivio
2022-06-27 16:59 ` Pablo Neira Ayuso
2022-06-29  8:50   ` Stefano Brivio
2022-07-01 23:55   ` Stefano Brivio
2022-07-05 11:53     ` Pablo Neira Ayuso
2022-07-06 21:12       ` Stefano Brivio
2022-07-15 11:13         ` Pablo Neira Ayuso
2022-07-17 13:39           ` Stefano Brivio
2022-07-19 15:47           ` Stefano Brivio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.