netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next 0/2] prevent sync issues with hw offload of flower
@ 2019-10-02 23:14 John Hurley
  2019-10-02 23:14 ` [RFC net-next 1/2] net: sched: add tp_op for pre_destroy John Hurley
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: John Hurley @ 2019-10-02 23:14 UTC (permalink / raw)
  To: vladbu
  Cc: jiri, netdev, simon.horman, jakub.kicinski, oss-drivers, John Hurley

Hi,

Putting this out an RFC built on net-next. It fixes some issues
discovered in testing when using the TC API of OvS to generate flower
rules and subsequently offloading them to HW. Rules seen contain the same
match fields or may be rule modifications run as a delete plus an add.
We're seeing race conditions whereby the rules present in kernel flower
are out of sync with those offloaded. Note that there are some issues
that will need fixed in the RFC before it becomes a patch such as
potential races between releasing locks and re-taking them. However, I'm
putting this out for comments or potential alternative solutions.

The main cause of the races seem to be in the chain table of cls_api. If
a tcf_proto is destroyed then it is removed from its chain. If a new
filter is then added to the same chain with the same priority and protocol
a new tcf_proto will be created - this may happen before the first is
fully removed and the hw offload message sent to the driver. In cls_flower
this means that the fl_ht_insert_unique() function can pass as its
hashtable is associated with the tcf_proto. We are then in a position
where the 'delete' and the 'add' are in a race to get offloaded. We also
noticed that doing an offload add, then checking if a tcf_proto is
concurrently deleting, then remove the offload if it is, can extend the
out of order messages. Drivers do not expect to get duplicate rules.
However, the kernel TC datapath they are not duplicates so we can get out
of sync here.

The RFC fixes this by adding a pre_destroy hook to cls_api that is called
when a tcf_proto is signaled to be destroyed but before it is removed from
its chain (which is essentially the lock for allowing duplicates in
flower). Flower then uses this new hook to send the hw delete messages
from tcf_proto destroys, preventing them racing with duplicate adds. It
also moves the check for 'deleting' to before the sending the hw add
message.

John Hurley (2):
  net: sched: add tp_op for pre_destroy
  net: sched: fix tp destroy race conditions in flower

 include/net/sch_generic.h |  3 +++
 net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
 net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
 3 files changed, 61 insertions(+), 26 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next 1/2] net: sched: add tp_op for pre_destroy
  2019-10-02 23:14 [RFC net-next 0/2] prevent sync issues with hw offload of flower John Hurley
@ 2019-10-02 23:14 ` John Hurley
  2019-10-02 23:14 ` [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower John Hurley
  2019-10-03 16:26 ` [RFC net-next 0/2] prevent sync issues with hw offload of flower Vlad Buslov
  2 siblings, 0 replies; 11+ messages in thread
From: John Hurley @ 2019-10-02 23:14 UTC (permalink / raw)
  To: vladbu
  Cc: jiri, netdev, simon.horman, jakub.kicinski, oss-drivers, John Hurley

It is possible that a race condition may exist when a tcf_proto is
destroyed. Several actions occur before the destroy() tcf_proto_op is
called so if no higher level locking (e.g. RTNL) is in use then other
rules may be received and processed in parallel before the classifier's
specific destroy functions are completed.

Add a new tcf_proto_op called pre_destroy that is triggered when a
tcf_proto is signalled to be destroyed. This allows classifiers the
option of implementing tasks at this hook.

Fixes: 1d965c4def07 ("Refactor flower classifier to remove dependency on rtnl lock")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reported-by: Louis Peens <louis.peens@netronome.com>
---
 include/net/sch_generic.h |  3 +++
 net/sched/cls_api.c       | 29 ++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 637548d..e458d6f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -294,6 +294,9 @@ struct tcf_proto_ops {
 					    const struct tcf_proto *,
 					    struct tcf_result *);
 	int			(*init)(struct tcf_proto*);
+	void			(*pre_destroy)(struct tcf_proto *tp,
+					       bool rtnl_held,
+					       struct netlink_ext_ack *extack);
 	void			(*destroy)(struct tcf_proto *tp, bool rtnl_held,
 					   struct netlink_ext_ack *extack);
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 64584a1..aecf716 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -222,6 +222,13 @@ static void tcf_proto_get(struct tcf_proto *tp)
 
 static void tcf_chain_put(struct tcf_chain *chain);
 
+static void tcf_proto_pre_destroy(struct tcf_proto *tp, bool rtnl_held,
+				  struct netlink_ext_ack *extack)
+{
+	if (tp->ops->pre_destroy)
+		tp->ops->pre_destroy(tp, rtnl_held, extack);
+}
+
 static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
 			      struct netlink_ext_ack *extack)
 {
@@ -534,9 +541,18 @@ static void tcf_chain_flush(struct tcf_chain *chain, bool rtnl_held)
 
 	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_dereference(chain->filter_chain, chain);
+	chain->flushing = true;
+	mutex_unlock(&chain->filter_chain_lock);
+
+	while (tp) {
+		tcf_proto_pre_destroy(tp, rtnl_held, NULL);
+		tp = rcu_dereference_protected(tp->next, 1);
+	}
+
+	mutex_lock(&chain->filter_chain_lock);
+	tp = tcf_chain_dereference(chain->filter_chain, chain);
 	RCU_INIT_POINTER(chain->filter_chain, NULL);
 	tcf_chain0_head_change(chain, NULL);
-	chain->flushing = true;
 	mutex_unlock(&chain->filter_chain_lock);
 
 	while (tp) {
@@ -1636,6 +1652,8 @@ static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
 	struct tcf_proto **pprev;
 	struct tcf_proto *next;
 
+	tcf_proto_pre_destroy(tp, rtnl_held, extack);
+
 	mutex_lock(&chain->filter_chain_lock);
 
 	/* Atomically find and remove tp from chain. */
@@ -2164,6 +2182,15 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		err = -EINVAL;
 		goto errout_locked;
 	} else if (t->tcm_handle == 0) {
+		mutex_unlock(&chain->filter_chain_lock);
+
+		tcf_proto_pre_destroy(tp, rtnl_held, extack);
+		tcf_proto_put(tp, rtnl_held, NULL);
+
+		mutex_lock(&chain->filter_chain_lock);
+		/* Lookup again and take new ref incase of parallel update. */
+		tp = tcf_chain_tp_find(chain, &chain_info, protocol, prio,
+				       false);
 		tcf_chain_tp_remove(chain, &chain_info, tp);
 		mutex_unlock(&chain->filter_chain_lock);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower
  2019-10-02 23:14 [RFC net-next 0/2] prevent sync issues with hw offload of flower John Hurley
  2019-10-02 23:14 ` [RFC net-next 1/2] net: sched: add tp_op for pre_destroy John Hurley
@ 2019-10-02 23:14 ` John Hurley
  2019-10-03 16:18   ` Vlad Buslov
  2019-10-03 16:26 ` [RFC net-next 0/2] prevent sync issues with hw offload of flower Vlad Buslov
  2 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2019-10-02 23:14 UTC (permalink / raw)
  To: vladbu
  Cc: jiri, netdev, simon.horman, jakub.kicinski, oss-drivers, John Hurley

Flower has rule HW offload functions available that drivers can choose to
register for. For the deletion case, these are triggered after filters
have been removed from lookup tables both at the flower level, and the
higher cls_api level. With flower running without RTNL locking, this can
lead to races where HW offload messages get out of order.

Ensure HW offloads stay in line with the kernel tables by triggering
the sending of messages before the kernel processing is completed. For
destroyed tcf_protos, do this at the new pre_destroy hook. Similarly, if
a filter is being added, check that it is not concurrently being deleted
before offloading to hw, rather than the current approach of offloading,
then checking and reversing the offload if required.

Fixes: 1d965c4def07 ("Refactor flower classifier to remove dependency on rtnl lock")
Fixes: 272ffaadeb3e ("net: sched: flower: handle concurrent tcf proto deletion")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reported-by: Louis Peens <louis.peens@netronome.com>
---
 net/sched/cls_flower.c | 55 +++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 25 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 74221e3..3ac47b5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -513,13 +513,16 @@ static struct cls_fl_filter *__fl_get(struct cls_fl_head *head, u32 handle)
 }
 
 static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
-		       bool *last, bool rtnl_held,
+		       bool *last, bool rtnl_held, bool do_hw,
 		       struct netlink_ext_ack *extack)
 {
 	struct cls_fl_head *head = fl_head_dereference(tp);
 
 	*last = false;
 
+	if (do_hw && !tc_skip_hw(f->flags))
+		fl_hw_destroy_filter(tp, f, rtnl_held, extack);
+
 	spin_lock(&tp->lock);
 	if (f->deleted) {
 		spin_unlock(&tp->lock);
@@ -534,8 +537,6 @@ static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
 	spin_unlock(&tp->lock);
 
 	*last = fl_mask_put(head, f->mask);
-	if (!tc_skip_hw(f->flags))
-		fl_hw_destroy_filter(tp, f, rtnl_held, extack);
 	tcf_unbind_filter(tp, &f->res);
 	__fl_put(f);
 
@@ -563,7 +564,7 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
 
 	list_for_each_entry_safe(mask, next_mask, &head->masks, list) {
 		list_for_each_entry_safe(f, next, &mask->filters, list) {
-			__fl_delete(tp, f, &last, rtnl_held, extack);
+			__fl_delete(tp, f, &last, rtnl_held, false, extack);
 			if (last)
 				break;
 		}
@@ -574,6 +575,19 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
 	tcf_queue_work(&head->rwork, fl_destroy_sleepable);
 }
 
+static void fl_pre_destroy(struct tcf_proto *tp, bool rtnl_held,
+			   struct netlink_ext_ack *extack)
+{
+	struct cls_fl_head *head = fl_head_dereference(tp);
+	struct fl_flow_mask *mask, *next_mask;
+	struct cls_fl_filter *f, *next;
+
+	list_for_each_entry_safe(mask, next_mask, &head->masks, list)
+		list_for_each_entry_safe(f, next, &mask->filters, list)
+			if (!tc_skip_hw(f->flags))
+				fl_hw_destroy_filter(tp, f, rtnl_held, extack);
+}
+
 static void fl_put(struct tcf_proto *tp, void *arg)
 {
 	struct cls_fl_filter *f = arg;
@@ -1588,6 +1602,13 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (err)
 		goto errout_mask;
 
+	spin_lock(&tp->lock);
+	if (tp->deleting || (fold && fold->deleted)) {
+		err = -EAGAIN;
+		goto errout_lock;
+	}
+	spin_unlock(&tp->lock);
+
 	if (!tc_skip_hw(fnew->flags)) {
 		err = fl_hw_replace_filter(tp, fnew, rtnl_held, extack);
 		if (err)
@@ -1598,22 +1619,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 		fnew->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
 	spin_lock(&tp->lock);
-
-	/* tp was deleted concurrently. -EAGAIN will cause caller to lookup
-	 * proto again or create new one, if necessary.
-	 */
-	if (tp->deleting) {
-		err = -EAGAIN;
-		goto errout_hw;
-	}
-
 	if (fold) {
-		/* Fold filter was deleted concurrently. Retry lookup. */
-		if (fold->deleted) {
-			err = -EAGAIN;
-			goto errout_hw;
-		}
-
 		fnew->handle = handle;
 
 		if (!in_ht) {
@@ -1624,7 +1630,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 						     &fnew->ht_node,
 						     params);
 			if (err)
-				goto errout_hw;
+				goto errout_lock;
 			in_ht = true;
 		}
 
@@ -1667,7 +1673,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 					    INT_MAX, GFP_ATOMIC);
 		}
 		if (err)
-			goto errout_hw;
+			goto errout_lock;
 
 		refcount_inc(&fnew->refcnt);
 		fnew->handle = handle;
@@ -1683,11 +1689,9 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 
 errout_ht:
 	spin_lock(&tp->lock);
-errout_hw:
+errout_lock:
 	fnew->deleted = true;
 	spin_unlock(&tp->lock);
-	if (!tc_skip_hw(fnew->flags))
-		fl_hw_destroy_filter(tp, fnew, rtnl_held, NULL);
 	if (in_ht)
 		rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node,
 				       fnew->mask->filter_ht_params);
@@ -1713,7 +1717,7 @@ static int fl_delete(struct tcf_proto *tp, void *arg, bool *last,
 	bool last_on_mask;
 	int err = 0;
 
-	err = __fl_delete(tp, f, &last_on_mask, rtnl_held, extack);
+	err = __fl_delete(tp, f, &last_on_mask, rtnl_held, true, extack);
 	*last = list_empty(&head->masks);
 	__fl_put(f);
 
@@ -2509,6 +2513,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
 	.kind		= "flower",
 	.classify	= fl_classify,
 	.init		= fl_init,
+	.pre_destroy	= fl_pre_destroy,
 	.destroy	= fl_destroy,
 	.get		= fl_get,
 	.put		= fl_put,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower
  2019-10-02 23:14 ` [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower John Hurley
@ 2019-10-03 16:18   ` Vlad Buslov
  2019-10-03 16:39     ` John Hurley
  0 siblings, 1 reply; 11+ messages in thread
From: Vlad Buslov @ 2019-10-03 16:18 UTC (permalink / raw)
  To: John Hurley
  Cc: Vlad Buslov, Jiri Pirko, netdev, simon.horman, jakub.kicinski,
	oss-drivers


On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> Flower has rule HW offload functions available that drivers can choose to
> register for. For the deletion case, these are triggered after filters
> have been removed from lookup tables both at the flower level, and the
> higher cls_api level. With flower running without RTNL locking, this can
> lead to races where HW offload messages get out of order.
>
> Ensure HW offloads stay in line with the kernel tables by triggering
> the sending of messages before the kernel processing is completed. For
> destroyed tcf_protos, do this at the new pre_destroy hook. Similarly, if
> a filter is being added, check that it is not concurrently being deleted
> before offloading to hw, rather than the current approach of offloading,
> then checking and reversing the offload if required.
>
> Fixes: 1d965c4def07 ("Refactor flower classifier to remove dependency on rtnl lock")
> Fixes: 272ffaadeb3e ("net: sched: flower: handle concurrent tcf proto deletion")
> Signed-off-by: John Hurley <john.hurley@netronome.com>
> Reported-by: Louis Peens <louis.peens@netronome.com>
> ---
>  net/sched/cls_flower.c | 55 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 74221e3..3ac47b5 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -513,13 +513,16 @@ static struct cls_fl_filter *__fl_get(struct cls_fl_head *head, u32 handle)
>  }
>  
>  static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
> -		       bool *last, bool rtnl_held,
> +		       bool *last, bool rtnl_held, bool do_hw,
>  		       struct netlink_ext_ack *extack)
>  {
>  	struct cls_fl_head *head = fl_head_dereference(tp);
>  
>  	*last = false;
>  
> +	if (do_hw && !tc_skip_hw(f->flags))
> +		fl_hw_destroy_filter(tp, f, rtnl_held, extack);
> +
>  	spin_lock(&tp->lock);
>  	if (f->deleted) {
>  		spin_unlock(&tp->lock);
> @@ -534,8 +537,6 @@ static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
>  	spin_unlock(&tp->lock);
>  
>  	*last = fl_mask_put(head, f->mask);
> -	if (!tc_skip_hw(f->flags))
> -		fl_hw_destroy_filter(tp, f, rtnl_held, extack);
>  	tcf_unbind_filter(tp, &f->res);
>  	__fl_put(f);
>  
> @@ -563,7 +564,7 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
>  
>  	list_for_each_entry_safe(mask, next_mask, &head->masks, list) {
>  		list_for_each_entry_safe(f, next, &mask->filters, list) {
> -			__fl_delete(tp, f, &last, rtnl_held, extack);
> +			__fl_delete(tp, f, &last, rtnl_held, false, extack);
>  			if (last)
>  				break;
>  		}
> @@ -574,6 +575,19 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
>  	tcf_queue_work(&head->rwork, fl_destroy_sleepable);
>  }
>  
> +static void fl_pre_destroy(struct tcf_proto *tp, bool rtnl_held,
> +			   struct netlink_ext_ack *extack)
> +{
> +	struct cls_fl_head *head = fl_head_dereference(tp);
> +	struct fl_flow_mask *mask, *next_mask;
> +	struct cls_fl_filter *f, *next;
> +
> +	list_for_each_entry_safe(mask, next_mask, &head->masks, list)
> +		list_for_each_entry_safe(f, next, &mask->filters, list)
> +			if (!tc_skip_hw(f->flags))
> +				fl_hw_destroy_filter(tp, f, rtnl_held, extack);
> +}
> +
>  static void fl_put(struct tcf_proto *tp, void *arg)
>  {
>  	struct cls_fl_filter *f = arg;
> @@ -1588,6 +1602,13 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>  	if (err)
>  		goto errout_mask;
>  
> +	spin_lock(&tp->lock);
> +	if (tp->deleting || (fold && fold->deleted)) {
> +		err = -EAGAIN;
> +		goto errout_lock;
> +	}
> +	spin_unlock(&tp->lock);
> +

But what if one of these flag are set after this block? It would be
possible to insert dangling filters on tp that is being deleted, or
double list_replace_rcu() and idr replace() if same filter is replaced
concurrently, etc.

>  	if (!tc_skip_hw(fnew->flags)) {
>  		err = fl_hw_replace_filter(tp, fnew, rtnl_held, extack);
>  		if (err)
> @@ -1598,22 +1619,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>  		fnew->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>  
>  	spin_lock(&tp->lock);
> -
> -	/* tp was deleted concurrently. -EAGAIN will cause caller to lookup
> -	 * proto again or create new one, if necessary.
> -	 */
> -	if (tp->deleting) {
> -		err = -EAGAIN;
> -		goto errout_hw;
> -	}
> -
>  	if (fold) {
> -		/* Fold filter was deleted concurrently. Retry lookup. */
> -		if (fold->deleted) {
> -			err = -EAGAIN;
> -			goto errout_hw;
> -		}
> -
>  		fnew->handle = handle;
>  
>  		if (!in_ht) {
> @@ -1624,7 +1630,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>  						     &fnew->ht_node,
>  						     params);
>  			if (err)
> -				goto errout_hw;
> +				goto errout_lock;
>  			in_ht = true;
>  		}
>  
> @@ -1667,7 +1673,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>  					    INT_MAX, GFP_ATOMIC);
>  		}
>  		if (err)
> -			goto errout_hw;
> +			goto errout_lock;
>  
>  		refcount_inc(&fnew->refcnt);
>  		fnew->handle = handle;
> @@ -1683,11 +1689,9 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
>  
>  errout_ht:
>  	spin_lock(&tp->lock);
> -errout_hw:
> +errout_lock:
>  	fnew->deleted = true;
>  	spin_unlock(&tp->lock);
> -	if (!tc_skip_hw(fnew->flags))
> -		fl_hw_destroy_filter(tp, fnew, rtnl_held, NULL);
>  	if (in_ht)
>  		rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node,
>  				       fnew->mask->filter_ht_params);
> @@ -1713,7 +1717,7 @@ static int fl_delete(struct tcf_proto *tp, void *arg, bool *last,
>  	bool last_on_mask;
>  	int err = 0;
>  
> -	err = __fl_delete(tp, f, &last_on_mask, rtnl_held, extack);
> +	err = __fl_delete(tp, f, &last_on_mask, rtnl_held, true, extack);
>  	*last = list_empty(&head->masks);
>  	__fl_put(f);
>  
> @@ -2509,6 +2513,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
>  	.kind		= "flower",
>  	.classify	= fl_classify,
>  	.init		= fl_init,
> +	.pre_destroy	= fl_pre_destroy,
>  	.destroy	= fl_destroy,
>  	.get		= fl_get,
>  	.put		= fl_put,


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-02 23:14 [RFC net-next 0/2] prevent sync issues with hw offload of flower John Hurley
  2019-10-02 23:14 ` [RFC net-next 1/2] net: sched: add tp_op for pre_destroy John Hurley
  2019-10-02 23:14 ` [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower John Hurley
@ 2019-10-03 16:26 ` Vlad Buslov
  2019-10-03 16:59   ` John Hurley
  2 siblings, 1 reply; 11+ messages in thread
From: Vlad Buslov @ 2019-10-03 16:26 UTC (permalink / raw)
  To: John Hurley
  Cc: Vlad Buslov, Jiri Pirko, netdev, simon.horman, jakub.kicinski,
	oss-drivers


On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> Hi,
>
> Putting this out an RFC built on net-next. It fixes some issues
> discovered in testing when using the TC API of OvS to generate flower
> rules and subsequently offloading them to HW. Rules seen contain the same
> match fields or may be rule modifications run as a delete plus an add.
> We're seeing race conditions whereby the rules present in kernel flower
> are out of sync with those offloaded. Note that there are some issues
> that will need fixed in the RFC before it becomes a patch such as
> potential races between releasing locks and re-taking them. However, I'm
> putting this out for comments or potential alternative solutions.
>
> The main cause of the races seem to be in the chain table of cls_api. If
> a tcf_proto is destroyed then it is removed from its chain. If a new
> filter is then added to the same chain with the same priority and protocol
> a new tcf_proto will be created - this may happen before the first is
> fully removed and the hw offload message sent to the driver. In cls_flower
> this means that the fl_ht_insert_unique() function can pass as its
> hashtable is associated with the tcf_proto. We are then in a position
> where the 'delete' and the 'add' are in a race to get offloaded. We also
> noticed that doing an offload add, then checking if a tcf_proto is
> concurrently deleting, then remove the offload if it is, can extend the
> out of order messages. Drivers do not expect to get duplicate rules.
> However, the kernel TC datapath they are not duplicates so we can get out
> of sync here.
>
> The RFC fixes this by adding a pre_destroy hook to cls_api that is called
> when a tcf_proto is signaled to be destroyed but before it is removed from
> its chain (which is essentially the lock for allowing duplicates in
> flower). Flower then uses this new hook to send the hw delete messages
> from tcf_proto destroys, preventing them racing with duplicate adds. It
> also moves the check for 'deleting' to before the sending the hw add
> message.
>
> John Hurley (2):
>   net: sched: add tp_op for pre_destroy
>   net: sched: fix tp destroy race conditions in flower
>
>  include/net/sch_generic.h |  3 +++
>  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
>  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
>  3 files changed, 61 insertions(+), 26 deletions(-)

Hi John,

Thanks for working on this!

Are there any other sources for race conditions described in this
letter? When you describe tcf_proto deletion you say "main cause" but
don't provide any others. If tcf_proto is the only problematic part,
then it might be worth to look into alternative ways to force concurrent
users to wait for proto deletion/destruction to be properly finished.
Maybe having some table that maps chain id + prio to completion would be
simpler approach? With such infra tcf_proto_create() can wait for
previous proto with same prio and chain to be fully destroyed (including
offloads) before creating a new one.

Regards,
Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower
  2019-10-03 16:18   ` Vlad Buslov
@ 2019-10-03 16:39     ` John Hurley
  0 siblings, 0 replies; 11+ messages in thread
From: John Hurley @ 2019-10-03 16:39 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Jiri Pirko, netdev, simon.horman, jakub.kicinski, oss-drivers

On Thu, Oct 3, 2019 at 5:18 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> > Flower has rule HW offload functions available that drivers can choose to
> > register for. For the deletion case, these are triggered after filters
> > have been removed from lookup tables both at the flower level, and the
> > higher cls_api level. With flower running without RTNL locking, this can
> > lead to races where HW offload messages get out of order.
> >
> > Ensure HW offloads stay in line with the kernel tables by triggering
> > the sending of messages before the kernel processing is completed. For
> > destroyed tcf_protos, do this at the new pre_destroy hook. Similarly, if
> > a filter is being added, check that it is not concurrently being deleted
> > before offloading to hw, rather than the current approach of offloading,
> > then checking and reversing the offload if required.
> >
> > Fixes: 1d965c4def07 ("Refactor flower classifier to remove dependency on rtnl lock")
> > Fixes: 272ffaadeb3e ("net: sched: flower: handle concurrent tcf proto deletion")
> > Signed-off-by: John Hurley <john.hurley@netronome.com>
> > Reported-by: Louis Peens <louis.peens@netronome.com>
> > ---
> >  net/sched/cls_flower.c | 55 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 30 insertions(+), 25 deletions(-)
> >
> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > index 74221e3..3ac47b5 100644
> > --- a/net/sched/cls_flower.c
> > +++ b/net/sched/cls_flower.c
> > @@ -513,13 +513,16 @@ static struct cls_fl_filter *__fl_get(struct cls_fl_head *head, u32 handle)
> >  }
> >
> >  static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
> > -                    bool *last, bool rtnl_held,
> > +                    bool *last, bool rtnl_held, bool do_hw,
> >                      struct netlink_ext_ack *extack)
> >  {
> >       struct cls_fl_head *head = fl_head_dereference(tp);
> >
> >       *last = false;
> >
> > +     if (do_hw && !tc_skip_hw(f->flags))
> > +             fl_hw_destroy_filter(tp, f, rtnl_held, extack);
> > +
> >       spin_lock(&tp->lock);
> >       if (f->deleted) {
> >               spin_unlock(&tp->lock);
> > @@ -534,8 +537,6 @@ static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
> >       spin_unlock(&tp->lock);
> >
> >       *last = fl_mask_put(head, f->mask);
> > -     if (!tc_skip_hw(f->flags))
> > -             fl_hw_destroy_filter(tp, f, rtnl_held, extack);
> >       tcf_unbind_filter(tp, &f->res);
> >       __fl_put(f);
> >
> > @@ -563,7 +564,7 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
> >
> >       list_for_each_entry_safe(mask, next_mask, &head->masks, list) {
> >               list_for_each_entry_safe(f, next, &mask->filters, list) {
> > -                     __fl_delete(tp, f, &last, rtnl_held, extack);
> > +                     __fl_delete(tp, f, &last, rtnl_held, false, extack);
> >                       if (last)
> >                               break;
> >               }
> > @@ -574,6 +575,19 @@ static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
> >       tcf_queue_work(&head->rwork, fl_destroy_sleepable);
> >  }
> >
> > +static void fl_pre_destroy(struct tcf_proto *tp, bool rtnl_held,
> > +                        struct netlink_ext_ack *extack)
> > +{
> > +     struct cls_fl_head *head = fl_head_dereference(tp);
> > +     struct fl_flow_mask *mask, *next_mask;
> > +     struct cls_fl_filter *f, *next;
> > +
> > +     list_for_each_entry_safe(mask, next_mask, &head->masks, list)
> > +             list_for_each_entry_safe(f, next, &mask->filters, list)
> > +                     if (!tc_skip_hw(f->flags))
> > +                             fl_hw_destroy_filter(tp, f, rtnl_held, extack);
> > +}
> > +
> >  static void fl_put(struct tcf_proto *tp, void *arg)
> >  {
> >       struct cls_fl_filter *f = arg;
> > @@ -1588,6 +1602,13 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >       if (err)
> >               goto errout_mask;
> >
> > +     spin_lock(&tp->lock);
> > +     if (tp->deleting || (fold && fold->deleted)) {
> > +             err = -EAGAIN;
> > +             goto errout_lock;
> > +     }
> > +     spin_unlock(&tp->lock);
> > +
>
> But what if one of these flag are set after this block? It would be
> possible to insert dangling filters on tp that is being deleted, or
> double list_replace_rcu() and idr replace() if same filter is replaced
> concurrently, etc.
>

Hi Vlad, yes, I alluded to this in the cover letter - 'Note that there
are some issues that will need fixed in the RFC before it becomes a
patch such as potential races between releasing locks and re-taking
them.'
It is definitely an issue I'd need to think on more. Maybe mark the tp
as 'cant delete' in the first lock here or something. Although perhaps
this may just push the races to a different place.

> >       if (!tc_skip_hw(fnew->flags)) {
> >               err = fl_hw_replace_filter(tp, fnew, rtnl_held, extack);
> >               if (err)
> > @@ -1598,22 +1619,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >               fnew->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
> >
> >       spin_lock(&tp->lock);
> > -
> > -     /* tp was deleted concurrently. -EAGAIN will cause caller to lookup
> > -      * proto again or create new one, if necessary.
> > -      */
> > -     if (tp->deleting) {
> > -             err = -EAGAIN;
> > -             goto errout_hw;
> > -     }
> > -
> >       if (fold) {
> > -             /* Fold filter was deleted concurrently. Retry lookup. */
> > -             if (fold->deleted) {
> > -                     err = -EAGAIN;
> > -                     goto errout_hw;
> > -             }
> > -
> >               fnew->handle = handle;
> >
> >               if (!in_ht) {
> > @@ -1624,7 +1630,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >                                                    &fnew->ht_node,
> >                                                    params);
> >                       if (err)
> > -                             goto errout_hw;
> > +                             goto errout_lock;
> >                       in_ht = true;
> >               }
> >
> > @@ -1667,7 +1673,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >                                           INT_MAX, GFP_ATOMIC);
> >               }
> >               if (err)
> > -                     goto errout_hw;
> > +                     goto errout_lock;
> >
> >               refcount_inc(&fnew->refcnt);
> >               fnew->handle = handle;
> > @@ -1683,11 +1689,9 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >
> >  errout_ht:
> >       spin_lock(&tp->lock);
> > -errout_hw:
> > +errout_lock:
> >       fnew->deleted = true;
> >       spin_unlock(&tp->lock);
> > -     if (!tc_skip_hw(fnew->flags))
> > -             fl_hw_destroy_filter(tp, fnew, rtnl_held, NULL);
> >       if (in_ht)
> >               rhashtable_remove_fast(&fnew->mask->ht, &fnew->ht_node,
> >                                      fnew->mask->filter_ht_params);
> > @@ -1713,7 +1717,7 @@ static int fl_delete(struct tcf_proto *tp, void *arg, bool *last,
> >       bool last_on_mask;
> >       int err = 0;
> >
> > -     err = __fl_delete(tp, f, &last_on_mask, rtnl_held, extack);
> > +     err = __fl_delete(tp, f, &last_on_mask, rtnl_held, true, extack);
> >       *last = list_empty(&head->masks);
> >       __fl_put(f);
> >
> > @@ -2509,6 +2513,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
> >       .kind           = "flower",
> >       .classify       = fl_classify,
> >       .init           = fl_init,
> > +     .pre_destroy    = fl_pre_destroy,
> >       .destroy        = fl_destroy,
> >       .get            = fl_get,
> >       .put            = fl_put,
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-03 16:26 ` [RFC net-next 0/2] prevent sync issues with hw offload of flower Vlad Buslov
@ 2019-10-03 16:59   ` John Hurley
  2019-10-03 17:19     ` Vlad Buslov
  0 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2019-10-03 16:59 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Jiri Pirko, netdev, simon.horman, jakub.kicinski, oss-drivers

On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> > Hi,
> >
> > Putting this out an RFC built on net-next. It fixes some issues
> > discovered in testing when using the TC API of OvS to generate flower
> > rules and subsequently offloading them to HW. Rules seen contain the same
> > match fields or may be rule modifications run as a delete plus an add.
> > We're seeing race conditions whereby the rules present in kernel flower
> > are out of sync with those offloaded. Note that there are some issues
> > that will need fixed in the RFC before it becomes a patch such as
> > potential races between releasing locks and re-taking them. However, I'm
> > putting this out for comments or potential alternative solutions.
> >
> > The main cause of the races seem to be in the chain table of cls_api. If
> > a tcf_proto is destroyed then it is removed from its chain. If a new
> > filter is then added to the same chain with the same priority and protocol
> > a new tcf_proto will be created - this may happen before the first is
> > fully removed and the hw offload message sent to the driver. In cls_flower
> > this means that the fl_ht_insert_unique() function can pass as its
> > hashtable is associated with the tcf_proto. We are then in a position
> > where the 'delete' and the 'add' are in a race to get offloaded. We also
> > noticed that doing an offload add, then checking if a tcf_proto is
> > concurrently deleting, then remove the offload if it is, can extend the
> > out of order messages. Drivers do not expect to get duplicate rules.
> > However, the kernel TC datapath they are not duplicates so we can get out
> > of sync here.
> >
> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
> > when a tcf_proto is signaled to be destroyed but before it is removed from
> > its chain (which is essentially the lock for allowing duplicates in
> > flower). Flower then uses this new hook to send the hw delete messages
> > from tcf_proto destroys, preventing them racing with duplicate adds. It
> > also moves the check for 'deleting' to before the sending the hw add
> > message.
> >
> > John Hurley (2):
> >   net: sched: add tp_op for pre_destroy
> >   net: sched: fix tp destroy race conditions in flower
> >
> >  include/net/sch_generic.h |  3 +++
> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
> >  3 files changed, 61 insertions(+), 26 deletions(-)
>
> Hi John,
>
> Thanks for working on this!
>
> Are there any other sources for race conditions described in this
> letter? When you describe tcf_proto deletion you say "main cause" but
> don't provide any others. If tcf_proto is the only problematic part,

Hi Vlad,
Thanks for the input.
The tcf_proto deletion was the cause from the tests we ran. That's not
to say there are not more I wasn't seeing in my analysis.

> then it might be worth to look into alternative ways to force concurrent
> users to wait for proto deletion/destruction to be properly finished.
> Maybe having some table that maps chain id + prio to completion would be
> simpler approach? With such infra tcf_proto_create() can wait for
> previous proto with same prio and chain to be fully destroyed (including
> offloads) before creating a new one.

I think a problem with this is that the chain removal functions call
tcf_proto_put() (which calls destroy when ref is 0) so, if other
concurrent processes (like a dump) have references to the tcf_proto
then we may not get the hw offload even by the time the chain deletion
function has finished. We would need to make sure this was tracked -
say after the tcf_proto_destroy function has completed.
How would you suggest doing the wait? With a replay flag as happens in
some other places?

To me it seems the main problem is that the tcf_proto being in a chain
almost acts like the lock to prevent duplicates filters getting to the
driver. We need some mechanism to ensure a delete has made it to HW
before we release this 'lock'.

>
> Regards,
> Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-03 16:59   ` John Hurley
@ 2019-10-03 17:19     ` Vlad Buslov
  2019-10-04 15:39       ` John Hurley
  0 siblings, 1 reply; 11+ messages in thread
From: Vlad Buslov @ 2019-10-03 17:19 UTC (permalink / raw)
  To: John Hurley
  Cc: Vlad Buslov, Jiri Pirko, netdev, simon.horman, jakub.kicinski,
	oss-drivers


On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@netronome.com> wrote:
> On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
>> > Hi,
>> >
>> > Putting this out an RFC built on net-next. It fixes some issues
>> > discovered in testing when using the TC API of OvS to generate flower
>> > rules and subsequently offloading them to HW. Rules seen contain the same
>> > match fields or may be rule modifications run as a delete plus an add.
>> > We're seeing race conditions whereby the rules present in kernel flower
>> > are out of sync with those offloaded. Note that there are some issues
>> > that will need fixed in the RFC before it becomes a patch such as
>> > potential races between releasing locks and re-taking them. However, I'm
>> > putting this out for comments or potential alternative solutions.
>> >
>> > The main cause of the races seem to be in the chain table of cls_api. If
>> > a tcf_proto is destroyed then it is removed from its chain. If a new
>> > filter is then added to the same chain with the same priority and protocol
>> > a new tcf_proto will be created - this may happen before the first is
>> > fully removed and the hw offload message sent to the driver. In cls_flower
>> > this means that the fl_ht_insert_unique() function can pass as its
>> > hashtable is associated with the tcf_proto. We are then in a position
>> > where the 'delete' and the 'add' are in a race to get offloaded. We also
>> > noticed that doing an offload add, then checking if a tcf_proto is
>> > concurrently deleting, then remove the offload if it is, can extend the
>> > out of order messages. Drivers do not expect to get duplicate rules.
>> > However, the kernel TC datapath they are not duplicates so we can get out
>> > of sync here.
>> >
>> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
>> > when a tcf_proto is signaled to be destroyed but before it is removed from
>> > its chain (which is essentially the lock for allowing duplicates in
>> > flower). Flower then uses this new hook to send the hw delete messages
>> > from tcf_proto destroys, preventing them racing with duplicate adds. It
>> > also moves the check for 'deleting' to before the sending the hw add
>> > message.
>> >
>> > John Hurley (2):
>> >   net: sched: add tp_op for pre_destroy
>> >   net: sched: fix tp destroy race conditions in flower
>> >
>> >  include/net/sch_generic.h |  3 +++
>> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
>> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
>> >  3 files changed, 61 insertions(+), 26 deletions(-)
>>
>> Hi John,
>>
>> Thanks for working on this!
>>
>> Are there any other sources for race conditions described in this
>> letter? When you describe tcf_proto deletion you say "main cause" but
>> don't provide any others. If tcf_proto is the only problematic part,
>
> Hi Vlad,
> Thanks for the input.
> The tcf_proto deletion was the cause from the tests we ran. That's not
> to say there are not more I wasn't seeing in my analysis.
>
>> then it might be worth to look into alternative ways to force concurrent
>> users to wait for proto deletion/destruction to be properly finished.
>> Maybe having some table that maps chain id + prio to completion would be
>> simpler approach? With such infra tcf_proto_create() can wait for
>> previous proto with same prio and chain to be fully destroyed (including
>> offloads) before creating a new one.
>
> I think a problem with this is that the chain removal functions call
> tcf_proto_put() (which calls destroy when ref is 0) so, if other
> concurrent processes (like a dump) have references to the tcf_proto
> then we may not get the hw offload even by the time the chain deletion
> function has finished. We would need to make sure this was tracked -
> say after the tcf_proto_destroy function has completed.
> How would you suggest doing the wait? With a replay flag as happens in
> some other places?
>
> To me it seems the main problem is that the tcf_proto being in a chain
> almost acts like the lock to prevent duplicates filters getting to the
> driver. We need some mechanism to ensure a delete has made it to HW
> before we release this 'lock'.

Maybe something like:

1. Extend block with hash table with key being chain id and prio
combined and value is some structure that contains struct completion
(completed in tcf_proto_destroy() where we sure that all rules were
removed from hw) and a reference counter.

2. When cls API wants to delete proto instance
(tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
table from 1. with chain+prio of proto that is being deleted (atomically
with detaching of proto from chain).

3. When inserting new proto, verify that there are no corresponding
entry in hash table with same chain+prio. If there is, increment
reference counter and wait for completion. Release reference counter
when completed.

>
>>
>> Regards,
>> Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-03 17:19     ` Vlad Buslov
@ 2019-10-04 15:39       ` John Hurley
  2019-10-04 15:58         ` Vlad Buslov
  0 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2019-10-04 15:39 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Jiri Pirko, netdev, simon.horman, jakub.kicinski, oss-drivers

On Thu, Oct 3, 2019 at 6:19 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@netronome.com> wrote:
> > On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >>
> >> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> >> > Hi,
> >> >
> >> > Putting this out an RFC built on net-next. It fixes some issues
> >> > discovered in testing when using the TC API of OvS to generate flower
> >> > rules and subsequently offloading them to HW. Rules seen contain the same
> >> > match fields or may be rule modifications run as a delete plus an add.
> >> > We're seeing race conditions whereby the rules present in kernel flower
> >> > are out of sync with those offloaded. Note that there are some issues
> >> > that will need fixed in the RFC before it becomes a patch such as
> >> > potential races between releasing locks and re-taking them. However, I'm
> >> > putting this out for comments or potential alternative solutions.
> >> >
> >> > The main cause of the races seem to be in the chain table of cls_api. If
> >> > a tcf_proto is destroyed then it is removed from its chain. If a new
> >> > filter is then added to the same chain with the same priority and protocol
> >> > a new tcf_proto will be created - this may happen before the first is
> >> > fully removed and the hw offload message sent to the driver. In cls_flower
> >> > this means that the fl_ht_insert_unique() function can pass as its
> >> > hashtable is associated with the tcf_proto. We are then in a position
> >> > where the 'delete' and the 'add' are in a race to get offloaded. We also
> >> > noticed that doing an offload add, then checking if a tcf_proto is
> >> > concurrently deleting, then remove the offload if it is, can extend the
> >> > out of order messages. Drivers do not expect to get duplicate rules.
> >> > However, the kernel TC datapath they are not duplicates so we can get out
> >> > of sync here.
> >> >
> >> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
> >> > when a tcf_proto is signaled to be destroyed but before it is removed from
> >> > its chain (which is essentially the lock for allowing duplicates in
> >> > flower). Flower then uses this new hook to send the hw delete messages
> >> > from tcf_proto destroys, preventing them racing with duplicate adds. It
> >> > also moves the check for 'deleting' to before the sending the hw add
> >> > message.
> >> >
> >> > John Hurley (2):
> >> >   net: sched: add tp_op for pre_destroy
> >> >   net: sched: fix tp destroy race conditions in flower
> >> >
> >> >  include/net/sch_generic.h |  3 +++
> >> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
> >> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
> >> >  3 files changed, 61 insertions(+), 26 deletions(-)
> >>
> >> Hi John,
> >>
> >> Thanks for working on this!
> >>
> >> Are there any other sources for race conditions described in this
> >> letter? When you describe tcf_proto deletion you say "main cause" but
> >> don't provide any others. If tcf_proto is the only problematic part,
> >
> > Hi Vlad,
> > Thanks for the input.
> > The tcf_proto deletion was the cause from the tests we ran. That's not
> > to say there are not more I wasn't seeing in my analysis.
> >
> >> then it might be worth to look into alternative ways to force concurrent
> >> users to wait for proto deletion/destruction to be properly finished.
> >> Maybe having some table that maps chain id + prio to completion would be
> >> simpler approach? With such infra tcf_proto_create() can wait for
> >> previous proto with same prio and chain to be fully destroyed (including
> >> offloads) before creating a new one.
> >
> > I think a problem with this is that the chain removal functions call
> > tcf_proto_put() (which calls destroy when ref is 0) so, if other
> > concurrent processes (like a dump) have references to the tcf_proto
> > then we may not get the hw offload even by the time the chain deletion
> > function has finished. We would need to make sure this was tracked -
> > say after the tcf_proto_destroy function has completed.
> > How would you suggest doing the wait? With a replay flag as happens in
> > some other places?
> >
> > To me it seems the main problem is that the tcf_proto being in a chain
> > almost acts like the lock to prevent duplicates filters getting to the
> > driver. We need some mechanism to ensure a delete has made it to HW
> > before we release this 'lock'.
>
> Maybe something like:

Ok, I'll need to give this more thought.
Initially it does sound like overkill.

>
> 1. Extend block with hash table with key being chain id and prio
> combined and value is some structure that contains struct completion
> (completed in tcf_proto_destroy() where we sure that all rules were
> removed from hw) and a reference counter.
>

Maybe it could live in each chain rather than block to be more fine grained?
Or would this potentially cause a similar issue on deletion of chains?

> 2. When cls API wants to delete proto instance
> (tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
> table from 1. with chain+prio of proto that is being deleted (atomically
> with detaching of proto from chain).
>
> 3. When inserting new proto, verify that there are no corresponding
> entry in hash table with same chain+prio. If there is, increment
> reference counter and wait for completion. Release reference counter
> when completed.

How would the 'wait' work? Loop back via replay flag?
It feels a bit like we are adding a lot more complexity to this and
almost hacking something in to work around a (relatively) newly
introduced problem.

>
> >
> >>
> >> Regards,
> >> Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-04 15:39       ` John Hurley
@ 2019-10-04 15:58         ` Vlad Buslov
  2019-10-04 16:06           ` John Hurley
  0 siblings, 1 reply; 11+ messages in thread
From: Vlad Buslov @ 2019-10-04 15:58 UTC (permalink / raw)
  To: John Hurley
  Cc: Vlad Buslov, Jiri Pirko, netdev, simon.horman, jakub.kicinski,
	oss-drivers


On Fri 04 Oct 2019 at 18:39, John Hurley <john.hurley@netronome.com> wrote:
> On Thu, Oct 3, 2019 at 6:19 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@netronome.com> wrote:
>> > On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >>
>> >> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
>> >> > Hi,
>> >> >
>> >> > Putting this out an RFC built on net-next. It fixes some issues
>> >> > discovered in testing when using the TC API of OvS to generate flower
>> >> > rules and subsequently offloading them to HW. Rules seen contain the same
>> >> > match fields or may be rule modifications run as a delete plus an add.
>> >> > We're seeing race conditions whereby the rules present in kernel flower
>> >> > are out of sync with those offloaded. Note that there are some issues
>> >> > that will need fixed in the RFC before it becomes a patch such as
>> >> > potential races between releasing locks and re-taking them. However, I'm
>> >> > putting this out for comments or potential alternative solutions.
>> >> >
>> >> > The main cause of the races seem to be in the chain table of cls_api. If
>> >> > a tcf_proto is destroyed then it is removed from its chain. If a new
>> >> > filter is then added to the same chain with the same priority and protocol
>> >> > a new tcf_proto will be created - this may happen before the first is
>> >> > fully removed and the hw offload message sent to the driver. In cls_flower
>> >> > this means that the fl_ht_insert_unique() function can pass as its
>> >> > hashtable is associated with the tcf_proto. We are then in a position
>> >> > where the 'delete' and the 'add' are in a race to get offloaded. We also
>> >> > noticed that doing an offload add, then checking if a tcf_proto is
>> >> > concurrently deleting, then remove the offload if it is, can extend the
>> >> > out of order messages. Drivers do not expect to get duplicate rules.
>> >> > However, the kernel TC datapath they are not duplicates so we can get out
>> >> > of sync here.
>> >> >
>> >> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
>> >> > when a tcf_proto is signaled to be destroyed but before it is removed from
>> >> > its chain (which is essentially the lock for allowing duplicates in
>> >> > flower). Flower then uses this new hook to send the hw delete messages
>> >> > from tcf_proto destroys, preventing them racing with duplicate adds. It
>> >> > also moves the check for 'deleting' to before the sending the hw add
>> >> > message.
>> >> >
>> >> > John Hurley (2):
>> >> >   net: sched: add tp_op for pre_destroy
>> >> >   net: sched: fix tp destroy race conditions in flower
>> >> >
>> >> >  include/net/sch_generic.h |  3 +++
>> >> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
>> >> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
>> >> >  3 files changed, 61 insertions(+), 26 deletions(-)
>> >>
>> >> Hi John,
>> >>
>> >> Thanks for working on this!
>> >>
>> >> Are there any other sources for race conditions described in this
>> >> letter? When you describe tcf_proto deletion you say "main cause" but
>> >> don't provide any others. If tcf_proto is the only problematic part,
>> >
>> > Hi Vlad,
>> > Thanks for the input.
>> > The tcf_proto deletion was the cause from the tests we ran. That's not
>> > to say there are not more I wasn't seeing in my analysis.
>> >
>> >> then it might be worth to look into alternative ways to force concurrent
>> >> users to wait for proto deletion/destruction to be properly finished.
>> >> Maybe having some table that maps chain id + prio to completion would be
>> >> simpler approach? With such infra tcf_proto_create() can wait for
>> >> previous proto with same prio and chain to be fully destroyed (including
>> >> offloads) before creating a new one.
>> >
>> > I think a problem with this is that the chain removal functions call
>> > tcf_proto_put() (which calls destroy when ref is 0) so, if other
>> > concurrent processes (like a dump) have references to the tcf_proto
>> > then we may not get the hw offload even by the time the chain deletion
>> > function has finished. We would need to make sure this was tracked -
>> > say after the tcf_proto_destroy function has completed.
>> > How would you suggest doing the wait? With a replay flag as happens in
>> > some other places?
>> >
>> > To me it seems the main problem is that the tcf_proto being in a chain
>> > almost acts like the lock to prevent duplicates filters getting to the
>> > driver. We need some mechanism to ensure a delete has made it to HW
>> > before we release this 'lock'.
>>
>> Maybe something like:
>
> Ok, I'll need to give this more thought.
> Initially it does sound like overkill.
>
>>
>> 1. Extend block with hash table with key being chain id and prio
>> combined and value is some structure that contains struct completion
>> (completed in tcf_proto_destroy() where we sure that all rules were
>> removed from hw) and a reference counter.
>>
>
> Maybe it could live in each chain rather than block to be more fine grained?
> Or would this potentially cause a similar issue on deletion of chains?

IMO just having one per block is straightforward enough, unless there is
a reason to make it per chain.

>
>> 2. When cls API wants to delete proto instance
>> (tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
>> table from 1. with chain+prio of proto that is being deleted (atomically
>> with detaching of proto from chain).
>>
>> 3. When inserting new proto, verify that there are no corresponding
>> entry in hash table with same chain+prio. If there is, increment
>> reference counter and wait for completion. Release reference counter
>> when completed.
>
> How would the 'wait' work? Loop back via replay flag?

What is "loop back via replay flag"?
Anyway, I was suggesting to use struct completion from completion.h,
which has following functions in its API:

/**
 * wait_for_completion: - waits for completion of a task
 * @x:  holds the state of this particular completion
 *
 * This waits to be signaled for completion of a specific task. It is NOT
 * interruptible and there is no timeout.
 *
 * See also similar routines (i.e. wait_for_completion_timeout()) with timeout
 * and interrupt capability. Also see complete().
 */
void __sched wait_for_completion(struct completion *x)

/**
 * complete_all: - signals all threads waiting on this completion
 * @x:  holds the state of this particular completion
 *
 * This will wake up all threads waiting on this particular completion event.
 *
 * If this function wakes up a task, it executes a full memory barrier before
 * accessing the task state.
 *
 * Since complete_all() sets the completion of @x permanently to done
 * to allow multiple waiters to finish, a call to reinit_completion()
 * must be used on @x if @x is to be used again. The code must make
 * sure that all waiters have woken and finished before reinitializing
 * @x. Also note that the function completion_done() can not be used
 * to know if there are still waiters after complete_all() has been called.
 */
void complete_all(struct completion *x)


> It feels a bit like we are adding a lot more complexity to this and
> almost hacking something in to work around a (relatively) newly
> introduced problem.

I'm not insisting on any particular approach, just suggesting something
which in my opinion is easier to implement than reshuffling locking in
flower and directly targets the problem you described in cover letter -
new filters can be inserted while concurrent destruction of tp with same
chain id and prio is in progress.

>
>>
>> >
>> >>
>> >> Regards,
>> >> Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
  2019-10-04 15:58         ` Vlad Buslov
@ 2019-10-04 16:06           ` John Hurley
  0 siblings, 0 replies; 11+ messages in thread
From: John Hurley @ 2019-10-04 16:06 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Jiri Pirko, netdev, simon.horman, jakub.kicinski, oss-drivers

On Fri, Oct 4, 2019 at 4:58 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Fri 04 Oct 2019 at 18:39, John Hurley <john.hurley@netronome.com> wrote:
> > On Thu, Oct 3, 2019 at 6:19 PM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >>
> >> On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@netronome.com> wrote:
> >> > On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> >>
> >> >>
> >> >> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Putting this out an RFC built on net-next. It fixes some issues
> >> >> > discovered in testing when using the TC API of OvS to generate flower
> >> >> > rules and subsequently offloading them to HW. Rules seen contain the same
> >> >> > match fields or may be rule modifications run as a delete plus an add.
> >> >> > We're seeing race conditions whereby the rules present in kernel flower
> >> >> > are out of sync with those offloaded. Note that there are some issues
> >> >> > that will need fixed in the RFC before it becomes a patch such as
> >> >> > potential races between releasing locks and re-taking them. However, I'm
> >> >> > putting this out for comments or potential alternative solutions.
> >> >> >
> >> >> > The main cause of the races seem to be in the chain table of cls_api. If
> >> >> > a tcf_proto is destroyed then it is removed from its chain. If a new
> >> >> > filter is then added to the same chain with the same priority and protocol
> >> >> > a new tcf_proto will be created - this may happen before the first is
> >> >> > fully removed and the hw offload message sent to the driver. In cls_flower
> >> >> > this means that the fl_ht_insert_unique() function can pass as its
> >> >> > hashtable is associated with the tcf_proto. We are then in a position
> >> >> > where the 'delete' and the 'add' are in a race to get offloaded. We also
> >> >> > noticed that doing an offload add, then checking if a tcf_proto is
> >> >> > concurrently deleting, then remove the offload if it is, can extend the
> >> >> > out of order messages. Drivers do not expect to get duplicate rules.
> >> >> > However, the kernel TC datapath they are not duplicates so we can get out
> >> >> > of sync here.
> >> >> >
> >> >> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
> >> >> > when a tcf_proto is signaled to be destroyed but before it is removed from
> >> >> > its chain (which is essentially the lock for allowing duplicates in
> >> >> > flower). Flower then uses this new hook to send the hw delete messages
> >> >> > from tcf_proto destroys, preventing them racing with duplicate adds. It
> >> >> > also moves the check for 'deleting' to before the sending the hw add
> >> >> > message.
> >> >> >
> >> >> > John Hurley (2):
> >> >> >   net: sched: add tp_op for pre_destroy
> >> >> >   net: sched: fix tp destroy race conditions in flower
> >> >> >
> >> >> >  include/net/sch_generic.h |  3 +++
> >> >> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
> >> >> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
> >> >> >  3 files changed, 61 insertions(+), 26 deletions(-)
> >> >>
> >> >> Hi John,
> >> >>
> >> >> Thanks for working on this!
> >> >>
> >> >> Are there any other sources for race conditions described in this
> >> >> letter? When you describe tcf_proto deletion you say "main cause" but
> >> >> don't provide any others. If tcf_proto is the only problematic part,
> >> >
> >> > Hi Vlad,
> >> > Thanks for the input.
> >> > The tcf_proto deletion was the cause from the tests we ran. That's not
> >> > to say there are not more I wasn't seeing in my analysis.
> >> >
> >> >> then it might be worth to look into alternative ways to force concurrent
> >> >> users to wait for proto deletion/destruction to be properly finished.
> >> >> Maybe having some table that maps chain id + prio to completion would be
> >> >> simpler approach? With such infra tcf_proto_create() can wait for
> >> >> previous proto with same prio and chain to be fully destroyed (including
> >> >> offloads) before creating a new one.
> >> >
> >> > I think a problem with this is that the chain removal functions call
> >> > tcf_proto_put() (which calls destroy when ref is 0) so, if other
> >> > concurrent processes (like a dump) have references to the tcf_proto
> >> > then we may not get the hw offload even by the time the chain deletion
> >> > function has finished. We would need to make sure this was tracked -
> >> > say after the tcf_proto_destroy function has completed.
> >> > How would you suggest doing the wait? With a replay flag as happens in
> >> > some other places?
> >> >
> >> > To me it seems the main problem is that the tcf_proto being in a chain
> >> > almost acts like the lock to prevent duplicates filters getting to the
> >> > driver. We need some mechanism to ensure a delete has made it to HW
> >> > before we release this 'lock'.
> >>
> >> Maybe something like:
> >
> > Ok, I'll need to give this more thought.
> > Initially it does sound like overkill.
> >
> >>
> >> 1. Extend block with hash table with key being chain id and prio
> >> combined and value is some structure that contains struct completion
> >> (completed in tcf_proto_destroy() where we sure that all rules were
> >> removed from hw) and a reference counter.
> >>
> >
> > Maybe it could live in each chain rather than block to be more fine grained?
> > Or would this potentially cause a similar issue on deletion of chains?
>
> IMO just having one per block is straightforward enough, unless there is
> a reason to make it per chain.
>
> >
> >> 2. When cls API wants to delete proto instance
> >> (tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
> >> table from 1. with chain+prio of proto that is being deleted (atomically
> >> with detaching of proto from chain).
> >>
> >> 3. When inserting new proto, verify that there are no corresponding
> >> entry in hash table with same chain+prio. If there is, increment
> >> reference counter and wait for completion. Release reference counter
> >> when completed.
> >
> > How would the 'wait' work? Loop back via replay flag?
>
> What is "loop back via replay flag"?

Ok, bad description :)
I was referring the EAGAIN error returns used in other places post RTNL removal.
e.g. https://elixir.bootlin.com/linux/v5.4-rc1/source/net/sched/cls_flower.c#L1606

> Anyway, I was suggesting to use struct completion from completion.h,
> which has following functions in its API:
>

thanks for the pointer.

> /**
>  * wait_for_completion: - waits for completion of a task
>  * @x:  holds the state of this particular completion
>  *
>  * This waits to be signaled for completion of a specific task. It is NOT
>  * interruptible and there is no timeout.
>  *
>  * See also similar routines (i.e. wait_for_completion_timeout()) with timeout
>  * and interrupt capability. Also see complete().
>  */
> void __sched wait_for_completion(struct completion *x)
>
> /**
>  * complete_all: - signals all threads waiting on this completion
>  * @x:  holds the state of this particular completion
>  *
>  * This will wake up all threads waiting on this particular completion event.
>  *
>  * If this function wakes up a task, it executes a full memory barrier before
>  * accessing the task state.
>  *
>  * Since complete_all() sets the completion of @x permanently to done
>  * to allow multiple waiters to finish, a call to reinit_completion()
>  * must be used on @x if @x is to be used again. The code must make
>  * sure that all waiters have woken and finished before reinitializing
>  * @x. Also note that the function completion_done() can not be used
>  * to know if there are still waiters after complete_all() has been called.
>  */
> void complete_all(struct completion *x)
>
>
> > It feels a bit like we are adding a lot more complexity to this and
> > almost hacking something in to work around a (relatively) newly
> > introduced problem.
>
> I'm not insisting on any particular approach, just suggesting something
> which in my opinion is easier to implement than reshuffling locking in
> flower and directly targets the problem you described in cover letter -
> new filters can be inserted while concurrent destruction of tp with same
> chain id and prio is in progress.
>
> >
> >>
> >> >
> >> >>
> >> >> Regards,
> >> >> Vlad

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-04 16:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-02 23:14 [RFC net-next 0/2] prevent sync issues with hw offload of flower John Hurley
2019-10-02 23:14 ` [RFC net-next 1/2] net: sched: add tp_op for pre_destroy John Hurley
2019-10-02 23:14 ` [RFC net-next 2/2] net: sched: fix tp destroy race conditions in flower John Hurley
2019-10-03 16:18   ` Vlad Buslov
2019-10-03 16:39     ` John Hurley
2019-10-03 16:26 ` [RFC net-next 0/2] prevent sync issues with hw offload of flower Vlad Buslov
2019-10-03 16:59   ` John Hurley
2019-10-03 17:19     ` Vlad Buslov
2019-10-04 15:39       ` John Hurley
2019-10-04 15:58         ` Vlad Buslov
2019-10-04 16:06           ` John Hurley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).