[PATCH net-next 0/2] net/sched: optimizations around action binding and init

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] net/sched: optimizations around action binding and init
@ 2023-12-05 15:30 Pedro Tammela
  2023-12-05 15:30 ` [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc Pedro Tammela
  2023-12-05 15:30 ` [PATCH net-next 2/2] net/sched: act_api: skip idr replace on bound actions Pedro Tammela
  0 siblings, 2 replies; 8+ messages in thread
From: Pedro Tammela @ 2023-12-05 15:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner, vladbu, Pedro Tammela

Scaling optimizations for action binding in rtnl-less filters.
We saw a noticeable lock contention around idrinfo->lock when
testing in a 56 core system, which disappeared after the patches.

Pedro Tammela (2):
  net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  net/sched: act_api: skip idr replace on bound actions

 include/net/act_api.h |  2 +-
 net/sched/act_api.c   | 67 ++++++++++++++++++++++++++-----------------
 net/sched/cls_api.c   |  2 +-
 3 files changed, 42 insertions(+), 29 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-05 15:30 [PATCH net-next 0/2] net/sched: optimizations around action binding and init Pedro Tammela
@ 2023-12-05 15:30 ` Pedro Tammela
  2023-12-05 18:34   ` Vlad Buslov
  2023-12-05 15:30 ` [PATCH net-next 2/2] net/sched: act_api: skip idr replace on bound actions Pedro Tammela
  1 sibling, 1 reply; 8+ messages in thread
From: Pedro Tammela @ 2023-12-05 15:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner, vladbu, Pedro Tammela

Instead of relying only on the idrinfo->lock mutex for
bind/alloc logic, rely on a combination of rcu + mutex + atomics
to better scale the case where multiple rtnl-less filters are
binding to the same action object.

Action binding happens when an action index is specified explicitly and
an action exists which such index exists. Example:
  tc actions add action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter ls ...
     filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

When no index is specified, as before, grab the mutex and allocate
in the idr the next available id. In this version, as opposed to before,
it's simplified to store the -EBUSY pointer instead of the previous
alloc + replace combination.

When an index is specified, rely on rcu to find if there's an object in
such index. If there's none, fallback to the above, serializing on the
mutex and reserving the specified id. If there's one, it can be an -EBUSY
pointer, in which case we just try again until it's an action, or an action.
Given the rcu guarantees, the action found could be dead and therefore
we need to bump the refcount if it's not 0, handling the case it's
in fact 0.

As bind and the action refcount are already atomics, these increments can
happen without the mutex protection while many tcf_idr_check_alloc race
to bind to the same action instance.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
---
 net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index abec5c45b5a4..79a044d2ae02 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 	struct tc_action *p;
 	int ret;
+	u32 max;
 
-again:
-	mutex_lock(&idrinfo->lock);
 	if (*index) {
+again:
+		rcu_read_lock();
 		p = idr_find(&idrinfo->action_idr, *index);
+
 		if (IS_ERR(p)) {
 			/* This means that another process allocated
 			 * index but did not assign the pointer yet.
 			 */
-			mutex_unlock(&idrinfo->lock);
+			rcu_read_unlock();
 			goto again;
 		}
 
-		if (p) {
-			refcount_inc(&p->tcfa_refcnt);
-			if (bind)
-				atomic_inc(&p->tcfa_bindcnt);
-			*a = p;
-			ret = 1;
-		} else {
-			*a = NULL;
-			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
-					    *index, GFP_KERNEL);
-			if (!ret)
-				idr_replace(&idrinfo->action_idr,
-					    ERR_PTR(-EBUSY), *index);
+		if (!p) {
+			/* Empty slot, try to allocate it */
+			max = *index;
+			rcu_read_unlock();
+			goto new;
+		}
+
+		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
+			/* Action was deleted in parallel */
+			rcu_read_unlock();
+			return -ENOENT;
 		}
+
+		if (bind)
+			atomic_inc(&p->tcfa_bindcnt);
+		*a = p;
+
+		rcu_read_unlock();
+
+		return 1;
 	} else {
+		/* Find a slot */
 		*index = 1;
-		*a = NULL;
-		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
-				    UINT_MAX, GFP_KERNEL);
-		if (!ret)
-			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
-				    *index);
+		max = UINT_MAX;
 	}
+
+new:
+	*a = NULL;
+
+	mutex_lock(&idrinfo->lock);
+	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
+			    GFP_KERNEL);
 	mutex_unlock(&idrinfo->lock);
+
 	return ret;
 }
 EXPORT_SYMBOL(tcf_idr_check_alloc);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 2/2] net/sched: act_api: skip idr replace on bound actions
  2023-12-05 15:30 [PATCH net-next 0/2] net/sched: optimizations around action binding and init Pedro Tammela
  2023-12-05 15:30 ` [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc Pedro Tammela
@ 2023-12-05 15:30 ` Pedro Tammela
  1 sibling, 0 replies; 8+ messages in thread
From: Pedro Tammela @ 2023-12-05 15:30 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner, vladbu, Pedro Tammela

tcf_idr_insert_many will replace the allocated -EBUSY pointer in
tcf_idr_check_alloc with the real action pointer, exposing it
to all operations. This operation is only needed when the action pointer
is created (ACT_P_CREATED). For actions which are bound to (returned 0),
the pointer already resides in the idr making such operation a nop.

Even though it's a nop, it's still not a cheap operation as internally
the idr code walks the idr and then does a replace on the appropriate slot.
So if the action was bound, better skip the idr replace entirely.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
---
 include/net/act_api.h |  2 +-
 net/sched/act_api.c   | 11 ++++++-----
 net/sched/cls_api.c   |  2 +-
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 4ae0580b63ca..ea13e1e4a7c2 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -191,7 +191,7 @@ int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 index,
 			      struct nlattr *est, struct tc_action **a,
 			      const struct tc_action_ops *ops, int bind,
 			      u32 flags);
-void tcf_idr_insert_many(struct tc_action *actions[]);
+void tcf_idr_insert_many(struct tc_action *actions[], int init_res[]);
 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
 int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 			struct tc_action **a, int bind);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 79a044d2ae02..64d35ab46993 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1295,7 +1295,7 @@ static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 	[TCA_ACT_HW_STATS]	= NLA_POLICY_BITFIELD32(TCA_ACT_HW_STATS_ANY),
 };
 
-void tcf_idr_insert_many(struct tc_action *actions[])
+void tcf_idr_insert_many(struct tc_action *actions[], int init_res[])
 {
 	struct tc_action *a;
 	int i;
@@ -1303,11 +1303,12 @@ void tcf_idr_insert_many(struct tc_action *actions[])
 	tcf_act_for_each_action(i, a, actions) {
 		struct tcf_idrinfo *idrinfo;
 
+		if (init_res[i] == 0) /* Bound */
+			continue;
+
 		idrinfo = a->idrinfo;
 		mutex_lock(&idrinfo->lock);
-		/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc if
-		 * it is just created, otherwise this is just a nop.
-		 */
+		/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
 		idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
 		mutex_unlock(&idrinfo->lock);
 	}
@@ -1507,7 +1508,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 	/* We have to commit them all together, because if any error happened in
 	 * between, we could not handle the failure gracefully.
 	 */
-	tcf_idr_insert_many(actions);
+	tcf_idr_insert_many(actions, init_res);
 
 	*attr_size = tcf_action_full_attrs_size(sz);
 	err = i - 1;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd163986..6391b341284e 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3309,7 +3309,7 @@ int tcf_exts_validate_ex(struct net *net, struct tcf_proto *tp, struct nlattr **
 			act->type = exts->type = TCA_OLD_COMPAT;
 			exts->actions[0] = act;
 			exts->nr_actions = 1;
-			tcf_idr_insert_many(exts->actions);
+			tcf_idr_insert_many(exts->actions, init_res);
 		} else if (exts->action && tb[exts->action]) {
 			int err;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-05 15:30 ` [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc Pedro Tammela
@ 2023-12-05 18:34   ` Vlad Buslov
  2023-12-05 20:19     ` Pedro Tammela
  0 siblings, 1 reply; 8+ messages in thread
From: Vlad Buslov @ 2023-12-05 18:34 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: netdev, davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner

On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
> Instead of relying only on the idrinfo->lock mutex for
> bind/alloc logic, rely on a combination of rcu + mutex + atomics
> to better scale the case where multiple rtnl-less filters are
> binding to the same action object.
>
> Action binding happens when an action index is specified explicitly and
> an action exists which such index exists. Example:

Nit: the first sentence looks mangled, extra 'exists' word and probably
'which' should be 'with'.

>   tc actions add action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter ls ...
>      filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
>    filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
>    filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
> When no index is specified, as before, grab the mutex and allocate
> in the idr the next available id. In this version, as opposed to before,
> it's simplified to store the -EBUSY pointer instead of the previous
> alloc + replace combination.
>
> When an index is specified, rely on rcu to find if there's an object in
> such index. If there's none, fallback to the above, serializing on the
> mutex and reserving the specified id. If there's one, it can be an -EBUSY
> pointer, in which case we just try again until it's an action, or an action.
> Given the rcu guarantees, the action found could be dead and therefore
> we need to bump the refcount if it's not 0, handling the case it's
> in fact 0.
>
> As bind and the action refcount are already atomics, these increments can
> happen without the mutex protection while many tcf_idr_check_alloc race
> to bind to the same action instance.
>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> ---
>  net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>  1 file changed, 34 insertions(+), 22 deletions(-)
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index abec5c45b5a4..79a044d2ae02 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>  	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>  	struct tc_action *p;
>  	int ret;
> +	u32 max;
>  
> -again:
> -	mutex_lock(&idrinfo->lock);
>  	if (*index) {
> +again:
> +		rcu_read_lock();
>  		p = idr_find(&idrinfo->action_idr, *index);
> +
>  		if (IS_ERR(p)) {
>  			/* This means that another process allocated
>  			 * index but did not assign the pointer yet.
>  			 */
> -			mutex_unlock(&idrinfo->lock);
> +			rcu_read_unlock();
>  			goto again;
>  		}
>  
> -		if (p) {
> -			refcount_inc(&p->tcfa_refcnt);
> -			if (bind)
> -				atomic_inc(&p->tcfa_bindcnt);
> -			*a = p;
> -			ret = 1;
> -		} else {
> -			*a = NULL;
> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
> -					    *index, GFP_KERNEL);
> -			if (!ret)
> -				idr_replace(&idrinfo->action_idr,
> -					    ERR_PTR(-EBUSY), *index);
> +		if (!p) {
> +			/* Empty slot, try to allocate it */
> +			max = *index;
> +			rcu_read_unlock();
> +			goto new;
> +		}
> +
> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
> +			/* Action was deleted in parallel */
> +			rcu_read_unlock();
> +			return -ENOENT;

Current version doesn't return ENOENT since it is synchronous. You are
now introducing basically a change to UAPI since users of this function
(individual actions) are not prepared to retry on ENOENT and will
propagate the error up the call chain. I guess you need to try to create
a new action with specified index instead.

>  		}
> +
> +		if (bind)
> +			atomic_inc(&p->tcfa_bindcnt);
> +		*a = p;
> +
> +		rcu_read_unlock();
> +
> +		return 1;
>  	} else {
> +		/* Find a slot */
>  		*index = 1;
> -		*a = NULL;
> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
> -				    UINT_MAX, GFP_KERNEL);
> -		if (!ret)
> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
> -				    *index);
> +		max = UINT_MAX;
>  	}
> +
> +new:
> +	*a = NULL;
> +
> +	mutex_lock(&idrinfo->lock);
> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
> +			    GFP_KERNEL);

What if multiple concurrent tasks didn't find the action by index with
rcu and get here, synchronizing on the idrinfo->lock? It looks like
after the one who got the lock first successfully allocates the index
everyone else will fail (also propagating ENOSPACE to the user). I guess
you need some mechanism to account for such case and retry.

>  	mutex_unlock(&idrinfo->lock);
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(tcf_idr_check_alloc);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-05 18:34   ` Vlad Buslov
@ 2023-12-05 20:19     ` Pedro Tammela
  2023-12-06  9:52       ` Vlad Buslov
  0 siblings, 1 reply; 8+ messages in thread
From: Pedro Tammela @ 2023-12-05 20:19 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: netdev, davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner

On 05/12/2023 15:34, Vlad Buslov wrote:
> On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
>> Instead of relying only on the idrinfo->lock mutex for
>> bind/alloc logic, rely on a combination of rcu + mutex + atomics
>> to better scale the case where multiple rtnl-less filters are
>> binding to the same action object.
>>
>> Action binding happens when an action index is specified explicitly and
>> an action exists which such index exists. Example:
> 
> Nit: the first sentence looks mangled, extra 'exists' word and probably
> 'which' should be 'with'.
> 
>>    tc actions add action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter ls ...
>>       filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>> When no index is specified, as before, grab the mutex and allocate
>> in the idr the next available id. In this version, as opposed to before,
>> it's simplified to store the -EBUSY pointer instead of the previous
>> alloc + replace combination.
>>
>> When an index is specified, rely on rcu to find if there's an object in
>> such index. If there's none, fallback to the above, serializing on the
>> mutex and reserving the specified id. If there's one, it can be an -EBUSY
>> pointer, in which case we just try again until it's an action, or an action.
>> Given the rcu guarantees, the action found could be dead and therefore
>> we need to bump the refcount if it's not 0, handling the case it's
>> in fact 0.
>>
>> As bind and the action refcount are already atomics, these increments can
>> happen without the mutex protection while many tcf_idr_check_alloc race
>> to bind to the same action instance.
>>
>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>> ---
>>   net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>>   1 file changed, 34 insertions(+), 22 deletions(-)
>>
>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> index abec5c45b5a4..79a044d2ae02 100644
>> --- a/net/sched/act_api.c
>> +++ b/net/sched/act_api.c
>> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>>   	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>   	struct tc_action *p;
>>   	int ret;
>> +	u32 max;
>>   
>> -again:
>> -	mutex_lock(&idrinfo->lock);
>>   	if (*index) {
>> +again:
>> +		rcu_read_lock();
>>   		p = idr_find(&idrinfo->action_idr, *index);
>> +
>>   		if (IS_ERR(p)) {
>>   			/* This means that another process allocated
>>   			 * index but did not assign the pointer yet.
>>   			 */
>> -			mutex_unlock(&idrinfo->lock);
>> +			rcu_read_unlock();
>>   			goto again;
>>   		}
>>   
>> -		if (p) {
>> -			refcount_inc(&p->tcfa_refcnt);
>> -			if (bind)
>> -				atomic_inc(&p->tcfa_bindcnt);
>> -			*a = p;
>> -			ret = 1;
>> -		} else {
>> -			*a = NULL;
>> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -					    *index, GFP_KERNEL);
>> -			if (!ret)
>> -				idr_replace(&idrinfo->action_idr,
>> -					    ERR_PTR(-EBUSY), *index);
>> +		if (!p) {
>> +			/* Empty slot, try to allocate it */
>> +			max = *index;
>> +			rcu_read_unlock();
>> +			goto new;
>> +		}
>> +
>> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
>> +			/* Action was deleted in parallel */
>> +			rcu_read_unlock();
>> +			return -ENOENT;
> 
> Current version doesn't return ENOENT since it is synchronous. You are
> now introducing basically a change to UAPI since users of this function
> (individual actions) are not prepared to retry on ENOENT and will
> propagate the error up the call chain. I guess you need to try to create
> a new action with specified index instead.

I see.
So you are saying that in the case where action foo is deleted and a 
binding in parallel observes the deleted action, it should fallback into 
trying to allocate the index.

We could goto again and hope that idr_find will observe the idr index 
being freed, in which case it would fall back into action allocation if 
it does or simply go via the same path as before (jumping to 'again').

I don't see much problems here, it seems to converge in this scenario
as it eventually transforms into race for action allocation (more below) 
if you have an unfortunate delete with many bindings in flight.

> 
>>   		}
>> +
>> +		if (bind)
>> +			atomic_inc(&p->tcfa_bindcnt);
>> +		*a = p;
>> +
>> +		rcu_read_unlock();
>> +
>> +		return 1;
>>   	} else {
>> +		/* Find a slot */
>>   		*index = 1;
>> -		*a = NULL;
>> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -				    UINT_MAX, GFP_KERNEL);
>> -		if (!ret)
>> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
>> -				    *index);
>> +		max = UINT_MAX;
>>   	}
>> +
>> +new:
>> +	*a = NULL;
>> +
>> +	mutex_lock(&idrinfo->lock);
>> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
>> +			    GFP_KERNEL);
> 
> What if multiple concurrent tasks didn't find the action by index with
> rcu and get here, synchronizing on the idrinfo->lock? It looks like
> after the one who got the lock first successfully allocates the index
> everyone else will fail (also propagating ENOSPACE to the user). 

Correct

> I guess you need some mechanism to account for such case and retry.

Ok, so if I'm binding and it's observed a free index, which means "try 
to allocate" and I get a ENOSPC after jumping to new, try again but this 
time binding into the allocated action.

In this scenario when we come back to 'again' we will wait until -EBUSY 
is replaced with the real pointer. Seems like a big enough window that 
any race for allocating from binding would most probably end up in this 
contention loop.

However I think when we have these two retry mechanisms there's a 
extremely small window for an infinite loop if an action delete is timed 
just right, in between the action pointer is found and when we grab the 
tcfa_refcnt.

	idr_find (pointer)
	tcfa_refcnt (0)  <-------|
	again:                   |
	idr_find (free index!)   |
	new:                     |
	idr_alloc_u32 (ENOSPC)   |
	again:                   |
	idr_find (EBUSY)         |
	again:                   |
	idr_find (pointer)       |
	<evil delete happens>    |
	------->>>>--------------|

Another potential problem, is that this will race with non binding 
actions. So if the ENOSPC was actually from another unrelated action. A 
practical example would be a race between a binding to an 'action drop 
index 1' and an 'action ok' allocation. Actually it's a general problem 
and not particular to this case here but it seems like we could be 
amplifying it.

I'm conflicted here. If I were to choose one of the two, I would pick 
the action respawing as to me it seems to converge much quicker and 
removes the uapi change (my bad! :).

As for usability, if all of a sudden there's a huge influx of ENOSPC 
errors because users are abusing 'tc filter add ... action index 1 ...' 
in parallel _before_ actually creating the action object the fix is to just:
tc actions add action index 1 ...
tc filter add ...

As tc has always supported

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-05 20:19     ` Pedro Tammela
@ 2023-12-06  9:52       ` Vlad Buslov
  2023-12-08 21:07         ` Pedro Tammela
  0 siblings, 1 reply; 8+ messages in thread
From: Vlad Buslov @ 2023-12-06  9:52 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: netdev, davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner

On Tue 05 Dec 2023 at 17:19, Pedro Tammela <pctammela@mojatatu.com> wrote:
> On 05/12/2023 15:34, Vlad Buslov wrote:
>> On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
>>> Instead of relying only on the idrinfo->lock mutex for
>>> bind/alloc logic, rely on a combination of rcu + mutex + atomics
>>> to better scale the case where multiple rtnl-less filters are
>>> binding to the same action object.
>>>
>>> Action binding happens when an action index is specified explicitly and
>>> an action exists which such index exists. Example:
>> Nit: the first sentence looks mangled, extra 'exists' word and probably
>> 'which' should be 'with'.
>> 
>>>    tc actions add action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter ls ...
>>>       filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>>     filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>>     filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>> When no index is specified, as before, grab the mutex and allocate
>>> in the idr the next available id. In this version, as opposed to before,
>>> it's simplified to store the -EBUSY pointer instead of the previous
>>> alloc + replace combination.
>>>
>>> When an index is specified, rely on rcu to find if there's an object in
>>> such index. If there's none, fallback to the above, serializing on the
>>> mutex and reserving the specified id. If there's one, it can be an -EBUSY
>>> pointer, in which case we just try again until it's an action, or an action.
>>> Given the rcu guarantees, the action found could be dead and therefore
>>> we need to bump the refcount if it's not 0, handling the case it's
>>> in fact 0.
>>>
>>> As bind and the action refcount are already atomics, these increments can
>>> happen without the mutex protection while many tcf_idr_check_alloc race
>>> to bind to the same action instance.
>>>
>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>> ---
>>>   net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>>>   1 file changed, 34 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>>> index abec5c45b5a4..79a044d2ae02 100644
>>> --- a/net/sched/act_api.c
>>> +++ b/net/sched/act_api.c
>>> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>>>   	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>>   	struct tc_action *p;
>>>   	int ret;
>>> +	u32 max;
>>>   -again:
>>> -	mutex_lock(&idrinfo->lock);
>>>   	if (*index) {
>>> +again:
>>> +		rcu_read_lock();
>>>   		p = idr_find(&idrinfo->action_idr, *index);
>>> +
>>>   		if (IS_ERR(p)) {
>>>   			/* This means that another process allocated
>>>   			 * index but did not assign the pointer yet.
>>>   			 */
>>> -			mutex_unlock(&idrinfo->lock);
>>> +			rcu_read_unlock();
>>>   			goto again;
>>>   		}
>>>   -		if (p) {
>>> -			refcount_inc(&p->tcfa_refcnt);
>>> -			if (bind)
>>> -				atomic_inc(&p->tcfa_bindcnt);
>>> -			*a = p;
>>> -			ret = 1;
>>> -		} else {
>>> -			*a = NULL;
>>> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>>> -					    *index, GFP_KERNEL);
>>> -			if (!ret)
>>> -				idr_replace(&idrinfo->action_idr,
>>> -					    ERR_PTR(-EBUSY), *index);
>>> +		if (!p) {
>>> +			/* Empty slot, try to allocate it */
>>> +			max = *index;
>>> +			rcu_read_unlock();
>>> +			goto new;
>>> +		}
>>> +
>>> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
>>> +			/* Action was deleted in parallel */
>>> +			rcu_read_unlock();
>>> +			return -ENOENT;
>> Current version doesn't return ENOENT since it is synchronous. You are
>> now introducing basically a change to UAPI since users of this function
>> (individual actions) are not prepared to retry on ENOENT and will
>> propagate the error up the call chain. I guess you need to try to create
>> a new action with specified index instead.
>
> I see.
> So you are saying that in the case where action foo is deleted and a binding in
> parallel observes the deleted action, it should fallback into trying to allocate
> the index.

Correct.

>
> We could goto again and hope that idr_find will observe the idr index being
> freed, in which case it would fall back into action allocation if it does or
> simply go via the same path as before (jumping to 'again').
>
> I don't see much problems here, it seems to converge in this scenario
> as it eventually transforms into race for action allocation (more below) if you
> have an unfortunate delete with many bindings in flight.
>
>> 
>>>   		}
>>> +
>>> +		if (bind)
>>> +			atomic_inc(&p->tcfa_bindcnt);
>>> +		*a = p;
>>> +
>>> +		rcu_read_unlock();
>>> +
>>> +		return 1;
>>>   	} else {
>>> +		/* Find a slot */
>>>   		*index = 1;
>>> -		*a = NULL;
>>> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>>> -				    UINT_MAX, GFP_KERNEL);
>>> -		if (!ret)
>>> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
>>> -				    *index);
>>> +		max = UINT_MAX;
>>>   	}
>>> +
>>> +new:
>>> +	*a = NULL;
>>> +
>>> +	mutex_lock(&idrinfo->lock);
>>> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
>>> +			    GFP_KERNEL);
>> What if multiple concurrent tasks didn't find the action by index with
>> rcu and get here, synchronizing on the idrinfo->lock? It looks like
>> after the one who got the lock first successfully allocates the index
>> everyone else will fail (also propagating ENOSPACE to the user). 
>
> Correct
>
>> I guess you need some mechanism to account for such case and retry.
>
> Ok, so if I'm binding and it's observed a free index, which means "try to
> allocate" and I get a ENOSPC after jumping to new, try again but this time
> binding into the allocated action.
>
> In this scenario when we come back to 'again' we will wait until -EBUSY is
> replaced with the real pointer. Seems like a big enough window that any race for
> allocating from binding would most probably end up in this contention loop.
>
> However I think when we have these two retry mechanisms there's a extremely
> small window for an infinite loop if an action delete is timed just right, in
> between the action pointer is found and when we grab the tcfa_refcnt.
>
> 	idr_find (pointer)
> 	tcfa_refcnt (0)  <-------|
> 	again:                   |
> 	idr_find (free index!)   |
> 	new:                     |
> 	idr_alloc_u32 (ENOSPC)   |
> 	again:                   |
> 	idr_find (EBUSY)         |
> 	again:                   |
> 	idr_find (pointer)       |
> 	<evil delete happens>    |
> 	------->>>>--------------|

I'm not sure I'm following. Why would this sequence cause infinite loop?

>
> Another potential problem, is that this will race with non binding actions. So
> if the ENOSPC was actually from another unrelated action. A practical example
> would be a race between a binding to an 'action drop index 1' and an 'action ok'
> allocation. Actually it's a general problem and not particular to this case here
> but it seems like we could be amplifying it.
>
> I'm conflicted here. If I were to choose one of the two, I would pick the action
> respawing as to me it seems to converge much quicker and removes the uapi change
> (my bad! :).
>
> As for usability, if all of a sudden there's a huge influx of ENOSPC errors
> because users are abusing 'tc filter add ... action index 1 ...' in parallel
> _before_ actually creating the action object the fix is to just:
> tc actions add action index 1 ...
> tc filter add ...
>
> As tc has always supported


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-06  9:52       ` Vlad Buslov
@ 2023-12-08 21:07         ` Pedro Tammela
  2023-12-11 16:10           ` Vlad Buslov
  0 siblings, 1 reply; 8+ messages in thread
From: Pedro Tammela @ 2023-12-08 21:07 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: netdev, davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner

On 06/12/2023 06:52, Vlad Buslov wrote:
>> Ok, so if I'm binding and it's observed a free index, which means "try to
>> allocate" and I get a ENOSPC after jumping to new, try again but this time
>> binding into the allocated action.
>>
>> In this scenario when we come back to 'again' we will wait until -EBUSY is
>> replaced with the real pointer. Seems like a big enough window that any race for
>> allocating from binding would most probably end up in this contention loop.
>>
>> However I think when we have these two retry mechanisms there's a extremely
>> small window for an infinite loop if an action delete is timed just right, in
>> between the action pointer is found and when we grab the tcfa_refcnt.
>>
>> 	idr_find (pointer)
>> 	tcfa_refcnt (0)  <-------|
>> 	again:                   |
>> 	idr_find (free index!)   |
>> 	new:                     |
>> 	idr_alloc_u32 (ENOSPC)   |
>> 	again:                   |
>> 	idr_find (EBUSY)         |
>> 	again:                   |
>> 	idr_find (pointer)       |
>> 	<evil delete happens>    |
>> 	------->>>>--------------|
> 
> I'm not sure I'm following. Why would this sequence cause infinite loop?
> 

Perhaps I was being overly paranoid. Taking a look again it seems that 
not only an evil delete but also EBUSY must be in the action idr for a 
long time. I see it now, it looks like it converges.

I was wondering if instead of looping in 'again:' in either scenarios 
you presented, what if we return -EAGAIN and let the filter 
infrastructure retry it under rtnl_lock()? At least will give enough 
breathing room for a call to schedule() to kick in if needed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc
  2023-12-08 21:07         ` Pedro Tammela
@ 2023-12-11 16:10           ` Vlad Buslov
  0 siblings, 0 replies; 8+ messages in thread
From: Vlad Buslov @ 2023-12-11 16:10 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: netdev, davem, edumazet, kuba, pabeni, jhs, xiyou.wangcong, jiri,
	marcelo.leitner


On Fri 08 Dec 2023 at 18:07, Pedro Tammela <pctammela@mojatatu.com> wrote:
> On 06/12/2023 06:52, Vlad Buslov wrote:
>>> Ok, so if I'm binding and it's observed a free index, which means "try to
>>> allocate" and I get a ENOSPC after jumping to new, try again but this time
>>> binding into the allocated action.
>>>
>>> In this scenario when we come back to 'again' we will wait until -EBUSY is
>>> replaced with the real pointer. Seems like a big enough window that any race for
>>> allocating from binding would most probably end up in this contention loop.
>>>
>>> However I think when we have these two retry mechanisms there's a extremely
>>> small window for an infinite loop if an action delete is timed just right, in
>>> between the action pointer is found and when we grab the tcfa_refcnt.
>>>
>>> 	idr_find (pointer)
>>> 	tcfa_refcnt (0)  <-------|
>>> 	again:                   |
>>> 	idr_find (free index!)   |
>>> 	new:                     |
>>> 	idr_alloc_u32 (ENOSPC)   |
>>> 	again:                   |
>>> 	idr_find (EBUSY)         |
>>> 	again:                   |
>>> 	idr_find (pointer)       |
>>> 	<evil delete happens>    |
>>> 	------->>>>--------------|
>> I'm not sure I'm following. Why would this sequence cause infinite loop?
>> 
>
> Perhaps I was being overly paranoid. Taking a look again it seems that not only
> an evil delete but also EBUSY must be in the action idr for a long time. I see
> it now, it looks like it converges.
>
> I was wondering if instead of looping in 'again:' in either scenarios you
> presented, what if we return -EAGAIN and let the filter infrastructure retry it
> under rtnl_lock()? At least will give enough breathing room for a call to
> schedule() to kick in if needed.

Sounds good, but you will need to ensure that both act and cls api
implementations properly retry on EAGAIN (looks like they do, but I only
gave it a cursory glance).


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-12-11 16:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-05 15:30 [PATCH net-next 0/2] net/sched: optimizations around action binding and init Pedro Tammela
2023-12-05 15:30 ` [PATCH net-next 1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc Pedro Tammela
2023-12-05 18:34   ` Vlad Buslov
2023-12-05 20:19     ` Pedro Tammela
2023-12-06  9:52       ` Vlad Buslov
2023-12-08 21:07         ` Pedro Tammela
2023-12-11 16:10           ` Vlad Buslov
2023-12-05 15:30 ` [PATCH net-next 2/2] net/sched: act_api: skip idr replace on bound actions Pedro Tammela

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.