All of lore.kernel.org
 help / color / mirror / Atom feed
* ipsec tunnel performance degrade
@ 2019-04-22 14:29 Vakul Garg
  2019-04-22 17:45 ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: Vakul Garg @ 2019-04-22 14:29 UTC (permalink / raw)
  To: netdev

Hi

Post kernel 4.9, I am experiencing more than 50% degrade in ipsec performance on my arm64 based systems (with onchip crypto accelerator).
(We use only lts kernels). My understanding is that it is mainly due to xfrm flow cache removal in version 4.12.

I am not sure whether any subsequent work could recover the lost performance.
With kernel 4.19, I see that xfrm_state_find() is taking a lot of cpu (more than 15%).
Further, perf show that a lot of atomic primitives such as __ll_sc___cmpxchg_case_mb_4(), 
__ll_sc_atomic_sub_return() are being invoked. On 16 core system, they consume more than 30% of cpu.

These are getting called mainly from xfrm_lookup_with_ifid().

Can someone please help?

Regards

Vakul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ipsec tunnel performance degrade
  2019-04-22 14:29 ipsec tunnel performance degrade Vakul Garg
@ 2019-04-22 17:45 ` Florian Westphal
  2019-04-23  2:36   ` Vakul Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2019-04-22 17:45 UTC (permalink / raw)
  To: Vakul Garg; +Cc: netdev

Vakul Garg <vakul.garg@nxp.com> wrote:
> Post kernel 4.9, I am experiencing more than 50% degrade in ipsec performance on my arm64 based systems (with onchip crypto accelerator).
> (We use only lts kernels). My understanding is that it is mainly due to xfrm flow cache removal in version 4.12.

Yes, likely.

> I am not sure whether any subsequent work could recover the lost performance.
> With kernel 4.19, I see that xfrm_state_find() is taking a lot of cpu (more than 15%).

Can you share details about the setup?

I.e., how many policies, states etc.?
Do you use xfrm interfaces?

> Further, perf show that a lot of atomic primitives such as __ll_sc___cmpxchg_case_mb_4(), 
> __ll_sc_atomic_sub_return() are being invoked. On 16 core system, they consume more than 30% of cpu.

Thats not good, perhaps we should look at pcpu refcounts for the xfrm
state structs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: ipsec tunnel performance degrade
  2019-04-22 17:45 ` Florian Westphal
@ 2019-04-23  2:36   ` Vakul Garg
  2019-04-23  5:56     ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: Vakul Garg @ 2019-04-23  2:36 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev



> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Monday, April 22, 2019 11:16 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: ipsec tunnel performance degrade
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > Post kernel 4.9, I am experiencing more than 50% degrade in ipsec
> performance on my arm64 based systems (with onchip crypto accelerator).
> > (We use only lts kernels). My understanding is that it is mainly due to xfrm
> flow cache removal in version 4.12.
> 
> Yes, likely.
> 
> > I am not sure whether any subsequent work could recover the lost
> performance.
> > With kernel 4.19, I see that xfrm_state_find() is taking a lot of cpu (more
> than 15%).
> 
> Can you share details about the setup?
> 
> I.e., how many policies, states etc.?

My setup has 2 ethernet interfaces. I am creating 64 ipsec tunnels for encapsulation.
I use 64 policies and 64 SAs.

> Do you use xfrm interfaces?

I don't think so. I use setkey to create policies/SAs.
Can you please give me some hint about it?

> 
> > Further, perf show that a lot of atomic primitives such as
> > __ll_sc___cmpxchg_case_mb_4(),
> > __ll_sc_atomic_sub_return() are being invoked. On 16 core system, they
> consume more than 30% of cpu.
> 
> Thats not good, perhaps we should look at pcpu refcounts for the xfrm state
> structs.

What else data can I collect?
Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ipsec tunnel performance degrade
  2019-04-23  2:36   ` Vakul Garg
@ 2019-04-23  5:56     ` Florian Westphal
  2019-04-23 15:04       ` Vakul Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2019-04-23  5:56 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev

Vakul Garg <vakul.garg@nxp.com> wrote:
> > Do you use xfrm interfaces?
> 
> I don't think so. I use setkey to create policies/SAs.
> Can you please give me some hint about it?

Then you're not using ipsec interfaces.

> > > Further, perf show that a lot of atomic primitives such as
> > > __ll_sc___cmpxchg_case_mb_4(),
> > > __ll_sc_atomic_sub_return() are being invoked. On 16 core system, they
> > consume more than 30% of cpu.
> > 
> > Thats not good, perhaps we should look at pcpu refcounts for the xfrm state
> > structs.
> 
> What else data can I collect?

I have no further suggestions.  I don't know yet when I will have time
to look into refcnt optimizations.

Idea would be to make them same as dev_hold/put.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: ipsec tunnel performance degrade
  2019-04-23  5:56     ` Florian Westphal
@ 2019-04-23 15:04       ` Vakul Garg
  2019-04-23 16:25         ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: Vakul Garg @ 2019-04-23 15:04 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev



> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Tuesday, April 23, 2019 11:27 AM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: ipsec tunnel performance degrade
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > > Do you use xfrm interfaces?
> >
> > I don't think so. I use setkey to create policies/SAs.
> > Can you please give me some hint about it?
> 
> Then you're not using ipsec interfaces.
> 
Instead of creating policies/SA using setkey, I shifted to using 'ip xfrm' commands.
With this, I get good performance improvement (20% better in one case).
Now xfrm_state_find() function is not taking much cpu.

Is this what you meant by 'xfrm interfaces'?

> > > > Further, perf show that a lot of atomic primitives such as
> > > > __ll_sc___cmpxchg_case_mb_4(),
> > > > __ll_sc_atomic_sub_return() are being invoked. On 16 core system,
> > > > they
> > > consume more than 30% of cpu.
> > >
> > > Thats not good, perhaps we should look at pcpu refcounts for the
> > > xfrm state structs.
> >
> > What else data can I collect?
> 
> I have no further suggestions.  I don't know yet when I will have time to look
> into refcnt optimizations.
> 
> Idea would be to make them same as dev_hold/put.

I will try to address it. Can you provide some guidance? Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ipsec tunnel performance degrade
  2019-04-23 15:04       ` Vakul Garg
@ 2019-04-23 16:25         ` Florian Westphal
  2019-04-23 17:27           ` David Miller
  2019-04-24 10:40           ` [RFC HACK] xfrm: make state refcounting percpu Florian Westphal
  0 siblings, 2 replies; 18+ messages in thread
From: Florian Westphal @ 2019-04-23 16:25 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev

Vakul Garg <vakul.garg@nxp.com> wrote:
> > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > Do you use xfrm interfaces?
> > >
> > > I don't think so. I use setkey to create policies/SAs.
> > > Can you please give me some hint about it?
> > 
> > Then you're not using ipsec interfaces.
> > 
> Instead of creating policies/SA using setkey, I shifted to using 'ip xfrm' commands.
> With this, I get good performance improvement (20% better in one case).
> Now xfrm_state_find() function is not taking much cpu.

Thats very strange, I have no explanation for this.
It would be good to find the cause, PF_KEY and 'ip xfrm'
are just different control plane frontends, they should have no impact
on data path performance.

> Is this what you meant by 'xfrm interfaces'?

No, i meant the xfrm network interfaces that were added recently, see
net/xfrm/xfrm_interface.c

> > I have no further suggestions.  I don't know yet when I will have time to look
> > into refcnt optimizations.
> > 
> > Idea would be to make them same as dev_hold/put.
> 
> I will try to address it. Can you provide some guidance? Thanks.

I will try to make ugly POC hack tomorrow that should
illustrate the general idea (and caveats/bugs that need fixing).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ipsec tunnel performance degrade
  2019-04-23 16:25         ` Florian Westphal
@ 2019-04-23 17:27           ` David Miller
  2019-04-24  7:32             ` Vakul Garg
  2019-04-24 10:40           ` [RFC HACK] xfrm: make state refcounting percpu Florian Westphal
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2019-04-23 17:27 UTC (permalink / raw)
  To: fw; +Cc: vakul.garg, netdev

From: Florian Westphal <fw@strlen.de>
Date: Tue, 23 Apr 2019 18:25:21 +0200

> Vakul Garg <vakul.garg@nxp.com> wrote:
>> > Vakul Garg <vakul.garg@nxp.com> wrote:
>> > > > Do you use xfrm interfaces?
>> > >
>> > > I don't think so. I use setkey to create policies/SAs.
>> > > Can you please give me some hint about it?
>> > 
>> > Then you're not using ipsec interfaces.
>> > 
>> Instead of creating policies/SA using setkey, I shifted to using 'ip xfrm' commands.
>> With this, I get good performance improvement (20% better in one case).
>> Now xfrm_state_find() function is not taking much cpu.
> 
> Thats very strange, I have no explanation for this.
> It would be good to find the cause, PF_KEY and 'ip xfrm'
> are just different control plane frontends, they should have no impact
> on data path performance.

I wonder if the masks and/or prefixes that end up being used are subtly
different for some reason.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: ipsec tunnel performance degrade
  2019-04-23 17:27           ` David Miller
@ 2019-04-24  7:32             ` Vakul Garg
  0 siblings, 0 replies; 18+ messages in thread
From: Vakul Garg @ 2019-04-24  7:32 UTC (permalink / raw)
  To: David Miller, fw; +Cc: netdev



> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Tuesday, April 23, 2019 10:57 PM
> To: fw@strlen.de
> Cc: Vakul Garg <vakul.garg@nxp.com>; netdev@vger.kernel.org
> Subject: Re: ipsec tunnel performance degrade
> 
> From: Florian Westphal <fw@strlen.de>
> Date: Tue, 23 Apr 2019 18:25:21 +0200
> 
> > Vakul Garg <vakul.garg@nxp.com> wrote:
> >> > Vakul Garg <vakul.garg@nxp.com> wrote:
> >> > > > Do you use xfrm interfaces?
> >> > >
> >> > > I don't think so. I use setkey to create policies/SAs.
> >> > > Can you please give me some hint about it?
> >> >
> >> > Then you're not using ipsec interfaces.
> >> >
> >> Instead of creating policies/SA using setkey, I shifted to using 'ip xfrm'
> commands.
> >> With this, I get good performance improvement (20% better in one case).
> >> Now xfrm_state_find() function is not taking much cpu.
> >
> > Thats very strange, I have no explanation for this.
> > It would be good to find the cause, PF_KEY and 'ip xfrm'
> > are just different control plane frontends, they should have no impact
> > on data path performance.
> 
> I wonder if the masks and/or prefixes that end up being used are subtly
> different for some reason.

Thanks for the hint.
I figured out the reason why it made a difference. 

Basically with 'ip xfrm' commands, we passed 'reqid' parameter to
create linking between policies and states. With 'setkey' we didn't do same.
Adding '/unique:[n]' keywords in policy creation yielded same performance with both
setkey and 'ip xfrm' based interfaces.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC HACK] xfrm: make state refcounting percpu
  2019-04-23 16:25         ` Florian Westphal
  2019-04-23 17:27           ` David Miller
@ 2019-04-24 10:40           ` Florian Westphal
  2019-04-24 11:38             ` Eric Dumazet
  2019-05-03  6:07             ` Steffen Klassert
  1 sibling, 2 replies; 18+ messages in thread
From: Florian Westphal @ 2019-04-24 10:40 UTC (permalink / raw)
  To: vakul.garg; +Cc: steffen.klassert, netdev, Florian Westphal

I'm not sure this is a good idea to begin with, refcount
is right next to state spinlock which is taken for both tx and rx ops,
plus this complicates debugging quite a bit.

Also expect UAF or refcount bugs, especially xfrm_states that won't get free'd.

In case someone wants to experiment with this, I've placed it here:
https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=xfrm_pcpu_gc_04

Summary of old/new locking model:

Old locking rules:
 From control plane:
   - new states get refcount of 1
   - calls xfrm_state_put() when done
   - calls xfrm_state_hold() to make sure
     state won't go away

 From data plane:
  - calls xfrm_state_hold_rcu, which will tell when
    refcount is already 0 (i.e, state is being destroyed)
  - calls xfrm_state_put when done

 From gc worker:
  - steals current gc list head
  - call synchronize_rcu
  - can free all entries in the list after this

Last xfrm_state_put will observe refcount transition to 0,
this will place xfrm_state on gc list, where its picked
up (and free'd) by xfrm state gc worker.

New locking rules:
 From control plane:
   - (NEW) must *explicitly* call xfrm_state_delete()
     when state should be removed
   - otherwise same as above
 From data plane:
   - calls xfrm_state_hold() to get ref,
     xfrm_state_put when done
 From gc worker:
  - steals current gc list head
  - call synchronize_rcu
  - entries on list MAY STILL BE IN USE, so
    * must check refcount for each (expensive, as it
      summing pcpu refcounts for each state)
    * needs to place entries that it can't free yet
      back on gc list and resched itself

XXX:
- survives minimal testing (netns + veth + esp tunnel mode)
- refcount for newly allocated states should probably be
  changed to 0, and only do 0 -> 1 increment right before
  state is published (added to state hashes).
- probably misses several error handling spots that now would
  have to call xfrm_state_delete() rather than xfrm_state_put().

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/xfrm.h    | 20 ++--------
 net/key/af_key.c      |  4 +-
 net/xfrm/xfrm_state.c | 90 ++++++++++++++++++++++++-------------------
 net/xfrm/xfrm_user.c  |  2 +-
 4 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 728d4e7b82c0..f9be94a3a029 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -153,7 +153,7 @@ struct xfrm_state {
 	struct hlist_node	bysrc;
 	struct hlist_node	byspi;
 
-	refcount_t		refcnt;
+	int __percpu		*pcpu_refcnt;
 	spinlock_t		lock;
 
 	struct xfrm_id		id;
@@ -790,26 +790,14 @@ static inline void xfrm_pols_put(struct xfrm_policy **pols, int npols)
 
 void __xfrm_state_destroy(struct xfrm_state *, bool);
 
-static inline void __xfrm_state_put(struct xfrm_state *x)
+static inline void xfrm_state_hold(struct xfrm_state *x)
 {
-	refcount_dec(&x->refcnt);
+	this_cpu_inc(*x->pcpu_refcnt);
 }
 
 static inline void xfrm_state_put(struct xfrm_state *x)
 {
-	if (refcount_dec_and_test(&x->refcnt))
-		__xfrm_state_destroy(x, false);
-}
-
-static inline void xfrm_state_put_sync(struct xfrm_state *x)
-{
-	if (refcount_dec_and_test(&x->refcnt))
-		__xfrm_state_destroy(x, true);
-}
-
-static inline void xfrm_state_hold(struct xfrm_state *x)
-{
-	refcount_inc(&x->refcnt);
+	this_cpu_dec(*x->pcpu_refcnt);
 }
 
 static inline bool addr_match(const void *token1, const void *token2,
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 5651c29cb5bd..21d4776b5431 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1303,7 +1303,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
 
 out:
 	x->km.state = XFRM_STATE_DEAD;
-	xfrm_state_put(x);
+	xfrm_state_delete(x);
 	return ERR_PTR(err);
 }
 
@@ -1525,7 +1525,7 @@ static int pfkey_add(struct sock *sk, struct sk_buff *skb, const struct sadb_msg
 
 	if (err < 0) {
 		x->km.state = XFRM_STATE_DEAD;
-		__xfrm_state_put(x);
+		xfrm_state_delete(x);
 		goto out;
 	}
 
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 8951c09ae9e9..4614df21ff5c 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -31,6 +31,8 @@
 #define xfrm_state_deref_prot(table, net) \
 	rcu_dereference_protected((table), lockdep_is_held(&(net)->xfrm.xfrm_state_lock))
 
+#define XFRM_GC_WAIT_JIFFIES	2
+
 static void xfrm_state_gc_task(struct work_struct *work);
 
 /* Each xfrm_state may be linked to two tables:
@@ -44,14 +46,9 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static __read_mostly seqcount_t xfrm_state_hash_generation = SEQCNT_ZERO(xfrm_state_hash_generation);
 static struct kmem_cache *xfrm_state_cache __ro_after_init;
 
-static DECLARE_WORK(xfrm_state_gc_work, xfrm_state_gc_task);
+static DECLARE_DELAYED_WORK(xfrm_state_gc_work, xfrm_state_gc_task);
 static HLIST_HEAD(xfrm_state_gc_list);
 
-static inline bool xfrm_state_hold_rcu(struct xfrm_state __rcu *x)
-{
-	return refcount_inc_not_zero(&x->refcnt);
-}
-
 static inline unsigned int xfrm_dst_hash(struct net *net,
 					 const xfrm_address_t *daddr,
 					 const xfrm_address_t *saddr,
@@ -472,6 +469,7 @@ static const struct xfrm_mode *xfrm_get_mode(unsigned int encap, int family)
 
 void xfrm_state_free(struct xfrm_state *x)
 {
+	free_percpu(x->pcpu_refcnt);
 	kmem_cache_free(xfrm_state_cache, x);
 }
 EXPORT_SYMBOL(xfrm_state_free);
@@ -499,6 +497,15 @@ static void ___xfrm_state_destroy(struct xfrm_state *x)
 	xfrm_state_free(x);
 }
 
+static int xfrm_state_refcnt_read(const struct xfrm_state *x)
+{
+	int i, refcnt = 0;
+
+	for_each_possible_cpu(i)
+		refcnt += *per_cpu_ptr(x->pcpu_refcnt, i);
+	return refcnt;
+}
+
 static void xfrm_state_gc_task(struct work_struct *work)
 {
 	struct xfrm_state *x;
@@ -511,8 +518,22 @@ static void xfrm_state_gc_task(struct work_struct *work)
 
 	synchronize_rcu();
 
-	hlist_for_each_entry_safe(x, tmp, &gc_list, gclist)
+	hlist_for_each_entry_safe(x, tmp, &gc_list, gclist) {
+		if (xfrm_state_refcnt_read(x))
+			continue;
+
+		hlist_del(&x->gclist);
 		___xfrm_state_destroy(x);
+	}
+
+	if (!hlist_empty(&gc_list)) {
+		spin_lock_bh(&xfrm_state_gc_lock);
+		hlist_for_each_entry_safe(x, tmp, &gc_list, gclist)
+			hlist_add_head(&x->gclist, &xfrm_state_gc_list);
+		spin_unlock_bh(&xfrm_state_gc_lock);
+
+		schedule_delayed_work(&xfrm_state_gc_work, XFRM_GC_WAIT_JIFFIES);
+	}
 }
 
 static enum hrtimer_restart xfrm_timer_handler(struct hrtimer *me)
@@ -611,8 +632,14 @@ struct xfrm_state *xfrm_state_alloc(struct net *net)
 	x = kmem_cache_alloc(xfrm_state_cache, GFP_ATOMIC | __GFP_ZERO);
 
 	if (x) {
+		x->pcpu_refcnt = alloc_percpu(int);
+		if (!x->pcpu_refcnt) {
+			kmem_cache_free(xfrm_state_cache, x);
+			return NULL;
+		}
+
 		write_pnet(&x->xs_net, net);
-		refcount_set(&x->refcnt, 1);
+		xfrm_state_hold(x);
 		atomic_set(&x->tunnel_users, 0);
 		INIT_LIST_HEAD(&x->km.all);
 		INIT_HLIST_NODE(&x->bydst);
@@ -634,22 +661,6 @@ struct xfrm_state *xfrm_state_alloc(struct net *net)
 }
 EXPORT_SYMBOL(xfrm_state_alloc);
 
-void __xfrm_state_destroy(struct xfrm_state *x, bool sync)
-{
-	WARN_ON(x->km.state != XFRM_STATE_DEAD);
-
-	if (sync) {
-		synchronize_rcu();
-		___xfrm_state_destroy(x);
-	} else {
-		spin_lock_bh(&xfrm_state_gc_lock);
-		hlist_add_head(&x->gclist, &xfrm_state_gc_list);
-		spin_unlock_bh(&xfrm_state_gc_lock);
-		schedule_work(&xfrm_state_gc_work);
-	}
-}
-EXPORT_SYMBOL(__xfrm_state_destroy);
-
 int __xfrm_state_delete(struct xfrm_state *x)
 {
 	struct net *net = xs_net(x);
@@ -673,6 +684,12 @@ int __xfrm_state_delete(struct xfrm_state *x)
 		 * is what we are dropping here.
 		 */
 		xfrm_state_put(x);
+
+		spin_lock_bh(&xfrm_state_gc_lock);
+		hlist_add_head(&x->gclist, &xfrm_state_gc_list);
+		spin_unlock_bh(&xfrm_state_gc_lock);
+
+		schedule_delayed_work(&xfrm_state_gc_work, 0);
 		err = 0;
 	}
 
@@ -771,10 +788,7 @@ int xfrm_state_flush(struct net *net, u8 proto, bool task_valid, bool sync)
 				err = xfrm_state_delete(x);
 				xfrm_audit_state_delete(x, err ? 0 : 1,
 							task_valid);
-				if (sync)
-					xfrm_state_put_sync(x);
-				else
-					xfrm_state_put(x);
+				xfrm_state_put(x);
 				if (!err)
 					cnt++;
 
@@ -937,8 +951,8 @@ static struct xfrm_state *__xfrm_state_lookup(struct net *net, u32 mark,
 
 		if ((mark & x->mark.m) != x->mark.v)
 			continue;
-		if (!xfrm_state_hold_rcu(x))
-			continue;
+
+		xfrm_state_hold(x);
 		return x;
 	}
 
@@ -962,8 +976,7 @@ static struct xfrm_state *__xfrm_state_lookup_byaddr(struct net *net, u32 mark,
 
 		if ((mark & x->mark.m) != x->mark.v)
 			continue;
-		if (!xfrm_state_hold_rcu(x))
-			continue;
+		xfrm_state_hold(x);
 		return x;
 	}
 
@@ -1151,10 +1164,7 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 	}
 out:
 	if (x) {
-		if (!xfrm_state_hold_rcu(x)) {
-			*err = -EAGAIN;
-			x = NULL;
-		}
+		xfrm_state_hold(x);
 	} else {
 		*err = acquire_in_progress ? -EAGAIN : error;
 	}
@@ -1681,7 +1691,7 @@ int xfrm_state_update(struct xfrm_state *x)
 
 		err = 0;
 		x->km.state = XFRM_STATE_DEAD;
-		__xfrm_state_put(x);
+		xfrm_state_put(x);
 	}
 
 fail:
@@ -2372,7 +2382,7 @@ struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family)
 
 void xfrm_flush_gc(void)
 {
-	flush_work(&xfrm_state_gc_work);
+	flush_delayed_work(&xfrm_state_gc_work);
 }
 EXPORT_SYMBOL(xfrm_flush_gc);
 
@@ -2385,7 +2395,7 @@ void xfrm_state_delete_tunnel(struct xfrm_state *x)
 		if (atomic_read(&t->tunnel_users) == 2)
 			xfrm_state_delete(t);
 		atomic_dec(&t->tunnel_users);
-		xfrm_state_put_sync(t);
+		xfrm_state_put(t);
 		x->tunnel = NULL;
 	}
 }
@@ -2531,7 +2541,7 @@ void xfrm_state_fini(struct net *net)
 	unsigned int sz;
 
 	flush_work(&net->xfrm.state_hash_work);
-	flush_work(&xfrm_state_gc_work);
+	flush_delayed_work(&xfrm_state_gc_work);
 	xfrm_state_flush(net, IPSEC_PROTO_ANY, false, true);
 
 	WARN_ON(!list_empty(&net->xfrm.state_all));
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index a131f9ff979e..4587f1342af0 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -675,7 +675,7 @@ static int xfrm_add_sa(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (err < 0) {
 		x->km.state = XFRM_STATE_DEAD;
 		xfrm_dev_state_delete(x);
-		__xfrm_state_put(x);
+		xfrm_state_delete(x);
 		goto out;
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-04-24 10:40           ` [RFC HACK] xfrm: make state refcounting percpu Florian Westphal
@ 2019-04-24 11:38             ` Eric Dumazet
  2019-05-03  6:07             ` Steffen Klassert
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2019-04-24 11:38 UTC (permalink / raw)
  To: Florian Westphal, vakul.garg; +Cc: steffen.klassert, netdev



On 04/24/2019 03:40 AM, Florian Westphal wrote:
> I'm not sure this is a good idea to begin with, refcount
> is right next to state spinlock which is taken for both tx and rx ops,
> plus this complicates debugging quite a bit.
> 


Why not simply using include/linux/percpu-refcount.h ?

> Also expect UAF or refcount bugs, especially xfrm_states that won't get free'd.
> 
> In case someone wants to experiment with this, I've placed it here:
> https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=xfrm_pcpu_gc_04
> 
> Summary of old/new locking model:
> 
> Old locking rules:
>  From control plane:
>    - new states get refcount of 1
>    - calls xfrm_state_put() when done
>    - calls xfrm_state_hold() to make sure
>      state won't go away
> 
>  From data plane:
>   - calls xfrm_state_hold_rcu, which will tell when
>     refcount is already 0 (i.e, state is being destroyed)
>   - calls xfrm_state_put when done
> 
>  From gc worker:
>   - steals current gc list head
>   - call synchronize_rcu
>   - can free all entries in the list after this
> 
> Last xfrm_state_put will observe refcount transition to 0,
> this will place xfrm_state on gc list, where its picked
> up (and free'd) by xfrm state gc worker.
> 
> New locking rules:
>  From control plane:
>    - (NEW) must *explicitly* call xfrm_state_delete()
>      when state should be removed
>    - otherwise same as above
>  From data plane:
>    - calls xfrm_state_hold() to get ref,
>      xfrm_state_put when done
>  From gc worker:
>   - steals current gc list head
>   - call synchronize_rcu
>   - entries on list MAY STILL BE IN USE, so
>     * must check refcount for each (expensive, as it
>       summing pcpu refcounts for each state)
>     * needs to place entries that it can't free yet
>       back on gc list and resched itself
> 
> XXX:
> - survives minimal testing (netns + veth + esp tunnel mode)
> - refcount for newly allocated states should probably be
>   changed to 0, and only do 0 -> 1 increment right before
>   state is published (added to state hashes).
> - probably misses several error handling spots that now would
>   have to call xfrm_state_delete() rather than xfrm_state_put().
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/net/xfrm.h    | 20 ++--------
>  net/key/af_key.c      |  4 +-
>  net/xfrm/xfrm_state.c | 90 ++++++++++++++++++++++++-------------------
>  net/xfrm/xfrm_user.c  |  2 +-
>  4 files changed, 57 insertions(+), 59 deletions(-)
> 
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 728d4e7b82c0..f9be94a3a029 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -153,7 +153,7 @@ struct xfrm_state {
>  	struct hlist_node	bysrc;
>  	struct hlist_node	byspi;
>  
> -	refcount_t		refcnt;
> +	int __percpu		*pcpu_refcnt;
>  	spinlock_t		lock;
>  
>  	struct xfrm_id		id;
> @@ -790,26 +790,14 @@ static inline void xfrm_pols_put(struct xfrm_policy **pols, int npols)
>  
>  void __xfrm_state_destroy(struct xfrm_state *, bool);
>  
> -static inline void __xfrm_state_put(struct xfrm_state *x)
> +static inline void xfrm_state_hold(struct xfrm_state *x)
>  {
> -	refcount_dec(&x->refcnt);
> +	this_cpu_inc(*x->pcpu_refcnt);
>  }
>  
>  static inline void xfrm_state_put(struct xfrm_state *x)
>  {
> -	if (refcount_dec_and_test(&x->refcnt))
> -		__xfrm_state_destroy(x, false);
> -}
> -
> -static inline void xfrm_state_put_sync(struct xfrm_state *x)
> -{
> -	if (refcount_dec_and_test(&x->refcnt))
> -		__xfrm_state_destroy(x, true);
> -}
> -
> -static inline void xfrm_state_hold(struct xfrm_state *x)
> -{
> -	refcount_inc(&x->refcnt);
> +	this_cpu_dec(*x->pcpu_refcnt);
>  }
>  
>  static inline bool addr_match(const void *token1, const void *token2,
> diff --git a/net/key/af_key.c b/net/key/af_key.c
> index 5651c29cb5bd..21d4776b5431 100644
> --- a/net/key/af_key.c
> +++ b/net/key/af_key.c
> @@ -1303,7 +1303,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
>  
>  out:
>  	x->km.state = XFRM_STATE_DEAD;
> -	xfrm_state_put(x);
> +	xfrm_state_delete(x);
>  	return ERR_PTR(err);
>  }
>  
> @@ -1525,7 +1525,7 @@ static int pfkey_add(struct sock *sk, struct sk_buff *skb, const struct sadb_msg
>  
>  	if (err < 0) {
>  		x->km.state = XFRM_STATE_DEAD;
> -		__xfrm_state_put(x);
> +		xfrm_state_delete(x);
>  		goto out;
>  	}
>  
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 8951c09ae9e9..4614df21ff5c 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -31,6 +31,8 @@
>  #define xfrm_state_deref_prot(table, net) \
>  	rcu_dereference_protected((table), lockdep_is_held(&(net)->xfrm.xfrm_state_lock))
>  
> +#define XFRM_GC_WAIT_JIFFIES	2
> +
>  static void xfrm_state_gc_task(struct work_struct *work);
>  
>  /* Each xfrm_state may be linked to two tables:
> @@ -44,14 +46,9 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
>  static __read_mostly seqcount_t xfrm_state_hash_generation = SEQCNT_ZERO(xfrm_state_hash_generation);
>  static struct kmem_cache *xfrm_state_cache __ro_after_init;
>  
> -static DECLARE_WORK(xfrm_state_gc_work, xfrm_state_gc_task);
> +static DECLARE_DELAYED_WORK(xfrm_state_gc_work, xfrm_state_gc_task);
>  static HLIST_HEAD(xfrm_state_gc_list);
>  
> -static inline bool xfrm_state_hold_rcu(struct xfrm_state __rcu *x)
> -{
> -	return refcount_inc_not_zero(&x->refcnt);
> -}
> -
>  static inline unsigned int xfrm_dst_hash(struct net *net,
>  					 const xfrm_address_t *daddr,
>  					 const xfrm_address_t *saddr,
> @@ -472,6 +469,7 @@ static const struct xfrm_mode *xfrm_get_mode(unsigned int encap, int family)
>  
>  void xfrm_state_free(struct xfrm_state *x)
>  {
> +	free_percpu(x->pcpu_refcnt);
>  	kmem_cache_free(xfrm_state_cache, x);
>  }
>  EXPORT_SYMBOL(xfrm_state_free);
> @@ -499,6 +497,15 @@ static void ___xfrm_state_destroy(struct xfrm_state *x)
>  	xfrm_state_free(x);
>  }
>  
> +static int xfrm_state_refcnt_read(const struct xfrm_state *x)
> +{
> +	int i, refcnt = 0;
> +
> +	for_each_possible_cpu(i)
> +		refcnt += *per_cpu_ptr(x->pcpu_refcnt, i);
> +	return refcnt;
> +}
> +
>  static void xfrm_state_gc_task(struct work_struct *work)
>  {
>  	struct xfrm_state *x;
> @@ -511,8 +518,22 @@ static void xfrm_state_gc_task(struct work_struct *work)
>  
>  	synchronize_rcu();
>  
> -	hlist_for_each_entry_safe(x, tmp, &gc_list, gclist)
> +	hlist_for_each_entry_safe(x, tmp, &gc_list, gclist) {
> +		if (xfrm_state_refcnt_read(x))
> +			continue;
> +
> +		hlist_del(&x->gclist);
>  		___xfrm_state_destroy(x);
> +	}
> +
> +	if (!hlist_empty(&gc_list)) {
> +		spin_lock_bh(&xfrm_state_gc_lock);
> +		hlist_for_each_entry_safe(x, tmp, &gc_list, gclist)
> +			hlist_add_head(&x->gclist, &xfrm_state_gc_list);
> +		spin_unlock_bh(&xfrm_state_gc_lock);
> +
> +		schedule_delayed_work(&xfrm_state_gc_work, XFRM_GC_WAIT_JIFFIES);
> +	}
>  }
>  
>  static enum hrtimer_restart xfrm_timer_handler(struct hrtimer *me)
> @@ -611,8 +632,14 @@ struct xfrm_state *xfrm_state_alloc(struct net *net)
>  	x = kmem_cache_alloc(xfrm_state_cache, GFP_ATOMIC | __GFP_ZERO);
>  
>  	if (x) {
> +		x->pcpu_refcnt = alloc_percpu(int);
> +		if (!x->pcpu_refcnt) {
> +			kmem_cache_free(xfrm_state_cache, x);
> +			return NULL;
> +		}
> +
>  		write_pnet(&x->xs_net, net);
> -		refcount_set(&x->refcnt, 1);
> +		xfrm_state_hold(x);
>  		atomic_set(&x->tunnel_users, 0);
>  		INIT_LIST_HEAD(&x->km.all);
>  		INIT_HLIST_NODE(&x->bydst);
> @@ -634,22 +661,6 @@ struct xfrm_state *xfrm_state_alloc(struct net *net)
>  }
>  EXPORT_SYMBOL(xfrm_state_alloc);
>  
> -void __xfrm_state_destroy(struct xfrm_state *x, bool sync)
> -{
> -	WARN_ON(x->km.state != XFRM_STATE_DEAD);
> -
> -	if (sync) {
> -		synchronize_rcu();
> -		___xfrm_state_destroy(x);
> -	} else {
> -		spin_lock_bh(&xfrm_state_gc_lock);
> -		hlist_add_head(&x->gclist, &xfrm_state_gc_list);
> -		spin_unlock_bh(&xfrm_state_gc_lock);
> -		schedule_work(&xfrm_state_gc_work);
> -	}
> -}
> -EXPORT_SYMBOL(__xfrm_state_destroy);
> -
>  int __xfrm_state_delete(struct xfrm_state *x)
>  {
>  	struct net *net = xs_net(x);
> @@ -673,6 +684,12 @@ int __xfrm_state_delete(struct xfrm_state *x)
>  		 * is what we are dropping here.
>  		 */
>  		xfrm_state_put(x);
> +
> +		spin_lock_bh(&xfrm_state_gc_lock);
> +		hlist_add_head(&x->gclist, &xfrm_state_gc_list);
> +		spin_unlock_bh(&xfrm_state_gc_lock);
> +
> +		schedule_delayed_work(&xfrm_state_gc_work, 0);
>  		err = 0;
>  	}
>  
> @@ -771,10 +788,7 @@ int xfrm_state_flush(struct net *net, u8 proto, bool task_valid, bool sync)
>  				err = xfrm_state_delete(x);
>  				xfrm_audit_state_delete(x, err ? 0 : 1,
>  							task_valid);
> -				if (sync)
> -					xfrm_state_put_sync(x);
> -				else
> -					xfrm_state_put(x);
> +				xfrm_state_put(x);
>  				if (!err)
>  					cnt++;
>  
> @@ -937,8 +951,8 @@ static struct xfrm_state *__xfrm_state_lookup(struct net *net, u32 mark,
>  
>  		if ((mark & x->mark.m) != x->mark.v)
>  			continue;
> -		if (!xfrm_state_hold_rcu(x))
> -			continue;
> +
> +		xfrm_state_hold(x);
>  		return x;
>  	}
>  
> @@ -962,8 +976,7 @@ static struct xfrm_state *__xfrm_state_lookup_byaddr(struct net *net, u32 mark,
>  
>  		if ((mark & x->mark.m) != x->mark.v)
>  			continue;
> -		if (!xfrm_state_hold_rcu(x))
> -			continue;
> +		xfrm_state_hold(x);
>  		return x;
>  	}
>  
> @@ -1151,10 +1164,7 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>  	}
>  out:
>  	if (x) {
> -		if (!xfrm_state_hold_rcu(x)) {
> -			*err = -EAGAIN;
> -			x = NULL;
> -		}
> +		xfrm_state_hold(x);
>  	} else {
>  		*err = acquire_in_progress ? -EAGAIN : error;
>  	}
> @@ -1681,7 +1691,7 @@ int xfrm_state_update(struct xfrm_state *x)
>  
>  		err = 0;
>  		x->km.state = XFRM_STATE_DEAD;
> -		__xfrm_state_put(x);
> +		xfrm_state_put(x);
>  	}
>  
>  fail:
> @@ -2372,7 +2382,7 @@ struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family)
>  
>  void xfrm_flush_gc(void)
>  {
> -	flush_work(&xfrm_state_gc_work);
> +	flush_delayed_work(&xfrm_state_gc_work);
>  }
>  EXPORT_SYMBOL(xfrm_flush_gc);
>  
> @@ -2385,7 +2395,7 @@ void xfrm_state_delete_tunnel(struct xfrm_state *x)
>  		if (atomic_read(&t->tunnel_users) == 2)
>  			xfrm_state_delete(t);
>  		atomic_dec(&t->tunnel_users);
> -		xfrm_state_put_sync(t);
> +		xfrm_state_put(t);
>  		x->tunnel = NULL;
>  	}
>  }
> @@ -2531,7 +2541,7 @@ void xfrm_state_fini(struct net *net)
>  	unsigned int sz;
>  
>  	flush_work(&net->xfrm.state_hash_work);
> -	flush_work(&xfrm_state_gc_work);
> +	flush_delayed_work(&xfrm_state_gc_work);
>  	xfrm_state_flush(net, IPSEC_PROTO_ANY, false, true);
>  
>  	WARN_ON(!list_empty(&net->xfrm.state_all));
> diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
> index a131f9ff979e..4587f1342af0 100644
> --- a/net/xfrm/xfrm_user.c
> +++ b/net/xfrm/xfrm_user.c
> @@ -675,7 +675,7 @@ static int xfrm_add_sa(struct sk_buff *skb, struct nlmsghdr *nlh,
>  	if (err < 0) {
>  		x->km.state = XFRM_STATE_DEAD;
>  		xfrm_dev_state_delete(x);
> -		__xfrm_state_put(x);
> +		xfrm_state_delete(x);
>  		goto out;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-04-24 10:40           ` [RFC HACK] xfrm: make state refcounting percpu Florian Westphal
  2019-04-24 11:38             ` Eric Dumazet
@ 2019-05-03  6:07             ` Steffen Klassert
  2019-05-03  6:13               ` Vakul Garg
  2019-05-03 12:06               ` Eric Dumazet
  1 sibling, 2 replies; 18+ messages in thread
From: Steffen Klassert @ 2019-05-03  6:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: vakul.garg, netdev

On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> I'm not sure this is a good idea to begin with, refcount
> is right next to state spinlock which is taken for both tx and rx ops,
> plus this complicates debugging quite a bit.


Hm, what would be the usecase where this could help?

The only thing that comes to my mind is a TX state
with wide selectors. In that case you might see
traffic for this state on a lot of cpus. But in
that case we have a lot of other problems too,
state lock, replay window etc. It might make more
sense to install a full state per cpu as this
would solve all the other problems too (I've
talked about that idea at the IPsec workshop).

In fact RFC 7296 allows to insert multiple SAs
with the same traffic selector, so it is possible
to install one state per cpu. We did a PoC for this
at the IETF meeting the week after the IPsec workshop.

One problem that is not solved completely is that,
from userland point of view, a SA consists of two
states (RX/TX) and this has to be symetic i.e.
both ends must have the same number of states.
So if both ends have a different number of cpus,
it is not clear how many states we should install.

We are currently discuss to extend the IKEv2 standard
so that we can negotiate the 'optimal' number of
(per cpu) SAs for a connection.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:07             ` Steffen Klassert
@ 2019-05-03  6:13               ` Vakul Garg
  2019-05-03  6:22                 ` Steffen Klassert
  2019-05-03 12:06               ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Vakul Garg @ 2019-05-03  6:13 UTC (permalink / raw)
  To: Steffen Klassert, Florian Westphal; +Cc: netdev



> -----Original Message-----
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Sent: Friday, May 3, 2019 11:38 AM
> To: Florian Westphal <fw@strlen.de>
> Cc: Vakul Garg <vakul.garg@nxp.com>; netdev@vger.kernel.org
> Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> 
> On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> > I'm not sure this is a good idea to begin with, refcount is right next
> > to state spinlock which is taken for both tx and rx ops, plus this
> > complicates debugging quite a bit.
> 
> 
> Hm, what would be the usecase where this could help?
> 
> The only thing that comes to my mind is a TX state with wide selectors. In
> that case you might see traffic for this state on a lot of cpus. But in that case
> we have a lot of other problems too, state lock, replay window etc. It might
> make more sense to install a full state per cpu as this would solve all the
> other problems too (I've talked about that idea at the IPsec workshop).
> 
> In fact RFC 7296 allows to insert multiple SAs with the same traffic selector,
> so it is possible to install one state per cpu. We did a PoC for this at the IETF
> meeting the week after the IPsec workshop.
> 

On 16-core arm64 processor, I am getting very high cpu usage (~ 40 %) in refcount atomics.
E.g. in function dst_release() itself, I get 19% cpu usage  in refcount api.
Will the PoC help here?

> One problem that is not solved completely is that, from userland point of
> view, a SA consists of two states (RX/TX) and this has to be symetic i.e.
> both ends must have the same number of states.
> So if both ends have a different number of cpus, it is not clear how many
> states we should install.
> 
> We are currently discuss to extend the IKEv2 standard so that we can
> negotiate the 'optimal' number of (per cpu) SAs for a connection.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:13               ` Vakul Garg
@ 2019-05-03  6:22                 ` Steffen Klassert
  2019-05-03  6:34                   ` Vakul Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Steffen Klassert @ 2019-05-03  6:22 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev

On Fri, May 03, 2019 at 06:13:22AM +0000, Vakul Garg wrote:
> 
> 
> > -----Original Message-----
> > From: Steffen Klassert <steffen.klassert@secunet.com>
> > Sent: Friday, May 3, 2019 11:38 AM
> > To: Florian Westphal <fw@strlen.de>
> > Cc: Vakul Garg <vakul.garg@nxp.com>; netdev@vger.kernel.org
> > Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> > 
> > On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> > > I'm not sure this is a good idea to begin with, refcount is right next
> > > to state spinlock which is taken for both tx and rx ops, plus this
> > > complicates debugging quite a bit.
> > 
> > 
> > Hm, what would be the usecase where this could help?
> > 
> > The only thing that comes to my mind is a TX state with wide selectors. In
> > that case you might see traffic for this state on a lot of cpus. But in that case
> > we have a lot of other problems too, state lock, replay window etc. It might
> > make more sense to install a full state per cpu as this would solve all the
> > other problems too (I've talked about that idea at the IPsec workshop).
> > 
> > In fact RFC 7296 allows to insert multiple SAs with the same traffic selector,
> > so it is possible to install one state per cpu. We did a PoC for this at the IETF
> > meeting the week after the IPsec workshop.
> > 
> 
> On 16-core arm64 processor, I am getting very high cpu usage (~ 40 %) in refcount atomics.
> E.g. in function dst_release() itself, I get 19% cpu usage  in refcount api.
> Will the PoC help here?

If your usecase is that what I described above, then yes.

I guess the high cpu usage comes from cachline bounces
because one SA is used from many cpus simultaneously.
Is that the case?

Also, is this a new problem or was it always like that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:22                 ` Steffen Klassert
@ 2019-05-03  6:34                   ` Vakul Garg
  2019-05-03  6:46                     ` Steffen Klassert
  0 siblings, 1 reply; 18+ messages in thread
From: Vakul Garg @ 2019-05-03  6:34 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Florian Westphal, netdev



> -----Original Message-----
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Sent: Friday, May 3, 2019 11:52 AM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> 
> On Fri, May 03, 2019 at 06:13:22AM +0000, Vakul Garg wrote:
> >
> >
> > > -----Original Message-----
> > > From: Steffen Klassert <steffen.klassert@secunet.com>
> > > Sent: Friday, May 3, 2019 11:38 AM
> > > To: Florian Westphal <fw@strlen.de>
> > > Cc: Vakul Garg <vakul.garg@nxp.com>; netdev@vger.kernel.org
> > > Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> > >
> > > On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> > > > I'm not sure this is a good idea to begin with, refcount is right
> > > > next to state spinlock which is taken for both tx and rx ops, plus
> > > > this complicates debugging quite a bit.
> > >
> > >
> > > Hm, what would be the usecase where this could help?
> > >
> > > The only thing that comes to my mind is a TX state with wide
> > > selectors. In that case you might see traffic for this state on a
> > > lot of cpus. But in that case we have a lot of other problems too,
> > > state lock, replay window etc. It might make more sense to install a
> > > full state per cpu as this would solve all the other problems too (I've
> talked about that idea at the IPsec workshop).
> > >
> > > In fact RFC 7296 allows to insert multiple SAs with the same traffic
> > > selector, so it is possible to install one state per cpu. We did a
> > > PoC for this at the IETF meeting the week after the IPsec workshop.
> > >
> >
> > On 16-core arm64 processor, I am getting very high cpu usage (~ 40 %) in
> refcount atomics.
> > E.g. in function dst_release() itself, I get 19% cpu usage  in refcount api.
> > Will the PoC help here?
> 
> If your usecase is that what I described above, then yes.
> 
> I guess the high cpu usage comes from cachline bounces because one SA is
> used from many cpus simultaneously.
> Is that the case?

I don't find kernel code to be taking care of reservation granule size alignment (or cacheline size)
for refcount vars. So it is possible that wasteful reservation loss is happening in atomics.

> 
> Also, is this a new problem or was it always like that?

It is always like this. On 4-core, 8-core platforms as well, these atomics consume significant cpu 
(8 core cpu usage is more than 4 core).

On 16-core system, we are seeing no throughput scalability beyond 8 cores. 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:34                   ` Vakul Garg
@ 2019-05-03  6:46                     ` Steffen Klassert
  2019-05-03  6:48                       ` Vakul Garg
  0 siblings, 1 reply; 18+ messages in thread
From: Steffen Klassert @ 2019-05-03  6:46 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev

On Fri, May 03, 2019 at 06:34:29AM +0000, Vakul Garg wrote:
> > -----Original Message-----
> > From: Steffen Klassert <steffen.klassert@secunet.com>
> > 
> > Also, is this a new problem or was it always like that?
> 
> It is always like this. On 4-core, 8-core platforms as well, these atomics consume significant cpu 
> (8 core cpu usage is more than 4 core).

Ok, so it is not a regression. That's what I wanted to know.

Did the patch from Florian improve the situation?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:46                     ` Steffen Klassert
@ 2019-05-03  6:48                       ` Vakul Garg
  0 siblings, 0 replies; 18+ messages in thread
From: Vakul Garg @ 2019-05-03  6:48 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Florian Westphal, netdev



> -----Original Message-----
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Sent: Friday, May 3, 2019 12:16 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: [RFC HACK] xfrm: make state refcounting percpu
> 
> On Fri, May 03, 2019 at 06:34:29AM +0000, Vakul Garg wrote:
> > > -----Original Message-----
> > > From: Steffen Klassert <steffen.klassert@secunet.com>
> > >
> > > Also, is this a new problem or was it always like that?
> >
> > It is always like this. On 4-core, 8-core platforms as well, these atomics
> consume significant cpu
> > (8 core cpu usage is more than 4 core).
> 
> Ok, so it is not a regression. That's what I wanted to know.
> 
> Did the patch from Florian improve the situation?

Actually I could not get time to try it yet.
Also I do not find Florian's patch making refcount in 'struct dst' to be pcpu.
So I think it may not help....

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03  6:07             ` Steffen Klassert
  2019-05-03  6:13               ` Vakul Garg
@ 2019-05-03 12:06               ` Eric Dumazet
  2019-05-03 15:55                 ` Florian Westphal
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2019-05-03 12:06 UTC (permalink / raw)
  To: Steffen Klassert, Florian Westphal; +Cc: vakul.garg, netdev, Eric Dumazet



On 5/3/19 2:07 AM, Steffen Klassert wrote:
> On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
>> I'm not sure this is a good idea to begin with, refcount
>> is right next to state spinlock which is taken for both tx and rx ops,
>> plus this complicates debugging quite a bit.
> 
> 


For some reason I have not received Florian response.

Florian, when the percpu counters are in nominal mode,
the updates are only in percpu memory, so the cache line containing struct percpu_ref in the
main object is not dirtied.

(This field/object can be placed in a read-mostly location of the structure if needed)

I would definitely try to use this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC HACK] xfrm: make state refcounting percpu
  2019-05-03 12:06               ` Eric Dumazet
@ 2019-05-03 15:55                 ` Florian Westphal
  0 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2019-05-03 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Steffen Klassert, Florian Westphal, vakul.garg, netdev, Eric Dumazet

Eric Dumazet <eric.dumazet@gmail.com> wrote:
 On 5/3/19 2:07 AM, Steffen Klassert wrote:
> > On Wed, Apr 24, 2019 at 12:40:23PM +0200, Florian Westphal wrote:
> >> I'm not sure this is a good idea to begin with, refcount
> >> is right next to state spinlock which is taken for both tx and rx ops,
> >> plus this complicates debugging quite a bit.
> > 
> > 
> 
> 
> For some reason I have not received Florian response.
> 
> Florian, when the percpu counters are in nominal mode,
> the updates are only in percpu memory, so the cache line containing struct percpu_ref in the
> main object is not dirtied.

Yes, I understand this.  We'll still still serialize anyway due to
spinlock.

Given Vakul says the state refcount isn't the main problem and Steffen
suggest to insert multiple states instead I don't think working on this
more makes any sense.

Thanks for the pcpu counter infra pointer though, I had not seen it before.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-05-03 15:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-22 14:29 ipsec tunnel performance degrade Vakul Garg
2019-04-22 17:45 ` Florian Westphal
2019-04-23  2:36   ` Vakul Garg
2019-04-23  5:56     ` Florian Westphal
2019-04-23 15:04       ` Vakul Garg
2019-04-23 16:25         ` Florian Westphal
2019-04-23 17:27           ` David Miller
2019-04-24  7:32             ` Vakul Garg
2019-04-24 10:40           ` [RFC HACK] xfrm: make state refcounting percpu Florian Westphal
2019-04-24 11:38             ` Eric Dumazet
2019-05-03  6:07             ` Steffen Klassert
2019-05-03  6:13               ` Vakul Garg
2019-05-03  6:22                 ` Steffen Klassert
2019-05-03  6:34                   ` Vakul Garg
2019-05-03  6:46                     ` Steffen Klassert
2019-05-03  6:48                       ` Vakul Garg
2019-05-03 12:06               ` Eric Dumazet
2019-05-03 15:55                 ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.