[PATCH 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'
@ 2020-12-08  9:45 SeongJae Park
  2020-12-08  9:45 ` [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
  0 siblings, 1 reply; 6+ messages in thread
From: SeongJae Park @ 2020-12-08  9:45 UTC (permalink / raw)
  To: davem; +Cc: SeongJae Park, kuba, kuznet, paulmck, netdev, rcu, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase.  As a result, memory pressure occurs.

'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects.  They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.

Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high.  In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.

I tried making 'fqdir_work_fn()' batched and confirmed it works.  The
following patch is for the change.  I think this is the right solution
for point fix of this issue, but someone might blame different parts.

1. User: Frequent 'unshare()' calls
From some point of view, such frequent 'unshare()' calls might seem only
insane.

2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished.  Therefore,
similar issues could happen in another 'rcu_barrier()' usages.  Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.

I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense.  If I'm missing
something or you have different opinions, please feel free to let me
know.

SeongJae Park (1):
  net/ipv4/inet_fragment: Batch fqdir destroy works

 include/net/inet_frag.h  |  2 +-
 net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-08  9:45 [PATCH 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' SeongJae Park
@ 2020-12-08  9:45 ` SeongJae Park
  2020-12-09 23:16   ` Jakub Kicinski
  2020-12-10  0:17   ` Eric Dumazet
  0 siblings, 2 replies; 6+ messages in thread
From: SeongJae Park @ 2020-12-08  9:45 UTC (permalink / raw)
  To: davem; +Cc: SeongJae Park, kuba, kuznet, paulmck, netdev, rcu, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

In 'fqdir_exit()', a work for destruction of the 'fqdir' is enqueued.
The work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
intensive 'fqdir_exit()' (e.g., frequent 'unshare(CLONE_NEWNET)'
systemcalls), this increased contention could result in unacceptably
high latency of 'rcu_barrier()'.  This commit avoids such contention by
doing the destruction in batched manner, as similar to that of
'cleanup_net()'.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
---
 include/net/inet_frag.h  |  2 +-
 net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index bac79e817776..558893d8810c 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -20,7 +20,7 @@ struct fqdir {
 
 	/* Keep atomic mem on separate cachelines in structs that include it */
 	atomic_long_t		mem ____cacheline_aligned_in_smp;
-	struct work_struct	destroy_work;
+	struct llist_node	destroy_list;
 };
 
 /**
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 10d31733297d..796b559137c5 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,12 +145,19 @@ static void inet_frags_free_cb(void *ptr, void *arg)
 		inet_frag_destroy(fq);
 }
 
+static LLIST_HEAD(destroy_list);
+
 static void fqdir_work_fn(struct work_struct *work)
 {
-	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
-	struct inet_frags *f = fqdir->f;
+	struct llist_node *kill_list;
+	struct fqdir *fqdir;
+	struct inet_frags *f;
+
+	/* Atomically snapshot the list of fqdirs to destroy */
+	kill_list = llist_del_all(&destroy_list);
 
-	rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
+	llist_for_each_entry(fqdir, kill_list, destroy_list)
+		rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
 
 	/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
 	 * have completed, since they need to dereference fqdir.
@@ -158,10 +165,13 @@ static void fqdir_work_fn(struct work_struct *work)
 	 */
 	rcu_barrier();
 
-	if (refcount_dec_and_test(&f->refcnt))
-		complete(&f->completion);
+	llist_for_each_entry(fqdir, kill_list, destroy_list) {
+		f = fqdir->f;
+		if (refcount_dec_and_test(&f->refcnt))
+			complete(&f->completion);
 
-	kfree(fqdir);
+		kfree(fqdir);
+	}
 }
 
 int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
@@ -184,10 +194,12 @@ int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
 }
 EXPORT_SYMBOL(fqdir_init);
 
+static DECLARE_WORK(fqdir_destroy_work, fqdir_work_fn);
+
 void fqdir_exit(struct fqdir *fqdir)
 {
-	INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
-	queue_work(system_wq, &fqdir->destroy_work);
+	if (llist_add(&fqdir->destroy_list, &destroy_list))
+		queue_work(system_wq, &fqdir_destroy_work);
 }
 EXPORT_SYMBOL(fqdir_exit);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-08  9:45 ` [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
@ 2020-12-09 23:16   ` Jakub Kicinski
  2020-12-10  6:43     ` SeongJae Park
  2020-12-10  0:17   ` Eric Dumazet
  1 sibling, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2020-12-09 23:16 UTC (permalink / raw)
  To: SeongJae Park
  Cc: davem, SeongJae Park, kuznet, paulmck, netdev, rcu, linux-kernel

On Tue, 8 Dec 2020 10:45:29 +0100 SeongJae Park wrote:
> From: SeongJae Park <sjpark@amazon.de>
> 
> In 'fqdir_exit()', a work for destruction of the 'fqdir' is enqueued.
> The work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
> intensive 'fqdir_exit()' (e.g., frequent 'unshare(CLONE_NEWNET)'
> systemcalls), this increased contention could result in unacceptably
> high latency of 'rcu_barrier()'.  This commit avoids such contention by
> doing the destruction in batched manner, as similar to that of
> 'cleanup_net()'.
> 
> Signed-off-by: SeongJae Park <sjpark@amazon.de>

Looks fine to me, but you haven't CCed Florian or Eric who where the
last two people to touch this function. Please repost CCing them and
fixing the nit below, thanks!

>  static void fqdir_work_fn(struct work_struct *work)
>  {
> -	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> -	struct inet_frags *f = fqdir->f;
> +	struct llist_node *kill_list;
> +	struct fqdir *fqdir;
> +	struct inet_frags *f;

nit: reorder fqdir and f to keep reverse xmas tree variable ordering.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-08  9:45 ` [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
  2020-12-09 23:16   ` Jakub Kicinski
@ 2020-12-10  0:17   ` Eric Dumazet
  2020-12-10  7:27     ` SeongJae Park
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2020-12-10  0:17 UTC (permalink / raw)
  To: SeongJae Park, davem
  Cc: SeongJae Park, kuba, kuznet, paulmck, netdev, rcu, linux-kernel



On 12/8/20 10:45 AM, SeongJae Park wrote:
> From: SeongJae Park <sjpark@amazon.de>
> 
> In 'fqdir_exit()', a work for destruction of the 'fqdir' is enqueued.
> The work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
> intensive 'fqdir_exit()' (e.g., frequent 'unshare(CLONE_NEWNET)'
> systemcalls), this increased contention could result in unacceptably
> high latency of 'rcu_barrier()'.  This commit avoids such contention by
> doing the destruction in batched manner, as similar to that of
> 'cleanup_net()'.

Any numbers to share ? I have never seen an issue.

> 
> Signed-off-by: SeongJae Park <sjpark@amazon.de>
> ---
>  include/net/inet_frag.h  |  2 +-
>  net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
>  2 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index bac79e817776..558893d8810c 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -20,7 +20,7 @@ struct fqdir {
>  
>  	/* Keep atomic mem on separate cachelines in structs that include it */
>  	atomic_long_t		mem ____cacheline_aligned_in_smp;
> -	struct work_struct	destroy_work;
> +	struct llist_node	destroy_list;
>  };
>  
>  /**
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 10d31733297d..796b559137c5 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -145,12 +145,19 @@ static void inet_frags_free_cb(void *ptr, void *arg)
>  		inet_frag_destroy(fq);
>  }
>  
> +static LLIST_HEAD(destroy_list);
> +
>  static void fqdir_work_fn(struct work_struct *work)
>  {
> -	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> -	struct inet_frags *f = fqdir->f;
> +	struct llist_node *kill_list;
> +	struct fqdir *fqdir;
> +	struct inet_frags *f;
> +
> +	/* Atomically snapshot the list of fqdirs to destroy */
> +	kill_list = llist_del_all(&destroy_list);
>  
> -	rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> +	llist_for_each_entry(fqdir, kill_list, destroy_list)
> +		rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> 


OK, it seems rhashtable_free_and_destroy() has cond_resched() so we are not going
to hold this cpu for long periods.
 
>  	/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
>  	 * have completed, since they need to dereference fqdir.
> @@ -158,10 +165,13 @@ static void fqdir_work_fn(struct work_struct *work)
>  	 */
>  	rcu_barrier();
>  
> -	if (refcount_dec_and_test(&f->refcnt))
> -		complete(&f->completion);
> +	llist_for_each_entry(fqdir, kill_list, destroy_list) {

Don't we need the llist_for_each_entry_safe() variant here ???

> +		f = fqdir->f;
> +		if (refcount_dec_and_test(&f->refcnt))
> +			complete(&f->completion);
>  
> -	kfree(fqdir);
> +		kfree(fqdir);
> +	}
>  }
>  
>  int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
> @@ -184,10 +194,12 @@ int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
>  }
>  EXPORT_SYMBOL(fqdir_init);
>  
> +static DECLARE_WORK(fqdir_destroy_work, fqdir_work_fn);
> +
>  void fqdir_exit(struct fqdir *fqdir)
>  {
> -	INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
> -	queue_work(system_wq, &fqdir->destroy_work);
> +	if (llist_add(&fqdir->destroy_list, &destroy_list))
> +		queue_work(system_wq, &fqdir_destroy_work);
>  }
>  EXPORT_SYMBOL(fqdir_exit);
>  
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-09 23:16   ` Jakub Kicinski
@ 2020-12-10  6:43     ` SeongJae Park
  0 siblings, 0 replies; 6+ messages in thread
From: SeongJae Park @ 2020-12-10  6:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: SeongJae Park, davem, SeongJae Park, kuznet, paulmck, netdev,
	rcu, linux-kernel

On Wed, 9 Dec 2020 15:16:59 -0800 Jakub Kicinski <kuba@kernel.org> wrote:

> On Tue, 8 Dec 2020 10:45:29 +0100 SeongJae Park wrote:
> > From: SeongJae Park <sjpark@amazon.de>
> > 
> > In 'fqdir_exit()', a work for destruction of the 'fqdir' is enqueued.
> > The work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
> > intensive 'fqdir_exit()' (e.g., frequent 'unshare(CLONE_NEWNET)'
> > systemcalls), this increased contention could result in unacceptably
> > high latency of 'rcu_barrier()'.  This commit avoids such contention by
> > doing the destruction in batched manner, as similar to that of
> > 'cleanup_net()'.
> > 
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> 
> Looks fine to me, but you haven't CCed Florian or Eric who where the
> last two people to touch this function. Please repost CCing them and
> fixing the nit below, thanks!

Thank you for let me know that.  I will send the next version so.

> 
> >  static void fqdir_work_fn(struct work_struct *work)
> >  {
> > -	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> > -	struct inet_frags *f = fqdir->f;
> > +	struct llist_node *kill_list;
> > +	struct fqdir *fqdir;
> > +	struct inet_frags *f;
> 
> nit: reorder fqdir and f to keep reverse xmas tree variable ordering.

Hehe, ok, I will. :)


Thanks,
SeongJae Park

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-10  0:17   ` Eric Dumazet
@ 2020-12-10  7:27     ` SeongJae Park
  0 siblings, 0 replies; 6+ messages in thread
From: SeongJae Park @ 2020-12-10  7:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: SeongJae Park, davem, SeongJae Park, kuba, kuznet, paulmck,
	netdev, rcu, linux-kernel

On Thu, 10 Dec 2020 01:17:58 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 
> 
> On 12/8/20 10:45 AM, SeongJae Park wrote:
> > From: SeongJae Park <sjpark@amazon.de>
> > 
> > In 'fqdir_exit()', a work for destruction of the 'fqdir' is enqueued.
> > The work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
> > intensive 'fqdir_exit()' (e.g., frequent 'unshare(CLONE_NEWNET)'
> > systemcalls), this increased contention could result in unacceptably
> > high latency of 'rcu_barrier()'.  This commit avoids such contention by
> > doing the destruction in batched manner, as similar to that of
> > 'cleanup_net()'.
> 
> Any numbers to share ? I have never seen an issue.

On our 40 CPU cores / 70GB DRAM machine, 15GB of available memory was reduced
within 2 minutes while my artificial reproducer runs.  The reproducer merely
repeats 'unshare(CLONE_NEWNET)' in a loop for 50,000 times.  The reproducer is
not only artificial but resembles the behavior of our real workloads.
While the reproducer runs, 'cleanup_net()' was called only 4 times.  First two
calls quickly finished, but third call took about 30 seconds, and the fourth
call didn't finished until the reproducer finishes.  We also confirmed the
third and fourth calls just waiting for 'rcu_barrier()'.

I think you've not seen this issue before because we are doing very intensive
'unshare()' calls.  Also, this is not reproducible on every hardware.  On my 6
CPU machine, the problem didn't reproduce.

> 
> > 
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > ---
> >  include/net/inet_frag.h  |  2 +-
> >  net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
> >  2 files changed, 21 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> > index bac79e817776..558893d8810c 100644
> > --- a/include/net/inet_frag.h
> > +++ b/include/net/inet_frag.h
> > @@ -20,7 +20,7 @@ struct fqdir {
> >  
> >  	/* Keep atomic mem on separate cachelines in structs that include it */
> >  	atomic_long_t		mem ____cacheline_aligned_in_smp;
> > -	struct work_struct	destroy_work;
> > +	struct llist_node	destroy_list;
> >  };
> >  
> >  /**
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > index 10d31733297d..796b559137c5 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -145,12 +145,19 @@ static void inet_frags_free_cb(void *ptr, void *arg)
> >  		inet_frag_destroy(fq);
> >  }
> >  
> > +static LLIST_HEAD(destroy_list);
> > +
> >  static void fqdir_work_fn(struct work_struct *work)
> >  {
> > -	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
> > -	struct inet_frags *f = fqdir->f;
> > +	struct llist_node *kill_list;
> > +	struct fqdir *fqdir;
> > +	struct inet_frags *f;
> > +
> > +	/* Atomically snapshot the list of fqdirs to destroy */
> > +	kill_list = llist_del_all(&destroy_list);
> >  
> > -	rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> > +	llist_for_each_entry(fqdir, kill_list, destroy_list)
> > +		rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
> > 
> 
> 
> OK, it seems rhashtable_free_and_destroy() has cond_resched() so we are not going
> to hold this cpu for long periods.
>  
> >  	/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
> >  	 * have completed, since they need to dereference fqdir.
> > @@ -158,10 +165,13 @@ static void fqdir_work_fn(struct work_struct *work)
> >  	 */
> >  	rcu_barrier();
> >  
> > -	if (refcount_dec_and_test(&f->refcnt))
> > -		complete(&f->completion);
> > +	llist_for_each_entry(fqdir, kill_list, destroy_list) {
> 
> Don't we need the llist_for_each_entry_safe() variant here ???

Oh, indeed.  I will do so in the next version.

> 
> > +		f = fqdir->f;
> > +		if (refcount_dec_and_test(&f->refcnt))
> > +			complete(&f->completion);
> >  
> > -	kfree(fqdir);
> > +		kfree(fqdir);
> > +	}


Thanks,
SeongJae Park

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-10  7:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-08  9:45 [PATCH 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' SeongJae Park
2020-12-08  9:45 ` [PATCH 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
2020-12-09 23:16   ` Jakub Kicinski
2020-12-10  6:43     ` SeongJae Park
2020-12-10  0:17   ` Eric Dumazet
2020-12-10  7:27     ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).