All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: davem@davemloft.net, vyasevic@redhat.com,
	kstewart@linuxfoundation.org, pombredanne@nexb.com,
	vyasevich@gmail.com, mark.rutland@arm.com,
	gregkh@linuxfoundation.org, adobriyan@gmail.com, fw@strlen.de,
	nicolas.dichtel@6wind.com, xiyou.wangcong@gmail.com,
	roman.kapl@sysgo.com, paul@paul-moore.com, dsahern@gmail.com,
	daniel@iogearbox.net, lucien.xin@gmail.com,
	mschiffer@universe-factory.net, rshearma@brocade.com,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	avagin@virtuozzo.com, gorcunov@virtuozzo.com
Subject: Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Date: Wed, 15 Nov 2017 00:25:01 -0600	[thread overview]
Message-ID: <87tvxw3vpe.fsf@xmission.com> (raw)
In-Reply-To: <151066759055.14465.9783879083192000862.stgit@localhost.localdomain> (Kirill Tkhai's message of "Tue, 14 Nov 2017 16:53:33 +0300")

Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(&net_mutex)
>   list_del_rcu(&net->list)
>   synchronize_rcu()                                  <--- Sleep there for ages
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_exit_list(ops, &net_exit_list)
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_free_list(ops, &net_exit_list)
>   mutex_unlock(&net_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
>         unsigned nr;
>         if (argc < 2) {
>                 fprintf(stderr, "Provide nr iterations arg\n");
>                 return 1;
>         }
>         nr = atoi(argv[1]);
>         while (nr-- > 0) {
>                 if (unshare(CLONE_NEWNET)) {
>                         perror("Can't unshare");
>                         return 1;
>                 }
>         }
>         return 0;
> }
>
> Origin, 100000 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 100000 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
>
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
>
> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
> from Linus tree (not in net-next yet).

Using a rwsem to protect the list of operations makes sense.

That should allow removing the sing

I am not wild about taking a the rwsem down_write in
rtnl_link_unregister, and net_ns_barrier.  I think that works but it
goes from being a mild hack to being a pretty bad hack and something
else that can kill the parallelism you are seeking it add.

There are about 204 instances of struct pernet_operations.  That is a
lot of code to have carefully audited to ensure it can in parallel all
at once.  The existence of the exit_batch method, net_ns_barrier,
for_each_net and taking of net_mutex in rtnl_link_unregister all testify
to the fact that there are data structures accessed by multiple network
namespaces.

My preference would be to:

- Add the net_sem in addition to net_mutex with down_write only held in
  register and unregister, and maybe net_ns_barrier and
  rtnl_link_unregister.

- Factor out struct pernet_ops out of struct pernet_operations.  With
  struct pernet_ops not having the exit_batch method.  With pernet_ops
  being embedded an anonymous member of the old struct pernet_operations.

- Add [un]register_pernet_{sys,dev} functions that take a struct
  pernet_ops, that don't take net_mutex.  Have them order the
  pernet_list as:

  pernet_sys
  pernet_subsys
  pernet_device
  pernet_dev

  With the chunk in the middle taking the net_mutex.

  I think I would enforce the ordering with a failure to register
  if a subsystem or a device tried to register out of order.  

- Disable use of the single threaded workqueue if nothing needs the
  net_mutex.

- Add a test mode that deliberartely spawns threads on multiple
  processors and deliberately creates multiple network namespaces
  at the same time.

- Add a test mode that deliberately spawns threads on multiple
  processors and delibearate destrosy multiple network namespaces
  at the same time.
  
- Convert the code to unlocked operation one pernet_operations to at a
  time.  Being careful with the loopback device it's order in the list
  strongly matters.

- Finally remove the unnecessary code.


At the end of the day because all of the operations for one network
namespace will run in parallel with all of the operations for another
network namespace all of the sophistication that goes into batching the
cleanup of multiple network namespaces can be removed.  As different
tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
without slowing each other down.

I think we can remove the batching but I am afraid that will lead us into
rtnl_lock contention.

I am definitely not comfortable with changing the locking on so much
code without an explanation at all why it is safe in the commit comments
in all 204 instances.  Which combined equal most of the well maintained
and regularly used part of the networking stack.

Eric

> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  include/linux/rtnetlink.h   |    2 +-
>  include/net/net_namespace.h |    9 +++++++--
>  net/core/net_namespace.c    |   40 ++++++++++++++++++++--------------------
>  net/core/rtnetlink.c        |    4 ++--
>  4 files changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 54bcd970bfd3..36cc009f4580 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -32,7 +32,7 @@ extern int rtnl_trylock(void);
>  extern int rtnl_is_locked(void);
>  
>  extern wait_queue_head_t netdev_unregistering_wq;
> -extern struct mutex net_mutex;
> +extern struct rw_semaphore net_sem;
>  
>  #ifdef CONFIG_PROVE_LOCKING
>  extern bool lockdep_rtnl_is_held(void);
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 10f99dafd5ac..aaed826ccbe7 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -60,7 +60,7 @@ struct net {
>  
>  	struct list_head	list;		/* list of network namespaces */
>  	struct list_head	cleanup_list;	/* namespaces on death row */
> -	struct list_head	exit_list;	/* Use only net_mutex */
> +	struct list_head	exit_list;	/* Use only net_sem */
>  
>  	struct user_namespace   *user_ns;	/* Owning user namespace */
>  	struct ucounts		*ucounts;
> @@ -89,7 +89,12 @@ struct net {
>  	/* core fib_rules */
>  	struct list_head	rules_ops;
>  
> -	struct list_head	fib_notifier_ops;  /* protected by net_mutex */
> +	/*
> +	 * RCU-protected list, modifiable by pernet-init and -exit methods.
> +	 * When net namespace is alive (net::count > 0), all the changes
> +	 * are made under rw_sem held on write.
> +	 */
> +	struct list_head	fib_notifier_ops;
>  
>  	struct net_device       *loopback_dev;          /* The loopback */
>  	struct netns_core	core;
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 6cfdc7c84c48..f502b11b507e 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -29,7 +29,7 @@
>  
>  static LIST_HEAD(pernet_list);
>  static struct list_head *first_device = &pernet_list;
> -DEFINE_MUTEX(net_mutex);
> +DECLARE_RWSEM(net_sem);
>  
>  LIST_HEAD(net_namespace_list);
>  EXPORT_SYMBOL_GPL(net_namespace_list);
> @@ -65,11 +65,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
>  {
>  	struct net_generic *ng, *old_ng;
>  
> -	BUG_ON(!mutex_is_locked(&net_mutex));
> +	BUG_ON(!rwsem_is_locked(&net_sem));
>  	BUG_ON(id < MIN_PERNET_OPS_ID);
>  
>  	old_ng = rcu_dereference_protected(net->gen,
> -					   lockdep_is_held(&net_mutex));
> +					   lockdep_is_held(&net_sem));
>  	if (old_ng->s.len > id) {
>  		old_ng->ptr[id] = data;
>  		return 0;
> @@ -278,7 +278,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
>   */
>  static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
>  {
> -	/* Must be called with net_mutex held */
> +	/* Must be called with net_sem held */
>  	const struct pernet_operations *ops, *saved_ops;
>  	int error = 0;
>  	LIST_HEAD(net_exit_list);
> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>  
>  	get_user_ns(user_ns);
>  
> -	rv = mutex_lock_killable(&net_mutex);
> +	rv = down_read_killable(&net_sem);
>  	if (rv < 0) {
>  		net_free(net);
>  		dec_net_namespaces(ucounts);
> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>  		list_add_tail_rcu(&net->list, &net_namespace_list);
>  		rtnl_unlock();
>  	}
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  	if (rv < 0) {
>  		dec_net_namespaces(ucounts);
>  		put_user_ns(user_ns);
> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>  	list_replace_init(&cleanup_list, &net_kill_list);
>  	spin_unlock_irq(&cleanup_list_lock);
>  
> -	mutex_lock(&net_mutex);
> +	down_read(&net_sem);
>  
>  	/* Don't let anyone else find us. */
>  	rtnl_lock();
> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
>  		ops_free_list(ops, &net_exit_list);
>  
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  
>  	/* Ensure there are no outstanding rcu callbacks using this
>  	 * network namespace.
> @@ -513,8 +513,8 @@ static void cleanup_net(struct work_struct *work)
>   */
>  void net_ns_barrier(void)
>  {
> -	mutex_lock(&net_mutex);
> -	mutex_unlock(&net_mutex);
> +	down_write(&net_sem);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL(net_ns_barrier);
>  
> @@ -841,7 +841,7 @@ static int __init net_ns_init(void)
>  
>  	rcu_assign_pointer(init_net.gen, ng);
>  
> -	mutex_lock(&net_mutex);
> +	down_read(&net_sem);
>  	if (setup_net(&init_net, &init_user_ns))
>  		panic("Could not setup the initial network namespace");
>  
> @@ -851,7 +851,7 @@ static int __init net_ns_init(void)
>  	list_add_tail_rcu(&init_net.list, &net_namespace_list);
>  	rtnl_unlock();
>  
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  
>  	register_pernet_subsys(&net_ns_ops);
>  
> @@ -991,9 +991,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>  int register_pernet_subsys(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	error =  register_pernet_operations(first_device, ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
> @@ -1009,9 +1009,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
>   */
>  void unregister_pernet_subsys(struct pernet_operations *ops)
>  {
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	unregister_pernet_operations(ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>  
> @@ -1037,11 +1037,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>  int register_pernet_device(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	error = register_pernet_operations(&pernet_list, ops);
>  	if (!error && (first_device == &pernet_list))
>  		first_device = &ops->list;
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_device);
> @@ -1057,11 +1057,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
>   */
>  void unregister_pernet_device(struct pernet_operations *ops)
>  {
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	if (&ops->list == first_device)
>  		first_device = first_device->next;
>  	unregister_pernet_operations(ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(unregister_pernet_device);
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 5ace48926b19..caa215fd170b 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -390,11 +390,11 @@ static void rtnl_lock_unregistering_all(void)
>  void rtnl_link_unregister(struct rtnl_link_ops *ops)
>  {
>  	/* Close the race with cleanup_net() */
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	rtnl_lock_unregistering_all();
>  	__rtnl_link_unregister(ops);
>  	rtnl_unlock();
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(rtnl_link_unregister);
>  

  parent reply	other threads:[~2017-11-15  6:25 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-14 13:53 [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit Kirill Tkhai
2017-11-14 17:07 ` Eric Dumazet
2017-11-14 17:25   ` Kirill Tkhai
2017-11-14 17:44 ` Andrei Vagin
2017-11-14 18:00   ` Eric Dumazet
2017-11-14 22:15     ` Andrei Vagin
2017-11-14 18:04   ` Kirill Tkhai
2017-11-14 18:38     ` Andrei Vagin
2017-11-14 20:43       ` Kirill Tkhai
2017-11-14 18:11 ` Stephen Hemminger
2017-11-14 19:07   ` Eric Dumazet
2017-11-14 18:39 ` Cong Wang
2017-11-14 19:58   ` Kirill Tkhai
2017-11-15  3:19     ` Eric W. Biederman
2017-11-15  9:51       ` Kirill Tkhai
2017-11-15 12:36         ` Kirill Tkhai
2017-11-15 16:31           ` Eric W. Biederman
2017-11-17 18:36             ` Kirill Tkhai
2017-11-17 18:52               ` Eric W. Biederman
2017-11-17 20:16                 ` Kirill Tkhai
2017-11-15  6:25 ` Eric W. Biederman [this message]
2017-11-15  9:49   ` Kirill Tkhai
2017-11-15 16:29     ` Eric W. Biederman
2017-11-16  9:13       ` Kirill Tkhai
2017-11-17 16:46       ` Kirill Tkhai
2017-11-17 18:54         ` Eric W. Biederman
2017-11-17 20:12           ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tvxw3vpe.fsf@xmission.com \
    --to=ebiederm@xmission.com \
    --cc=adobriyan@gmail.com \
    --cc=avagin@virtuozzo.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=fw@strlen.de \
    --cc=gorcunov@virtuozzo.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=kstewart@linuxfoundation.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lucien.xin@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=mschiffer@universe-factory.net \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=paul@paul-moore.com \
    --cc=pombredanne@nexb.com \
    --cc=roman.kapl@sysgo.com \
    --cc=rshearma@brocade.com \
    --cc=vyasevic@redhat.com \
    --cc=vyasevich@gmail.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.