All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Ying Xue <ying.xue@windriver.com>
Cc: <netdev@vger.kernel.org>, <cwang@twopensource.com>,
	<herbert@gondor.apana.org.au>, <xemul@openvz.org>,
	<davem@davemloft.net>, <eric.dumazet@gmail.com>,
	<maxk@qti.qualcomm.com>, <stephen@networkplumber.org>,
	<tgraf@suug.ch>, <nicolas.dichtel@6wind.com>,
	<tom@herbertland.com>, <jchapman@katalix.com>,
	<erik.hugne@ericsson.com>, <jon.maloy@ericsson.com>,
	<horms@verge.net.au>
Subject: Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
Date: Thu, 07 May 2015 11:14:13 -0500	[thread overview]
Message-ID: <87wq0kcqlm.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <1430988770-28907-1-git-send-email-ying.xue@windriver.com> (Ying Xue's message of "Thu, 7 May 2015 16:52:39 +0800")

Ying Xue <ying.xue@windriver.com> writes:

> When commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
> netlink_kernel_create().") attempted to fix the race between put_net()
> and kernel socket's creation, it adopted a complex solution: create
> netlink socket inside init_net namespace and then re-attach it to the
> desired one right after the socket is created; similarly, when close
> the socket, move back its namespace to init_net so that the socket can
> be destroyed in the context which is same as the socket creation.
>
> But the solution artificially makes the whole thing complex as its
> design is not only weird, but also it causes a bad consequence that
> when all kernel modules create kernel sockets, they have to follow
> the model of namespace switch. More importantly, with the way kernel
> sockets are created in init_net namespace, but they are released in
> another new ones. This inconsistent namespace brings some modules many
> inconvenience. For example, what tipc socket is inserted to rhashtable
> happens in socket's creation, and different namespace has different
> rhashtable for tipc socket. With the approach, a tipc kernel socket
> will be inserted into the rhashtable of init_net. But as releasing
> the socket happens in another one, it causes what the socket cannot
> be found from the rhashtable of the new namespace.
>
> Therefore, we propose a simpler solution to avoid the race: if we
> find there is still pending a cleanup work in __put_net(), we don't
> queue a new cleanup work to stop the cleanup process. The new proposal
> not only successfully solves the race, but also it can help us to
> avoid unnecessary namespace switches when creating kernel sockets.
> Moreover, it can guarantee that both creation and release of kernel
> sockets happen in the same namespace at all time.
>
> In the series, we first resolve the race with patch #1, and then
> prevent namespace switches from happening in all relevant kernel
> modules one by one from patch #2 to patch #9. Until now, as all
> dependencies on sk_change_net() are killed, we can delete the
> interface completely in patch #10. Lastly, we simplify the code of
> creating kernel sockets through changing the original behaviours
> of sock_create_kern() and sk_release_kernel(). If a kernel socket
> is created within a namespace which is different with init_net,
> we must put the reference counter of the namespace once the socket
> is successfully allocated in sk_alloc(), otherwise, the namespace
> is probably unable to be shut down finally. Therefore, we decrease
> namespace's reference counter once a kernel socket is created
> successfully by sock_create_kern() within a namespace which is
> different with init_net. Similarly, namespace's reference counter
> must be increased back before the socket is destroyed in
> sk_release_kernel().
>
> Welcome to any comments.

I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
and netlink_kernel_create()."  was a hack.

However it is not appropriate to call get_net on a network namespace
whose count might be zero.  I believe all of your patches rely on that
currently.  Instead we need to build something like sk_release_kernel
that does not increase the network namespace reference count if you are
going to avoid changing the network namespace on a socket (a worthy
goal).

The following change shows how it is possible to always know that your
network namespace has a non-zero reference count in the network
namespace initialization methods.  My implementation of
lock_network_namespaces is problematic in that it does not sleep
while network namespaces are unregistering.  But it is enough to show
how the locking and reference counting can be fixed.

Eric


diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a3abb719221f..81c53ccc5764 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -822,6 +822,49 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 		ida_remove(&net_generic_ids, *ops->id);
 }
 
+static void unlock_network_namespaces(void)
+{
+	/* Drop the reference count to every network namespace
+	 * and then release the net_mutex.
+	 */
+	struct net *net;
+
+	for_each_net(net)
+		put_net(net);
+
+	mutex_unlock(&net_mutex);
+}
+
+static void lock_network_namespaces(void)
+{
+	/* Take the mutex lock ensuring no new network namespaces
+	 * and take a reference on all existing network namespaces
+	 * allowing network namespace initialization code to take
+	 * further references
+	 */
+	for (;;) {
+		struct net *net, *stop;
+
+		mutex_lock(&net_mutex);
+		for_each_net(net) {
+			if (!maybe_get_net(net))
+				goto undo;
+		}
+		return;
+undo:
+		/* Remember the network namespace whose reference
+		 * count was not acquired. */
+		stop = net;
+		for_each_net(net) {
+			if (net_eq(net, stop))
+				goto undone;
+			put_net(net);
+		}
+undone:
+		mutex_unlock(&net_mutex);
+	}
+}
+
 /**
  *      register_pernet_subsys - register a network namespace subsystem
  *	@ops:  pernet operations structure for the subsystem
@@ -844,9 +887,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 int register_pernet_subsys(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	lock_network_namespaces();
 	error =  register_pernet_operations(first_device, ops);
-	mutex_unlock(&net_mutex);
+	unlock_network_namespaces();
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_subsys);
@@ -890,11 +933,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 int register_pernet_device(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	lock_network_namespaces();
 	error = register_pernet_operations(&pernet_list, ops);
 	if (!error && (first_device == &pernet_list))
 		first_device = &ops->list;
-	mutex_unlock(&net_mutex);
+	unlock_network_namespaces();
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_device);

  parent reply	other threads:[~2015-05-07 16:18 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
2015-05-07  9:04   ` Herbert Xu
2015-05-07 17:19     ` Cong Wang
2015-05-07 17:28       ` Eric W. Biederman
2015-05-08 11:20       ` Eric W. Biederman
2015-05-08 11:20       ` Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 04/11] inet: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 06/11] ip6_udp_tunnel: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 09/11] tipc: fix net leak issue Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets Ying Xue
2015-05-07 16:14 ` Eric W. Biederman [this message]
2015-05-07 18:19   ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Cong Wang
2015-05-07 18:26     ` Eric W. Biederman
2015-05-07 18:53       ` Cong Wang
2015-05-07 18:58         ` Eric W. Biederman
2015-05-07 19:29           ` Cong Wang
2015-05-07 20:01             ` Eric W. Biederman
2015-05-08  9:10               ` Ying Xue
2015-05-08 11:15                 ` Eric W. Biederman
2015-05-08  8:50   ` Ying Xue
2015-05-08  9:25     ` Ying Xue
2015-05-08 11:07     ` Eric W. Biederman
2015-05-08 16:33       ` Cong Wang
2015-05-08 14:07   ` Herbert Xu
2015-05-08 17:36     ` Eric W. Biederman
2015-05-08 20:27       ` Cong Wang
2015-05-08 21:13         ` Cong Wang
2015-05-08 22:08           ` Eric W. Biederman
2015-05-09  1:13       ` Herbert Xu
2015-05-09  1:53         ` Eric W. Biederman
2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
2015-05-12  8:24             ` David Laight
2015-05-12  8:55               ` Eric W. Biederman
2015-05-12 11:48                 ` David Laight
2015-05-12 12:28                   ` Nicolas Dichtel
2015-05-12 13:16                     ` David Laight
2015-05-12 14:15                       ` Nicolas Dichtel
2015-05-12 15:58                       ` Eric W. Biederman
2015-05-12 14:45               ` David Miller
2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
2015-05-09 16:51             ` Eric Dumazet
2015-05-09 17:31               ` Eric W. Biederman
2015-05-09  2:10           ` [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets Eric W. Biederman
2015-05-09  2:11           ` [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace Eric W. Biederman
2015-05-09  2:12           ` [PATCH 6/6] net: kill sk_change_net and sk_release_kernel Eric W. Biederman
2015-05-09  2:38           ` [PATCH 0/6] Cleanup the kernel sockets Herbert Xu
2015-05-11 14:53           ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wq0kcqlm.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=cwang@twopensource.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=erik.hugne@ericsson.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=horms@verge.net.au \
    --cc=jchapman@katalix.com \
    --cc=jon.maloy@ericsson.com \
    --cc=maxk@qti.qualcomm.com \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=stephen@networkplumber.org \
    --cc=tgraf@suug.ch \
    --cc=tom@herbertland.com \
    --cc=xemul@openvz.org \
    --cc=ying.xue@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.