All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
@ 2015-05-07  8:52 Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
                   ` (11 more replies)
  0 siblings, 12 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

When commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
netlink_kernel_create().") attempted to fix the race between put_net()
and kernel socket's creation, it adopted a complex solution: create
netlink socket inside init_net namespace and then re-attach it to the
desired one right after the socket is created; similarly, when close
the socket, move back its namespace to init_net so that the socket can
be destroyed in the context which is same as the socket creation.

But the solution artificially makes the whole thing complex as its
design is not only weird, but also it causes a bad consequence that
when all kernel modules create kernel sockets, they have to follow
the model of namespace switch. More importantly, with the way kernel
sockets are created in init_net namespace, but they are released in
another new ones. This inconsistent namespace brings some modules many
inconvenience. For example, what tipc socket is inserted to rhashtable
happens in socket's creation, and different namespace has different
rhashtable for tipc socket. With the approach, a tipc kernel socket
will be inserted into the rhashtable of init_net. But as releasing
the socket happens in another one, it causes what the socket cannot
be found from the rhashtable of the new namespace.

Therefore, we propose a simpler solution to avoid the race: if we
find there is still pending a cleanup work in __put_net(), we don't
queue a new cleanup work to stop the cleanup process. The new proposal
not only successfully solves the race, but also it can help us to
avoid unnecessary namespace switches when creating kernel sockets.
Moreover, it can guarantee that both creation and release of kernel
sockets happen in the same namespace at all time.

In the series, we first resolve the race with patch #1, and then
prevent namespace switches from happening in all relevant kernel
modules one by one from patch #2 to patch #9. Until now, as all
dependencies on sk_change_net() are killed, we can delete the
interface completely in patch #10. Lastly, we simplify the code of
creating kernel sockets through changing the original behaviours
of sock_create_kern() and sk_release_kernel(). If a kernel socket
is created within a namespace which is different with init_net,
we must put the reference counter of the namespace once the socket
is successfully allocated in sk_alloc(), otherwise, the namespace
is probably unable to be shut down finally. Therefore, we decrease
namespace's reference counter once a kernel socket is created
successfully by sock_create_kern() within a namespace which is
different with init_net. Similarly, namespace's reference counter
must be increased back before the socket is destroyed in
sk_release_kernel().

Welcome to any comments.

Ying Xue (11):
  netns: Fix race between put_net() and netlink_kernel_create()
  netlink: avoid unnecessary namespace switch when create netlink
    kernel sockets
  tun: avoid unnecessary namespace switch druing kernel socket creation
  inet: avoid unnecessary namespace switch during kernel socket
    creation
  udp_tunnel: avoid to switch namespace for tunnel socket
  ip6_udp_tunnel: avoid to switch namespace for tunnel socket
  l2tp: avoid to switch namespace for l2tp tunnel socket
  ipvs: avoid to switch namespace for ipvs kernel socket
  tipc: fix net leak issue
  tipc: remove sk_change_net interface
  net: change behaviours of functions of creating and releasing kernel
    sockets

 drivers/block/drbd/drbd_receiver.c |    6 ++++--
 drivers/net/tun.c                  |   14 +++++++++-----
 fs/afs/rxrpc.c                     |    3 ++-
 fs/dlm/lowcomms.c                  |   16 ++++++++--------
 include/linux/net.h                |    3 ++-
 include/net/sock.h                 |   16 ----------------
 net/bluetooth/rfcomm/core.c        |    3 ++-
 net/ceph/messenger.c               |    4 ++--
 net/core/net_namespace.c           |   10 ++++++++--
 net/core/sock.c                    |    5 ++---
 net/ipv4/af_inet.c                 |    4 +---
 net/ipv4/udp_tunnel.c              |    4 +---
 net/ipv6/ip6_udp_tunnel.c          |    4 +---
 net/l2tp/l2tp_core.c               |   12 ++++--------
 net/netfilter/ipvs/ip_vs_sync.c    |   16 ++--------------
 net/netlink/af_netlink.c           |    7 ++++---
 net/rxrpc/ar-local.c               |    4 ++--
 net/socket.c                       |    9 +++++++--
 net/tipc/server.c                  |    5 +++++
 19 files changed, 66 insertions(+), 79 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  9:04   ` Herbert Xu
  2015-05-07  8:52 ` [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets Ying Xue
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
netlink_kernel_create().") attempts to fix the following race
scenario:

put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */
      __put_net(net);
        queue_work(...);

/*
 * note: the net now has refcnt 0, but still in
 * the global list of net namespaces
 */

== re-schedule ==

register_pernet_subsys(&some_ops);
  register_pernet_operations(&some_ops);
    (*some_ops)->init(net);
      /*
       * we call netlink_kernel_create() here
       * in some places
       */
      netlink_kernel_create();
         sk_alloc();
            get_net(net); /* refcnt = 1 */
         /*
          * now we drop the net refcount not to
          * block the net namespace exit in the
          * future (or this can be done on the
          * error path)
          */
         put_net(sk->sk_net);
             if (atomic_dec_and_test(&...))
                   /*
                    * true. BOOOM! The net is
                    * scheduled for release twice
                    */

In order to prevent the race from happening, the commit adopted the
following solution: create netlink socket inside init_net namespace
and then re-attach it to the desired one right the socket is created;
similarly, when closing the socket, first move its namespace to
init_net so that the socket can be destroyed in the same context of
the socket creation.

Actually the proposal artificially makes the whole thing complex.
Instead there exists a simpler solution to avoid the risk of net
double release: if we find there is still pending a cleanup work
in __put_net(), we don't queue a new cleanup work again. The solution
is not only simple and easily understandable, but also it can help us
to avoid unnecessary namespace change for kernel sockets which will
be made in the future commits.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/core/net_namespace.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..058508f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	net->dev_base_seq = 1;
 	net->user_ns = user_ns;
 	idr_init(&net->netns_ids);
+	INIT_LIST_HEAD(&net->cleanup_list);
 
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
@@ -409,12 +410,17 @@ void __put_net(struct net *net)
 {
 	/* Cleanup the network namespace in process context */
 	unsigned long flags;
+	bool added = false;
 
 	spin_lock_irqsave(&cleanup_list_lock, flags);
-	list_add(&net->cleanup_list, &cleanup_list);
+	if (list_empty(&net->cleanup_list)) {
+		list_add(&net->cleanup_list, &cleanup_list);
+		added = true;
+	}
 	spin_unlock_irqrestore(&cleanup_list_lock, flags);
 
-	queue_work(netns_wq, &net_cleanup_work);
+	if (added)
+		queue_work(netns_wq, &net_cleanup_work);
 }
 EXPORT_SYMBOL_GPL(__put_net);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation Ying Xue
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

As now the race between put_net() and netlink_kernel_create() is
resolved, it's unnecessary to switch namespace for a netlink kernel
socket from init_net to its desirable one, which means the kernel
socket is created and released within one namespace. But as the
kernel socket is part of a namespace, we should not hold a reference
count to the namespace, otherwise, probably modules relying on it
cannot be stopped.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/netlink/af_netlink.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ec4adbd..9aec20a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2479,11 +2479,11 @@ __netlink_kernel_create(struct net *net, int unit, struct module *module,
 	 * So we create one inside init_net and the move it to net.
 	 */
 
-	if (__netlink_create(&init_net, sock, cb_mutex, unit) < 0)
+	if (__netlink_create(net, sock, cb_mutex, unit) < 0)
 		goto out_sock_release_nosk;
 
 	sk = sock->sk;
-	sk_change_net(sk, net);
+	put_net(sock_net(sk));
 
 	if (!cfg || cfg->groups < 32)
 		groups = 32;
@@ -2539,7 +2539,8 @@ EXPORT_SYMBOL(__netlink_kernel_create);
 void
 netlink_kernel_release(struct sock *sk)
 {
-	sk_release_kernel(sk);
+	get_net(sock_net(sk));
+	sock_release(sk->sk_socket);
 }
 EXPORT_SYMBOL(netlink_kernel_release);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 04/11] inet: " Ying Xue
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

After the race between put_net() and kernel socket's creation is
solved, it's unnecessary to switch namespace for a kernel socket
from init_net to its desirable one. But as the kernel socket is
released when namespace is shutdown, the namespace is never released
if the kernel socket holds a reference count to the namespace. So we
have to put its namespace's reference counter after the kernel socket
is created with sk_alloc(), and get it again before it's released.

Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 drivers/net/tun.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e470ae5..abf7bae 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -456,6 +456,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 {
 	struct tun_file *ntfile;
 	struct tun_struct *tun;
+	struct sock *sk = &tfile->sk;
 
 	tun = rtnl_dereference(tfile->tun);
 
@@ -482,7 +483,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 		tun_set_real_num_queues(tun);
 	} else if (tfile->detached && clean) {
 		tun = tun_enable_queue(tfile);
-		sock_put(&tfile->sk);
+		sock_put(sk);
 	}
 
 	if (clean) {
@@ -496,7 +497,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 
 		BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
 				 &tfile->socket.flags));
-		sk_release_kernel(&tfile->sk);
+		get_net(sock_net(sk));
+		sock_release(sk->sk_socket);
 	}
 }
 
@@ -2155,15 +2157,17 @@ out:
 static int tun_chr_open(struct inode *inode, struct file * file)
 {
 	struct tun_file *tfile;
+	struct net *net;
 
 	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
 
-	tfile = (struct tun_file *)sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL,
+	net = get_net(current->nsproxy->net_ns);
+	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
 					    &tun_proto);
 	if (!tfile)
 		return -ENOMEM;
 	RCU_INIT_POINTER(tfile->tun, NULL);
-	tfile->net = get_net(current->nsproxy->net_ns);
+	tfile->net = net;
 	tfile->flags = 0;
 	tfile->ifindex = 0;
 
@@ -2174,7 +2178,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 	tfile->socket.ops = &tun_socket_ops;
 
 	sock_init_data(&tfile->socket, &tfile->sk);
-	sk_change_net(&tfile->sk, tfile->net);
+	put_net(sock_net(&tfile->sk));
 
 	tfile->sk.sk_write_space = tun_sock_write_space;
 	tfile->sk.sk_sndbuf = INT_MAX;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 04/11] inet: avoid unnecessary namespace switch during kernel socket creation
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (2 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket Ying Xue
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

As the race between put_net() and kernel socket's creation is gone,
it's unnecessary to switch namespace for a kernel socket from init_net
to its desirable one. But as the kernel socket is part of a namespace,
we should not hold a reference counter to the namespace, otherwise,
probably modules relying on it cannot be stopped.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 include/net/inet_common.h |    3 ++-
 net/ipv4/af_inet.c        |    4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 4a92423..cedc7c7 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -41,7 +41,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
 
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
-	sk_release_kernel(sk);
+	get_net(sock_net(sk));
+	sock_release(sk->sk_socket);
 }
 
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8b47a4d..50e6292 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1430,7 +1430,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 struct net *net)
 {
 	struct socket *sock;
-	int rc = sock_create_kern(family, type, protocol, &sock);
+	int rc = __sock_create(net, family, type, protocol, &sock, 1);
 
 	if (rc == 0) {
 		*sk = sock->sk;
@@ -1441,7 +1441,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 		 */
 		(*sk)->sk_prot->unhash(*sk);
 
-		sk_change_net(*sk, net);
+		put_net(sock_net(*sk));
 	}
 	return rc;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (3 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 04/11] inet: " Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 06/11] ip6_udp_tunnel: " Ying Xue
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

There is no the race between put_net() and kernel socket creation, so
it's unnecessary to switch namespace for a kernel tunnel socket from
init_net to its desirable one.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/ipv4/udp_tunnel.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 6bb98cc..720ab82 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -15,11 +15,11 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 	struct socket *sock = NULL;
 	struct sockaddr_in udp_addr;
 
-	err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, &sock);
+	err = __sock_create(net, AF_INET, SOCK_DGRAM, 0, &sock, 1);
 	if (err < 0)
 		goto error;
 
-	sk_change_net(sock->sk, net);
+	put_net(sock_net(sock->sk));
 
 	udp_addr.sin_family = AF_INET;
 	udp_addr.sin_addr = cfg->local_ip;
@@ -47,7 +47,8 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		get_net(sock_net(sock->sk));
+		sock_release(sock);
 	}
 	*sockp = NULL;
 	return err;
@@ -101,7 +102,8 @@ void udp_tunnel_sock_release(struct socket *sock)
 {
 	rcu_assign_sk_user_data(sock->sk, NULL);
 	kernel_sock_shutdown(sock, SHUT_RDWR);
-	sk_release_kernel(sock->sk);
+	get_net(sock_net(sock->sk));
+	sock_release(sock);
 }
 EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 06/11] ip6_udp_tunnel: avoid to switch namespace for tunnel socket
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (4 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp " Ying Xue
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

As there is no the race between put_net() and kernel socket creation,
it's unnecessary to switch namespace for a kernel tunnel socket from
init_net to its desirable one.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/ipv6/ip6_udp_tunnel.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index bba8903..4da0bc5 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -19,11 +19,11 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 	int err;
 	struct socket *sock = NULL;
 
-	err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, &sock);
+	err = __sock_create(net, AF_INET6, SOCK_DGRAM, 0, &sock, 1);
 	if (err < 0)
 		goto error;
 
-	sk_change_net(sock->sk, net);
+	put_net(sock_net(sock->sk));
 
 	udp6_addr.sin6_family = AF_INET6;
 	memcpy(&udp6_addr.sin6_addr, &cfg->local_ip6,
@@ -55,7 +55,8 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		get_net(sock_net(sock->sk));
+		sock_release(sock);
 	}
 	*sockp = NULL;
 	return err;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp tunnel socket
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (5 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 06/11] ip6_udp_tunnel: " Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket Ying Xue
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

As there is no the race between put_net() and kernel socket creation,
it's unnecessary to switch namespace for a kernel tunnel socket from
init_net to its desirable one.

Cc: James Chapman <jchapman@katalix.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/l2tp/l2tp_core.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index a29a504..aa01daac 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1336,7 +1336,8 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
 	} else {
 		if (sock)
 			kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sk);
+		get_net(sock_net(sk));
+		sock_release(sock);
 	}
 
 	l2tp_tunnel_sock_put(sk);
@@ -1399,12 +1400,12 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		if (cfg->local_ip6 && cfg->peer_ip6) {
 			struct sockaddr_l2tpip6 ip6_addr = {0};
 
-			err = sock_create_kern(AF_INET6, SOCK_DGRAM,
-					  IPPROTO_L2TP, &sock);
+			err = __sock_create(net, AF_INET6, SOCK_DGRAM,
+					    IPPROTO_L2TP, &sock, 1);
 			if (err < 0)
 				goto out;
 
-			sk_change_net(sock->sk, net);
+			put_net(sock_net(sock->sk));
 
 			ip6_addr.l2tp_family = AF_INET6;
 			memcpy(&ip6_addr.l2tp_addr, cfg->local_ip6,
@@ -1429,12 +1430,12 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		{
 			struct sockaddr_l2tpip ip_addr = {0};
 
-			err = sock_create_kern(AF_INET, SOCK_DGRAM,
-					  IPPROTO_L2TP, &sock);
+			err = __sock_create(net, AF_INET, SOCK_DGRAM,
+					    IPPROTO_L2TP, &sock, 1);
 			if (err < 0)
 				goto out;
 
-			sk_change_net(sock->sk, net);
+			put_net(sock_net(sock->sk));
 
 			ip_addr.l2tp_family = AF_INET;
 			ip_addr.l2tp_addr = cfg->local_ip;
@@ -1462,7 +1463,8 @@ out:
 	*sockp = sock;
 	if ((err < 0) && sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		get_net(sock_net(sock->sk));
+		sock_release(sock);
 		*sockp = NULL;
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (6 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp " Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 09/11] tipc: fix net leak issue Ying Xue
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

The race between put_net() and kernel socket creation is gone, so
it's unnecessary to change namespace from init_net to a desirable
one.

Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/netfilter/ipvs/ip_vs_sync.c |   32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 19b9cce..4472fa0 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1458,7 +1458,7 @@ static struct socket *make_send_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket move it to right name space later */
-	result = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	result = __sock_create(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock, 1);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
@@ -1466,9 +1466,10 @@ static struct socket *make_send_sock(struct net *net, int id)
 	/*
 	 * Kernel sockets that are a part of a namespace, should not
 	 * hold a reference to a namespace in order to allow to stop it.
-	 * After sk_change_net should be released using sk_release_kernel.
+	 * After the reference is decreased here with put_net(), it should
+	 * be increased again using get_net() before the socket is released.
 	 */
-	sk_change_net(sock->sk, net);
+	put_net(sock_net(sock->sk));
 	result = set_mcast_if(sock->sk, ipvs->master_mcast_ifn);
 	if (result < 0) {
 		pr_err("Error setting outbound mcast interface\n");
@@ -1497,7 +1498,8 @@ static struct socket *make_send_sock(struct net *net, int id)
 	return sock;
 
 error:
-	sk_release_kernel(sock->sk);
+	get_net(sock_net(sock->sk));
+	sock_release(sock);
 	return ERR_PTR(result);
 }
 
@@ -1518,7 +1520,7 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket */
-	result = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	result = __sock_create(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock, 1);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
@@ -1526,9 +1528,10 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	/*
 	 * Kernel sockets that are a part of a namespace, should not
 	 * hold a reference to a namespace in order to allow to stop it.
-	 * After sk_change_net should be released using sk_release_kernel.
+	 * After the reference is decreased here with put_net(), it should
+	 * be increased again using get_net() before the socket is released.
 	 */
-	sk_change_net(sock->sk, net);
+	put_net(sock_net(sock->sk));
 	/* it is equivalent to the REUSEADDR option in user-space */
 	sock->sk->sk_reuse = SK_CAN_REUSE;
 	result = sysctl_sync_sock_size(ipvs);
@@ -1554,7 +1557,8 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	return sock;
 
 error:
-	sk_release_kernel(sock->sk);
+	get_net(sock_net(sock->sk));
+	sock_release(sock);
 	return ERR_PTR(result);
 }
 
@@ -1692,7 +1696,8 @@ done:
 		ip_vs_sync_buff_release(sb);
 
 	/* release the sending multicast socket */
-	sk_release_kernel(tinfo->sock->sk);
+	get_net(sock_net(tinfo->sock->sk));
+	sock_release(tinfo->sock);
 	kfree(tinfo);
 
 	return 0;
@@ -1729,7 +1734,8 @@ static int sync_thread_backup(void *data)
 	}
 
 	/* release the sending multicast socket */
-	sk_release_kernel(tinfo->sock->sk);
+	get_net(sock_net(tinfo->sock->sk));
+	sock_release(tinfo->sock);
 	kfree(tinfo->buf);
 	kfree(tinfo);
 
@@ -1854,11 +1860,13 @@ int start_sync_thread(struct net *net, int state, char *mcast_ifn, __u8 syncid)
 	return 0;
 
 outsocket:
-	sk_release_kernel(sock->sk);
+	get_net(sock_net(sock->sk));
+	sock_release(sock);
 
 outtinfo:
 	if (tinfo) {
-		sk_release_kernel(tinfo->sock->sk);
+		get_net(sock_net(tinfo->sock->sk));
+		sock_release(tinfo->sock);
 		kfree(tinfo->buf);
 		kfree(tinfo);
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 09/11] tipc: fix net leak issue
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (7 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface Ying Xue
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

After tipc internal server kernel socket is created with __sock_create(),
the reference counter to a namespace of the socket is held in sk_alloc().
This causes what the nemespace is never destroyed as the kernel socket is
released on namespace's shutdown. So, after the kernel socket's creation,
we should immediately put the net's reference counter.

Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/server.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index a91a2f7..29d9695 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -99,6 +99,7 @@ static void tipc_conn_kref_release(struct kref *kref)
 		if (test_bit(CF_SERVER, &con->flags)) {
 			__module_get(sock->ops->owner);
 			__module_get(sk->sk_prot_creator->owner);
+			get_net(sock_net(sock->sk));
 		}
 		saddr->scope = -TIPC_NODE_SCOPE;
 		kernel_bind(sock, (struct sockaddr *)saddr, sizeof(*saddr));
@@ -332,6 +333,9 @@ static struct socket *tipc_create_listen_sock(struct tipc_conn *con)
 				(char *)&s->imp, sizeof(s->imp));
 	if (ret < 0)
 		goto create_err;
+
+	put_net(sock_net(sock->sk));
+
 	ret = kernel_bind(sock, (struct sockaddr *)s->saddr, sizeof(*s->saddr));
 	if (ret < 0)
 		goto create_err;
@@ -377,6 +381,7 @@ static struct socket *tipc_create_listen_sock(struct tipc_conn *con)
 
 create_err:
 	kernel_sock_shutdown(sock, SHUT_RDWR);
+	get_net(sock_net(sock->sk));
 	sock_release(sock);
 	return NULL;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (8 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 09/11] tipc: fix net leak issue Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07  8:52 ` [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets Ying Xue
  2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Until now all callers of sk_change_net() don't rely on it to create
kernel sockets, so now it's an appropriate time to purge it completely.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 include/net/sock.h |   16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3a4898e..586e352 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2192,22 +2192,6 @@ void sock_net_set(struct sock *sk, struct net *net)
 	write_pnet(&sk->sk_net, net);
 }
 
-/*
- * Kernel sockets, f.e. rtnl or icmp_socket, are a part of a namespace.
- * They should not hold a reference to a namespace in order to allow
- * to stop it.
- * Sockets after sk_change_net should be released using sk_release_kernel
- */
-static inline void sk_change_net(struct sock *sk, struct net *net)
-{
-	struct net *current_net = sock_net(sk);
-
-	if (!net_eq(current_net, net)) {
-		put_net(current_net);
-		sock_net_set(sk, net);
-	}
-}
-
 static inline struct sock *skb_steal_sock(struct sk_buff *skb)
 {
 	if (skb->sk) {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (9 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface Ying Xue
@ 2015-05-07  8:52 ` Ying Xue
  2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
  11 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-07  8:52 UTC (permalink / raw)
  To: netdev
  Cc: cwang, herbert, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

So far it's unnecessary to switch namespace when creating a kernel
socket with __sock_create() within a namespace which is different
with init_net. But after that, we have to explicitly decrease the
namespace's reference counter, and increase it again before release
the socket. To make code as simple as possible, we decide to change
the sock_create_kern() API by adding an extra argument - net that
represents in which namespace the kernel socket is created. If all
kernel modules create kernel sockets with the updated API, the
reference counters of sockets' namespaces which are different with
init_net will be put, and these sockets must be released with
sk_release_kernel() in which corresponding namespaces' reference
counters will be increased before they are released.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 drivers/block/drbd/drbd_receiver.c |    6 ++++--
 fs/afs/rxrpc.c                     |    3 ++-
 fs/dlm/lowcomms.c                  |   16 ++++++++--------
 include/linux/net.h                |    3 ++-
 include/net/inet_common.h          |    3 +--
 net/bluetooth/rfcomm/core.c        |    3 ++-
 net/ceph/messenger.c               |    4 ++--
 net/core/sock.c                    |    5 ++---
 net/ipv4/af_inet.c                 |    4 +---
 net/ipv4/udp_tunnel.c              |   10 +++-------
 net/ipv6/ip6_udp_tunnel.c          |    7 ++-----
 net/l2tp/l2tp_core.c               |   18 ++++++------------
 net/netfilter/ipvs/ip_vs_sync.c    |   36 ++++++++----------------------------
 net/rxrpc/ar-local.c               |    4 ++--
 net/socket.c                       |    9 +++++++--
 15 files changed, 52 insertions(+), 79 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index cee2035..8d86b28 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -598,7 +598,8 @@ static struct socket *drbd_try_connect(struct drbd_connection *connection)
 	memcpy(&peer_in6, &connection->peer_addr, peer_addr_len);
 
 	what = "sock_create_kern";
-	err = sock_create_kern(((struct sockaddr *)&src_in6)->sa_family,
+	err = sock_create_kern(&init_net,
+			       ((struct sockaddr *)&src_in6)->sa_family,
 			       SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (err < 0) {
 		sock = NULL;
@@ -693,7 +694,8 @@ static int prepare_listen_socket(struct drbd_connection *connection, struct acce
 	memcpy(&my_addr, &connection->my_addr, my_addr_len);
 
 	what = "sock_create_kern";
-	err = sock_create_kern(((struct sockaddr *)&my_addr)->sa_family,
+	err = sock_create_kern(&init_net,
+			       ((struct sockaddr *)&my_addr)->sa_family,
 			       SOCK_STREAM, IPPROTO_TCP, &s_listen);
 	if (err) {
 		s_listen = NULL;
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 3a57a1b..69486ca 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -85,7 +85,8 @@ int afs_open_socket(void)
 		return -ENOMEM;
 	}
 
-	ret = sock_create_kern(AF_RXRPC, SOCK_DGRAM, PF_INET, &socket);
+	ret = sock_create_kern(&init_net, AF_RXRPC, SOCK_DGRAM, PF_INET,
+			       &socket);
 	if (ret < 0) {
 		destroy_workqueue(afs_async_calls);
 		_leave(" = %d [socket]", ret);
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index d08e079..754fd6c 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -921,8 +921,8 @@ static int tcp_accept_from_sock(struct connection *con)
 	mutex_unlock(&connections_lock);
 
 	memset(&peeraddr, 0, sizeof(peeraddr));
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &newsock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &newsock);
 	if (result < 0)
 		return -ENOMEM;
 
@@ -1173,8 +1173,8 @@ static void tcp_connect_to_sock(struct connection *con)
 		goto out;
 
 	/* Create a socket to communicate with */
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (result < 0)
 		goto out_err;
 
@@ -1258,8 +1258,8 @@ static struct socket *tcp_create_listen_sock(struct connection *con,
 		addr_len = sizeof(struct sockaddr_in6);
 
 	/* Create a socket to communicate with */
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (result < 0) {
 		log_print("Can't create listening comms socket");
 		goto create_out;
@@ -1365,8 +1365,8 @@ static int sctp_listen_for_all(void)
 
 	log_print("Using SCTP for communications");
 
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_SEQPACKET,
-				  IPPROTO_SCTP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_SEQPACKET, IPPROTO_SCTP, &sock);
 	if (result < 0) {
 		log_print("Can't create comms socket, check SCTP is loaded");
 		goto out;
diff --git a/include/linux/net.h b/include/linux/net.h
index 738ea48..fc1fdc2 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -208,7 +208,8 @@ void sock_unregister(int family);
 int __sock_create(struct net *net, int family, int type, int proto,
 		  struct socket **res, int kern);
 int sock_create(int family, int type, int proto, struct socket **res);
-int sock_create_kern(int family, int type, int proto, struct socket **res);
+int sock_create_kern(struct net *net, int family, int type, int proto,
+		     struct socket **res);
 int sock_create_lite(int family, int type, int proto, struct socket **res);
 void sock_release(struct socket *sock);
 int sock_sendmsg(struct socket *sock, struct msghdr *msg);
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index cedc7c7..4a92423 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -41,8 +41,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
 
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
-	get_net(sock_net(sk));
-	sock_release(sk->sk_socket);
+	sk_release_kernel(sk);
 }
 
 #endif
diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
index 4fea242..6b4bbfb 100644
--- a/net/bluetooth/rfcomm/core.c
+++ b/net/bluetooth/rfcomm/core.c
@@ -200,7 +200,8 @@ static int rfcomm_l2sock_create(struct socket **sock)
 
 	BT_DBG("");
 
-	err = sock_create_kern(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP, sock);
+	err = sock_create_kern(&init_net, PF_BLUETOOTH, SOCK_SEQPACKET,
+			       BTPROTO_L2CAP, sock);
 	if (!err) {
 		struct sock *sk = (*sock)->sk;
 		sk->sk_data_ready   = rfcomm_l2data_ready;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 967080a..073262f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -480,8 +480,8 @@ static int ceph_tcp_connect(struct ceph_connection *con)
 	int ret;
 
 	BUG_ON(con->sock);
-	ret = sock_create_kern(con->peer_addr.in_addr.ss_family, SOCK_STREAM,
-			       IPPROTO_TCP, &sock);
+	ret = sock_create_kern(&init_net, con->peer_addr.in_addr.ss_family,
+			       SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
diff --git a/net/core/sock.c b/net/core/sock.c
index e891bcf..41aa188 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1473,10 +1473,9 @@ void sk_release_kernel(struct sock *sk)
 	if (sk == NULL || sk->sk_socket == NULL)
 		return;
 
-	sock_hold(sk);
-	sock_net_set(sk, get_net(&init_net));
+	if (!net_eq(&init_net, sock_net(sk)))
+		get_net(sock_net(sk));
 	sock_release(sk->sk_socket);
-	sock_put(sk);
 }
 EXPORT_SYMBOL(sk_release_kernel);
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 50e6292..ddc8369 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1430,7 +1430,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 struct net *net)
 {
 	struct socket *sock;
-	int rc = __sock_create(net, family, type, protocol, &sock, 1);
+	int rc = sock_create_kern(net, family, type, protocol, &sock);
 
 	if (rc == 0) {
 		*sk = sock->sk;
@@ -1440,8 +1440,6 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 		 * we do not wish this socket to see incoming packets.
 		 */
 		(*sk)->sk_prot->unhash(*sk);
-
-		put_net(sock_net(*sk));
 	}
 	return rc;
 }
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 720ab82..de4e134 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -15,12 +15,10 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 	struct socket *sock = NULL;
 	struct sockaddr_in udp_addr;
 
-	err = __sock_create(net, AF_INET, SOCK_DGRAM, 0, &sock, 1);
+	err = sock_create_kern(net, AF_INET, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
-	put_net(sock_net(sock->sk));
-
 	udp_addr.sin_family = AF_INET;
 	udp_addr.sin_addr = cfg->local_ip;
 	udp_addr.sin_port = cfg->local_udp_port;
@@ -47,8 +45,7 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		get_net(sock_net(sock->sk));
-		sock_release(sock);
+		sk_release_kernel(sock->sk);
 	}
 	*sockp = NULL;
 	return err;
@@ -102,8 +99,7 @@ void udp_tunnel_sock_release(struct socket *sock)
 {
 	rcu_assign_sk_user_data(sock->sk, NULL);
 	kernel_sock_shutdown(sock, SHUT_RDWR);
-	get_net(sock_net(sock->sk));
-	sock_release(sock);
+	sk_release_kernel(sock->sk);
 }
 EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
 
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 4da0bc5..b35c5cd 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -19,12 +19,10 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 	int err;
 	struct socket *sock = NULL;
 
-	err = __sock_create(net, AF_INET6, SOCK_DGRAM, 0, &sock, 1);
+	err = sock_create_kern(net, AF_INET6, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
-	put_net(sock_net(sock->sk));
-
 	udp6_addr.sin6_family = AF_INET6;
 	memcpy(&udp6_addr.sin6_addr, &cfg->local_ip6,
 	       sizeof(udp6_addr.sin6_addr));
@@ -55,8 +53,7 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		get_net(sock_net(sock->sk));
-		sock_release(sock);
+		sk_release_kernel(sock->sk);
 	}
 	*sockp = NULL;
 	return err;
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index aa01daac..7e1f736 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1336,8 +1336,7 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
 	} else {
 		if (sock)
 			kernel_sock_shutdown(sock, SHUT_RDWR);
-		get_net(sock_net(sk));
-		sock_release(sock);
+		sk_release_kernel(sk);
 	}
 
 	l2tp_tunnel_sock_put(sk);
@@ -1400,13 +1399,11 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		if (cfg->local_ip6 && cfg->peer_ip6) {
 			struct sockaddr_l2tpip6 ip6_addr = {0};
 
-			err = __sock_create(net, AF_INET6, SOCK_DGRAM,
-					    IPPROTO_L2TP, &sock, 1);
+			err = sock_create_kern(net, AF_INET6, SOCK_DGRAM,
+					       IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
 
-			put_net(sock_net(sock->sk));
-
 			ip6_addr.l2tp_family = AF_INET6;
 			memcpy(&ip6_addr.l2tp_addr, cfg->local_ip6,
 			       sizeof(ip6_addr.l2tp_addr));
@@ -1430,13 +1427,11 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		{
 			struct sockaddr_l2tpip ip_addr = {0};
 
-			err = __sock_create(net, AF_INET, SOCK_DGRAM,
-					    IPPROTO_L2TP, &sock, 1);
+			err = sock_create_kern(net, AF_INET, SOCK_DGRAM,
+					       IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
 
-			put_net(sock_net(sock->sk));
-
 			ip_addr.l2tp_family = AF_INET;
 			ip_addr.l2tp_addr = cfg->local_ip;
 			ip_addr.l2tp_conn_id = tunnel_id;
@@ -1463,8 +1458,7 @@ out:
 	*sockp = sock;
 	if ((err < 0) && sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		get_net(sock_net(sock->sk));
-		sock_release(sock);
+		sk_release_kernel(sock->sk);
 		*sockp = NULL;
 	}
 
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 4472fa0..de52dc8 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1458,18 +1458,11 @@ static struct socket *make_send_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket move it to right name space later */
-	result = __sock_create(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock, 1);
+	result = sock_create_kern(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
 	}
-	/*
-	 * Kernel sockets that are a part of a namespace, should not
-	 * hold a reference to a namespace in order to allow to stop it.
-	 * After the reference is decreased here with put_net(), it should
-	 * be increased again using get_net() before the socket is released.
-	 */
-	put_net(sock_net(sock->sk));
 	result = set_mcast_if(sock->sk, ipvs->master_mcast_ifn);
 	if (result < 0) {
 		pr_err("Error setting outbound mcast interface\n");
@@ -1498,8 +1491,7 @@ static struct socket *make_send_sock(struct net *net, int id)
 	return sock;
 
 error:
-	get_net(sock_net(sock->sk));
-	sock_release(sock);
+	sk_release_kernel(sock->sk);
 	return ERR_PTR(result);
 }
 
@@ -1520,18 +1512,11 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket */
-	result = __sock_create(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock, 1);
+	result = sock_create_kern(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
 	}
-	/*
-	 * Kernel sockets that are a part of a namespace, should not
-	 * hold a reference to a namespace in order to allow to stop it.
-	 * After the reference is decreased here with put_net(), it should
-	 * be increased again using get_net() before the socket is released.
-	 */
-	put_net(sock_net(sock->sk));
 	/* it is equivalent to the REUSEADDR option in user-space */
 	sock->sk->sk_reuse = SK_CAN_REUSE;
 	result = sysctl_sync_sock_size(ipvs);
@@ -1557,8 +1542,7 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	return sock;
 
 error:
-	get_net(sock_net(sock->sk));
-	sock_release(sock);
+	sk_release_kernel(sock->sk);
 	return ERR_PTR(result);
 }
 
@@ -1696,8 +1680,7 @@ done:
 		ip_vs_sync_buff_release(sb);
 
 	/* release the sending multicast socket */
-	get_net(sock_net(tinfo->sock->sk));
-	sock_release(tinfo->sock);
+	sk_release_kernel(tinfo->sock->sk);
 	kfree(tinfo);
 
 	return 0;
@@ -1734,8 +1717,7 @@ static int sync_thread_backup(void *data)
 	}
 
 	/* release the sending multicast socket */
-	get_net(sock_net(tinfo->sock->sk));
-	sock_release(tinfo->sock);
+	sk_release_kernel(tinfo->sock->sk);
 	kfree(tinfo->buf);
 	kfree(tinfo);
 
@@ -1860,13 +1842,11 @@ int start_sync_thread(struct net *net, int state, char *mcast_ifn, __u8 syncid)
 	return 0;
 
 outsocket:
-	get_net(sock_net(sock->sk));
-	sock_release(sock);
+	sk_release_kernel(sock->sk);
 
 outtinfo:
 	if (tinfo) {
-		get_net(sock_net(tinfo->sock->sk));
-		sock_release(tinfo->sock);
+		sk_release_kernel(tinfo->sock->sk);
 		kfree(tinfo->buf);
 		kfree(tinfo);
 	}
diff --git a/net/rxrpc/ar-local.c b/net/rxrpc/ar-local.c
index ca904ed..78483b4 100644
--- a/net/rxrpc/ar-local.c
+++ b/net/rxrpc/ar-local.c
@@ -73,8 +73,8 @@ static int rxrpc_create_local(struct rxrpc_local *local)
 	_enter("%p{%d}", local, local->srx.transport_type);
 
 	/* create a socket to represent the local endpoint */
-	ret = sock_create_kern(PF_INET, local->srx.transport_type, IPPROTO_UDP,
-			       &local->socket);
+	ret = sock_create_kern(&init_net, PF_INET, local->srx.transport_type,
+			       IPPROTO_UDP, &local->socket);
 	if (ret < 0) {
 		_leave(" = %d [socket]", ret);
 		return ret;
diff --git a/net/socket.c b/net/socket.c
index 884e329..09414d7 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1213,9 +1213,14 @@ int sock_create(int family, int type, int protocol, struct socket **res)
 }
 EXPORT_SYMBOL(sock_create);
 
-int sock_create_kern(int family, int type, int protocol, struct socket **res)
+int sock_create_kern(struct net *net, int family, int type, int protocol,
+		     struct socket **res)
 {
-	return __sock_create(&init_net, family, type, protocol, res, 1);
+	int err = __sock_create(net, family, type, protocol, res, 1);
+
+	if (!err && !net_eq(&init_net, net))
+		put_net(sock_net((*res)->sk));
+	return err;
 }
 EXPORT_SYMBOL(sock_create_kern);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
@ 2015-05-07  9:04   ` Herbert Xu
  2015-05-07 17:19     ` Cong Wang
  0 siblings, 1 reply; 56+ messages in thread
From: Herbert Xu @ 2015-05-07  9:04 UTC (permalink / raw)
  To: Ying Xue
  Cc: netdev, cwang, xemul, davem, eric.dumazet, ebiederm, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>
> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>  {
>  	/* Cleanup the network namespace in process context */
>  	unsigned long flags;
> +	bool added = false;
>  
>  	spin_lock_irqsave(&cleanup_list_lock, flags);
> -	list_add(&net->cleanup_list, &cleanup_list);
> +	if (list_empty(&net->cleanup_list)) {
> +		list_add(&net->cleanup_list, &cleanup_list);
> +		added = true;
> +	}

Stop right there.  If a ref count is hitting zero twice you've
got big problems.  Because it means that after hitting zero the
first time it'll become positive again (if only briefly) which
opens you up to a race against the cleanup.

So this idea is a non-starter.

The rules for ref counts are really simple, if you hit zero then
you're dead.  Can someone explain to me in simple terms why this
ref count is special and needs to hit zero twice?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
                   ` (10 preceding siblings ...)
  2015-05-07  8:52 ` [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets Ying Xue
@ 2015-05-07 16:14 ` Eric W. Biederman
  2015-05-07 18:19   ` Cong Wang
                     ` (2 more replies)
  11 siblings, 3 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-07 16:14 UTC (permalink / raw)
  To: Ying Xue
  Cc: netdev, cwang, herbert, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Ying Xue <ying.xue@windriver.com> writes:

> When commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
> netlink_kernel_create().") attempted to fix the race between put_net()
> and kernel socket's creation, it adopted a complex solution: create
> netlink socket inside init_net namespace and then re-attach it to the
> desired one right after the socket is created; similarly, when close
> the socket, move back its namespace to init_net so that the socket can
> be destroyed in the context which is same as the socket creation.
>
> But the solution artificially makes the whole thing complex as its
> design is not only weird, but also it causes a bad consequence that
> when all kernel modules create kernel sockets, they have to follow
> the model of namespace switch. More importantly, with the way kernel
> sockets are created in init_net namespace, but they are released in
> another new ones. This inconsistent namespace brings some modules many
> inconvenience. For example, what tipc socket is inserted to rhashtable
> happens in socket's creation, and different namespace has different
> rhashtable for tipc socket. With the approach, a tipc kernel socket
> will be inserted into the rhashtable of init_net. But as releasing
> the socket happens in another one, it causes what the socket cannot
> be found from the rhashtable of the new namespace.
>
> Therefore, we propose a simpler solution to avoid the race: if we
> find there is still pending a cleanup work in __put_net(), we don't
> queue a new cleanup work to stop the cleanup process. The new proposal
> not only successfully solves the race, but also it can help us to
> avoid unnecessary namespace switches when creating kernel sockets.
> Moreover, it can guarantee that both creation and release of kernel
> sockets happen in the same namespace at all time.
>
> In the series, we first resolve the race with patch #1, and then
> prevent namespace switches from happening in all relevant kernel
> modules one by one from patch #2 to patch #9. Until now, as all
> dependencies on sk_change_net() are killed, we can delete the
> interface completely in patch #10. Lastly, we simplify the code of
> creating kernel sockets through changing the original behaviours
> of sock_create_kern() and sk_release_kernel(). If a kernel socket
> is created within a namespace which is different with init_net,
> we must put the reference counter of the namespace once the socket
> is successfully allocated in sk_alloc(), otherwise, the namespace
> is probably unable to be shut down finally. Therefore, we decrease
> namespace's reference counter once a kernel socket is created
> successfully by sock_create_kern() within a namespace which is
> different with init_net. Similarly, namespace's reference counter
> must be increased back before the socket is destroyed in
> sk_release_kernel().
>
> Welcome to any comments.

I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
and netlink_kernel_create()."  was a hack.

However it is not appropriate to call get_net on a network namespace
whose count might be zero.  I believe all of your patches rely on that
currently.  Instead we need to build something like sk_release_kernel
that does not increase the network namespace reference count if you are
going to avoid changing the network namespace on a socket (a worthy
goal).

The following change shows how it is possible to always know that your
network namespace has a non-zero reference count in the network
namespace initialization methods.  My implementation of
lock_network_namespaces is problematic in that it does not sleep
while network namespaces are unregistering.  But it is enough to show
how the locking and reference counting can be fixed.

Eric


diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a3abb719221f..81c53ccc5764 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -822,6 +822,49 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 		ida_remove(&net_generic_ids, *ops->id);
 }
 
+static void unlock_network_namespaces(void)
+{
+	/* Drop the reference count to every network namespace
+	 * and then release the net_mutex.
+	 */
+	struct net *net;
+
+	for_each_net(net)
+		put_net(net);
+
+	mutex_unlock(&net_mutex);
+}
+
+static void lock_network_namespaces(void)
+{
+	/* Take the mutex lock ensuring no new network namespaces
+	 * and take a reference on all existing network namespaces
+	 * allowing network namespace initialization code to take
+	 * further references
+	 */
+	for (;;) {
+		struct net *net, *stop;
+
+		mutex_lock(&net_mutex);
+		for_each_net(net) {
+			if (!maybe_get_net(net))
+				goto undo;
+		}
+		return;
+undo:
+		/* Remember the network namespace whose reference
+		 * count was not acquired. */
+		stop = net;
+		for_each_net(net) {
+			if (net_eq(net, stop))
+				goto undone;
+			put_net(net);
+		}
+undone:
+		mutex_unlock(&net_mutex);
+	}
+}
+
 /**
  *      register_pernet_subsys - register a network namespace subsystem
  *	@ops:  pernet operations structure for the subsystem
@@ -844,9 +887,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 int register_pernet_subsys(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	lock_network_namespaces();
 	error =  register_pernet_operations(first_device, ops);
-	mutex_unlock(&net_mutex);
+	unlock_network_namespaces();
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_subsys);
@@ -890,11 +933,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 int register_pernet_device(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	lock_network_namespaces();
 	error = register_pernet_operations(&pernet_list, ops);
 	if (!error && (first_device == &pernet_list))
 		first_device = &ops->list;
-	mutex_unlock(&net_mutex);
+	unlock_network_namespaces();
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_device);

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07  9:04   ` Herbert Xu
@ 2015-05-07 17:19     ` Cong Wang
  2015-05-07 17:28       ` Eric W. Biederman
                         ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Cong Wang @ 2015-05-07 17:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ying Xue, netdev, xemul, David Miller, Eric Dumazet,
	Eric W. Biederman, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, horms

On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>
>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>  {
>>       /* Cleanup the network namespace in process context */
>>       unsigned long flags;
>> +     bool added = false;
>>
>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>> -     list_add(&net->cleanup_list, &cleanup_list);
>> +     if (list_empty(&net->cleanup_list)) {
>> +             list_add(&net->cleanup_list, &cleanup_list);
>> +             added = true;
>> +     }
>
> Stop right there.  If a ref count is hitting zero twice you've
> got big problems.  Because it means that after hitting zero the
> first time it'll become positive again (if only briefly) which
> opens you up to a race against the cleanup.

That is true.

>
> So this idea is a non-starter.
>
> The rules for ref counts are really simple, if you hit zero then
> you're dead.  Can someone explain to me in simple terms why this
> ref count is special and needs to hit zero twice?

Because after it hits zero, it queues the final cleanup to a workqueue,
which could be delayed (depends on when the worker will be scheduled),
therefore there is a small window that new ->init() could come in.

You can argue that netns->init() should not see a dead netns, but
unfortunately that seems even harder since removing netns from global
netns list requires rtnl lock which is not doable in __put_net().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07 17:19     ` Cong Wang
@ 2015-05-07 17:28       ` Eric W. Biederman
  2015-05-08 11:20       ` Eric W. Biederman
  2015-05-08 11:20       ` Ying Xue
  2 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-07 17:28 UTC (permalink / raw)
  To: Cong Wang
  Cc: Herbert Xu, Ying Xue, netdev, xemul, David Miller, Eric Dumazet,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, horms

Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
>
> That is true.
>
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
>
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.
>
> You can argue that netns->init() should not see a dead netns, but
> unfortunately that seems even harder since removing netns from global
> netns list requires rtnl lock which is not doable in __put_net().

See my reply upthread where I have already written the code to ensure
netns->init() does not see a dead netns.  It is a very reasonable
expectation and not all that hard to ensure.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
@ 2015-05-07 18:19   ` Cong Wang
  2015-05-07 18:26     ` Eric W. Biederman
  2015-05-08  8:50   ` Ying Xue
  2015-05-08 14:07   ` Herbert Xu
  2 siblings, 1 reply; 56+ messages in thread
From: Cong Wang @ 2015-05-07 18:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, Herbert Xu, xemul, David Miller, Eric Dumazet,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, horms

On Thu, May 7, 2015 at 9:14 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ying Xue <ying.xue@windriver.com> writes:
>
>> When commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
>> netlink_kernel_create().") attempted to fix the race between put_net()
>> and kernel socket's creation, it adopted a complex solution: create
>> netlink socket inside init_net namespace and then re-attach it to the
>> desired one right after the socket is created; similarly, when close
>> the socket, move back its namespace to init_net so that the socket can
>> be destroyed in the context which is same as the socket creation.
>>
>> But the solution artificially makes the whole thing complex as its
>> design is not only weird, but also it causes a bad consequence that
>> when all kernel modules create kernel sockets, they have to follow
>> the model of namespace switch. More importantly, with the way kernel
>> sockets are created in init_net namespace, but they are released in
>> another new ones. This inconsistent namespace brings some modules many
>> inconvenience. For example, what tipc socket is inserted to rhashtable
>> happens in socket's creation, and different namespace has different
>> rhashtable for tipc socket. With the approach, a tipc kernel socket
>> will be inserted into the rhashtable of init_net. But as releasing
>> the socket happens in another one, it causes what the socket cannot
>> be found from the rhashtable of the new namespace.
>>
>> Therefore, we propose a simpler solution to avoid the race: if we
>> find there is still pending a cleanup work in __put_net(), we don't
>> queue a new cleanup work to stop the cleanup process. The new proposal
>> not only successfully solves the race, but also it can help us to
>> avoid unnecessary namespace switches when creating kernel sockets.
>> Moreover, it can guarantee that both creation and release of kernel
>> sockets happen in the same namespace at all time.
>>
>> In the series, we first resolve the race with patch #1, and then
>> prevent namespace switches from happening in all relevant kernel
>> modules one by one from patch #2 to patch #9. Until now, as all
>> dependencies on sk_change_net() are killed, we can delete the
>> interface completely in patch #10. Lastly, we simplify the code of
>> creating kernel sockets through changing the original behaviours
>> of sock_create_kern() and sk_release_kernel(). If a kernel socket
>> is created within a namespace which is different with init_net,
>> we must put the reference counter of the namespace once the socket
>> is successfully allocated in sk_alloc(), otherwise, the namespace
>> is probably unable to be shut down finally. Therefore, we decrease
>> namespace's reference counter once a kernel socket is created
>> successfully by sock_create_kern() within a namespace which is
>> different with init_net. Similarly, namespace's reference counter
>> must be increased back before the socket is destroyed in
>> sk_release_kernel().
>>
>> Welcome to any comments.
>
> I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
> and netlink_kernel_create()."  was a hack.
>
> However it is not appropriate to call get_net on a network namespace
> whose count might be zero.  I believe all of your patches rely on that
> currently.  Instead we need to build something like sk_release_kernel
> that does not increase the network namespace reference count if you are
> going to avoid changing the network namespace on a socket (a worthy
> goal).
>
> The following change shows how it is possible to always know that your
> network namespace has a non-zero reference count in the network
> namespace initialization methods.  My implementation of
> lock_network_namespaces is problematic in that it does not sleep
> while network namespaces are unregistering.  But it is enough to show
> how the locking and reference counting can be fixed.

Why does this have to be so complicated? We can simply avoid
calling ops_init() by skipping those in cleanup_list, no?

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..c7cbd5a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net,
struct user_namespace *user_ns)
        net->dev_base_seq = 1;
        net->user_ns = user_ns;
        idr_init(&net->netns_ids);
+       INIT_LIST_HEAD(&net->cleanup_list);

        list_for_each_entry(ops, &pernet_list, list) {
                error = ops_init(ops, net);
@@ -737,6 +738,8 @@ static int __register_pernet_operations(struct
list_head *list,
        list_add_tail(&ops->list, list);
        if (ops->init || (ops->id && ops->size)) {
                for_each_net(net) {
+                       if (!list_empty(net->cleanup_list)) // <---
Need a big comment here;
+                               continue;
                        error = ops_init(ops, net);
                        if (error)
                                goto out_undo;

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 18:19   ` Cong Wang
@ 2015-05-07 18:26     ` Eric W. Biederman
  2015-05-07 18:53       ` Cong Wang
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-07 18:26 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ying Xue, netdev, Herbert Xu, xemul, David Miller, Eric Dumazet,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, horms

Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 9:14 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Ying Xue <ying.xue@windriver.com> writes:
>>
>>> When commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
>>> netlink_kernel_create().") attempted to fix the race between put_net()
>>> and kernel socket's creation, it adopted a complex solution: create
>>> netlink socket inside init_net namespace and then re-attach it to the
>>> desired one right after the socket is created; similarly, when close
>>> the socket, move back its namespace to init_net so that the socket can
>>> be destroyed in the context which is same as the socket creation.
>>>
>>> But the solution artificially makes the whole thing complex as its
>>> design is not only weird, but also it causes a bad consequence that
>>> when all kernel modules create kernel sockets, they have to follow
>>> the model of namespace switch. More importantly, with the way kernel
>>> sockets are created in init_net namespace, but they are released in
>>> another new ones. This inconsistent namespace brings some modules many
>>> inconvenience. For example, what tipc socket is inserted to rhashtable
>>> happens in socket's creation, and different namespace has different
>>> rhashtable for tipc socket. With the approach, a tipc kernel socket
>>> will be inserted into the rhashtable of init_net. But as releasing
>>> the socket happens in another one, it causes what the socket cannot
>>> be found from the rhashtable of the new namespace.
>>>
>>> Therefore, we propose a simpler solution to avoid the race: if we
>>> find there is still pending a cleanup work in __put_net(), we don't
>>> queue a new cleanup work to stop the cleanup process. The new proposal
>>> not only successfully solves the race, but also it can help us to
>>> avoid unnecessary namespace switches when creating kernel sockets.
>>> Moreover, it can guarantee that both creation and release of kernel
>>> sockets happen in the same namespace at all time.
>>>
>>> In the series, we first resolve the race with patch #1, and then
>>> prevent namespace switches from happening in all relevant kernel
>>> modules one by one from patch #2 to patch #9. Until now, as all
>>> dependencies on sk_change_net() are killed, we can delete the
>>> interface completely in patch #10. Lastly, we simplify the code of
>>> creating kernel sockets through changing the original behaviours
>>> of sock_create_kern() and sk_release_kernel(). If a kernel socket
>>> is created within a namespace which is different with init_net,
>>> we must put the reference counter of the namespace once the socket
>>> is successfully allocated in sk_alloc(), otherwise, the namespace
>>> is probably unable to be shut down finally. Therefore, we decrease
>>> namespace's reference counter once a kernel socket is created
>>> successfully by sock_create_kern() within a namespace which is
>>> different with init_net. Similarly, namespace's reference counter
>>> must be increased back before the socket is destroyed in
>>> sk_release_kernel().
>>>
>>> Welcome to any comments.
>>
>> I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
>> and netlink_kernel_create()."  was a hack.
>>
>> However it is not appropriate to call get_net on a network namespace
>> whose count might be zero.  I believe all of your patches rely on that
>> currently.  Instead we need to build something like sk_release_kernel
>> that does not increase the network namespace reference count if you are
>> going to avoid changing the network namespace on a socket (a worthy
>> goal).
>>
>> The following change shows how it is possible to always know that your
>> network namespace has a non-zero reference count in the network
>> namespace initialization methods.  My implementation of
>> lock_network_namespaces is problematic in that it does not sleep
>> while network namespaces are unregistering.  But it is enough to show
>> how the locking and reference counting can be fixed.
>
> Why does this have to be so complicated? We can simply avoid
> calling ops_init() by skipping those in cleanup_list, no?

The problem is that there is a single list of methods to call and if you
simply skip calling the initialization methods for a struct net and add
yourself to the list cleanup_net will then call the cleanup methods
without calling the cleanup methods.

Simply limiting new network namespace registrations to a point when
network namespaces are not being registered or unregisted seems like
the simplest way to achieve this effect.

> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 78fc04a..c7cbd5a 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net,
> struct user_namespace *user_ns)
>         net->dev_base_seq = 1;
>         net->user_ns = user_ns;
>         idr_init(&net->netns_ids);
> +       INIT_LIST_HEAD(&net->cleanup_list);
>
>         list_for_each_entry(ops, &pernet_list, list) {
>                 error = ops_init(ops, net);
> @@ -737,6 +738,8 @@ static int __register_pernet_operations(struct
> list_head *list,
>         list_add_tail(&ops->list, list);
>         if (ops->init || (ops->id && ops->size)) {
>                 for_each_net(net) {
> +                       if (!list_empty(net->cleanup_list)) // <---
> Need a big comment here;
> +                               continue;
>                         error = ops_init(ops, net);
>                         if (error)
>                                 goto out_undo;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 18:26     ` Eric W. Biederman
@ 2015-05-07 18:53       ` Cong Wang
  2015-05-07 18:58         ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Cong Wang @ 2015-05-07 18:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Cong Wang <cwang@twopensource.com> writes:
>
>>
>> Why does this have to be so complicated? We can simply avoid
>> calling ops_init() by skipping those in cleanup_list, no?
>
> The problem is that there is a single list of methods to call and if you
> simply skip calling the initialization methods for a struct net and add
> yourself to the list cleanup_net will then call the cleanup methods
> without calling the cleanup methods.

If you mean pernet_list, ops->list has been already added before
for_each_net().

>
> Simply limiting new network namespace registrations to a point when
> network namespaces are not being registered or unregisted seems like
> the simplest way to achieve this effect.
>

Literally, any point before ops_init().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 18:53       ` Cong Wang
@ 2015-05-07 18:58         ` Eric W. Biederman
  2015-05-07 19:29           ` Cong Wang
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-07 18:58 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ying Xue, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Cong Wang <cwang@twopensource.com> writes:
>>
>>>
>>> Why does this have to be so complicated? We can simply avoid
>>> calling ops_init() by skipping those in cleanup_list, no?
>>
>> The problem is that there is a single list of methods to call and if you
>> simply skip calling the initialization methods for a struct net and add
>> yourself to the list cleanup_net will then call the cleanup methods
>> without calling the cleanup methods.
>
> If you mean pernet_list, ops->list has been already added before
> for_each_net().
>
>>
>> Simply limiting new network namespace registrations to a point when
>> network namespaces are not being registered or unregisted seems like
>> the simplest way to achieve this effect.
>>
>
> Literally, any point before ops_init().

Think about what that what it means to add a set of operations to the
pernet_list and then to skip a network namespace with a count of 0 and
then to have that network namespace exit with those methods on
pernet_list.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 18:58         ` Eric W. Biederman
@ 2015-05-07 19:29           ` Cong Wang
  2015-05-07 20:01             ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Cong Wang @ 2015-05-07 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

On Thu, May 7, 2015 at 11:58 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Cong Wang <cwang@twopensource.com> writes:
>
>> On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Cong Wang <cwang@twopensource.com> writes:
>>>
>>>>
>>>> Why does this have to be so complicated? We can simply avoid
>>>> calling ops_init() by skipping those in cleanup_list, no?
>>>
>>> The problem is that there is a single list of methods to call and if you
>>> simply skip calling the initialization methods for a struct net and add
>>> yourself to the list cleanup_net will then call the cleanup methods
>>> without calling the cleanup methods.
>>
>> If you mean pernet_list, ops->list has been already added before
>> for_each_net().
>>
>>>
>>> Simply limiting new network namespace registrations to a point when
>>> network namespaces are not being registered or unregisted seems like
>>> the simplest way to achieve this effect.
>>>
>>
>> Literally, any point before ops_init().
>
> Think about what that what it means to add a set of operations to the
> pernet_list and then to skip a network namespace with a count of 0 and
> then to have that network namespace exit with those methods on
> pernet_list.
>

That is easy to solve, isn't it?

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..d2af11e 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net,
struct user_namespace *user_ns)
        net->dev_base_seq = 1;
        net->user_ns = user_ns;
        idr_init(&net->netns_ids);
+       INIT_LIST_HEAD(&net->cleanup_list);

        list_for_each_entry(ops, &pernet_list, list) {
                error = ops_init(ops, net);
@@ -734,20 +735,21 @@ static int __register_pernet_operations(struct
list_head *list,
        int error;
        LIST_HEAD(net_exit_list);

-       list_add_tail(&ops->list, list);
        if (ops->init || (ops->id && ops->size)) {
                for_each_net(net) {
+                       if (!list_empty(&net->cleanup_list))
+                               continue;
                        error = ops_init(ops, net);
                        if (error)
                                goto out_undo;
                        list_add_tail(&net->exit_list, &net_exit_list);
                }
        }
+       list_add_tail(&ops->list, list);
        return 0;

 out_undo:
        /* If I have an error cleanup all namespaces I initialized */
-       list_del(&ops->list);
        ops_exit_list(ops, &net_exit_list);
        ops_free_list(ops, &net_exit_list);
        return error;


The problem with your approach is that the code is over complicated,
the netns core code is already too complicated. ;)

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 19:29           ` Cong Wang
@ 2015-05-07 20:01             ` Eric W. Biederman
  2015-05-08  9:10               ` Ying Xue
  0 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-07 20:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ying Xue, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 11:58 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Cong Wang <cwang@twopensource.com> writes:
>>
>>> On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Cong Wang <cwang@twopensource.com> writes:
>>>>
>>>>>
>>>>> Why does this have to be so complicated? We can simply avoid
>>>>> calling ops_init() by skipping those in cleanup_list, no?
>>>>
>>>> The problem is that there is a single list of methods to call and if you
>>>> simply skip calling the initialization methods for a struct net and add
>>>> yourself to the list cleanup_net will then call the cleanup methods
>>>> without calling the cleanup methods.
>>>
>>> If you mean pernet_list, ops->list has been already added before
>>> for_each_net().
>>>
>>>>
>>>> Simply limiting new network namespace registrations to a point when
>>>> network namespaces are not being registered or unregisted seems like
>>>> the simplest way to achieve this effect.
>>>>
>>>
>>> Literally, any point before ops_init().
>>
>> Think about what that what it means to add a set of operations to the
>> pernet_list and then to skip a network namespace with a count of 0 and
>> then to have that network namespace exit with those methods on
>> pernet_list.
>>
>
> That is easy to solve, isn't it?

Nope.  That doesn't work.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
  2015-05-07 18:19   ` Cong Wang
@ 2015-05-08  8:50   ` Ying Xue
  2015-05-08  9:25     ` Ying Xue
  2015-05-08 11:07     ` Eric W. Biederman
  2015-05-08 14:07   ` Herbert Xu
  2 siblings, 2 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-08  8:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, cwang, herbert, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On 05/08/2015 12:14 AM, Eric W. Biederman wrote:
> I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
> and netlink_kernel_create()."  was a hack.
>

Thanks for the agreement :)

> However it is not appropriate to call get_net on a network namespace
> whose count might be zero.

I will explain why it's still safe for us in another mail even if we do this.

  I believe all of your patches rely on that
> currently.

Yes, you are right.

  Instead we need to build something like sk_release_kernel
> that does not increase the network namespace reference count

Please refer to above comments.

 if you are
> going to avoid changing the network namespace on a socket (a worthy
> goal).
> 

Thanks to the conformation for the effort!

> The following change shows how it is possible to always know that your
> network namespace has a non-zero reference count in the network
> namespace initialization methods.  My implementation of
> lock_network_namespaces is problematic in that it does not sleep
> while network namespaces are unregistering.  But it is enough to show
> how the locking and reference counting can be fixed.
> 

If my understanding for your proposal is right, register_pernet_subsys() will
return a failed error code to its caller if it's found that there is a net whose
refcount is zero from net_namespace_list in lock_network_namespaces(), which
means the per namespace operation is not registered at all in this situation.
But in practice we should not reject the registration even if there is a dead
net linked in the global net_namespace_list.

Regards,
Ying

> Eric
> 
> 
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index a3abb719221f..81c53ccc5764 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -822,6 +822,49 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>  		ida_remove(&net_generic_ids, *ops->id);
>  }
>  
> +static void unlock_network_namespaces(void)
> +{
> +	/* Drop the reference count to every network namespace
> +	 * and then release the net_mutex.
> +	 */
> +	struct net *net;
> +
> +	for_each_net(net)
> +		put_net(net);
> +
> +	mutex_unlock(&net_mutex);
> +}
> +
> +static void lock_network_namespaces(void)
> +{
> +	/* Take the mutex lock ensuring no new network namespaces
> +	 * and take a reference on all existing network namespaces
> +	 * allowing network namespace initialization code to take
> +	 * further references
> +	 */
> +	for (;;) {
> +		struct net *net, *stop;
> +
> +		mutex_lock(&net_mutex);
> +		for_each_net(net) {
> +			if (!maybe_get_net(net))
> +				goto undo;
> +		}
> +		return;
> +undo:
> +		/* Remember the network namespace whose reference
> +		 * count was not acquired. */
> +		stop = net;
> +		for_each_net(net) {
> +			if (net_eq(net, stop))
> +				goto undone;
> +			put_net(net);
> +		}
> +undone:
> +		mutex_unlock(&net_mutex);
> +	}
> +}
> +
>  /**
>   *      register_pernet_subsys - register a network namespace subsystem
>   *	@ops:  pernet operations structure for the subsystem
> @@ -844,9 +887,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>  int register_pernet_subsys(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	lock_network_namespaces();
>  	error =  register_pernet_operations(first_device, ops);
> -	mutex_unlock(&net_mutex);
> +	unlock_network_namespaces();
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
> @@ -890,11 +933,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>  int register_pernet_device(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	lock_network_namespaces();
>  	error = register_pernet_operations(&pernet_list, ops);
>  	if (!error && (first_device == &pernet_list))
>  		first_device = &ops->list;
> -	mutex_unlock(&net_mutex);
> +	unlock_network_namespaces();
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_device);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 20:01             ` Eric W. Biederman
@ 2015-05-08  9:10               ` Ying Xue
  2015-05-08 11:15                 ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Ying Xue @ 2015-05-08  9:10 UTC (permalink / raw)
  To: Eric W. Biederman, Cong Wang
  Cc: netdev, Herbert Xu, Pavel Emelyanov, David Miller, Eric Dumazet,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, Simon Horman

On 05/08/2015 04:01 AM, Eric W. Biederman wrote:
> Cong Wang <cwang@twopensource.com> writes:
> 
>> On Thu, May 7, 2015 at 11:58 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Cong Wang <cwang@twopensource.com> writes:
>>>
>>>> On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>> Cong Wang <cwang@twopensource.com> writes:
>>>>>
>>>>>>
>>>>>> Why does this have to be so complicated? We can simply avoid
>>>>>> calling ops_init() by skipping those in cleanup_list, no?
>>>>>
>>>>> The problem is that there is a single list of methods to call and if you
>>>>> simply skip calling the initialization methods for a struct net and add
>>>>> yourself to the list cleanup_net will then call the cleanup methods
>>>>> without calling the cleanup methods.
>>>>
>>>> If you mean pernet_list, ops->list has been already added before
>>>> for_each_net().
>>>>
>>>>>
>>>>> Simply limiting new network namespace registrations to a point when
>>>>> network namespaces are not being registered or unregisted seems like
>>>>> the simplest way to achieve this effect.
>>>>>
>>>>
>>>> Literally, any point before ops_init().
>>>
>>> Think about what that what it means to add a set of operations to the
>>> pernet_list and then to skip a network namespace with a count of 0 and
>>> then to have that network namespace exit with those methods on
>>> pernet_list.
>>>
>>
>> That is easy to solve, isn't it?
> 
> Nope.  That doesn't work.
> 

Cong, although I don't know why Eric confirmed your solution did not work, in my
view it really exists a bit fault especially in locking policy. For instance,
net->cleanup_list may be linked to cleanup_list list and probably it's inserted
in net_kill_list too, and the both global lists are protected by two different
locks respectively. But when we check list_empty(&net->cleanup_list), any lock
is not held.

However, except for the point, overall, I think your idea is workable.

So, Eric, can you please further explain why Cong's proposal doesn't work?

Thanks,
Ying

> Eric
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08  8:50   ` Ying Xue
@ 2015-05-08  9:25     ` Ying Xue
  2015-05-08 11:07     ` Eric W. Biederman
  1 sibling, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-08  9:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, cwang, herbert, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On 05/08/2015 04:50 PM, Ying Xue wrote:
> On 05/08/2015 12:14 AM, Eric W. Biederman wrote:
>> I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
>> and netlink_kernel_create()."  was a hack.
>>
> 
> Thanks for the agreement :)
> 
>> However it is not appropriate to call get_net on a network namespace
>> whose count might be zero.
> 
> I will explain why it's still safe for us in another mail even if we do this.
> 
>   I believe all of your patches rely on that
>> currently.
> 
> Yes, you are right.
> 
>   Instead we need to build something like sk_release_kernel
>> that does not increase the network namespace reference count
> 
> Please refer to above comments.
> 
>  if you are
>> going to avoid changing the network namespace on a socket (a worthy
>> goal).
>>
> 
> Thanks to the conformation for the effort!
> 
>> The following change shows how it is possible to always know that your
>> network namespace has a non-zero reference count in the network
>> namespace initialization methods.  My implementation of
>> lock_network_namespaces is problematic in that it does not sleep
>> while network namespaces are unregistering.  But it is enough to show
>> how the locking and reference counting can be fixed.
>>
> 
> If my understanding for your proposal is right, register_pernet_subsys() will
> return a failed error code to its caller if it's found that there is a net whose
> refcount is zero from net_namespace_list in lock_network_namespaces(), which
> means the per namespace operation is not registered at all in this situation.
> But in practice we should not reject the registration even if there is a dead
> net linked in the global net_namespace_list.
> 

Please consider tipc case below:
tipc_init()
  register_pernet_subsys(&tipc_net_ops)
    lock_network_namespaces();
       for_each_net(net) {
         if (!maybe_get_net(net))
            goto undo; //we should exit from the entire process of registration.

This is obviously unreasonable for us. TIPC module is unable to be loaded
successfully as there is a dead net in net_namespace_list.

Regards,
Ying

> Regards,
> Ying
> 
>> Eric
>>
>>
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index a3abb719221f..81c53ccc5764 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -822,6 +822,49 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>>  		ida_remove(&net_generic_ids, *ops->id);
>>  }
>>  
>> +static void unlock_network_namespaces(void)
>> +{
>> +	/* Drop the reference count to every network namespace
>> +	 * and then release the net_mutex.
>> +	 */
>> +	struct net *net;
>> +
>> +	for_each_net(net)
>> +		put_net(net);
>> +
>> +	mutex_unlock(&net_mutex);
>> +}
>> +
>> +static void lock_network_namespaces(void)
>> +{
>> +	/* Take the mutex lock ensuring no new network namespaces
>> +	 * and take a reference on all existing network namespaces
>> +	 * allowing network namespace initialization code to take
>> +	 * further references
>> +	 */
>> +	for (;;) {
>> +		struct net *net, *stop;
>> +
>> +		mutex_lock(&net_mutex);
>> +		for_each_net(net) {
>> +			if (!maybe_get_net(net))
>> +				goto undo;
>> +		}
>> +		return;
>> +undo:
>> +		/* Remember the network namespace whose reference
>> +		 * count was not acquired. */
>> +		stop = net;
>> +		for_each_net(net) {
>> +			if (net_eq(net, stop))
>> +				goto undone;
>> +			put_net(net);
>> +		}
>> +undone:
>> +		mutex_unlock(&net_mutex);
>> +	}
>> +}
>> +
>>  /**
>>   *      register_pernet_subsys - register a network namespace subsystem
>>   *	@ops:  pernet operations structure for the subsystem
>> @@ -844,9 +887,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>>  int register_pernet_subsys(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	lock_network_namespaces();
>>  	error =  register_pernet_operations(first_device, ops);
>> -	mutex_unlock(&net_mutex);
>> +	unlock_network_namespaces();
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
>> @@ -890,11 +933,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>>  int register_pernet_device(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	lock_network_namespaces();
>>  	error = register_pernet_operations(&pernet_list, ops);
>>  	if (!error && (first_device == &pernet_list))
>>  		first_device = &ops->list;
>> -	mutex_unlock(&net_mutex);
>> +	unlock_network_namespaces();
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_device);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08  8:50   ` Ying Xue
  2015-05-08  9:25     ` Ying Xue
@ 2015-05-08 11:07     ` Eric W. Biederman
  2015-05-08 16:33       ` Cong Wang
  1 sibling, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-08 11:07 UTC (permalink / raw)
  To: Ying Xue
  Cc: netdev, cwang, herbert, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Ying Xue <ying.xue@windriver.com> writes:

> On 05/08/2015 12:14 AM, Eric W. Biederman wrote:
>> I agree that commit 23fe18669e7f ("[NETNS]: Fix race between put_net()
>> and netlink_kernel_create()."  was a hack.
>>
>
> Thanks for the agreement :)
>
>> However it is not appropriate to call get_net on a network namespace
>> whose count might be zero.
>
> I will explain why it's still safe for us in another mail even if we do this.
>
>   I believe all of your patches rely on that
>> currently.
>
> Yes, you are right.
>
>   Instead we need to build something like sk_release_kernel
>> that does not increase the network namespace reference count
>
> Please refer to above comments.
>
>  if you are
>> going to avoid changing the network namespace on a socket (a worthy
>> goal).
>> 
>
> Thanks to the conformation for the effort!
>
>> The following change shows how it is possible to always know that your
>> network namespace has a non-zero reference count in the network
>> namespace initialization methods.  My implementation of
>> lock_network_namespaces is problematic in that it does not sleep
>> while network namespaces are unregistering.  But it is enough to show
>> how the locking and reference counting can be fixed.
>> 
>
> If my understanding for your proposal is right, register_pernet_subsys() will
> return a failed error code to its caller if it's found that there is a net whose
> refcount is zero from net_namespace_list in lock_network_namespaces(), which
> means the per namespace operation is not registered at all in this situation.
> But in practice we should not reject the registration even if there is a dead
> net linked in the global net_namespace_list.

I agree which is why my implementation of lock_network_namespaces below
loops until there are no exiting network namespaces.

Eric

> Regards,
> Ying
>
>> Eric
>> 
>> 
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index a3abb719221f..81c53ccc5764 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -822,6 +822,49 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>>  		ida_remove(&net_generic_ids, *ops->id);
>>  }
>>  
>> +static void unlock_network_namespaces(void)
>> +{
>> +	/* Drop the reference count to every network namespace
>> +	 * and then release the net_mutex.
>> +	 */
>> +	struct net *net;
>> +
>> +	for_each_net(net)
>> +		put_net(net);
>> +
>> +	mutex_unlock(&net_mutex);
>> +}
>> +
>> +static void lock_network_namespaces(void)
>> +{
>> +	/* Take the mutex lock ensuring no new network namespaces
>> +	 * and take a reference on all existing network namespaces
>> +	 * allowing network namespace initialization code to take
>> +	 * further references
>> +	 */
>> +	for (;;) {
>> +		struct net *net, *stop;
>> +
>> +		mutex_lock(&net_mutex);
>> +		for_each_net(net) {
>> +			if (!maybe_get_net(net))
>> +				goto undo;
>> +		}
>> +		return;
>> +undo:
>> +		/* Remember the network namespace whose reference
>> +		 * count was not acquired. */
>> +		stop = net;
>> +		for_each_net(net) {
>> +			if (net_eq(net, stop))
>> +				goto undone;
>> +			put_net(net);
>> +		}
>> +undone:
>> +		mutex_unlock(&net_mutex);
>> +	}
>> +}
>> +
>>  /**
>>   *      register_pernet_subsys - register a network namespace subsystem
>>   *	@ops:  pernet operations structure for the subsystem
>> @@ -844,9 +887,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>>  int register_pernet_subsys(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	lock_network_namespaces();
>>  	error =  register_pernet_operations(first_device, ops);
>> -	mutex_unlock(&net_mutex);
>> +	unlock_network_namespaces();
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
>> @@ -890,11 +933,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>>  int register_pernet_device(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	lock_network_namespaces();
>>  	error = register_pernet_operations(&pernet_list, ops);
>>  	if (!error && (first_device == &pernet_list))
>>  		first_device = &ops->list;
>> -	mutex_unlock(&net_mutex);
>> +	unlock_network_namespaces();
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_device);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08  9:10               ` Ying Xue
@ 2015-05-08 11:15                 ` Eric W. Biederman
  0 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-08 11:15 UTC (permalink / raw)
  To: Ying Xue
  Cc: Cong Wang, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

Ying Xue <ying.xue@windriver.com> writes:

> On 05/08/2015 04:01 AM, Eric W. Biederman wrote:
>> Cong Wang <cwang@twopensource.com> writes:
>> 
>>> On Thu, May 7, 2015 at 11:58 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Cong Wang <cwang@twopensource.com> writes:
>>>>
>>>>> On Thu, May 7, 2015 at 11:26 AM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>>> Cong Wang <cwang@twopensource.com> writes:
>>>>>>
>>>>>>>
>>>>>>> Why does this have to be so complicated? We can simply avoid
>>>>>>> calling ops_init() by skipping those in cleanup_list, no?
>>>>>>
>>>>>> The problem is that there is a single list of methods to call and if you
>>>>>> simply skip calling the initialization methods for a struct net and add
>>>>>> yourself to the list cleanup_net will then call the cleanup methods
>>>>>> without calling the cleanup methods.
>>>>>
>>>>> If you mean pernet_list, ops->list has been already added before
>>>>> for_each_net().
>>>>>
>>>>>>
>>>>>> Simply limiting new network namespace registrations to a point when
>>>>>> network namespaces are not being registered or unregisted seems like
>>>>>> the simplest way to achieve this effect.
>>>>>>
>>>>>
>>>>> Literally, any point before ops_init().
>>>>
>>>> Think about what that what it means to add a set of operations to the
>>>> pernet_list and then to skip a network namespace with a count of 0 and
>>>> then to have that network namespace exit with those methods on
>>>> pernet_list.
>>>>
>>>
>>> That is easy to solve, isn't it?
>> 
>> Nope.  That doesn't work.
>> 
>
> Cong, although I don't know why Eric confirmed your solution did not work, in my
> view it really exists a bit fault especially in locking policy. For instance,
> net->cleanup_list may be linked to cleanup_list list and probably it's inserted
> in net_kill_list too, and the both global lists are protected by two different
> locks respectively. But when we check list_empty(&net->cleanup_list), any lock
> is not held.
>
> However, except for the point, overall, I think your idea is workable.
>
> So, Eric, can you please further explain why Cong's proposal doesn't
> work?

Because changing where the list assignment happens under the net_mutex
does not affect anything, and I already explained why it did not work
once.

In particular it is possible to call the cleanup methods on a network
namespace where the initialization methods where not called.

If we want to avoid the complexity required to wait for no network
namespaces to be exiting that was implicit in my version of
lock_network_namespace() we need to allocate some per network namespace
per network operations state.  To remember if the per network namespace
operations have been initialized on a network namespace.  Essentially
creating a per network namespace copy of the pernet_list.

My end game would be to work towards reducing the scope of the net_mutex
so potentially we could have two network namespaces exiting at the same
time.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07 17:19     ` Cong Wang
  2015-05-07 17:28       ` Eric W. Biederman
@ 2015-05-08 11:20       ` Eric W. Biederman
  2015-05-08 11:20       ` Ying Xue
  2 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-08 11:20 UTC (permalink / raw)
  To: Cong Wang
  Cc: Herbert Xu, Ying Xue, netdev, xemul, David Miller, Eric Dumazet,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, horms

Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
>
> That is true.
>
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
>
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.

The bad case in all of this is if we hit that window and increment the
count.  The namespace could be removed from the lists ->exit() methods
could run.  Then the reference count could drop to zero again.

Having a network namespace alive with cleanup methods having already run
is too nasty to think about.  We really don't want to do that.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create()
  2015-05-07 17:19     ` Cong Wang
  2015-05-07 17:28       ` Eric W. Biederman
  2015-05-08 11:20       ` Eric W. Biederman
@ 2015-05-08 11:20       ` Ying Xue
  2 siblings, 0 replies; 56+ messages in thread
From: Ying Xue @ 2015-05-08 11:20 UTC (permalink / raw)
  To: Cong Wang, Herbert Xu
  Cc: netdev, xemul, David Miller, Eric Dumazet, Eric W. Biederman,
	maxk, Stephen Hemminger, Thomas Graf, Nicolas Dichtel,
	Tom Herbert, James Chapman, Erik Hugne, jon.maloy, horms

Thank everyone for the review, comments, and suggestions!
Please see my inline responses to your inputs.

On 05/08/2015 01:19 AM, Cong Wang wrote:
> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
> 
> That is true.
> 
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
> 
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.
> 
> You can argue that netns->init() should not see a dead netns, but
> unfortunately that seems even harder since removing netns from global
> netns list requires rtnl lock which is not doable in __put_net().
> 
> 

Cong's understanding about the race between put_net and kernel creation is right.

But in my opinion the patch is still able to cover all kinds of race conditions.
You may ask why we can get reference count of a dead net again. As the race
scenario is a bit complex, please let me spend a little more time explaining why
it's safe for us. To make the case clear, please let us consider tipc module as
an example:

//Shutdown a namesapce thread:
put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */ -----> the net's refcount is zero now
      __put_net(net);
        queue_work(netns_wq, &net_cleanup_work);

//Worker thread:
cleanup_net()
  mutex_lock(&net_mutex);
  rtnl_lock();
  list_del_rcu(&net->list); //delete a net from net_namespace_list
  rtnl_unlock();
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list()
       ops->exit(net)
  mutex_unlock(&net_mutex);
  net_free(ns);

== re-schedule ==
//the thread of inserting tipc module:
tipc_init()
 register_pernet_subsys(&tipc_net_ops);
 mutex_lock(&net_mutex);
  __register_pernet_operations(&tipc_net_ops)
   for_each_net(net) //iterate the net_namespace_list
    ops_init(ops, net_dead)
      tipc_init_net(net_dead)
        tipc_topsrv_start(net_dead)
          tipc_create_listen_sock(net_dead)
            sock_create_kern(net_dead)
              sk_alloc();
                get_net(net_dead); /* refcnt = 1 */
              put_net(net_dead)
               __put_net(net_dead); /* refcnt = 0 */
                if (!list_empty(&net->cleanup_list))
                   queue_work(netns_wq, &net_cleanup_work);
           /*
            * As the net_dead is linked in net_namespace_list, we don't queue
            * the cleanup work again. cleanup_net() is not called, which means
            * the risk of double free of net is avoided.
            */

The reason why the original risk occurs is that destroying a net has to be done
in a worker thread. After a net refcount is decreased to 0, and before the
process of destroying net is finished in cleanup_net(), there remains a small
race window for us. In above case, during the tiny time, tipc module may be
inserted in another thread in which a TIPC kernel socket is created. During the
socket creation, a dead net refcout is increased from 0 to 1 in sk_allc(), and
then decreased to 0 in sk_change_net(). At the moment, the net's refcout
secondly hits zero in put_net(), subsequently __put_net() is called again, which
means cleanup_net() will be secondly called. So releasing a net twice happens!

However, as __register_pernet_operations() is protected by net_mutex and
cleanup_net() also needs to grab the net_mutex before it destroys a net, the
dead net is temporarily valid during the registration. Therefore, during this
time, we can safely operate a net to be dead like what we are doing in
tipc_init_net(), such as, get its refcout in sk_alloc() and subsequently
decrease it after socket creation. But after the registration is over, the dead
net will be freed when cleanup net thread worker is scheduled.

Actually there is another possible method to avoid namespace change. For
example, when we create a kernel socket, we don't crease its net's refcount at
all in sk_alloc(), correspondingly we don't decrease the net refcout for a
kernel socket in __sk_free(). However, it needs to change many code, and the
solution is also ugly, so it's not feasible.

Of course, if you find other race with the solution, please let me know.

Thanks,
Ying

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
  2015-05-07 18:19   ` Cong Wang
  2015-05-08  8:50   ` Ying Xue
@ 2015-05-08 14:07   ` Herbert Xu
  2015-05-08 17:36     ` Eric W. Biederman
  2 siblings, 1 reply; 56+ messages in thread
From: Herbert Xu @ 2015-05-08 14:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, cwang, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On Thu, May 07, 2015 at 11:14:13AM -0500, Eric W. Biederman wrote:
>
> The following change shows how it is possible to always know that your
> network namespace has a non-zero reference count in the network
> namespace initialization methods.  My implementation of
> lock_network_namespaces is problematic in that it does not sleep
> while network namespaces are unregistering.  But it is enough to show
> how the locking and reference counting can be fixed.

If name spaces are created/deleted constantly this could take
a long time.  How about forcibly cleaning up entries when we
register a new op?

---8<---
Subject: netns: Do not init dead nets when registering new op

Currently when a new op is registered we will initialise all
current namespaces through that op, even those that are zombies
(zero ref count).  This is bad because it puts the onus on each
op implementor to deal with this and they are bound to slip up.

This patch fixes this by only initialising live namespaces.

This however brings about a new problem as when those zombies
are eventually cleaned up we will call op->exit even on them
even though they have not been initialised.

This is fixed by bringing forward parts of the work of the cleanup
thread and performing them during registration.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f733656..4931100 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -133,6 +133,7 @@ struct net {
 #endif
 	struct sock		*diag_nlsk;
 	atomic_t		fnhe_genid;
+	bool			dead;
 };
 
 #include <linux/seq_file_net.h>
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..1f888a7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -339,41 +339,38 @@ struct net *copy_net_ns(unsigned long flags,
 	return net;
 }
 
-static DEFINE_SPINLOCK(cleanup_list_lock);
-static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
-
-static void cleanup_net(struct work_struct *work)
+static void unlink_net(struct net *net, struct list_head *net_exit_list)
 {
-	const struct pernet_operations *ops;
-	struct net *net, *tmp;
-	struct list_head net_kill_list;
-	LIST_HEAD(net_exit_list);
+	struct net *tmp;
 
-	/* Atomically snapshot the list of namespaces to cleanup */
-	spin_lock_irq(&cleanup_list_lock);
-	list_replace_init(&cleanup_list, &net_kill_list);
-	spin_unlock_irq(&cleanup_list_lock);
+	if (net->dead)
+		return;
 
-	mutex_lock(&net_mutex);
+	net->dead = true;
+
+	list_add_tail(&net->exit_list, net_exit_list);
 
 	/* Don't let anyone else find us. */
 	rtnl_lock();
-	list_for_each_entry(net, &net_kill_list, cleanup_list) {
-		list_del_rcu(&net->list);
-		list_add_tail(&net->exit_list, &net_exit_list);
-		for_each_net(tmp) {
-			int id = __peernet2id(tmp, net, false);
+	list_del_rcu(&net->list);
 
-			if (id >= 0) {
-				rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
-				idr_remove(&tmp->netns_ids, id);
-			}
-		}
-		idr_destroy(&net->netns_ids);
+	for_each_net(tmp) {
+		int id = __peernet2id(tmp, net, false);
 
+		if (id >= 0) {
+			rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
+			idr_remove(&tmp->netns_ids, id);
+		}
 	}
 	rtnl_unlock();
 
+	idr_destroy(&net->netns_ids);
+}
+
+static void exit_all_ops(struct list_head *net_exit_list)
+{
+	const struct pernet_operations *ops;
+
 	/*
 	 * Another CPU might be rcu-iterating the list, wait for it.
 	 * This needs to be before calling the exit() notifiers, so
@@ -383,11 +380,33 @@ static void cleanup_net(struct work_struct *work)
 
 	/* Run all of the network namespace exit methods */
 	list_for_each_entry_reverse(ops, &pernet_list, list)
-		ops_exit_list(ops, &net_exit_list);
+		ops_exit_list(ops, net_exit_list);
 
 	/* Free the net generic variables */
 	list_for_each_entry_reverse(ops, &pernet_list, list)
-		ops_free_list(ops, &net_exit_list);
+		ops_free_list(ops, net_exit_list);
+}
+
+static DEFINE_SPINLOCK(cleanup_list_lock);
+static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
+
+static void cleanup_net(struct work_struct *work)
+{
+	struct list_head net_kill_list;
+	struct net *net, *tmp;
+	LIST_HEAD(net_exit_list);
+
+	/* Atomically snapshot the list of namespaces to cleanup */
+	spin_lock_irq(&cleanup_list_lock);
+	list_replace_init(&cleanup_list, &net_kill_list);
+	spin_unlock_irq(&cleanup_list_lock);
+
+	mutex_lock(&net_mutex);
+
+	list_for_each_entry(net, &net_kill_list, cleanup_list)
+		unlink_net(net, &net_exit_list);
+
+	exit_all_ops(&net_exit_list);
 
 	mutex_unlock(&net_mutex);
 
@@ -397,8 +416,7 @@ static void cleanup_net(struct work_struct *work)
 	rcu_barrier();
 
 	/* Finally it is safe to free my network namespace structure */
-	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
-		list_del_init(&net->exit_list);
+	list_for_each_entry_safe(net, tmp, &net_kill_list, cleanup_list) {
 		put_user_ns(net->user_ns);
 		net_drop_ns(net);
 	}
@@ -727,30 +745,59 @@ static int __init net_ns_init(void)
 pure_initcall(net_ns_init);
 
 #ifdef CONFIG_NET_NS
+static void grab_all_nets(struct list_head *all_net_list)
+{
+	struct net *net;
+	LIST_HEAD(net_exit_list);
+
+	for_each_net(net) {
+		if (maybe_get_net(net)) 
+			list_add_tail(&net->cleanup_list, all_net_list);
+		else
+			unlink_net(net, &net_exit_list);
+	}
+
+	exit_all_ops(&net_exit_list);
+}
+
+static void drop_all_nets(struct list_head *all_net_list)
+{
+	struct net *net, *next;
+
+	list_for_each_entry_safe(net, next, all_net_list, cleanup_list)
+		put_net(net);
+}
+
 static int __register_pernet_operations(struct list_head *list,
 					struct pernet_operations *ops)
 {
 	struct net *net;
-	int error;
+	int error = 0;
+	LIST_HEAD(all_net_list);
 	LIST_HEAD(net_exit_list);
 
+	grab_all_nets(&all_net_list);
+
 	list_add_tail(&ops->list, list);
 	if (ops->init || (ops->id && ops->size)) {
-		for_each_net(net) {
+		list_for_each_entry(net, &all_net_list, cleanup_list) {
 			error = ops_init(ops, net);
 			if (error)
 				goto out_undo;
 			list_add_tail(&net->exit_list, &net_exit_list);
 		}
 	}
-	return 0;
+
+out_drop_nets:
+	drop_all_nets(&all_net_list);
+	return error;
 
 out_undo:
 	/* If I have an error cleanup all namespaces I initialized */
 	list_del(&ops->list);
 	ops_exit_list(ops, &net_exit_list);
 	ops_free_list(ops, &net_exit_list);
-	return error;
+	goto out_drop_nets;
 }
 
 static void __unregister_pernet_operations(struct pernet_operations *ops)
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 11:07     ` Eric W. Biederman
@ 2015-05-08 16:33       ` Cong Wang
  0 siblings, 0 replies; 56+ messages in thread
From: Cong Wang @ 2015-05-08 16:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, Herbert Xu, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

On Fri, May 8, 2015 at 4:07 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I agree which is why my implementation of lock_network_namespaces below
> loops until there are no exiting network namespaces.
>

It is easy to have live lock since you keep dropping and acquire netns mutex
in a loop.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 14:07   ` Herbert Xu
@ 2015-05-08 17:36     ` Eric W. Biederman
  2015-05-08 20:27       ` Cong Wang
  2015-05-09  1:13       ` Herbert Xu
  0 siblings, 2 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-08 17:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ying Xue, netdev, cwang, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Herbert Xu <herbert@gondor.apana.org.au> writes:

> On Thu, May 07, 2015 at 11:14:13AM -0500, Eric W. Biederman wrote:
>>
>> The following change shows how it is possible to always know that your
>> network namespace has a non-zero reference count in the network
>> namespace initialization methods.  My implementation of
>> lock_network_namespaces is problematic in that it does not sleep
>> while network namespaces are unregistering.  But it is enough to show
>> how the locking and reference counting can be fixed.
>
> If name spaces are created/deleted constantly this could take
> a long time.  How about forcibly cleaning up entries when we
> register a new op?

That certainly seems more sensible than waiting for an indefinite amount
of time, and the code looks reasonably sound.  Placing on things on
lists and then not removing them, and then later stomping the list
pointers gives me flashbacks of other bugs I have fought, but we already
do that in this code, and in this context it appears harmless.

So overall I think I like it.

At the same time while a network namespace remains discoverable we might
be using it in small ways.  So I am wondering if this is the right
approach.

It really is invalid for a network namespace init routine to grab the
reference count of it's network namespace (thus making the network
namespace unfreeable).  So I am wondering if perhaps all we need to do
is find a clean refactoring of the socket code so this case does not
come up at all.

Perhaps just a flag that says this is a kernel socket so don't get/put
the refcount on the network namespace.  

> ---8<---
> Subject: netns: Do not init dead nets when registering new op
>
> Currently when a new op is registered we will initialise all
> current namespaces through that op, even those that are zombies
> (zero ref count).  This is bad because it puts the onus on each
> op implementor to deal with this and they are bound to slip up.
>
> This patch fixes this by only initialising live namespaces.
>
> This however brings about a new problem as when those zombies
> are eventually cleaned up we will call op->exit even on them
> even though they have not been initialised.
>
> This is fixed by bringing forward parts of the work of the cleanup
> thread and performing them during registration.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index f733656..4931100 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -133,6 +133,7 @@ struct net {
>  #endif
>  	struct sock		*diag_nlsk;
>  	atomic_t		fnhe_genid;
> +	bool			dead;
>  };
>  
>  #include <linux/seq_file_net.h>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 78fc04a..1f888a7 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -339,41 +339,38 @@ struct net *copy_net_ns(unsigned long flags,
>  	return net;
>  }
>  
> -static DEFINE_SPINLOCK(cleanup_list_lock);
> -static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
> -
> -static void cleanup_net(struct work_struct *work)
> +static void unlink_net(struct net *net, struct list_head *net_exit_list)
>  {
> -	const struct pernet_operations *ops;
> -	struct net *net, *tmp;
> -	struct list_head net_kill_list;
> -	LIST_HEAD(net_exit_list);
> +	struct net *tmp;
>  
> -	/* Atomically snapshot the list of namespaces to cleanup */
> -	spin_lock_irq(&cleanup_list_lock);
> -	list_replace_init(&cleanup_list, &net_kill_list);
> -	spin_unlock_irq(&cleanup_list_lock);
> +	if (net->dead)
> +		return;
>  
> -	mutex_lock(&net_mutex);
> +	net->dead = true;
> +
> +	list_add_tail(&net->exit_list, net_exit_list);
>  
>  	/* Don't let anyone else find us. */
>  	rtnl_lock();
> -	list_for_each_entry(net, &net_kill_list, cleanup_list) {
> -		list_del_rcu(&net->list);
> -		list_add_tail(&net->exit_list, &net_exit_list);
> -		for_each_net(tmp) {
> -			int id = __peernet2id(tmp, net, false);
> +	list_del_rcu(&net->list);
>  
> -			if (id >= 0) {
> -				rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
> -				idr_remove(&tmp->netns_ids, id);
> -			}
> -		}
> -		idr_destroy(&net->netns_ids);
> +	for_each_net(tmp) {
> +		int id = __peernet2id(tmp, net, false);
>  
> +		if (id >= 0) {
> +			rtnl_net_notifyid(tmp, net, RTM_DELNSID, id);
> +			idr_remove(&tmp->netns_ids, id);
> +		}
>  	}
>  	rtnl_unlock();
>  
> +	idr_destroy(&net->netns_ids);
> +}
> +
> +static void exit_all_ops(struct list_head *net_exit_list)
> +{
> +	const struct pernet_operations *ops;
> +
>  	/*
>  	 * Another CPU might be rcu-iterating the list, wait for it.
>  	 * This needs to be before calling the exit() notifiers, so
> @@ -383,11 +380,33 @@ static void cleanup_net(struct work_struct *work)
>  
>  	/* Run all of the network namespace exit methods */
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
> -		ops_exit_list(ops, &net_exit_list);
> +		ops_exit_list(ops, net_exit_list);
>  
>  	/* Free the net generic variables */
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
> -		ops_free_list(ops, &net_exit_list);
> +		ops_free_list(ops, net_exit_list);
> +}
> +
> +static DEFINE_SPINLOCK(cleanup_list_lock);
> +static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
> +
> +static void cleanup_net(struct work_struct *work)
> +{
> +	struct list_head net_kill_list;
> +	struct net *net, *tmp;
> +	LIST_HEAD(net_exit_list);
> +
> +	/* Atomically snapshot the list of namespaces to cleanup */
> +	spin_lock_irq(&cleanup_list_lock);
> +	list_replace_init(&cleanup_list, &net_kill_list);
> +	spin_unlock_irq(&cleanup_list_lock);
> +
> +	mutex_lock(&net_mutex);
> +
> +	list_for_each_entry(net, &net_kill_list, cleanup_list)
> +		unlink_net(net, &net_exit_list);
> +
> +	exit_all_ops(&net_exit_list);
>  
>  	mutex_unlock(&net_mutex);
>  
> @@ -397,8 +416,7 @@ static void cleanup_net(struct work_struct *work)
>  	rcu_barrier();
>  
>  	/* Finally it is safe to free my network namespace structure */
> -	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
> -		list_del_init(&net->exit_list);
> +	list_for_each_entry_safe(net, tmp, &net_kill_list, cleanup_list) {
>  		put_user_ns(net->user_ns);
>  		net_drop_ns(net);
>  	}
> @@ -727,30 +745,59 @@ static int __init net_ns_init(void)
>  pure_initcall(net_ns_init);
>  
>  #ifdef CONFIG_NET_NS
> +static void grab_all_nets(struct list_head *all_net_list)
> +{
> +	struct net *net;
> +	LIST_HEAD(net_exit_list);
> +
> +	for_each_net(net) {
> +		if (maybe_get_net(net)) 
> +			list_add_tail(&net->cleanup_list, all_net_list);
> +		else
> +			unlink_net(net, &net_exit_list);
> +	}
> +
> +	exit_all_ops(&net_exit_list);
> +}
> +
> +static void drop_all_nets(struct list_head *all_net_list)
> +{
> +	struct net *net, *next;
> +
> +	list_for_each_entry_safe(net, next, all_net_list, cleanup_list)
> +		put_net(net);
> +}
> +
>  static int __register_pernet_operations(struct list_head *list,
>  					struct pernet_operations *ops)
>  {
>  	struct net *net;
> -	int error;
> +	int error = 0;
> +	LIST_HEAD(all_net_list);
>  	LIST_HEAD(net_exit_list);
>  
> +	grab_all_nets(&all_net_list);
> +
>  	list_add_tail(&ops->list, list);
>  	if (ops->init || (ops->id && ops->size)) {
> -		for_each_net(net) {
> +		list_for_each_entry(net, &all_net_list, cleanup_list) {
>  			error = ops_init(ops, net);
>  			if (error)
>  				goto out_undo;
>  			list_add_tail(&net->exit_list, &net_exit_list);
>  		}
>  	}
> -	return 0;
> +
> +out_drop_nets:
> +	drop_all_nets(&all_net_list);
> +	return error;
>  
>  out_undo:
>  	/* If I have an error cleanup all namespaces I initialized */
>  	list_del(&ops->list);
>  	ops_exit_list(ops, &net_exit_list);
>  	ops_free_list(ops, &net_exit_list);
> -	return error;
> +	goto out_drop_nets;
>  }
>  
>  static void __unregister_pernet_operations(struct pernet_operations *ops)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 17:36     ` Eric W. Biederman
@ 2015-05-08 20:27       ` Cong Wang
  2015-05-08 21:13         ` Cong Wang
  2015-05-09  1:13       ` Herbert Xu
  1 sibling, 1 reply; 56+ messages in thread
From: Cong Wang @ 2015-05-08 20:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Xu, Ying Xue, netdev, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

On Fri, May 8, 2015 at 10:36 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> It really is invalid for a network namespace init routine to grab the
> reference count of it's network namespace (thus making the network
> namespace unfreeable).  So I am wondering if perhaps all we need to do
> is find a clean refactoring of the socket code so this case does not
> come up at all.


Good point!

I _guess_ the reason is these kernel sockets have to exist longer than
netns' life-time, it could be due to on-flying skb's?

On the other hand, we do create some fb_tunnel netdevice in netns init
too, but we don't take a refcnt there, probably because we wait
for netdevice refcnt goes to zero when unregistering.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 20:27       ` Cong Wang
@ 2015-05-08 21:13         ` Cong Wang
  2015-05-08 22:08           ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Cong Wang @ 2015-05-08 21:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Herbert Xu, Ying Xue, netdev, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

On Fri, May 8, 2015 at 1:27 PM, Cong Wang <cwang@twopensource.com> wrote:
> On Fri, May 8, 2015 at 10:36 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> It really is invalid for a network namespace init routine to grab the
>> reference count of it's network namespace (thus making the network
>> namespace unfreeable).  So I am wondering if perhaps all we need to do
>> is find a clean refactoring of the socket code so this case does not
>> come up at all.
>
>
> Good point!
>
> I _guess_ the reason is these kernel sockets have to exist longer than
> netns' life-time, it could be due to on-flying skb's?
>
> On the other hand, we do create some fb_tunnel netdevice in netns init
> too, but we don't take a refcnt there, probably because we wait
> for netdevice refcnt goes to zero when unregistering.

Answer myself, it looks like we don't need to hold the refcnt at all,
since we create the socket at init and release it at uninit, so they
should have the same life-time?

The reason why user-space sockets need this refcnt is they
could have longer life-time than the netns?

This seems the right direction to solve this problem for me,
I am going to try this way to see how far I can go. ;)

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 21:13         ` Cong Wang
@ 2015-05-08 22:08           ` Eric W. Biederman
  0 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-08 22:08 UTC (permalink / raw)
  To: Cong Wang
  Cc: Herbert Xu, Ying Xue, netdev, Pavel Emelyanov, David Miller,
	Eric Dumazet, maxk, Stephen Hemminger, Thomas Graf,
	Nicolas Dichtel, Tom Herbert, James Chapman, Erik Hugne,
	jon.maloy, Simon Horman

Cong Wang <cwang@twopensource.com> writes:

> On Fri, May 8, 2015 at 1:27 PM, Cong Wang <cwang@twopensource.com> wrote:
>> On Fri, May 8, 2015 at 10:36 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>> It really is invalid for a network namespace init routine to grab the
>>> reference count of it's network namespace (thus making the network
>>> namespace unfreeable).  So I am wondering if perhaps all we need to do
>>> is find a clean refactoring of the socket code so this case does not
>>> come up at all.
>>
>>
>> Good point!
>>
>> I _guess_ the reason is these kernel sockets have to exist longer than
>> netns' life-time, it could be due to on-flying skb's?
>>
>> On the other hand, we do create some fb_tunnel netdevice in netns init
>> too, but we don't take a refcnt there, probably because we wait
>> for netdevice refcnt goes to zero when unregistering.
>
> Answer myself, it looks like we don't need to hold the refcnt at all,
> since we create the socket at init and release it at uninit, so they
> should have the same life-time?
>
> The reason why user-space sockets need this refcnt is they
> could have longer life-time than the netns?

User space sockets by design keep the network namespace alive.

> This seems the right direction to solve this problem for me,
> I am going to try this way to see how far I can go. ;)

I will be interested to see what you come up with.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-08 17:36     ` Eric W. Biederman
  2015-05-08 20:27       ` Cong Wang
@ 2015-05-09  1:13       ` Herbert Xu
  2015-05-09  1:53         ` Eric W. Biederman
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
  1 sibling, 2 replies; 56+ messages in thread
From: Herbert Xu @ 2015-05-09  1:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ying Xue, netdev, cwang, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On Fri, May 08, 2015 at 12:36:56PM -0500, Eric W. Biederman wrote:
>
> At the same time while a network namespace remains discoverable we might
> be using it in small ways.  So I am wondering if this is the right
> approach.
> 
> It really is invalid for a network namespace init routine to grab the
> reference count of it's network namespace (thus making the network
> namespace unfreeable).  So I am wondering if perhaps all we need to do
> is find a clean refactoring of the socket code so this case does not
> come up at all.
> 
> Perhaps just a flag that says this is a kernel socket so don't get/put
> the refcount on the network namespace.  

No this line of thinking is what led us into the hole in the
first place.

Really it's perfectly valid for the init function to want to
take a reference on net.  For all we know it might be a temporary
reference taken by a third party.  It doesn't even have to be a
socket.

We must hide this subtlety from ops implementors since they have
no knowledge of our implementation.  Expecting them to deal with
this is going to result in bugs, and we have already had multiple
bugs in this area.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets
  2015-05-09  1:13       ` Herbert Xu
@ 2015-05-09  1:53         ` Eric W. Biederman
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
  1 sibling, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  1:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ying Xue, netdev, cwang, xemul, davem, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

Herbert Xu <herbert@gondor.apana.org.au> writes:

> On Fri, May 08, 2015 at 12:36:56PM -0500, Eric W. Biederman wrote:
>>
>> At the same time while a network namespace remains discoverable we might
>> be using it in small ways.  So I am wondering if this is the right
>> approach.
>> 
>> It really is invalid for a network namespace init routine to grab the
>> reference count of it's network namespace (thus making the network
>> namespace unfreeable).  So I am wondering if perhaps all we need to do
>> is find a clean refactoring of the socket code so this case does not
>> come up at all.
>> 
>> Perhaps just a flag that says this is a kernel socket so don't get/put
>> the refcount on the network namespace.  
>
> No this line of thinking is what led us into the hole in the
> first place.
>
> Really it's perfectly valid for the init function to want to
> take a reference on net.  For all we know it might be a temporary
> reference taken by a third party.  It doesn't even have to be a
> socket.
>
> We must hide this subtlety from ops implementors since they have
> no knowledge of our implementation.  Expecting them to deal with
> this is going to result in bugs, and we have already had multiple
> bugs in this area.

I won't fundamentally argue.  However the worst hack and subtlety come
from the way we deal with kernel sockets and that is much easier to
clean up so we might as well get that first.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/6] Cleanup the kernel sockets.
  2015-05-09  1:13       ` Herbert Xu
  2015-05-09  1:53         ` Eric W. Biederman
@ 2015-05-09  2:05         ` Eric W. Biederman
  2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
                             ` (7 more replies)
  1 sibling, 8 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:05 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


Right now the situtation for allocating kernel sockets is a mess.
- sock_create_kern does not take a namespace parameter.
- kernel sockets must not reference count a network namespace and keep
  it alive or else we will have a reference counting loop.
- The way we avoid the reference counting loop with sk_change_net
  and sk_release_kernel are major hacks.

This patchset addresses this mess by fixing sock_create_kern to do
everything necessary to create a kernel socket.  None of the current
users of kernel sockets need the network namespace reference counted.
Either kernel sockets are network namespace aware (and using the current
hacks) or kernel sockets are limited to the initial network namespace
in which case it does not matter.

This patchset starts by addressing tun which should be using normal
userspace sockets like macvtap.

Then sock_create_kern is fixed to take a network namespace.
Then the in kernel status of sockets are passed through to sk_alloc.
Then sk_alloc is fixed to not reference count the network namespace
     of kernel sockets.
Then the callers of sock_create_kern are fixed up to stop using hacks.
Then netlink which uses it's own flavor of sock_create_kern is fixed.

Finally the hacks that are sk_change_net and sk_release_kernel are removed.

When it is all done the code is easier to follow, easier to use, easier
to maintain and shorter by about 70 lines.

Reported-by: Ying Xue <ying.xue@windriver.com>

Eric W. Biederman (6):
      tun: Utilize the normal socket network namespace refcounting.
      net: Add a struct net parameter to sock_create_kern
      net: Pass kern from net_proto_family.create to sk_alloc
      net: Modify sk_alloc to not reference count the netns of kernel sockets.
      netlink: Create kernel netlink sockets in the proper network namespace
      net: kill sk_change_net and sk_release_kernel

 crypto/af_alg.c                    |  4 ++--
 drivers/block/drbd/drbd_receiver.c |  4 ++--
 drivers/isdn/mISDN/socket.c        | 12 ++++++------
 drivers/net/macvtap.c              |  2 +-
 drivers/net/ppp/pppoe.c            |  4 ++--
 drivers/net/ppp/pppox.c            |  2 +-
 drivers/net/ppp/pptp.c             |  4 ++--
 drivers/net/tun.c                  | 26 +++++---------------------
 fs/afs/rxrpc.c                     |  2 +-
 fs/dlm/lowcomms.c                  | 16 ++++++++--------
 include/linux/if_pppox.h           |  2 +-
 include/linux/net.h                |  3 +--
 include/net/af_vsock.h             |  2 +-
 include/net/inet_common.h          |  2 +-
 include/net/llc_conn.h             |  2 +-
 include/net/sock.h                 | 21 +++------------------
 net/appletalk/ddp.c                |  2 +-
 net/atm/common.c                   |  4 ++--
 net/atm/common.h                   |  2 +-
 net/atm/pvc.c                      |  2 +-
 net/atm/svc.c                      |  2 +-
 net/ax25/af_ax25.c                 |  4 ++--
 net/bluetooth/bnep/sock.c          |  2 +-
 net/bluetooth/cmtp/sock.c          |  2 +-
 net/bluetooth/hci_sock.c           |  2 +-
 net/bluetooth/hidp/sock.c          |  2 +-
 net/bluetooth/l2cap_sock.c         | 10 +++++-----
 net/bluetooth/rfcomm/core.c        |  2 +-
 net/bluetooth/rfcomm/sock.c        |  8 ++++----
 net/bluetooth/sco.c                |  8 ++++----
 net/caif/caif_socket.c             |  2 +-
 net/can/af_can.c                   |  2 +-
 net/ceph/messenger.c               |  4 ++--
 net/core/sock.c                    | 30 ++++++++----------------------
 net/decnet/af_decnet.c             |  8 ++++----
 net/ieee802154/socket.c            |  2 +-
 net/ipv4/af_inet.c                 |  6 ++----
 net/ipv4/udp_tunnel.c              |  8 +++-----
 net/ipv6/af_inet6.c                |  2 +-
 net/ipv6/ip6_udp_tunnel.c          |  6 ++----
 net/ipx/af_ipx.c                   |  2 +-
 net/irda/af_irda.c                 |  2 +-
 net/iucv/af_iucv.c                 | 10 +++++-----
 net/key/af_key.c                   |  2 +-
 net/l2tp/l2tp_core.c               | 15 ++++++---------
 net/l2tp/l2tp_ppp.c                |  4 ++--
 net/llc/af_llc.c                   |  2 +-
 net/llc/llc_conn.c                 |  6 +++---
 net/netfilter/ipvs/ip_vs_sync.c    | 30 +++++++++---------------------
 net/netlink/af_netlink.c           | 21 +++++++++------------
 net/netrom/af_netrom.c             |  4 ++--
 net/nfc/af_nfc.c                   |  2 +-
 net/nfc/llcp.h                     |  2 +-
 net/nfc/llcp_core.c                |  2 +-
 net/nfc/llcp_sock.c                |  8 ++++----
 net/nfc/nfc.h                      |  2 +-
 net/nfc/rawsock.c                  |  4 ++--
 net/packet/af_packet.c             |  2 +-
 net/phonet/af_phonet.c             |  2 +-
 net/phonet/pep.c                   |  2 +-
 net/rds/af_rds.c                   |  2 +-
 net/rose/af_rose.c                 |  4 ++--
 net/rxrpc/af_rxrpc.c               |  2 +-
 net/rxrpc/ar-local.c               |  4 ++--
 net/sctp/ipv6.c                    |  2 +-
 net/sctp/protocol.c                |  2 +-
 net/socket.c                       |  7 ++-----
 net/tipc/socket.c                  |  2 +-
 net/unix/af_unix.c                 |  8 ++++----
 net/vmw_vsock/af_vsock.c           |  7 ++++---
 net/vmw_vsock/vmci_transport.c     |  2 +-
 net/x25/af_x25.c                   |  8 ++++----
 72 files changed, 166 insertions(+), 238 deletions(-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting.
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
@ 2015-05-09  2:07           ` Eric W. Biederman
  2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
                             ` (6 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:07 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


There is no need for tun to do the weird network namespace refcounting.
The existing network namespace refcounting in tfile has almost exactly
the same lifetime.  So rewrite the code to use the struct sock network
namespace refcounting and remove the unnecessary hand rolled network
namespace refcounting and the unncesary tfile->net.

This change allows the tun code to directly call sock_put bypassing
sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

Remove the now unncessary tun_release so that if anything tries to use
the sock_release code path the kernel will oops, and let us know about
the bug.

The macvtap code already uses it's internal socket this way.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/tun.c   | 24 ++++--------------------
 include/linux/net.h |  1 -
 net/socket.c        |  3 ---
 3 files changed, 4 insertions(+), 24 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e470ae59d405..3262f3e2b8b2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -146,7 +146,6 @@ struct tun_file {
 	struct socket socket;
 	struct socket_wq wq;
 	struct tun_struct __rcu *tun;
-	struct net *net;
 	struct fasync_struct *fasync;
 	/* only used for fasnyc */
 	unsigned int flags;
@@ -493,10 +492,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 			    tun->dev->reg_state == NETREG_REGISTERED)
 				unregister_netdevice(tun->dev);
 		}
-
-		BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
-				 &tfile->socket.flags));
-		sk_release_kernel(&tfile->sk);
+		sock_put(&tfile->sk);
 	}
 }
 
@@ -1492,18 +1488,10 @@ out:
 	return ret;
 }
 
-static int tun_release(struct socket *sock)
-{
-	if (sock->sk)
-		sock_put(sock->sk);
-	return 0;
-}
-
 /* Ops structure to mimic raw sockets with tun */
 static const struct proto_ops tun_socket_ops = {
 	.sendmsg = tun_sendmsg,
 	.recvmsg = tun_recvmsg,
-	.release = tun_release,
 };
 
 static struct proto tun_proto = {
@@ -1865,7 +1853,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	if (cmd == TUNSETIFF && !tun) {
 		ifr.ifr_name[IFNAMSIZ-1] = '\0';
 
-		ret = tun_set_iff(tfile->net, file, &ifr);
+		ret = tun_set_iff(sock_net(&tfile->sk), file, &ifr);
 
 		if (ret)
 			goto unlock;
@@ -2154,16 +2142,16 @@ out:
 
 static int tun_chr_open(struct inode *inode, struct file * file)
 {
+	struct net *net = current->nsproxy->net_ns;
 	struct tun_file *tfile;
 
 	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
 
-	tfile = (struct tun_file *)sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL,
+	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
 					    &tun_proto);
 	if (!tfile)
 		return -ENOMEM;
 	RCU_INIT_POINTER(tfile->tun, NULL);
-	tfile->net = get_net(current->nsproxy->net_ns);
 	tfile->flags = 0;
 	tfile->ifindex = 0;
 
@@ -2174,13 +2162,11 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 	tfile->socket.ops = &tun_socket_ops;
 
 	sock_init_data(&tfile->socket, &tfile->sk);
-	sk_change_net(&tfile->sk, tfile->net);
 
 	tfile->sk.sk_write_space = tun_sock_write_space;
 	tfile->sk.sk_sndbuf = INT_MAX;
 
 	file->private_data = tfile;
-	set_bit(SOCK_EXTERNALLY_ALLOCATED, &tfile->socket.flags);
 	INIT_LIST_HEAD(&tfile->next);
 
 	sock_set_flag(&tfile->sk, SOCK_ZEROCOPY);
@@ -2191,10 +2177,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 static int tun_chr_close(struct inode *inode, struct file *file)
 {
 	struct tun_file *tfile = file->private_data;
-	struct net *net = tfile->net;
 
 	tun_detach(tfile, true);
-	put_net(net);
 
 	return 0;
 }
diff --git a/include/linux/net.h b/include/linux/net.h
index 738ea48be889..8a5e81d2bdf7 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -38,7 +38,6 @@ struct net;
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
-#define SOCK_EXTERNALLY_ALLOCATED 5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/net/socket.c b/net/socket.c
index 884e32997698..b5f1f43ed8f4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -576,9 +576,6 @@ void sock_release(struct socket *sock)
 	if (rcu_dereference_protected(sock->wq, 1)->fasync_list)
 		pr_err("%s: fasync list not empty!\n", __func__);
 
-	if (test_bit(SOCK_EXTERNALLY_ALLOCATED, &sock->flags))
-		return;
-
 	this_cpu_sub(sockets_in_use, 1);
 	if (!sock->file) {
 		iput(SOCK_INODE(sock));
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
  2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
@ 2015-05-09  2:08           ` Eric W. Biederman
  2015-05-12  8:24             ` David Laight
  2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
                             ` (5 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:08 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/block/drbd/drbd_receiver.c |  4 ++--
 fs/afs/rxrpc.c                     |  2 +-
 fs/dlm/lowcomms.c                  | 16 ++++++++--------
 include/linux/net.h                |  2 +-
 net/bluetooth/rfcomm/core.c        |  2 +-
 net/ceph/messenger.c               |  4 ++--
 net/ipv4/af_inet.c                 |  2 +-
 net/ipv4/udp_tunnel.c              |  2 +-
 net/ipv6/ip6_udp_tunnel.c          |  2 +-
 net/l2tp/l2tp_core.c               |  4 ++--
 net/netfilter/ipvs/ip_vs_sync.c    |  4 ++--
 net/rxrpc/ar-local.c               |  4 ++--
 net/socket.c                       |  4 ++--
 13 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index cee20354ac37..c097909c589c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -598,7 +598,7 @@ static struct socket *drbd_try_connect(struct drbd_connection *connection)
 	memcpy(&peer_in6, &connection->peer_addr, peer_addr_len);
 
 	what = "sock_create_kern";
-	err = sock_create_kern(((struct sockaddr *)&src_in6)->sa_family,
+	err = sock_create_kern(&init_net, ((struct sockaddr *)&src_in6)->sa_family,
 			       SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (err < 0) {
 		sock = NULL;
@@ -693,7 +693,7 @@ static int prepare_listen_socket(struct drbd_connection *connection, struct acce
 	memcpy(&my_addr, &connection->my_addr, my_addr_len);
 
 	what = "sock_create_kern";
-	err = sock_create_kern(((struct sockaddr *)&my_addr)->sa_family,
+	err = sock_create_kern(&init_net, ((struct sockaddr *)&my_addr)->sa_family,
 			       SOCK_STREAM, IPPROTO_TCP, &s_listen);
 	if (err) {
 		s_listen = NULL;
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 3a57a1b0fb51..b50642870a43 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -85,7 +85,7 @@ int afs_open_socket(void)
 		return -ENOMEM;
 	}
 
-	ret = sock_create_kern(AF_RXRPC, SOCK_DGRAM, PF_INET, &socket);
+	ret = sock_create_kern(&init_net, AF_RXRPC, SOCK_DGRAM, PF_INET, &socket);
 	if (ret < 0) {
 		destroy_workqueue(afs_async_calls);
 		_leave(" = %d [socket]", ret);
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index d08e079ea5d3..754fd6c0b747 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -921,8 +921,8 @@ static int tcp_accept_from_sock(struct connection *con)
 	mutex_unlock(&connections_lock);
 
 	memset(&peeraddr, 0, sizeof(peeraddr));
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &newsock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &newsock);
 	if (result < 0)
 		return -ENOMEM;
 
@@ -1173,8 +1173,8 @@ static void tcp_connect_to_sock(struct connection *con)
 		goto out;
 
 	/* Create a socket to communicate with */
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (result < 0)
 		goto out_err;
 
@@ -1258,8 +1258,8 @@ static struct socket *tcp_create_listen_sock(struct connection *con,
 		addr_len = sizeof(struct sockaddr_in6);
 
 	/* Create a socket to communicate with */
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
-				  IPPROTO_TCP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (result < 0) {
 		log_print("Can't create listening comms socket");
 		goto create_out;
@@ -1365,8 +1365,8 @@ static int sctp_listen_for_all(void)
 
 	log_print("Using SCTP for communications");
 
-	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_SEQPACKET,
-				  IPPROTO_SCTP, &sock);
+	result = sock_create_kern(&init_net, dlm_local_addr[0]->ss_family,
+				  SOCK_SEQPACKET, IPPROTO_SCTP, &sock);
 	if (result < 0) {
 		log_print("Can't create comms socket, check SCTP is loaded");
 		goto out;
diff --git a/include/linux/net.h b/include/linux/net.h
index 8a5e81d2bdf7..04aa06852771 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -207,7 +207,7 @@ void sock_unregister(int family);
 int __sock_create(struct net *net, int family, int type, int proto,
 		  struct socket **res, int kern);
 int sock_create(int family, int type, int proto, struct socket **res);
-int sock_create_kern(int family, int type, int proto, struct socket **res);
+int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res);
 int sock_create_lite(int family, int type, int proto, struct socket **res);
 void sock_release(struct socket *sock);
 int sock_sendmsg(struct socket *sock, struct msghdr *msg);
diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
index 4fea24275b17..29709fbfd1f5 100644
--- a/net/bluetooth/rfcomm/core.c
+++ b/net/bluetooth/rfcomm/core.c
@@ -200,7 +200,7 @@ static int rfcomm_l2sock_create(struct socket **sock)
 
 	BT_DBG("");
 
-	err = sock_create_kern(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP, sock);
+	err = sock_create_kern(&init_net, PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP, sock);
 	if (!err) {
 		struct sock *sk = (*sock)->sk;
 		sk->sk_data_ready   = rfcomm_l2data_ready;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 967080a9f043..073262fea6dd 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -480,8 +480,8 @@ static int ceph_tcp_connect(struct ceph_connection *con)
 	int ret;
 
 	BUG_ON(con->sock);
-	ret = sock_create_kern(con->peer_addr.in_addr.ss_family, SOCK_STREAM,
-			       IPPROTO_TCP, &sock);
+	ret = sock_create_kern(&init_net, con->peer_addr.in_addr.ss_family,
+			       SOCK_STREAM, IPPROTO_TCP, &sock);
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8b47a4d79d04..09f4d024dfe5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1430,7 +1430,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 struct net *net)
 {
 	struct socket *sock;
-	int rc = sock_create_kern(family, type, protocol, &sock);
+	int rc = sock_create_kern(&init_net, family, type, protocol, &sock);
 
 	if (rc == 0) {
 		*sk = sock->sk;
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 6bb98cc193c9..4e2837476967 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -15,7 +15,7 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 	struct socket *sock = NULL;
 	struct sockaddr_in udp_addr;
 
-	err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, &sock);
+	err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index bba8903e871f..478576b61214 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -19,7 +19,7 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 	int err;
 	struct socket *sock = NULL;
 
-	err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, &sock);
+	err = sock_create_kern(&init_net, AF_INET6, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index a29a504492af..ae513a2fe7f3 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1399,7 +1399,7 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		if (cfg->local_ip6 && cfg->peer_ip6) {
 			struct sockaddr_l2tpip6 ip6_addr = {0};
 
-			err = sock_create_kern(AF_INET6, SOCK_DGRAM,
+			err = sock_create_kern(&init_net, AF_INET6, SOCK_DGRAM,
 					  IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
@@ -1429,7 +1429,7 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		{
 			struct sockaddr_l2tpip ip_addr = {0};
 
-			err = sock_create_kern(AF_INET, SOCK_DGRAM,
+			err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM,
 					  IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 19b9cce6c210..2e9a5b5d1239 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1458,7 +1458,7 @@ static struct socket *make_send_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket move it to right name space later */
-	result = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	result = sock_create_kern(&init_net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
@@ -1518,7 +1518,7 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket */
-	result = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	result = sock_create_kern(&init_net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
diff --git a/net/rxrpc/ar-local.c b/net/rxrpc/ar-local.c
index ca904ed5400a..78483b4602bf 100644
--- a/net/rxrpc/ar-local.c
+++ b/net/rxrpc/ar-local.c
@@ -73,8 +73,8 @@ static int rxrpc_create_local(struct rxrpc_local *local)
 	_enter("%p{%d}", local, local->srx.transport_type);
 
 	/* create a socket to represent the local endpoint */
-	ret = sock_create_kern(PF_INET, local->srx.transport_type, IPPROTO_UDP,
-			       &local->socket);
+	ret = sock_create_kern(&init_net, PF_INET, local->srx.transport_type,
+			       IPPROTO_UDP, &local->socket);
 	if (ret < 0) {
 		_leave(" = %d [socket]", ret);
 		return ret;
diff --git a/net/socket.c b/net/socket.c
index b5f1f43ed8f4..9963a0b53a64 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1210,9 +1210,9 @@ int sock_create(int family, int type, int protocol, struct socket **res)
 }
 EXPORT_SYMBOL(sock_create);
 
-int sock_create_kern(int family, int type, int protocol, struct socket **res)
+int sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
 {
-	return __sock_create(&init_net, family, type, protocol, res, 1);
+	return __sock_create(net, family, type, protocol, res, 1);
 }
 EXPORT_SYMBOL(sock_create_kern);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
  2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
  2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
@ 2015-05-09  2:09           ` Eric W. Biederman
  2015-05-09 16:51             ` Eric Dumazet
  2015-05-09  2:10           ` [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets Eric W. Biederman
                             ` (4 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:09 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 crypto/af_alg.c                |  4 ++--
 drivers/isdn/mISDN/socket.c    | 12 ++++++------
 drivers/net/macvtap.c          |  2 +-
 drivers/net/ppp/pppoe.c        |  4 ++--
 drivers/net/ppp/pppox.c        |  2 +-
 drivers/net/ppp/pptp.c         |  4 ++--
 drivers/net/tun.c              |  2 +-
 include/linux/if_pppox.h       |  2 +-
 include/net/af_vsock.h         |  2 +-
 include/net/llc_conn.h         |  2 +-
 include/net/sock.h             |  2 +-
 net/appletalk/ddp.c            |  2 +-
 net/atm/common.c               |  4 ++--
 net/atm/common.h               |  2 +-
 net/atm/pvc.c                  |  2 +-
 net/atm/svc.c                  |  2 +-
 net/ax25/af_ax25.c             |  4 ++--
 net/bluetooth/bnep/sock.c      |  2 +-
 net/bluetooth/cmtp/sock.c      |  2 +-
 net/bluetooth/hci_sock.c       |  2 +-
 net/bluetooth/hidp/sock.c      |  2 +-
 net/bluetooth/l2cap_sock.c     | 10 +++++-----
 net/bluetooth/rfcomm/sock.c    |  8 ++++----
 net/bluetooth/sco.c            |  8 ++++----
 net/caif/caif_socket.c         |  2 +-
 net/can/af_can.c               |  2 +-
 net/core/sock.c                |  3 ++-
 net/decnet/af_decnet.c         |  8 ++++----
 net/ieee802154/socket.c        |  2 +-
 net/ipv4/af_inet.c             |  2 +-
 net/ipv6/af_inet6.c            |  2 +-
 net/ipx/af_ipx.c               |  2 +-
 net/irda/af_irda.c             |  2 +-
 net/iucv/af_iucv.c             | 10 +++++-----
 net/key/af_key.c               |  2 +-
 net/l2tp/l2tp_ppp.c            |  4 ++--
 net/llc/af_llc.c               |  2 +-
 net/llc/llc_conn.c             |  6 +++---
 net/netlink/af_netlink.c       | 11 +++++------
 net/netrom/af_netrom.c         |  4 ++--
 net/nfc/af_nfc.c               |  2 +-
 net/nfc/llcp.h                 |  2 +-
 net/nfc/llcp_core.c            |  2 +-
 net/nfc/llcp_sock.c            |  8 ++++----
 net/nfc/nfc.h                  |  2 +-
 net/nfc/rawsock.c              |  4 ++--
 net/packet/af_packet.c         |  2 +-
 net/phonet/af_phonet.c         |  2 +-
 net/phonet/pep.c               |  2 +-
 net/rds/af_rds.c               |  2 +-
 net/rose/af_rose.c             |  4 ++--
 net/rxrpc/af_rxrpc.c           |  2 +-
 net/sctp/ipv6.c                |  2 +-
 net/sctp/protocol.c            |  2 +-
 net/tipc/socket.c              |  2 +-
 net/unix/af_unix.c             |  8 ++++----
 net/vmw_vsock/af_vsock.c       |  7 ++++---
 net/vmw_vsock/vmci_transport.c |  2 +-
 net/x25/af_x25.c               |  8 ++++----
 59 files changed, 109 insertions(+), 108 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index f22cc56fd1b3..5ad0d5354535 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -244,7 +244,7 @@ int af_alg_accept(struct sock *sk, struct socket *newsock)
 	if (!type)
 		goto unlock;
 
-	sk2 = sk_alloc(sock_net(sk), PF_ALG, GFP_KERNEL, &alg_proto);
+	sk2 = sk_alloc(sock_net(sk), PF_ALG, GFP_KERNEL, &alg_proto, 0);
 	err = -ENOMEM;
 	if (!sk2)
 		goto unlock;
@@ -324,7 +324,7 @@ static int alg_create(struct net *net, struct socket *sock, int protocol,
 		return -EPROTONOSUPPORT;
 
 	err = -ENOMEM;
-	sk = sk_alloc(net, PF_ALG, GFP_KERNEL, &alg_proto);
+	sk = sk_alloc(net, PF_ALG, GFP_KERNEL, &alg_proto, kern);
 	if (!sk)
 		goto out;
 
diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index 8dc7290089bb..0d29b5a6356d 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -601,14 +601,14 @@ static const struct proto_ops data_sock_ops = {
 };
 
 static int
-data_sock_create(struct net *net, struct socket *sock, int protocol)
+data_sock_create(struct net *net, struct socket *sock, int protocol, int kern)
 {
 	struct sock *sk;
 
 	if (sock->type != SOCK_DGRAM)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_ISDN, GFP_KERNEL, &mISDN_proto);
+	sk = sk_alloc(net, PF_ISDN, GFP_KERNEL, &mISDN_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -756,14 +756,14 @@ static const struct proto_ops base_sock_ops = {
 
 
 static int
-base_sock_create(struct net *net, struct socket *sock, int protocol)
+base_sock_create(struct net *net, struct socket *sock, int protocol, int kern)
 {
 	struct sock *sk;
 
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_ISDN, GFP_KERNEL, &mISDN_proto);
+	sk = sk_alloc(net, PF_ISDN, GFP_KERNEL, &mISDN_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -785,7 +785,7 @@ mISDN_sock_create(struct net *net, struct socket *sock, int proto, int kern)
 
 	switch (proto) {
 	case ISDN_P_BASE:
-		err = base_sock_create(net, sock, proto);
+		err = base_sock_create(net, sock, proto, kern);
 		break;
 	case ISDN_P_TE_S0:
 	case ISDN_P_NT_S0:
@@ -799,7 +799,7 @@ mISDN_sock_create(struct net *net, struct socket *sock, int proto, int kern)
 	case ISDN_P_B_L2DTMF:
 	case ISDN_P_B_L2DSP:
 	case ISDN_P_B_L2DSPHDLC:
-		err = data_sock_create(net, sock, proto);
+		err = data_sock_create(net, sock, proto, kern);
 		break;
 	default:
 		return err;
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 8c350c5d54ad..0398631a3c24 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -476,7 +476,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 
 	err = -ENOMEM;
 	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
-					     &macvtap_proto);
+					     &macvtap_proto, 0);
 	if (!q)
 		goto out;
 
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index aa1dd926623a..f86c5ab334aa 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -546,11 +546,11 @@ static struct proto pppoe_sk_proto __read_mostly = {
  * Initialize a new struct sock.
  *
  **********************************************************************/
-static int pppoe_create(struct net *net, struct socket *sock)
+static int pppoe_create(struct net *net, struct socket *sock, int kern)
 {
 	struct sock *sk;
 
-	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pppoe_sk_proto);
+	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pppoe_sk_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/drivers/net/ppp/pppox.c b/drivers/net/ppp/pppox.c
index 2940e9fe351b..0e1b30622477 100644
--- a/drivers/net/ppp/pppox.c
+++ b/drivers/net/ppp/pppox.c
@@ -118,7 +118,7 @@ static int pppox_create(struct net *net, struct socket *sock, int protocol,
 	    !try_module_get(pppox_protos[protocol]->owner))
 		goto out;
 
-	rc = pppox_protos[protocol]->create(net, sock);
+	rc = pppox_protos[protocol]->create(net, sock, kern);
 
 	module_put(pppox_protos[protocol]->owner);
 out:
diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index e3bfbd4d0136..14839bc0aaf5 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -561,14 +561,14 @@ static void pptp_sock_destruct(struct sock *sk)
 	skb_queue_purge(&sk->sk_receive_queue);
 }
 
-static int pptp_create(struct net *net, struct socket *sock)
+static int pptp_create(struct net *net, struct socket *sock, int kern)
 {
 	int error = -ENOMEM;
 	struct sock *sk;
 	struct pppox_sock *po;
 	struct pptp_opt *opt;
 
-	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pptp_sk_proto);
+	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pptp_sk_proto, kern);
 	if (!sk)
 		goto out;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3262f3e2b8b2..1a1c4f7b3ec5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2148,7 +2148,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
 	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
 
 	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
-					    &tun_proto);
+					    &tun_proto, 0);
 	if (!tfile)
 		return -ENOMEM;
 	RCU_INIT_POINTER(tfile->tun, NULL);
diff --git a/include/linux/if_pppox.h b/include/linux/if_pppox.h
index 66a7d7600f43..b49cf923becc 100644
--- a/include/linux/if_pppox.h
+++ b/include/linux/if_pppox.h
@@ -74,7 +74,7 @@ static inline struct sock *sk_pppox(struct pppox_sock *po)
 struct module;
 
 struct pppox_proto {
-	int		(*create)(struct net *net, struct socket *sock);
+	int		(*create)(struct net *net, struct socket *sock, int kern);
 	int		(*ioctl)(struct socket *sock, unsigned int cmd,
 				 unsigned long arg);
 	struct module	*owner;
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 172632dd9930..db639a4c5ab8 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -74,7 +74,7 @@ void vsock_pending_work(struct work_struct *work);
 struct sock *__vsock_create(struct net *net,
 			    struct socket *sock,
 			    struct sock *parent,
-			    gfp_t priority, unsigned short type);
+			    gfp_t priority, unsigned short type, int kern);
 
 /**** TRANSPORT ****/
 
diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h
index 0134681acc4c..fe994d2e5286 100644
--- a/include/net/llc_conn.h
+++ b/include/net/llc_conn.h
@@ -96,7 +96,7 @@ static __inline__ char llc_backlog_type(struct sk_buff *skb)
 }
 
 struct sock *llc_sk_alloc(struct net *net, int family, gfp_t priority,
-			  struct proto *prot);
+			  struct proto *prot, int kern);
 void llc_sk_free(struct sock *sk);
 
 void llc_sk_reset(struct sock *sk);
diff --git a/include/net/sock.h b/include/net/sock.h
index 3a4898ec8c67..d8dcf91732b0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1514,7 +1514,7 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 
 
 struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
-		      struct proto *prot);
+		      struct proto *prot, int kern);
 void sk_free(struct sock *sk);
 void sk_release_kernel(struct sock *sk);
 struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority);
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 3b7ad43c7dad..d5871ac493eb 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1030,7 +1030,7 @@ static int atalk_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_RAW && sock->type != SOCK_DGRAM)
 		goto out;
 	rc = -ENOMEM;
-	sk = sk_alloc(net, PF_APPLETALK, GFP_KERNEL, &ddp_proto);
+	sk = sk_alloc(net, PF_APPLETALK, GFP_KERNEL, &ddp_proto, kern);
 	if (!sk)
 		goto out;
 	rc = 0;
diff --git a/net/atm/common.c b/net/atm/common.c
index ed0466637e13..49a872db7e42 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -141,7 +141,7 @@ static struct proto vcc_proto = {
 	.release_cb = vcc_release_cb,
 };
 
-int vcc_create(struct net *net, struct socket *sock, int protocol, int family)
+int vcc_create(struct net *net, struct socket *sock, int protocol, int family, int kern)
 {
 	struct sock *sk;
 	struct atm_vcc *vcc;
@@ -149,7 +149,7 @@ int vcc_create(struct net *net, struct socket *sock, int protocol, int family)
 	sock->sk = NULL;
 	if (sock->type == SOCK_STREAM)
 		return -EINVAL;
-	sk = sk_alloc(net, family, GFP_KERNEL, &vcc_proto);
+	sk = sk_alloc(net, family, GFP_KERNEL, &vcc_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 	sock_init_data(sock, sk);
diff --git a/net/atm/common.h b/net/atm/common.h
index 4d6f5b2068ac..959436b87182 100644
--- a/net/atm/common.h
+++ b/net/atm/common.h
@@ -10,7 +10,7 @@
 #include <linux/poll.h> /* for poll_table */
 
 
-int vcc_create(struct net *net, struct socket *sock, int protocol, int family);
+int vcc_create(struct net *net, struct socket *sock, int protocol, int family, int kern);
 int vcc_release(struct socket *sock);
 int vcc_connect(struct socket *sock, int itf, short vpi, int vci);
 int vcc_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/net/atm/pvc.c b/net/atm/pvc.c
index ae0324021407..040207ec399f 100644
--- a/net/atm/pvc.c
+++ b/net/atm/pvc.c
@@ -136,7 +136,7 @@ static int pvc_create(struct net *net, struct socket *sock, int protocol,
 		return -EAFNOSUPPORT;
 
 	sock->ops = &pvc_proto_ops;
-	return vcc_create(net, sock, protocol, PF_ATMPVC);
+	return vcc_create(net, sock, protocol, PF_ATMPVC, kern);
 }
 
 static const struct net_proto_family pvc_family_ops = {
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 1ba23f5018e7..3fa0a9ee98d1 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -660,7 +660,7 @@ static int svc_create(struct net *net, struct socket *sock, int protocol,
 		return -EAFNOSUPPORT;
 
 	sock->ops = &svc_proto_ops;
-	error = vcc_create(net, sock, protocol, AF_ATMSVC);
+	error = vcc_create(net, sock, protocol, AF_ATMSVC, kern);
 	if (error)
 		return error;
 	ATM_SD(sock)->local.sas_family = AF_ATMSVC;
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 330c1f4a5a0b..4273533d22b1 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -855,7 +855,7 @@ static int ax25_create(struct net *net, struct socket *sock, int protocol,
 		return -ESOCKTNOSUPPORT;
 	}
 
-	sk = sk_alloc(net, PF_AX25, GFP_ATOMIC, &ax25_proto);
+	sk = sk_alloc(net, PF_AX25, GFP_ATOMIC, &ax25_proto, kern);
 	if (sk == NULL)
 		return -ENOMEM;
 
@@ -881,7 +881,7 @@ struct sock *ax25_make_new(struct sock *osk, struct ax25_dev *ax25_dev)
 	struct sock *sk;
 	ax25_cb *ax25, *oax25;
 
-	sk = sk_alloc(sock_net(osk), PF_AX25, GFP_ATOMIC,	osk->sk_prot);
+	sk = sk_alloc(sock_net(osk), PF_AX25, GFP_ATOMIC, osk->sk_prot, 0);
 	if (sk == NULL)
 		return NULL;
 
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index bde2bdd9e929..b5116fa9835e 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -202,7 +202,7 @@ static int bnep_sock_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &bnep_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &bnep_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index d82787d417bd..ce86a7bae844 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -205,7 +205,7 @@ static int cmtp_sock_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &cmtp_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &cmtp_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 56f9edbf3d05..5b14dcafcd08 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -1377,7 +1377,7 @@ static int hci_sock_create(struct net *net, struct socket *sock, int protocol,
 
 	sock->ops = &hci_sock_ops;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &hci_sk_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &hci_sk_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index cb3fdde1968a..008ba439bd62 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -235,7 +235,7 @@ static int hidp_sock_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_RAW)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &hidp_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, GFP_ATOMIC, &hidp_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index a7278f05eafb..244287706f91 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -43,7 +43,7 @@ static struct bt_sock_list l2cap_sk_list = {
 static const struct proto_ops l2cap_sock_ops;
 static void l2cap_sock_init(struct sock *sk, struct sock *parent);
 static struct sock *l2cap_sock_alloc(struct net *net, struct socket *sock,
-				     int proto, gfp_t prio);
+				     int proto, gfp_t prio, int kern);
 
 bool l2cap_is_socket(struct socket *sock)
 {
@@ -1193,7 +1193,7 @@ static struct l2cap_chan *l2cap_sock_new_connection_cb(struct l2cap_chan *chan)
 	}
 
 	sk = l2cap_sock_alloc(sock_net(parent), NULL, BTPROTO_L2CAP,
-			      GFP_ATOMIC);
+			      GFP_ATOMIC, 0);
 	if (!sk) {
 		release_sock(parent);
 		return NULL;
@@ -1523,12 +1523,12 @@ static struct proto l2cap_proto = {
 };
 
 static struct sock *l2cap_sock_alloc(struct net *net, struct socket *sock,
-				     int proto, gfp_t prio)
+				     int proto, gfp_t prio, int kern)
 {
 	struct sock *sk;
 	struct l2cap_chan *chan;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, prio, &l2cap_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, prio, &l2cap_proto, kern);
 	if (!sk)
 		return NULL;
 
@@ -1574,7 +1574,7 @@ static int l2cap_sock_create(struct net *net, struct socket *sock, int protocol,
 
 	sock->ops = &l2cap_sock_ops;
 
-	sk = l2cap_sock_alloc(net, sock, protocol, GFP_ATOMIC);
+	sk = l2cap_sock_alloc(net, sock, protocol, GFP_ATOMIC, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 825e8fb5114b..b2338e971b33 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -269,12 +269,12 @@ static struct proto rfcomm_proto = {
 	.obj_size	= sizeof(struct rfcomm_pinfo)
 };
 
-static struct sock *rfcomm_sock_alloc(struct net *net, struct socket *sock, int proto, gfp_t prio)
+static struct sock *rfcomm_sock_alloc(struct net *net, struct socket *sock, int proto, gfp_t prio, int kern)
 {
 	struct rfcomm_dlc *d;
 	struct sock *sk;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, prio, &rfcomm_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, prio, &rfcomm_proto, kern);
 	if (!sk)
 		return NULL;
 
@@ -324,7 +324,7 @@ static int rfcomm_sock_create(struct net *net, struct socket *sock,
 
 	sock->ops = &rfcomm_sock_ops;
 
-	sk = rfcomm_sock_alloc(net, sock, protocol, GFP_ATOMIC);
+	sk = rfcomm_sock_alloc(net, sock, protocol, GFP_ATOMIC, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -969,7 +969,7 @@ int rfcomm_connect_ind(struct rfcomm_session *s, u8 channel, struct rfcomm_dlc *
 		goto done;
 	}
 
-	sk = rfcomm_sock_alloc(sock_net(parent), NULL, BTPROTO_RFCOMM, GFP_ATOMIC);
+	sk = rfcomm_sock_alloc(sock_net(parent), NULL, BTPROTO_RFCOMM, GFP_ATOMIC, 0);
 	if (!sk)
 		goto done;
 
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index 4322c833e748..6b6e59dc54cf 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -460,11 +460,11 @@ static struct proto sco_proto = {
 	.obj_size	= sizeof(struct sco_pinfo)
 };
 
-static struct sock *sco_sock_alloc(struct net *net, struct socket *sock, int proto, gfp_t prio)
+static struct sock *sco_sock_alloc(struct net *net, struct socket *sock, int proto, gfp_t prio, int kern)
 {
 	struct sock *sk;
 
-	sk = sk_alloc(net, PF_BLUETOOTH, prio, &sco_proto);
+	sk = sk_alloc(net, PF_BLUETOOTH, prio, &sco_proto, kern);
 	if (!sk)
 		return NULL;
 
@@ -501,7 +501,7 @@ static int sco_sock_create(struct net *net, struct socket *sock, int protocol,
 
 	sock->ops = &sco_sock_ops;
 
-	sk = sco_sock_alloc(net, sock, protocol, GFP_ATOMIC);
+	sk = sco_sock_alloc(net, sock, protocol, GFP_ATOMIC, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -1026,7 +1026,7 @@ static void sco_conn_ready(struct sco_conn *conn)
 		bh_lock_sock(parent);
 
 		sk = sco_sock_alloc(sock_net(parent), NULL,
-				    BTPROTO_SCO, GFP_ATOMIC);
+				    BTPROTO_SCO, GFP_ATOMIC, 0);
 		if (!sk) {
 			bh_unlock_sock(parent);
 			sco_conn_unlock(conn);
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 4ec0c803aef1..78a04ebb113c 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -1047,7 +1047,7 @@ static int caif_create(struct net *net, struct socket *sock, int protocol,
 	 * is really not used at all in the net/core or socket.c but the
 	 * initialization makes sure that sock->state is not uninitialized.
 	 */
-	sk = sk_alloc(net, PF_CAIF, GFP_KERNEL, &prot);
+	sk = sk_alloc(net, PF_CAIF, GFP_KERNEL, &prot, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/can/af_can.c b/net/can/af_can.c
index 32d710eaf1fc..d4d404bdfc9a 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -179,7 +179,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol,
 
 	sock->ops = cp->ops;
 
-	sk = sk_alloc(net, PF_CAN, GFP_KERNEL, cp->prot);
+	sk = sk_alloc(net, PF_CAN, GFP_KERNEL, cp->prot, kern);
 	if (!sk) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/core/sock.c b/net/core/sock.c
index e891bcf325ca..cbc3789b830c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1396,9 +1396,10 @@ EXPORT_SYMBOL_GPL(sock_update_netprioidx);
  *	@family: protocol family
  *	@priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
  *	@prot: struct proto associated with this new sock instance
+ *	@kern: is this to be a kernel socket?
  */
 struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
-		      struct proto *prot)
+		      struct proto *prot, int kern)
 {
 	struct sock *sk;
 
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 754484b3cd0e..675cf94e04f8 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -468,10 +468,10 @@ static struct proto dn_proto = {
 	.obj_size		= sizeof(struct dn_sock),
 };
 
-static struct sock *dn_alloc_sock(struct net *net, struct socket *sock, gfp_t gfp)
+static struct sock *dn_alloc_sock(struct net *net, struct socket *sock, gfp_t gfp, int kern)
 {
 	struct dn_scp *scp;
-	struct sock *sk = sk_alloc(net, PF_DECnet, gfp, &dn_proto);
+	struct sock *sk = sk_alloc(net, PF_DECnet, gfp, &dn_proto, kern);
 
 	if  (!sk)
 		goto out;
@@ -693,7 +693,7 @@ static int dn_create(struct net *net, struct socket *sock, int protocol,
 	}
 
 
-	if ((sk = dn_alloc_sock(net, sock, GFP_KERNEL)) == NULL)
+	if ((sk = dn_alloc_sock(net, sock, GFP_KERNEL, kern)) == NULL)
 		return -ENOBUFS;
 
 	sk->sk_protocol = protocol;
@@ -1096,7 +1096,7 @@ static int dn_accept(struct socket *sock, struct socket *newsock, int flags)
 
 	cb = DN_SKB_CB(skb);
 	sk->sk_ack_backlog--;
-	newsk = dn_alloc_sock(sock_net(sk), newsock, sk->sk_allocation);
+	newsk = dn_alloc_sock(sock_net(sk), newsock, sk->sk_allocation, 0);
 	if (newsk == NULL) {
 		release_sock(sk);
 		kfree_skb(skb);
diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index b60c65f70346..7aaaf967df58 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -1014,7 +1014,7 @@ static int ieee802154_create(struct net *net, struct socket *sock,
 	}
 
 	rc = -ENOMEM;
-	sk = sk_alloc(net, PF_IEEE802154, GFP_KERNEL, proto);
+	sk = sk_alloc(net, PF_IEEE802154, GFP_KERNEL, proto, kern);
 	if (!sk)
 		goto out;
 	rc = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09f4d024dfe5..e2dd9cb99d61 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -317,7 +317,7 @@ lookup_protocol:
 	WARN_ON(!answer_prot->slab);
 
 	err = -ENOBUFS;
-	sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot);
+	sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);
 	if (!sk)
 		goto out;
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 4632afa57e05..f3866c0b6cfe 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -167,7 +167,7 @@ lookup_protocol:
 	WARN_ON(!answer_prot->slab);
 
 	err = -ENOBUFS;
-	sk = sk_alloc(net, PF_INET6, GFP_KERNEL, answer_prot);
+	sk = sk_alloc(net, PF_INET6, GFP_KERNEL, answer_prot, kern);
 	if (!sk)
 		goto out;
 
diff --git a/net/ipx/af_ipx.c b/net/ipx/af_ipx.c
index 4ea5d7497b5f..48d0dc89b58d 100644
--- a/net/ipx/af_ipx.c
+++ b/net/ipx/af_ipx.c
@@ -1347,7 +1347,7 @@ static int ipx_create(struct net *net, struct socket *sock, int protocol,
 		goto out;
 
 	rc = -ENOMEM;
-	sk = sk_alloc(net, PF_IPX, GFP_KERNEL, &ipx_proto);
+	sk = sk_alloc(net, PF_IPX, GFP_KERNEL, &ipx_proto, kern);
 	if (!sk)
 		goto out;
 
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index ee0ea25c8e7a..fae6822cc367 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -1100,7 +1100,7 @@ static int irda_create(struct net *net, struct socket *sock, int protocol,
 	}
 
 	/* Allocate networking socket */
-	sk = sk_alloc(net, PF_IRDA, GFP_KERNEL, &irda_proto);
+	sk = sk_alloc(net, PF_IRDA, GFP_KERNEL, &irda_proto, kern);
 	if (sk == NULL)
 		return -ENOMEM;
 
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 6daa52a18d40..918151c11348 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -535,12 +535,12 @@ static void iucv_sock_init(struct sock *sk, struct sock *parent)
 		sk->sk_type = parent->sk_type;
 }
 
-static struct sock *iucv_sock_alloc(struct socket *sock, int proto, gfp_t prio)
+static struct sock *iucv_sock_alloc(struct socket *sock, int proto, gfp_t prio, int kern)
 {
 	struct sock *sk;
 	struct iucv_sock *iucv;
 
-	sk = sk_alloc(&init_net, PF_IUCV, prio, &iucv_proto);
+	sk = sk_alloc(&init_net, PF_IUCV, prio, &iucv_proto, kern);
 	if (!sk)
 		return NULL;
 	iucv = iucv_sk(sk);
@@ -602,7 +602,7 @@ static int iucv_sock_create(struct net *net, struct socket *sock, int protocol,
 		return -ESOCKTNOSUPPORT;
 	}
 
-	sk = iucv_sock_alloc(sock, protocol, GFP_KERNEL);
+	sk = iucv_sock_alloc(sock, protocol, GFP_KERNEL, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -1723,7 +1723,7 @@ static int iucv_callback_connreq(struct iucv_path *path,
 	}
 
 	/* Create the new socket */
-	nsk = iucv_sock_alloc(NULL, sk->sk_type, GFP_ATOMIC);
+	nsk = iucv_sock_alloc(NULL, sk->sk_type, GFP_ATOMIC, 0);
 	if (!nsk) {
 		err = pr_iucv->path_sever(path, user_data);
 		iucv_path_free(path);
@@ -1933,7 +1933,7 @@ static int afiucv_hs_callback_syn(struct sock *sk, struct sk_buff *skb)
 		goto out;
 	}
 
-	nsk = iucv_sock_alloc(NULL, sk->sk_type, GFP_ATOMIC);
+	nsk = iucv_sock_alloc(NULL, sk->sk_type, GFP_ATOMIC, 0);
 	bh_lock_sock(sk);
 	if ((sk->sk_state != IUCV_LISTEN) ||
 	    sk_acceptq_is_full(sk) ||
diff --git a/net/key/af_key.c b/net/key/af_key.c
index f0d52d721b3a..9e834ec475a9 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -149,7 +149,7 @@ static int pfkey_create(struct net *net, struct socket *sock, int protocol,
 		return -EPROTONOSUPPORT;
 
 	err = -ENOMEM;
-	sk = sk_alloc(net, PF_KEY, GFP_KERNEL, &key_proto);
+	sk = sk_alloc(net, PF_KEY, GFP_KERNEL, &key_proto, kern);
 	if (sk == NULL)
 		goto out;
 
diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index e9b0dec56b8e..f56c9f69e9f2 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -542,12 +542,12 @@ static int pppol2tp_backlog_recv(struct sock *sk, struct sk_buff *skb)
 
 /* socket() handler. Initialize a new struct sock.
  */
-static int pppol2tp_create(struct net *net, struct socket *sock)
+static int pppol2tp_create(struct net *net, struct socket *sock, int kern)
 {
 	int error = -ENOMEM;
 	struct sock *sk;
 
-	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pppol2tp_sk_proto);
+	sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, &pppol2tp_sk_proto, kern);
 	if (!sk)
 		goto out;
 
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 17a8dff06090..8fd9febaa5ba 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -168,7 +168,7 @@ static int llc_ui_create(struct net *net, struct socket *sock, int protocol,
 
 	if (likely(sock->type == SOCK_DGRAM || sock->type == SOCK_STREAM)) {
 		rc = -ENOMEM;
-		sk = llc_sk_alloc(net, PF_LLC, GFP_KERNEL, &llc_proto);
+		sk = llc_sk_alloc(net, PF_LLC, GFP_KERNEL, &llc_proto, kern);
 		if (sk) {
 			rc = 0;
 			llc_ui_sk_init(sock, sk);
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 81a61fce3afb..3e821daf9dd4 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -768,7 +768,7 @@ static struct sock *llc_create_incoming_sock(struct sock *sk,
 					     struct llc_addr *daddr)
 {
 	struct sock *newsk = llc_sk_alloc(sock_net(sk), sk->sk_family, GFP_ATOMIC,
-					  sk->sk_prot);
+					  sk->sk_prot, 0);
 	struct llc_sock *newllc, *llc = llc_sk(sk);
 
 	if (!newsk)
@@ -931,9 +931,9 @@ static void llc_sk_init(struct sock *sk)
  *	Allocates a LLC sock and initializes it. Returns the new LLC sock
  *	or %NULL if there's no memory available for one
  */
-struct sock *llc_sk_alloc(struct net *net, int family, gfp_t priority, struct proto *prot)
+struct sock *llc_sk_alloc(struct net *net, int family, gfp_t priority, struct proto *prot, int kern)
 {
-	struct sock *sk = sk_alloc(net, family, priority, prot);
+	struct sock *sk = sk_alloc(net, family, priority, prot, kern);
 
 	if (!sk)
 		goto out;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ec4adbdcb9b4..d0a82b412b24 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1117,14 +1117,15 @@ static struct proto netlink_proto = {
 };
 
 static int __netlink_create(struct net *net, struct socket *sock,
-			    struct mutex *cb_mutex, int protocol)
+			    struct mutex *cb_mutex, int protocol,
+			    int kern)
 {
 	struct sock *sk;
 	struct netlink_sock *nlk;
 
 	sock->ops = &netlink_ops;
 
-	sk = sk_alloc(net, PF_NETLINK, GFP_KERNEL, &netlink_proto);
+	sk = sk_alloc(net, PF_NETLINK, GFP_KERNEL, &netlink_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
@@ -1186,7 +1187,7 @@ static int netlink_create(struct net *net, struct socket *sock, int protocol,
 	if (err < 0)
 		goto out;
 
-	err = __netlink_create(net, sock, cb_mutex, protocol);
+	err = __netlink_create(net, sock, cb_mutex, protocol, kern);
 	if (err < 0)
 		goto out_module;
 
@@ -2472,14 +2473,12 @@ __netlink_kernel_create(struct net *net, int unit, struct module *module,
 
 	if (sock_create_lite(PF_NETLINK, SOCK_DGRAM, unit, &sock))
 		return NULL;
-
 	/*
 	 * We have to just have a reference on the net from sk, but don't
 	 * get_net it. Besides, we cannot get and then put the net here.
 	 * So we create one inside init_net and the move it to net.
 	 */
-
-	if (__netlink_create(&init_net, sock, cb_mutex, unit) < 0)
+	if (__netlink_create(&init_net, sock, cb_mutex, unit, 0) < 0)
 		goto out_sock_release_nosk;
 
 	sk = sock->sk;
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index b987fd56c3c5..ed212ffc1d9d 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -433,7 +433,7 @@ static int nr_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_SEQPACKET || protocol != 0)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_NETROM, GFP_ATOMIC, &nr_proto);
+	sk = sk_alloc(net, PF_NETROM, GFP_ATOMIC, &nr_proto, kern);
 	if (sk  == NULL)
 		return -ENOMEM;
 
@@ -476,7 +476,7 @@ static struct sock *nr_make_new(struct sock *osk)
 	if (osk->sk_type != SOCK_SEQPACKET)
 		return NULL;
 
-	sk = sk_alloc(sock_net(osk), PF_NETROM, GFP_ATOMIC, osk->sk_prot);
+	sk = sk_alloc(sock_net(osk), PF_NETROM, GFP_ATOMIC, osk->sk_prot, 0);
 	if (sk == NULL)
 		return NULL;
 
diff --git a/net/nfc/af_nfc.c b/net/nfc/af_nfc.c
index 2277276f52bc..54e40fa47822 100644
--- a/net/nfc/af_nfc.c
+++ b/net/nfc/af_nfc.c
@@ -40,7 +40,7 @@ static int nfc_sock_create(struct net *net, struct socket *sock, int proto,
 
 	read_lock(&proto_tab_lock);
 	if (proto_tab[proto] &&	try_module_get(proto_tab[proto]->owner)) {
-		rc = proto_tab[proto]->create(net, sock, proto_tab[proto]);
+		rc = proto_tab[proto]->create(net, sock, proto_tab[proto], kern);
 		module_put(proto_tab[proto]->owner);
 	}
 	read_unlock(&proto_tab_lock);
diff --git a/net/nfc/llcp.h b/net/nfc/llcp.h
index de1789e3cc82..1f68724d44d3 100644
--- a/net/nfc/llcp.h
+++ b/net/nfc/llcp.h
@@ -225,7 +225,7 @@ void nfc_llcp_send_to_raw_sock(struct nfc_llcp_local *local,
 			       struct sk_buff *skb, u8 direction);
 
 /* Sock API */
-struct sock *nfc_llcp_sock_alloc(struct socket *sock, int type, gfp_t gfp);
+struct sock *nfc_llcp_sock_alloc(struct socket *sock, int type, gfp_t gfp, int kern);
 void nfc_llcp_sock_free(struct nfc_llcp_sock *sock);
 void nfc_llcp_accept_unlink(struct sock *sk);
 void nfc_llcp_accept_enqueue(struct sock *parent, struct sock *sk);
diff --git a/net/nfc/llcp_core.c b/net/nfc/llcp_core.c
index b18f07ccb504..98876274a1ee 100644
--- a/net/nfc/llcp_core.c
+++ b/net/nfc/llcp_core.c
@@ -934,7 +934,7 @@ static void nfc_llcp_recv_connect(struct nfc_llcp_local *local,
 		sock->ssap = ssap;
 	}
 
-	new_sk = nfc_llcp_sock_alloc(NULL, parent->sk_type, GFP_ATOMIC);
+	new_sk = nfc_llcp_sock_alloc(NULL, parent->sk_type, GFP_ATOMIC, 0);
 	if (new_sk == NULL) {
 		reason = LLCP_DM_REJ;
 		release_sock(&sock->sk);
diff --git a/net/nfc/llcp_sock.c b/net/nfc/llcp_sock.c
index 9578bd6a4f3e..b7de0da46acd 100644
--- a/net/nfc/llcp_sock.c
+++ b/net/nfc/llcp_sock.c
@@ -942,12 +942,12 @@ static void llcp_sock_destruct(struct sock *sk)
 	}
 }
 
-struct sock *nfc_llcp_sock_alloc(struct socket *sock, int type, gfp_t gfp)
+struct sock *nfc_llcp_sock_alloc(struct socket *sock, int type, gfp_t gfp, int kern)
 {
 	struct sock *sk;
 	struct nfc_llcp_sock *llcp_sock;
 
-	sk = sk_alloc(&init_net, PF_NFC, gfp, &llcp_sock_proto);
+	sk = sk_alloc(&init_net, PF_NFC, gfp, &llcp_sock_proto, kern);
 	if (!sk)
 		return NULL;
 
@@ -993,7 +993,7 @@ void nfc_llcp_sock_free(struct nfc_llcp_sock *sock)
 }
 
 static int llcp_sock_create(struct net *net, struct socket *sock,
-			    const struct nfc_protocol *nfc_proto)
+			    const struct nfc_protocol *nfc_proto, int kern)
 {
 	struct sock *sk;
 
@@ -1009,7 +1009,7 @@ static int llcp_sock_create(struct net *net, struct socket *sock,
 	else
 		sock->ops = &llcp_sock_ops;
 
-	sk = nfc_llcp_sock_alloc(sock, sock->type, GFP_ATOMIC);
+	sk = nfc_llcp_sock_alloc(sock, sock->type, GFP_ATOMIC, kern);
 	if (sk == NULL)
 		return -ENOMEM;
 
diff --git a/net/nfc/nfc.h b/net/nfc/nfc.h
index a8ce80b47720..5c93e8412a26 100644
--- a/net/nfc/nfc.h
+++ b/net/nfc/nfc.h
@@ -30,7 +30,7 @@ struct nfc_protocol {
 	struct proto *proto;
 	struct module *owner;
 	int (*create)(struct net *net, struct socket *sock,
-		      const struct nfc_protocol *nfc_proto);
+		      const struct nfc_protocol *nfc_proto, int kern);
 };
 
 struct nfc_rawsock {
diff --git a/net/nfc/rawsock.c b/net/nfc/rawsock.c
index 82b4e8024778..e9a91488fe3d 100644
--- a/net/nfc/rawsock.c
+++ b/net/nfc/rawsock.c
@@ -334,7 +334,7 @@ static void rawsock_destruct(struct sock *sk)
 }
 
 static int rawsock_create(struct net *net, struct socket *sock,
-			  const struct nfc_protocol *nfc_proto)
+			  const struct nfc_protocol *nfc_proto, int kern)
 {
 	struct sock *sk;
 
@@ -348,7 +348,7 @@ static int rawsock_create(struct net *net, struct socket *sock,
 	else
 		sock->ops = &rawsock_ops;
 
-	sk = sk_alloc(net, PF_NFC, GFP_ATOMIC, nfc_proto->proto);
+	sk = sk_alloc(net, PF_NFC, GFP_ATOMIC, nfc_proto->proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5102c3cc4eec..94713276a1d9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2832,7 +2832,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	sock->state = SS_UNCONNECTED;
 
 	err = -ENOBUFS;
-	sk = sk_alloc(net, PF_PACKET, GFP_KERNEL, &packet_proto);
+	sk = sk_alloc(net, PF_PACKET, GFP_KERNEL, &packet_proto, kern);
 	if (sk == NULL)
 		goto out;
 
diff --git a/net/phonet/af_phonet.c b/net/phonet/af_phonet.c
index 32ab87d34828..10d42f3220ab 100644
--- a/net/phonet/af_phonet.c
+++ b/net/phonet/af_phonet.c
@@ -97,7 +97,7 @@ static int pn_socket_create(struct net *net, struct socket *sock, int protocol,
 		goto out;
 	}
 
-	sk = sk_alloc(net, PF_PHONET, GFP_KERNEL, pnp->prot);
+	sk = sk_alloc(net, PF_PHONET, GFP_KERNEL, pnp->prot, kern);
 	if (sk == NULL) {
 		err = -ENOMEM;
 		goto out;
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 6de2aeb98a1f..850a86cde0b3 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -845,7 +845,7 @@ static struct sock *pep_sock_accept(struct sock *sk, int flags, int *errp)
 	}
 
 	/* Create a new to-be-accepted sock */
-	newsk = sk_alloc(sock_net(sk), PF_PHONET, GFP_KERNEL, sk->sk_prot);
+	newsk = sk_alloc(sock_net(sk), PF_PHONET, GFP_KERNEL, sk->sk_prot, 0);
 	if (!newsk) {
 		pep_reject_conn(sk, skb, PN_PIPE_ERR_OVERLOAD, GFP_KERNEL);
 		err = -ENOBUFS;
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 10443377fb9d..3d83641f2861 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -440,7 +440,7 @@ static int rds_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_SEQPACKET || protocol)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto);
+	sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 8ae603069a1a..36dbc2da3661 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -520,7 +520,7 @@ static int rose_create(struct net *net, struct socket *sock, int protocol,
 	if (sock->type != SOCK_SEQPACKET || protocol != 0)
 		return -ESOCKTNOSUPPORT;
 
-	sk = sk_alloc(net, PF_ROSE, GFP_ATOMIC, &rose_proto);
+	sk = sk_alloc(net, PF_ROSE, GFP_ATOMIC, &rose_proto, kern);
 	if (sk == NULL)
 		return -ENOMEM;
 
@@ -559,7 +559,7 @@ static struct sock *rose_make_new(struct sock *osk)
 	if (osk->sk_type != SOCK_SEQPACKET)
 		return NULL;
 
-	sk = sk_alloc(sock_net(osk), PF_ROSE, GFP_ATOMIC, &rose_proto);
+	sk = sk_alloc(sock_net(osk), PF_ROSE, GFP_ATOMIC, &rose_proto, 0);
 	if (sk == NULL)
 		return NULL;
 
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 0095b9a0b779..25d60ed15284 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -632,7 +632,7 @@ static int rxrpc_create(struct net *net, struct socket *sock, int protocol,
 	sock->ops = &rxrpc_rpc_ops;
 	sock->state = SS_UNCONNECTED;
 
-	sk = sk_alloc(net, PF_RXRPC, GFP_KERNEL, &rxrpc_proto);
+	sk = sk_alloc(net, PF_RXRPC, GFP_KERNEL, &rxrpc_proto, kern);
 	if (!sk)
 		return -ENOMEM;
 
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 0e4198ee2370..e703ff7fed40 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -635,7 +635,7 @@ static struct sock *sctp_v6_create_accept_sk(struct sock *sk,
 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
 	struct sctp6_sock *newsctp6sk;
 
-	newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot);
+	newsk = sk_alloc(sock_net(sk), PF_INET6, GFP_KERNEL, sk->sk_prot, 0);
 	if (!newsk)
 		goto out;
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 53b7acde9aa3..59e80356672b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -550,7 +550,7 @@ static struct sock *sctp_v4_create_accept_sk(struct sock *sk,
 					     struct sctp_association *asoc)
 {
 	struct sock *newsk = sk_alloc(sock_net(sk), PF_INET, GFP_KERNEL,
-			sk->sk_prot);
+			sk->sk_prot, 0);
 	struct inet_sock *newinet;
 
 	if (!newsk)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 9074b5cede38..8f3c8e2cef8e 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -342,7 +342,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock,
 	}
 
 	/* Allocate socket's protocol area */
-	sk = sk_alloc(net, AF_TIPC, GFP_KERNEL, &tipc_proto);
+	sk = sk_alloc(net, AF_TIPC, GFP_KERNEL, &tipc_proto, kern);
 	if (sk == NULL)
 		return -ENOMEM;
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5266ea7b922b..941b3d26e3bf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -620,7 +620,7 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
-static struct sock *unix_create1(struct net *net, struct socket *sock)
+static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 {
 	struct sock *sk = NULL;
 	struct unix_sock *u;
@@ -629,7 +629,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock)
 	if (atomic_long_read(&unix_nr_socks) > 2 * get_max_files())
 		goto out;
 
-	sk = sk_alloc(net, PF_UNIX, GFP_KERNEL, &unix_proto);
+	sk = sk_alloc(net, PF_UNIX, GFP_KERNEL, &unix_proto, kern);
 	if (!sk)
 		goto out;
 
@@ -688,7 +688,7 @@ static int unix_create(struct net *net, struct socket *sock, int protocol,
 		return -ESOCKTNOSUPPORT;
 	}
 
-	return unix_create1(net, sock) ? 0 : -ENOMEM;
+	return unix_create1(net, sock, kern) ? 0 : -ENOMEM;
 }
 
 static int unix_release(struct socket *sock)
@@ -1088,7 +1088,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	err = -ENOMEM;
 
 	/* create new sock for complete connection */
-	newsk = unix_create1(sock_net(sk), NULL);
+	newsk = unix_create1(sock_net(sk), NULL, 0);
 	if (newsk == NULL)
 		goto out;
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 2ec86e652a19..df5fc6b340f1 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -581,13 +581,14 @@ struct sock *__vsock_create(struct net *net,
 			    struct socket *sock,
 			    struct sock *parent,
 			    gfp_t priority,
-			    unsigned short type)
+			    unsigned short type,
+			    int kern)
 {
 	struct sock *sk;
 	struct vsock_sock *psk;
 	struct vsock_sock *vsk;
 
-	sk = sk_alloc(net, AF_VSOCK, priority, &vsock_proto);
+	sk = sk_alloc(net, AF_VSOCK, priority, &vsock_proto, kern);
 	if (!sk)
 		return NULL;
 
@@ -1866,7 +1867,7 @@ static int vsock_create(struct net *net, struct socket *sock,
 
 	sock->state = SS_UNCONNECTED;
 
-	return __vsock_create(net, sock, NULL, GFP_KERNEL, 0) ? 0 : -ENOMEM;
+	return __vsock_create(net, sock, NULL, GFP_KERNEL, 0, kern) ? 0 : -ENOMEM;
 }
 
 static const struct net_proto_family vsock_family_ops = {
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index c294da095461..1f63daff3965 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1022,7 +1022,7 @@ static int vmci_transport_recv_listen(struct sock *sk,
 	}
 
 	pending = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
-				 sk->sk_type);
+				 sk->sk_type, 0);
 	if (!pending) {
 		vmci_transport_send_reset(sk, pkt);
 		return -ENOMEM;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index c3ab230e4493..a750f330b8dd 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -515,10 +515,10 @@ static struct proto x25_proto = {
 	.obj_size = sizeof(struct x25_sock),
 };
 
-static struct sock *x25_alloc_socket(struct net *net)
+static struct sock *x25_alloc_socket(struct net *net, int kern)
 {
 	struct x25_sock *x25;
-	struct sock *sk = sk_alloc(net, AF_X25, GFP_ATOMIC, &x25_proto);
+	struct sock *sk = sk_alloc(net, AF_X25, GFP_ATOMIC, &x25_proto, kern);
 
 	if (!sk)
 		goto out;
@@ -553,7 +553,7 @@ static int x25_create(struct net *net, struct socket *sock, int protocol,
 		goto out;
 
 	rc = -ENOBUFS;
-	if ((sk = x25_alloc_socket(net)) == NULL)
+	if ((sk = x25_alloc_socket(net, kern)) == NULL)
 		goto out;
 
 	x25 = x25_sk(sk);
@@ -602,7 +602,7 @@ static struct sock *x25_make_new(struct sock *osk)
 	if (osk->sk_type != SOCK_SEQPACKET)
 		goto out;
 
-	if ((sk = x25_alloc_socket(sock_net(osk))) == NULL)
+	if ((sk = x25_alloc_socket(sock_net(osk), 0)) == NULL)
 		goto out;
 
 	x25 = x25_sk(sk);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets.
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
                             ` (2 preceding siblings ...)
  2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
@ 2015-05-09  2:10           ` Eric W. Biederman
  2015-05-09  2:11           ` [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace Eric W. Biederman
                             ` (3 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:10 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.

Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.

Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/inet_common.h       |  2 +-
 include/net/sock.h              |  2 ++
 net/core/sock.c                 |  8 ++++++--
 net/ipv4/af_inet.c              |  4 +---
 net/ipv4/udp_tunnel.c           |  8 +++-----
 net/ipv6/ip6_udp_tunnel.c       |  6 ++----
 net/l2tp/l2tp_core.c            | 15 ++++++---------
 net/netfilter/ipvs/ip_vs_sync.c | 30 +++++++++---------------------
 8 files changed, 30 insertions(+), 45 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 4a92423eefa5..279f83591971 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -41,7 +41,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len,
 
 static inline void inet_ctl_sock_destroy(struct sock *sk)
 {
-	sk_release_kernel(sk);
+	sock_release(sk->sk_socket);
 }
 
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index d8dcf91732b0..9e6b2c0b4741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -184,6 +184,7 @@ struct sock_common {
 	unsigned char		skc_reuse:4;
 	unsigned char		skc_reuseport:1;
 	unsigned char		skc_ipv6only:1;
+	unsigned char		skc_net_refcnt:1;
 	int			skc_bound_dev_if;
 	union {
 		struct hlist_node	skc_bind_node;
@@ -323,6 +324,7 @@ struct sock {
 #define sk_reuse		__sk_common.skc_reuse
 #define sk_reuseport		__sk_common.skc_reuseport
 #define sk_ipv6only		__sk_common.skc_ipv6only
+#define sk_net_refcnt		__sk_common.skc_net_refcnt
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
diff --git a/net/core/sock.c b/net/core/sock.c
index cbc3789b830c..9e8968e9ebeb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1412,7 +1412,10 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		 */
 		sk->sk_prot = sk->sk_prot_creator = prot;
 		sock_lock_init(sk);
-		sock_net_set(sk, get_net(net));
+		sk->sk_net_refcnt = kern ? 0 : 1;
+		if (likely(sk->sk_net_refcnt))
+			get_net(net);
+		sock_net_set(sk, net);
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
@@ -1446,7 +1449,8 @@ static void __sk_free(struct sock *sk)
 	if (sk->sk_peer_cred)
 		put_cred(sk->sk_peer_cred);
 	put_pid(sk->sk_peer_pid);
-	put_net(sock_net(sk));
+	if (likely(sk->sk_net_refcnt))
+		put_net(sock_net(sk));
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e2dd9cb99d61..235d36afece3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1430,7 +1430,7 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 struct net *net)
 {
 	struct socket *sock;
-	int rc = sock_create_kern(&init_net, family, type, protocol, &sock);
+	int rc = sock_create_kern(net, family, type, protocol, &sock);
 
 	if (rc == 0) {
 		*sk = sock->sk;
@@ -1440,8 +1440,6 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 		 * we do not wish this socket to see incoming packets.
 		 */
 		(*sk)->sk_prot->unhash(*sk);
-
-		sk_change_net(*sk, net);
 	}
 	return rc;
 }
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 4e2837476967..933ea903f7b8 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -15,12 +15,10 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 	struct socket *sock = NULL;
 	struct sockaddr_in udp_addr;
 
-	err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &sock);
+	err = sock_create_kern(net, AF_INET, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
-	sk_change_net(sock->sk, net);
-
 	udp_addr.sin_family = AF_INET;
 	udp_addr.sin_addr = cfg->local_ip;
 	udp_addr.sin_port = cfg->local_udp_port;
@@ -47,7 +45,7 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		sock_release(sock);
 	}
 	*sockp = NULL;
 	return err;
@@ -101,7 +99,7 @@ void udp_tunnel_sock_release(struct socket *sock)
 {
 	rcu_assign_sk_user_data(sock->sk, NULL);
 	kernel_sock_shutdown(sock, SHUT_RDWR);
-	sk_release_kernel(sock->sk);
+	sock_release(sock);
 }
 EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
 
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 478576b61214..e1a1136bda7c 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -19,12 +19,10 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 	int err;
 	struct socket *sock = NULL;
 
-	err = sock_create_kern(&init_net, AF_INET6, SOCK_DGRAM, 0, &sock);
+	err = sock_create_kern(net, AF_INET6, SOCK_DGRAM, 0, &sock);
 	if (err < 0)
 		goto error;
 
-	sk_change_net(sock->sk, net);
-
 	udp6_addr.sin6_family = AF_INET6;
 	memcpy(&udp6_addr.sin6_addr, &cfg->local_ip6,
 	       sizeof(udp6_addr.sin6_addr));
@@ -55,7 +53,7 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
 error:
 	if (sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		sock_release(sock);
 	}
 	*sockp = NULL;
 	return err;
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index ae513a2fe7f3..f6b090df3930 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1334,9 +1334,10 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
 		if (sock)
 			inet_shutdown(sock, 2);
 	} else {
-		if (sock)
+		if (sock) {
 			kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sk);
+			sock_release(sock);
+		}
 	}
 
 	l2tp_tunnel_sock_put(sk);
@@ -1399,13 +1400,11 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		if (cfg->local_ip6 && cfg->peer_ip6) {
 			struct sockaddr_l2tpip6 ip6_addr = {0};
 
-			err = sock_create_kern(&init_net, AF_INET6, SOCK_DGRAM,
+			err = sock_create_kern(net, AF_INET6, SOCK_DGRAM,
 					  IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
 
-			sk_change_net(sock->sk, net);
-
 			ip6_addr.l2tp_family = AF_INET6;
 			memcpy(&ip6_addr.l2tp_addr, cfg->local_ip6,
 			       sizeof(ip6_addr.l2tp_addr));
@@ -1429,13 +1428,11 @@ static int l2tp_tunnel_sock_create(struct net *net,
 		{
 			struct sockaddr_l2tpip ip_addr = {0};
 
-			err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM,
+			err = sock_create_kern(net, AF_INET, SOCK_DGRAM,
 					  IPPROTO_L2TP, &sock);
 			if (err < 0)
 				goto out;
 
-			sk_change_net(sock->sk, net);
-
 			ip_addr.l2tp_family = AF_INET;
 			ip_addr.l2tp_addr = cfg->local_ip;
 			ip_addr.l2tp_conn_id = tunnel_id;
@@ -1462,7 +1459,7 @@ out:
 	*sockp = sock;
 	if ((err < 0) && sock) {
 		kernel_sock_shutdown(sock, SHUT_RDWR);
-		sk_release_kernel(sock->sk);
+		sock_release(sock);
 		*sockp = NULL;
 	}
 
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 2e9a5b5d1239..b08ba9538d12 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1457,18 +1457,12 @@ static struct socket *make_send_sock(struct net *net, int id)
 	struct socket *sock;
 	int result;
 
-	/* First create a socket move it to right name space later */
-	result = sock_create_kern(&init_net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	/* First create a socket */
+	result = sock_create_kern(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
 	}
-	/*
-	 * Kernel sockets that are a part of a namespace, should not
-	 * hold a reference to a namespace in order to allow to stop it.
-	 * After sk_change_net should be released using sk_release_kernel.
-	 */
-	sk_change_net(sock->sk, net);
 	result = set_mcast_if(sock->sk, ipvs->master_mcast_ifn);
 	if (result < 0) {
 		pr_err("Error setting outbound mcast interface\n");
@@ -1497,7 +1491,7 @@ static struct socket *make_send_sock(struct net *net, int id)
 	return sock;
 
 error:
-	sk_release_kernel(sock->sk);
+	sock_release(sock);
 	return ERR_PTR(result);
 }
 
@@ -1518,17 +1512,11 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	int result;
 
 	/* First create a socket */
-	result = sock_create_kern(&init_net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
+	result = sock_create_kern(net, PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (result < 0) {
 		pr_err("Error during creation of socket; terminating\n");
 		return ERR_PTR(result);
 	}
-	/*
-	 * Kernel sockets that are a part of a namespace, should not
-	 * hold a reference to a namespace in order to allow to stop it.
-	 * After sk_change_net should be released using sk_release_kernel.
-	 */
-	sk_change_net(sock->sk, net);
 	/* it is equivalent to the REUSEADDR option in user-space */
 	sock->sk->sk_reuse = SK_CAN_REUSE;
 	result = sysctl_sync_sock_size(ipvs);
@@ -1554,7 +1542,7 @@ static struct socket *make_receive_sock(struct net *net, int id)
 	return sock;
 
 error:
-	sk_release_kernel(sock->sk);
+	sock_release(sock);
 	return ERR_PTR(result);
 }
 
@@ -1692,7 +1680,7 @@ done:
 		ip_vs_sync_buff_release(sb);
 
 	/* release the sending multicast socket */
-	sk_release_kernel(tinfo->sock->sk);
+	sock_release(tinfo->sock);
 	kfree(tinfo);
 
 	return 0;
@@ -1729,7 +1717,7 @@ static int sync_thread_backup(void *data)
 	}
 
 	/* release the sending multicast socket */
-	sk_release_kernel(tinfo->sock->sk);
+	sock_release(tinfo->sock);
 	kfree(tinfo->buf);
 	kfree(tinfo);
 
@@ -1854,11 +1842,11 @@ int start_sync_thread(struct net *net, int state, char *mcast_ifn, __u8 syncid)
 	return 0;
 
 outsocket:
-	sk_release_kernel(sock->sk);
+	sock_release(sock);
 
 outtinfo:
 	if (tinfo) {
-		sk_release_kernel(tinfo->sock->sk);
+		sock_release(tinfo->sock);
 		kfree(tinfo->buf);
 		kfree(tinfo);
 	}
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
                             ` (3 preceding siblings ...)
  2015-05-09  2:10           ` [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets Eric W. Biederman
@ 2015-05-09  2:11           ` Eric W. Biederman
  2015-05-09  2:12           ` [PATCH 6/6] net: kill sk_change_net and sk_release_kernel Eric W. Biederman
                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:11 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


Utilize the new functionality of sk_alloc so that nothing needs to be
done to suprress the reference counting on kernel sockets.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/netlink/af_netlink.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d0a82b412b24..5e3883bad65a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2473,16 +2473,11 @@ __netlink_kernel_create(struct net *net, int unit, struct module *module,
 
 	if (sock_create_lite(PF_NETLINK, SOCK_DGRAM, unit, &sock))
 		return NULL;
-	/*
-	 * We have to just have a reference on the net from sk, but don't
-	 * get_net it. Besides, we cannot get and then put the net here.
-	 * So we create one inside init_net and the move it to net.
-	 */
-	if (__netlink_create(&init_net, sock, cb_mutex, unit, 0) < 0)
+
+	if (__netlink_create(net, sock, cb_mutex, unit, 1) < 0)
 		goto out_sock_release_nosk;
 
 	sk = sock->sk;
-	sk_change_net(sk, net);
 
 	if (!cfg || cfg->groups < 32)
 		groups = 32;
@@ -2538,7 +2533,10 @@ EXPORT_SYMBOL(__netlink_kernel_create);
 void
 netlink_kernel_release(struct sock *sk)
 {
-	sk_release_kernel(sk);
+	if (sk == NULL || sk->sk_socket == NULL)
+		return;
+
+	sock_release(sk->sk_socket);
 }
 EXPORT_SYMBOL(netlink_kernel_release);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/6] net: kill sk_change_net and sk_release_kernel
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
                             ` (4 preceding siblings ...)
  2015-05-09  2:11           ` [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace Eric W. Biederman
@ 2015-05-09  2:12           ` Eric W. Biederman
  2015-05-09  2:38           ` [PATCH 0/6] Cleanup the kernel sockets Herbert Xu
  2015-05-11 14:53           ` David Miller
  7 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09  2:12 UTC (permalink / raw)
  To: davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu


These functions are no longer needed and no longer used kill them.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/sock.h | 17 -----------------
 net/core/sock.c    | 19 -------------------
 2 files changed, 36 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 9e6b2c0b4741..d882f4c8e438 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1518,7 +1518,6 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
 struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		      struct proto *prot, int kern);
 void sk_free(struct sock *sk);
-void sk_release_kernel(struct sock *sk);
 struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority);
 
 struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
@@ -2194,22 +2193,6 @@ void sock_net_set(struct sock *sk, struct net *net)
 	write_pnet(&sk->sk_net, net);
 }
 
-/*
- * Kernel sockets, f.e. rtnl or icmp_socket, are a part of a namespace.
- * They should not hold a reference to a namespace in order to allow
- * to stop it.
- * Sockets after sk_change_net should be released using sk_release_kernel
- */
-static inline void sk_change_net(struct sock *sk, struct net *net)
-{
-	struct net *current_net = sock_net(sk);
-
-	if (!net_eq(current_net, net)) {
-		put_net(current_net);
-		sock_net_set(sk, net);
-	}
-}
-
 static inline struct sock *skb_steal_sock(struct sk_buff *skb)
 {
 	if (skb->sk) {
diff --git a/net/core/sock.c b/net/core/sock.c
index 9e8968e9ebeb..c18738a795b0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1466,25 +1466,6 @@ void sk_free(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_free);
 
-/*
- * Last sock_put should drop reference to sk->sk_net. It has already
- * been dropped in sk_change_net. Taking reference to stopping namespace
- * is not an option.
- * Take reference to a socket to remove it from hash _alive_ and after that
- * destroy it in the context of init_net.
- */
-void sk_release_kernel(struct sock *sk)
-{
-	if (sk == NULL || sk->sk_socket == NULL)
-		return;
-
-	sock_hold(sk);
-	sock_net_set(sk, get_net(&init_net));
-	sock_release(sk->sk_socket);
-	sock_put(sk);
-}
-EXPORT_SYMBOL(sk_release_kernel);
-
 static void sk_update_clone(const struct sock *sk, struct sock *newsk)
 {
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/6] Cleanup the kernel sockets.
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
                             ` (5 preceding siblings ...)
  2015-05-09  2:12           ` [PATCH 6/6] net: kill sk_change_net and sk_release_kernel Eric W. Biederman
@ 2015-05-09  2:38           ` Herbert Xu
  2015-05-11 14:53           ` David Miller
  7 siblings, 0 replies; 56+ messages in thread
From: Herbert Xu @ 2015-05-09  2:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms

On Fri, May 08, 2015 at 09:05:33PM -0500, Eric W. Biederman wrote:
> 
> Right now the situtation for allocating kernel sockets is a mess.
> - sock_create_kern does not take a namespace parameter.
> - kernel sockets must not reference count a network namespace and keep
>   it alive or else we will have a reference counting loop.
> - The way we avoid the reference counting loop with sk_change_net
>   and sk_release_kernel are major hacks.
> 
> This patchset addresses this mess by fixing sock_create_kern to do
> everything necessary to create a kernel socket.  None of the current
> users of kernel sockets need the network namespace reference counted.
> Either kernel sockets are network namespace aware (and using the current
> hacks) or kernel sockets are limited to the initial network namespace
> in which case it does not matter.
> 
> This patchset starts by addressing tun which should be using normal
> userspace sockets like macvtap.
> 
> Then sock_create_kern is fixed to take a network namespace.
> Then the in kernel status of sockets are passed through to sk_alloc.
> Then sk_alloc is fixed to not reference count the network namespace
>      of kernel sockets.
> Then the callers of sock_create_kern are fixed up to stop using hacks.
> Then netlink which uses it's own flavor of sock_create_kern is fixed.
> 
> Finally the hacks that are sk_change_net and sk_release_kernel are removed.
> 
> When it is all done the code is easier to follow, easier to use, easier
> to maintain and shorter by about 70 lines.
> 
> Reported-by: Ying Xue <ying.xue@windriver.com>

Looks good to me.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc
  2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
@ 2015-05-09 16:51             ` Eric Dumazet
  2015-05-09 17:31               ` Eric W. Biederman
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Dumazet @ 2015-05-09 16:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, Ying Xue, netdev, cwang, xemul, maxk, stephen, tgraf,
	nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy, horms,
	Herbert Xu

On Fri, 2015-05-08 at 21:09 -0500, Eric W. Biederman wrote:
> In preparation for changing how struct net is refcounted
> on kernel sockets pass the knowledge that we are creating
> a kernel socket from sock_create_kern through to sk_alloc.

...

>  
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 3262f3e2b8b2..1a1c4f7b3ec5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2148,7 +2148,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>  	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
>  
>  	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
> -					    &tun_proto);
> +					    &tun_proto, 0);
>  	if (!tfile)
>  		return -ENOMEM;
>  	RCU_INIT_POINTER(tfile->tun, NULL);
...

> index 3a4898ec8c67..d8dcf91732b0 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1514,7 +1514,7 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
>  
> 
>  struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
> -		      struct proto *prot);
> +		      struct proto *prot, int kern);


Problem adding lot of parameters to functions, is that 0 (or false)
means nothing when we read the code later and we have to go back to
sk_alloc() definition to check.

Sure, at the time you write the code everything is clear, but in 2 years
we wont have the context in our heads.

I would use an enumeration to ease code readability maybe.

enum sk_alloc_kern {
	SK_ALLOC_USER,
	SK_ALLOC_KERN,
};

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc
  2015-05-09 16:51             ` Eric Dumazet
@ 2015-05-09 17:31               ` Eric W. Biederman
  0 siblings, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-09 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, Ying Xue, netdev, cwang, xemul, maxk, stephen, tgraf,
	nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy, horms,
	Herbert Xu

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Fri, 2015-05-08 at 21:09 -0500, Eric W. Biederman wrote:
>> In preparation for changing how struct net is refcounted
>> on kernel sockets pass the knowledge that we are creating
>> a kernel socket from sock_create_kern through to sk_alloc.
>
> ...
>
>>  
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 3262f3e2b8b2..1a1c4f7b3ec5 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2148,7 +2148,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>  	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
>>  
>>  	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
>> -					    &tun_proto);
>> +					    &tun_proto, 0);
>>  	if (!tfile)
>>  		return -ENOMEM;
>>  	RCU_INIT_POINTER(tfile->tun, NULL);
> ...
>
>> index 3a4898ec8c67..d8dcf91732b0 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -1514,7 +1514,7 @@ static inline void unlock_sock_fast(struct sock *sk, bool slow)
>>  
>> 
>>  struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>> -		      struct proto *prot);
>> +		      struct proto *prot, int kern);
>
>
> Problem adding lot of parameters to functions, is that 0 (or false)
> means nothing when we read the code later and we have to go back to
> sk_alloc() definition to check.
>
> Sure, at the time you write the code everything is clear, but in 2 years
> we wont have the context in our heads.

Possibly.  In most cases what we are doing is passing the kern parameter
from net_proto_family.create (aka inet_create) to sk_alloc.   Which has
enough local context that unless you are debugging that particular
aspect of the code it should be fairly obviously locally what is
happening.  We are passing the parameter through to sk_alloc.

> I would use an enumeration to ease code readability maybe.
>
> enum sk_alloc_kern {
> 	SK_ALLOC_USER,
> 	SK_ALLOC_KERN,
> };

If you want to add such an enumeration to the entire call chain of
__sock_create, security_socket_create, security_post_socket_create,
net_proto_family.create and sk_alloc and the handful of protocol
specific helper functions in between feel free.

I can't think of an obviously better convention than what we are using
now, and I certainly can't think of an appropriate name.

In most cases it does not matter as this parameter is simply passed
through.

There are only a handful of places where a hard coded constant is used,
and I always used a 0 to ensure there would be no change in existing
behavior.  But I really don't think there are enough of them to matter.


For myself the main issue is fixing the mess that is creating kernel
sockets and how they deal with reference counting.  Being consistent
with existing conventions and not taking on more than I had to, to get
that done appears to be the best course.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/6] Cleanup the kernel sockets.
  2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
                             ` (6 preceding siblings ...)
  2015-05-09  2:38           ` [PATCH 0/6] Cleanup the kernel sockets Herbert Xu
@ 2015-05-11 14:53           ` David Miller
  7 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2015-05-11 14:53 UTC (permalink / raw)
  To: ebiederm
  Cc: ying.xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, herbert

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 08 May 2015 21:05:33 -0500

> Right now the situtation for allocating kernel sockets is a mess.
> - sock_create_kern does not take a namespace parameter.
> - kernel sockets must not reference count a network namespace and keep
>   it alive or else we will have a reference counting loop.
> - The way we avoid the reference counting loop with sk_change_net
>   and sk_release_kernel are major hacks.
> 
> This patchset addresses this mess by fixing sock_create_kern to do
> everything necessary to create a kernel socket.  None of the current
> users of kernel sockets need the network namespace reference counted.
> Either kernel sockets are network namespace aware (and using the current
> hacks) or kernel sockets are limited to the initial network namespace
> in which case it does not matter.
> 
> This patchset starts by addressing tun which should be using normal
> userspace sockets like macvtap.
> 
> Then sock_create_kern is fixed to take a network namespace.
> Then the in kernel status of sockets are passed through to sk_alloc.
> Then sk_alloc is fixed to not reference count the network namespace
>      of kernel sockets.
> Then the callers of sock_create_kern are fixed up to stop using hacks.
> Then netlink which uses it's own flavor of sock_create_kern is fixed.
> 
> Finally the hacks that are sk_change_net and sk_release_kernel are removed.
> 
> When it is all done the code is easier to follow, easier to use, easier
> to maintain and shorter by about 70 lines.
> 
> Reported-by: Ying Xue <ying.xue@windriver.com>

Looks good, applied to net-next, thanks Eric.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
@ 2015-05-12  8:24             ` David Laight
  2015-05-12  8:55               ` Eric W. Biederman
  2015-05-12 14:45               ` David Miller
  0 siblings, 2 replies; 56+ messages in thread
From: David Laight @ 2015-05-12  8:24 UTC (permalink / raw)
  To: 'Eric W. Biederman', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu

From: Eric W. Biederman
> Sent: 09 May 2015 03:08
>
> This is long overdue, and is part of cleaning up how we allocate kernel
> sockets that don't reference count struct net.
...
> diff --git a/net/socket.c b/net/socket.c
> index b5f1f43ed8f4..9963a0b53a64 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1210,9 +1210,9 @@ int sock_create(int family, int type, int protocol, struct socket **res)
>  }
>  EXPORT_SYMBOL(sock_create);
> 
> -int sock_create_kern(int family, int type, int protocol, struct socket **res)
> +int sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
>  {
> -	return __sock_create(&init_net, family, type, protocol, res, 1);
> +	return __sock_create(net, family, type, protocol, res, 1);
>  }
>  EXPORT_SYMBOL(sock_create_kern);

Wouldn't it involve far less churn to add a new function that uses a non-default
namespace?

Changing the function prototype will a PITA for anyone doing back-ports of fixes.
(And more so for anyone trying to get a driver to build against kernels
that might have this change back-ported.)

	David

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12  8:24             ` David Laight
@ 2015-05-12  8:55               ` Eric W. Biederman
  2015-05-12 11:48                 ` David Laight
  2015-05-12 14:45               ` David Miller
  1 sibling, 1 reply; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-12  8:55 UTC (permalink / raw)
  To: David Laight, davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu



On May 12, 2015 3:24:11 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
>From: Eric W. Biederman
>> Sent: 09 May 2015 03:08
>>
>> This is long overdue, and is part of cleaning up how we allocate
>kernel
>> sockets that don't reference count struct net.
>...
>> diff --git a/net/socket.c b/net/socket.c
>> index b5f1f43ed8f4..9963a0b53a64 100644
>> --- a/net/socket.c
>> +++ b/net/socket.c
>> @@ -1210,9 +1210,9 @@ int sock_create(int family, int type, int
>protocol, struct socket **res)
>>  }
>>  EXPORT_SYMBOL(sock_create);
>> 
>> -int sock_create_kern(int family, int type, int protocol, struct
>socket **res)
>> +int sock_create_kern(struct net *net, int family, int type, int
>protocol, struct socket **res)
>>  {
>> -	return __sock_create(&init_net, family, type, protocol, res, 1);
>> +	return __sock_create(net, family, type, protocol, res, 1);
>>  }
>>  EXPORT_SYMBOL(sock_create_kern);
>
>Wouldn't it involve far less churn to add a new function that uses a
>non-default
>namespace?

The goal is comprehensible and maintainable kernel code.

Which network namespace your socket is in, is an important property and something you probably care about if you are creating kernel sockets.

Having a second function is more maintenance and results in harder to understand code.  A major fraction of the callers even before this change wanted to be outside the initial network namespace.

This change should have been made years ago, but unfortunately it was not.

>Changing the function prototype will a PITA for anyone doing back-ports
>of fixes.
>(And more so for anyone trying to get a driver to build against kernels
>that might have this change back-ported.)

Yep dealing with backports sucks, my sympathies.

Eric

 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12  8:55               ` Eric W. Biederman
@ 2015-05-12 11:48                 ` David Laight
  2015-05-12 12:28                   ` Nicolas Dichtel
  0 siblings, 1 reply; 56+ messages in thread
From: David Laight @ 2015-05-12 11:48 UTC (permalink / raw)
  To: 'Eric W. Biederman', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, nicolas.dichtel, tom, jchapman, erik.hugne, jon.maloy,
	horms, Herbert Xu

From: Eric W. Biederman
> Sent: 12 May 2015 09:55
> 
> On May 12, 2015 3:24:11 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
...
> >Wouldn't it involve far less churn to add a new function that uses a
> >non-default namespace?
> 
> The goal is comprehensible and maintainable kernel code.
> 
> Which network namespace your socket is in, is an important property and something you probably care
> about if you are creating kernel sockets.

That rather depends on whether you've anywhere to get a namespace from.
eg something like ceph/messenger.c

> Having a second function is more maintenance and results in harder to understand code.  A major
> fraction of the callers even before this change wanted to be outside the initial network namespace.

A static inline in the header file wouldn't cause any maintenance issues.

	David


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12 11:48                 ` David Laight
@ 2015-05-12 12:28                   ` Nicolas Dichtel
  2015-05-12 13:16                     ` David Laight
  0 siblings, 1 reply; 56+ messages in thread
From: Nicolas Dichtel @ 2015-05-12 12:28 UTC (permalink / raw)
  To: David Laight, 'Eric W. Biederman', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, tom, jchapman, erik.hugne, jon.maloy, horms, Herbert Xu

Le 12/05/2015 13:48, David Laight a écrit :
> From: Eric W. Biederman
>> Sent: 12 May 2015 09:55
>>
>> On May 12, 2015 3:24:11 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
> ...
>>> Wouldn't it involve far less churn to add a new function that uses a
>>> non-default namespace?
>>
>> The goal is comprehensible and maintainable kernel code.
>>
>> Which network namespace your socket is in, is an important property and something you probably care
>> about if you are creating kernel sockets.
>
> That rather depends on whether you've anywhere to get a namespace from.
> eg something like ceph/messenger.c
sk_net(con->sock->sk)?

This parameter is essential, hiding it will just hides bugs.
Having this parameter forces the developer to ask himself what is the best
value.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12 12:28                   ` Nicolas Dichtel
@ 2015-05-12 13:16                     ` David Laight
  2015-05-12 14:15                       ` Nicolas Dichtel
  2015-05-12 15:58                       ` Eric W. Biederman
  0 siblings, 2 replies; 56+ messages in thread
From: David Laight @ 2015-05-12 13:16 UTC (permalink / raw)
  To: 'nicolas.dichtel@6wind.com', 'Eric W. Biederman', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, tom, jchapman, erik.hugne, jon.maloy, horms, Herbert Xu

From: Nicolas Dichtel [mailto:nicolas.dichtel@6wind.com]
> Sent: 12 May 2015 13:29
> Le 12/05/2015 13:48, David Laight a écrit :
> > From: Eric W. Biederman
> >> Sent: 12 May 2015 09:55
> >>
> >> On May 12, 2015 3:24:11 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
> > ...
> >>> Wouldn't it involve far less churn to add a new function that uses a
> >>> non-default namespace?
> >>
> >> The goal is comprehensible and maintainable kernel code.
> >>
> >> Which network namespace your socket is in, is an important property and something you probably care
> >> about if you are creating kernel sockets.
> >
> > That rather depends on whether you've anywhere to get a namespace from.
> > eg something like ceph/messenger.c
> sk_net(con->sock->sk)?

What if you don't already have a socket?
Just an IP(v6) address ?

> This parameter is essential, hiding it will just hides bugs.
> Having this parameter forces the developer to ask himself what is the best
> value.

What if the answer is 'NFI'?
Which requires the answer be pushed back to the 'application' configuration?
Most users will have no idea either.

	David

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12 13:16                     ` David Laight
@ 2015-05-12 14:15                       ` Nicolas Dichtel
  2015-05-12 15:58                       ` Eric W. Biederman
  1 sibling, 0 replies; 56+ messages in thread
From: Nicolas Dichtel @ 2015-05-12 14:15 UTC (permalink / raw)
  To: David Laight, 'Eric W. Biederman', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, tom, jchapman, erik.hugne, jon.maloy, horms, Herbert Xu

Le 12/05/2015 15:16, David Laight a écrit :
> From: Nicolas Dichtel [mailto:nicolas.dichtel@6wind.com]
>> Sent: 12 May 2015 13:29
>> Le 12/05/2015 13:48, David Laight a écrit :
>>> From: Eric W. Biederman
>>>> Sent: 12 May 2015 09:55
>>>>
>>>> On May 12, 2015 3:24:11 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
>>> ...
>>>>> Wouldn't it involve far less churn to add a new function that uses a
>>>>> non-default namespace?
>>>>
>>>> The goal is comprehensible and maintainable kernel code.
>>>>
>>>> Which network namespace your socket is in, is an important property and something you probably care
>>>> about if you are creating kernel sockets.
>>>
>>> That rather depends on whether you've anywhere to get a namespace from.
>>> eg something like ceph/messenger.c
>> sk_net(con->sock->sk)?
>
> What if you don't already have a socket?
> Just an IP(v6) address ?
So you have an IPv6 address and you don't know on which netns this address
belongs? I can't imagine how it can work.

>
>> This parameter is essential, hiding it will just hides bugs.
>> Having this parameter forces the developer to ask himself what is the best
>> value.
>
> What if the answer is 'NFI'?
> Which requires the answer be pushed back to the 'application' configuration?
Your application is bound to a netns.

> Most users will have no idea either.
If you support netns, you *have to know*.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12  8:24             ` David Laight
  2015-05-12  8:55               ` Eric W. Biederman
@ 2015-05-12 14:45               ` David Miller
  1 sibling, 0 replies; 56+ messages in thread
From: David Miller @ 2015-05-12 14:45 UTC (permalink / raw)
  To: David.Laight
  Cc: ebiederm, ying.xue, netdev, cwang, xemul, eric.dumazet, maxk,
	stephen, tgraf, nicolas.dichtel, tom, jchapman, erik.hugne,
	jon.maloy, horms, herbert

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 12 May 2015 08:24:11 +0000

> Changing the function prototype will a PITA for anyone doing back-ports of fixes.
> (And more so for anyone trying to get a driver to build against kernels
> that might have this change back-ported.)

As someone who actually does such backports every week or two for up
to 5 different -stable releases, I really don't find it to be a big
deal, and I've therefore developed a very low level of sympathy for
this argument so please don't make it as a reason to not make an
interface change.

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/6] net: Add a struct net parameter to sock_create_kern
  2015-05-12 13:16                     ` David Laight
  2015-05-12 14:15                       ` Nicolas Dichtel
@ 2015-05-12 15:58                       ` Eric W. Biederman
  1 sibling, 0 replies; 56+ messages in thread
From: Eric W. Biederman @ 2015-05-12 15:58 UTC (permalink / raw)
  To: David Laight, 'nicolas.dichtel@6wind.com', davem
  Cc: Ying Xue, netdev, cwang, xemul, eric.dumazet, maxk, stephen,
	tgraf, tom, jchapman, erik.hugne, jon.maloy, horms, Herbert Xu



On May 12, 2015 8:16:23 AM CDT, David Laight <David.Laight@ACULAB.COM> wrote:
>From: Nicolas Dichtel [mailto:nicolas.dichtel@6wind.com]
>> Sent: 12 May 2015 13:29
>> Le 12/05/2015 13:48, David Laight a écrit :
>> > From: Eric W. Biederman
>> >> Sent: 12 May 2015 09:55
>> >>
>> >> On May 12, 2015 3:24:11 AM CDT, David Laight
><David.Laight@ACULAB.COM> wrote:
>> > ...
>> >>> Wouldn't it involve far less churn to add a new function that
>uses a
>> >>> non-default namespace?
>> >>
>> >> The goal is comprehensible and maintainable kernel code.
>> >>
>> >> Which network namespace your socket is in, is an important
>property and something you probably care
>> >> about if you are creating kernel sockets.
>> >
>> > That rather depends on whether you've anywhere to get a namespace
>from.
>> > eg something like ceph/messenger.c
>> sk_net(con->sock->sk)?
>
>What if you don't already have a socket?
>Just an IP(v6) address ?
>
>> This parameter is essential, hiding it will just hides bugs.
>> Having this parameter forces the developer to ask himself what is the
>best
>> value.
>
>What if the answer is 'NFI'?
>Which requires the answer be pushed back to the 'application'
>configuration?
>Most users will have no idea either.

current->nsproxy->netns.

Capture it at mount time.  Call get_net then and put_net after you have cleaned up during unmount.

Not hard, and the parameter is doing what it is supposed to be doing getting you to ask the question, and realize there is an unhandled issue.

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-05-12 15:58 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-07  8:52 [RFC PATCH net-next 00/11] netns: don't switch namespace while creating kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 01/11] netns: Fix race between put_net() and netlink_kernel_create() Ying Xue
2015-05-07  9:04   ` Herbert Xu
2015-05-07 17:19     ` Cong Wang
2015-05-07 17:28       ` Eric W. Biederman
2015-05-08 11:20       ` Eric W. Biederman
2015-05-08 11:20       ` Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 02/11] netlink: avoid unnecessary namespace switch when create netlink kernel sockets Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 03/11] tun: avoid unnecessary namespace switch during kernel socket creation Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 04/11] inet: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 05/11] udp_tunnel: avoid to switch namespace for tunnel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 06/11] ip6_udp_tunnel: " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 07/11] l2tp: avoid to switch namespace for l2tp " Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 08/11] ipvs: avoid to switch namespace for ipvs kernel socket Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 09/11] tipc: fix net leak issue Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 10/11] tipc: remove sk_change_net interface Ying Xue
2015-05-07  8:52 ` [RFC PATCH net-next 11/11] net: change behaviours of functions of creating and releasing kernel sockets Ying Xue
2015-05-07 16:14 ` [RFC PATCH net-next 00/11] netns: don't switch namespace while creating " Eric W. Biederman
2015-05-07 18:19   ` Cong Wang
2015-05-07 18:26     ` Eric W. Biederman
2015-05-07 18:53       ` Cong Wang
2015-05-07 18:58         ` Eric W. Biederman
2015-05-07 19:29           ` Cong Wang
2015-05-07 20:01             ` Eric W. Biederman
2015-05-08  9:10               ` Ying Xue
2015-05-08 11:15                 ` Eric W. Biederman
2015-05-08  8:50   ` Ying Xue
2015-05-08  9:25     ` Ying Xue
2015-05-08 11:07     ` Eric W. Biederman
2015-05-08 16:33       ` Cong Wang
2015-05-08 14:07   ` Herbert Xu
2015-05-08 17:36     ` Eric W. Biederman
2015-05-08 20:27       ` Cong Wang
2015-05-08 21:13         ` Cong Wang
2015-05-08 22:08           ` Eric W. Biederman
2015-05-09  1:13       ` Herbert Xu
2015-05-09  1:53         ` Eric W. Biederman
2015-05-09  2:05         ` [PATCH 0/6] Cleanup the " Eric W. Biederman
2015-05-09  2:07           ` [PATCH 1/6] tun: Utilize the normal socket network namespace refcounting Eric W. Biederman
2015-05-09  2:08           ` [PATCH 2/6] net: Add a struct net parameter to sock_create_kern Eric W. Biederman
2015-05-12  8:24             ` David Laight
2015-05-12  8:55               ` Eric W. Biederman
2015-05-12 11:48                 ` David Laight
2015-05-12 12:28                   ` Nicolas Dichtel
2015-05-12 13:16                     ` David Laight
2015-05-12 14:15                       ` Nicolas Dichtel
2015-05-12 15:58                       ` Eric W. Biederman
2015-05-12 14:45               ` David Miller
2015-05-09  2:09           ` [PATCH 3/6] net: Pass kern from net_proto_family.create to sk_alloc Eric W. Biederman
2015-05-09 16:51             ` Eric Dumazet
2015-05-09 17:31               ` Eric W. Biederman
2015-05-09  2:10           ` [PATCH 4/6] net: Modify sk_alloc to not reference count the netns of kernel sockets Eric W. Biederman
2015-05-09  2:11           ` [PATCH 5/6] netlink: Create kernel netlink sockets in the proper network namespace Eric W. Biederman
2015-05-09  2:12           ` [PATCH 6/6] net: kill sk_change_net and sk_release_kernel Eric W. Biederman
2015-05-09  2:38           ` [PATCH 0/6] Cleanup the kernel sockets Herbert Xu
2015-05-11 14:53           ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.