netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket
@ 2015-05-06  6:08 Ying Xue
  2015-05-06 23:03 ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Ying Xue @ 2015-05-06  6:08 UTC (permalink / raw)
  To: netdev; +Cc: herbert, xemul, davem, eric.dumazet, ebiederm

Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
netlink_kernel_create().") attempts to fix the following race
scenario:

put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */
      __put_net(net);
        queue_work(...);

/*
 * note: the net now has refcnt 0, but still in
 * the global list of net namespaces
 */

== re-schedule ==

register_pernet_subsys(&some_ops);
  register_pernet_operations(&some_ops);
    (*some_ops)->init(net);
      /*
       * we call netlink_kernel_create() here
       * in some places
       */
      netlink_kernel_create();
         sk_alloc();
            get_net(net); /* refcnt = 1 */
         /*
          * now we drop the net refcount not to
          * block the net namespace exit in the
          * future (or this can be done on the
          * error path)
          */
         put_net(sk->sk_net);
             if (atomic_dec_and_test(&...))
                   /*
                    * true. BOOOM! The net is
                    * scheduled for release twice
                    */

In order to prevent the race from happening, the commit adopted the
following solution: create netlink socket inside init_net namespace
and then re-attach it to the desired one right after the socket is
created; similarly, when close the socket, move back its namespace
to init_net so that the socket can be destroyed in the context which
is same as the socket creation.

Actually the proposal artificially makes the whole thing complex.
Instead there exists a simpler solution to avoid the risk of net
double release: if we find that the net reference counter reaches
zero before the reference counter will be increased in sk_alloc(),
we can identify that the process of the net namespace exit happening
in workqueue is not finished yet. At the moment, we should immediately
exit from sk_alloc() to avoid the risk. This is because once refcount
reaches zero, the net will be definetely destroyed later in workqueue
whatever we take its refcount or not. This solution is not only simple
and easily understandable, but also it can help to avoid the redundant
namespace change.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
v2 Changes:
  Kernel sockets should not hold a reference count to a namespace,
  otherwise, probably modules relying on them cannot be stopped. But
  we hold a reference on the net from sk allocated in sk_alloc() in
  previous version. In this version, we correct the wrong behaviour
  by putting the net reference count once sk is created successfully.

 net/core/sock.c          |    7 ++++++-
 net/netlink/af_netlink.c |   11 ++++++-----
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index e891bcf..9442387 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1411,7 +1411,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		 */
 		sk->sk_prot = sk->sk_prot_creator = prot;
 		sock_lock_init(sk);
-		sock_net_set(sk, get_net(net));
+		net = maybe_get_net(net);
+		if (!net) {
+			sk_prot_free(prot, sk);
+			return NULL;
+		}
+		sock_net_set(sk, net);
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ec4adbd..ca3f63a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2475,15 +2475,15 @@ __netlink_kernel_create(struct net *net, int unit, struct module *module,
 
 	/*
 	 * We have to just have a reference on the net from sk, but don't
-	 * get_net it. Besides, we cannot get and then put the net here.
-	 * So we create one inside init_net and the move it to net.
+	 * get_net it as netlink kernel sockets are a part of the net. So
+	 * we put the net here and get it before release the socket.
 	 */
 
-	if (__netlink_create(&init_net, sock, cb_mutex, unit) < 0)
+	if (__netlink_create(net, sock, cb_mutex, unit) < 0)
 		goto out_sock_release_nosk;
 
 	sk = sock->sk;
-	sk_change_net(sk, net);
+	put_net(sock_net(sk));
 
 	if (!cfg || cfg->groups < 32)
 		groups = 32;
@@ -2539,7 +2539,8 @@ EXPORT_SYMBOL(__netlink_kernel_create);
 void
 netlink_kernel_release(struct sock *sk)
 {
-	sk_release_kernel(sk);
+	get_net(sock_net(sk));
+	sock_release(sk->sk_socket);
 }
 EXPORT_SYMBOL(netlink_kernel_release);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket
  2015-05-06  6:08 [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket Ying Xue
@ 2015-05-06 23:03 ` Cong Wang
  2015-05-07  8:57   ` Ying Xue
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2015-05-06 23:03 UTC (permalink / raw)
  To: Ying Xue
  Cc: netdev, Herbert Xu, xemul, David Miller, Eric Dumazet, Eric W. Biederman

On Tue, May 5, 2015 at 11:08 PM, Ying Xue <ying.xue@windriver.com> wrote:
> Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
> netlink_kernel_create().") attempts to fix the following race
> scenario:
>
> put_net()
>   if (atomic_dec_and_test(&net->refcnt))
>     /* true */
>       __put_net(net);
>         queue_work(...);
>
> /*
>  * note: the net now has refcnt 0, but still in
>  * the global list of net namespaces
>  */
>
> == re-schedule ==
>
> register_pernet_subsys(&some_ops);
>   register_pernet_operations(&some_ops);
>     (*some_ops)->init(net);
>       /*
>        * we call netlink_kernel_create() here
>        * in some places
>        */
>       netlink_kernel_create();
>          sk_alloc();
>             get_net(net); /* refcnt = 1 */
>          /*
>           * now we drop the net refcount not to
>           * block the net namespace exit in the
>           * future (or this can be done on the
>           * error path)
>           */
>          put_net(sk->sk_net);
>              if (atomic_dec_and_test(&...))
>                    /*
>                     * true. BOOOM! The net is
>                     * scheduled for release twice
>                     */
>
> In order to prevent the race from happening, the commit adopted the
> following solution: create netlink socket inside init_net namespace
> and then re-attach it to the desired one right after the socket is
> created; similarly, when close the socket, move back its namespace
> to init_net so that the socket can be destroyed in the context which
> is same as the socket creation.
>
> Actually the proposal artificially makes the whole thing complex.
> Instead there exists a simpler solution to avoid the risk of net
> double release: if we find that the net reference counter reaches
> zero before the reference counter will be increased in sk_alloc(),
> we can identify that the process of the net namespace exit happening
> in workqueue is not finished yet. At the moment, we should immediately
> exit from sk_alloc() to avoid the risk. This is because once refcount
> reaches zero, the net will be definetely destroyed later in workqueue
> whatever we take its refcount or not. This solution is not only simple
> and easily understandable, but also it can help to avoid the redundant
> namespace change.
>

Hmm, why does the caller have to handle some race condition of the callee?
Isn't this solvable at netns API layer?

How about the following patch (PoC ONLY) ? __put_net() should be able
to detect a pending cleanup work, shouldn't it?

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..ded15a7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net,
struct user_namespace *user_ns)
        net->dev_base_seq = 1;
        net->user_ns = user_ns;
        idr_init(&net->netns_ids);
+       LIST_HEAD_INIT(&net->cleanup_list);

        list_for_each_entry(ops, &pernet_list, list) {
                error = ops_init(ops, net);
@@ -409,12 +410,17 @@ void __put_net(struct net *net)
 {
        /* Cleanup the network namespace in process context */
        unsigned long flags;
+       bool added = false;

        spin_lock_irqsave(&cleanup_list_lock, flags);
-       list_add(&net->cleanup_list, &cleanup_list);
+       if (list_empty(&net->cleanup_list)) {
+               list_add(&net->cleanup_list, &cleanup_list);
+               added = true;
+       }
        spin_unlock_irqrestore(&cleanup_list_lock, flags);

-       queue_work(netns_wq, &net_cleanup_work);
+       if (added)
+               queue_work(netns_wq, &net_cleanup_work);
 }
 EXPORT_SYMBOL_GPL(__put_net);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket
  2015-05-06 23:03 ` Cong Wang
@ 2015-05-07  8:57   ` Ying Xue
  2015-05-07 18:21     ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Ying Xue @ 2015-05-07  8:57 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Herbert Xu, xemul, David Miller, Eric Dumazet, Eric W. Biederman

On 05/07/2015 07:03 AM, Cong Wang wrote:
> On Tue, May 5, 2015 at 11:08 PM, Ying Xue <ying.xue@windriver.com> wrote:
>> Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
>> netlink_kernel_create().") attempts to fix the following race
>> scenario:
>>
>> put_net()
>>   if (atomic_dec_and_test(&net->refcnt))
>>     /* true */
>>       __put_net(net);
>>         queue_work(...);
>>
>> /*
>>  * note: the net now has refcnt 0, but still in
>>  * the global list of net namespaces
>>  */
>>
>> == re-schedule ==
>>
>> register_pernet_subsys(&some_ops);
>>   register_pernet_operations(&some_ops);
>>     (*some_ops)->init(net);
>>       /*
>>        * we call netlink_kernel_create() here
>>        * in some places
>>        */
>>       netlink_kernel_create();
>>          sk_alloc();
>>             get_net(net); /* refcnt = 1 */
>>          /*
>>           * now we drop the net refcount not to
>>           * block the net namespace exit in the
>>           * future (or this can be done on the
>>           * error path)
>>           */
>>          put_net(sk->sk_net);
>>              if (atomic_dec_and_test(&...))
>>                    /*
>>                     * true. BOOOM! The net is
>>                     * scheduled for release twice
>>                     */
>>
>> In order to prevent the race from happening, the commit adopted the
>> following solution: create netlink socket inside init_net namespace
>> and then re-attach it to the desired one right after the socket is
>> created; similarly, when close the socket, move back its namespace
>> to init_net so that the socket can be destroyed in the context which
>> is same as the socket creation.
>>
>> Actually the proposal artificially makes the whole thing complex.
>> Instead there exists a simpler solution to avoid the risk of net
>> double release: if we find that the net reference counter reaches
>> zero before the reference counter will be increased in sk_alloc(),
>> we can identify that the process of the net namespace exit happening
>> in workqueue is not finished yet. At the moment, we should immediately
>> exit from sk_alloc() to avoid the risk. This is because once refcount
>> reaches zero, the net will be definetely destroyed later in workqueue
>> whatever we take its refcount or not. This solution is not only simple
>> and easily understandable, but also it can help to avoid the redundant
>> namespace change.
>>
> 
> Hmm, why does the caller have to handle some race condition of the callee?
> Isn't this solvable at netns API layer?
> 
> How about the following patch (PoC ONLY) ? __put_net() should be able
> to detect a pending cleanup work, shouldn't it?
> 

Thanks to Cong! Your below idea is absolutely better than mine.

I have created a whole series based on the idea. Please review them.
In addition, I am not sure whether I should use your "Suggested-by" or
"Signed-off-by" tag in the first patch. But I finally selected the
"Suggested-by". If you think the tag is inappropriate, please let me know. I
will replace it with "Signed-off-by" tag in next version.

Thanks,
Ying

> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 78fc04a..ded15a7 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net,
> struct user_namespace *user_ns)
>         net->dev_base_seq = 1;
>         net->user_ns = user_ns;
>         idr_init(&net->netns_ids);
> +       LIST_HEAD_INIT(&net->cleanup_list);
> 
>         list_for_each_entry(ops, &pernet_list, list) {
>                 error = ops_init(ops, net);
> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>  {
>         /* Cleanup the network namespace in process context */
>         unsigned long flags;
> +       bool added = false;
> 
>         spin_lock_irqsave(&cleanup_list_lock, flags);
> -       list_add(&net->cleanup_list, &cleanup_list);
> +       if (list_empty(&net->cleanup_list)) {
> +               list_add(&net->cleanup_list, &cleanup_list);
> +               added = true;
> +       }
>         spin_unlock_irqrestore(&cleanup_list_lock, flags);
> 
> -       queue_work(netns_wq, &net_cleanup_work);
> +       if (added)
> +               queue_work(netns_wq, &net_cleanup_work);
>  }
>  EXPORT_SYMBOL_GPL(__put_net);
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket
  2015-05-07  8:57   ` Ying Xue
@ 2015-05-07 18:21     ` Cong Wang
  2015-05-08  8:51       ` Ying Xue
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2015-05-07 18:21 UTC (permalink / raw)
  To: Ying Xue
  Cc: netdev, Herbert Xu, Pavel Emelyanov, David Miller, Eric Dumazet,
	Eric W. Biederman

On Thu, May 7, 2015 at 1:57 AM, Ying Xue <ying.xue@windriver.com> wrote:
>
> Thanks to Cong! Your below idea is absolutely better than mine.
>
> I have created a whole series based on the idea. Please review them.
> In addition, I am not sure whether I should use your "Suggested-by" or
> "Signed-off-by" tag in the first patch. But I finally selected the
> "Suggested-by". If you think the tag is inappropriate, please let me know. I
> will replace it with "Signed-off-by" tag in next version.
>

That is fairly easy to choose:

If you use any code from me, you need my Signed-off-by.
If you only get suggestion without any code, then Suggested-by.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket
  2015-05-07 18:21     ` Cong Wang
@ 2015-05-08  8:51       ` Ying Xue
  0 siblings, 0 replies; 5+ messages in thread
From: Ying Xue @ 2015-05-08  8:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Herbert Xu, Pavel Emelyanov, David Miller, Eric Dumazet,
	Eric W. Biederman

On 05/08/2015 02:21 AM, Cong Wang wrote:
> On Thu, May 7, 2015 at 1:57 AM, Ying Xue <ying.xue@windriver.com> wrote:
>>
>> Thanks to Cong! Your below idea is absolutely better than mine.
>>
>> I have created a whole series based on the idea. Please review them.
>> In addition, I am not sure whether I should use your "Suggested-by" or
>> "Signed-off-by" tag in the first patch. But I finally selected the
>> "Suggested-by". If you think the tag is inappropriate, please let me know. I
>> will replace it with "Signed-off-by" tag in next version.
>>
> 
> That is fairly easy to choose:
> 
> If you use any code from me, you need my Signed-off-by.
> If you only get suggestion without any code, then Suggested-by.
> 
> 

OK, thanks for the clarification.

Regards,
Ying

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-08  8:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06  6:08 [RFC PATCH v2 net-next] netlink: avoid namespace change while creating socket Ying Xue
2015-05-06 23:03 ` Cong Wang
2015-05-07  8:57   ` Ying Xue
2015-05-07 18:21     ` Cong Wang
2015-05-08  8:51       ` Ying Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).