linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call
@ 2013-10-15 17:35 Guillaume Gaudonville
  2013-10-15 18:40 ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Guillaume Gaudonville @ 2013-10-15 17:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: ebiederm, serge.hallyn, akpm, viro, davem, cmetcalf,
	Guillaume Gaudonville

Currently, at each call of setns system call a new nsproxy is allocated,
the old nsproxy namespaces are copied into the new one and the old nsproxy
is freed if the task was the only one to use it.

It can creates large delays on hardware with large number of cpus since
to free a nsproxy a synchronize_rcu() call is done.

When a task is the only one to use a nsproxy, only the task can do an action
that will make this nsproxy to be shared by another task or thread (fork,...).
So when the refcount of the nsproxy is equal to 1, we can simply update the
current nsproxy field without allocating a new one and freeing the old one.

The install operations of each kind of namespace cannot fails, so there's no
need to check for an error and calling ops->install().

Tested on TileGX (36 cores) and Intel (32 cores).

Reported-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
---
 kernel/nsproxy.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index afc0456..afc04ac 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -255,6 +255,18 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	if (nstype && (ops->type != nstype))
 		goto out;
 
+	/*
+	 * If count == 1, only the current task can increment it,
+	 * by doing a fork for example so we can safely update the
+	 * current nsproxy pointers without allocate a new one,
+	 * update it and destroy the old one
+	 */
+	if (atomic_read(&tsk->nsproxy->count) == 1) {
+		err = ops->install(tsk->nsproxy, ei->ns);
+		fput(file);
+		return err;
+	}
+
 	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
 	if (IS_ERR(new_nsproxy)) {
 		err = PTR_ERR(new_nsproxy);
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call
  2013-10-15 17:35 [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call Guillaume Gaudonville
@ 2013-10-15 18:40 ` Eric W. Biederman
  2013-10-16  8:48   ` Guillaume Gaudonville
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2013-10-15 18:40 UTC (permalink / raw)
  To: Guillaume Gaudonville
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf,
	Guillaume Gaudonville

"Guillaume Gaudonville" <gaudonville@6wind.com> writes:

> Currently, at each call of setns system call a new nsproxy is allocated,
> the old nsproxy namespaces are copied into the new one and the old nsproxy
> is freed if the task was the only one to use it.
>
> It can creates large delays on hardware with large number of cpus since
> to free a nsproxy a synchronize_rcu() call is done.
>
> When a task is the only one to use a nsproxy, only the task can do an action
> that will make this nsproxy to be shared by another task or thread (fork,...).
> So when the refcount of the nsproxy is equal to 1, we can simply update the
> current nsproxy field without allocating a new one and freeing the old one.
>
> The install operations of each kind of namespace cannot fails, so there's no
> need to check for an error and calling ops->install().
>
> Tested on TileGX (36 cores) and Intel (32 cores).

This may be worth doing (I am a little scared of a design that has setns
on a fast path) but right now this isn't safe.

Currently pidns_install ends with:

	put_pid_ns(nsproxy->pid_ns_for_children);
	nsproxy->pid_ns_for_children = get_pid_ns(new);
	return 0;


And netns_install ends with:

	put_net(nsproxy->net_ns);
	nsproxy->net_ns = get_net(net);
	return 0;

The put before the set is not atomic and is not safe unless
the nsproxy is private.  I think this is fixable but it requires a more
indepth look at the code than you have done.

Mind if I ask where this comes up?


> Reported-by: Chris Metcalf <cmetcalf@tilera.com>
> Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
> ---
>  kernel/nsproxy.c |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index afc0456..afc04ac 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -255,6 +255,18 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>  	if (nstype && (ops->type != nstype))
>  		goto out;
>  
> +	/*
> +	 * If count == 1, only the current task can increment it,
> +	 * by doing a fork for example so we can safely update the
> +	 * current nsproxy pointers without allocate a new one,
> +	 * update it and destroy the old one
> +	 */
> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
> +		err = ops->install(tsk->nsproxy, ei->ns);
> +		fput(file);
> +		return err;
> +	}

As a minor nit, but to match the rest of the code in this function that
should read:

> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
> +		err = ops->install(tsk->nsproxy, ei->ns);
> +		goto out;
> +	}

There is no need to add an additional exit point to reason about.

> +
>  	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>  	if (IS_ERR(new_nsproxy)) {
>  		err = PTR_ERR(new_nsproxy);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call
  2013-10-15 18:40 ` Eric W. Biederman
@ 2013-10-16  8:48   ` Guillaume Gaudonville
  2013-10-16 19:42     ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Guillaume Gaudonville @ 2013-10-16  8:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Guillaume Gaudonville, linux-kernel, serge.hallyn, akpm, viro,
	davem, cmetcalf

On 10/15/2013 08:40 PM, Eric W. Biederman wrote:
> "Guillaume Gaudonville" <gaudonville@6wind.com> writes:
>
>> Currently, at each call of setns system call a new nsproxy is allocated,
>> the old nsproxy namespaces are copied into the new one and the old nsproxy
>> is freed if the task was the only one to use it.
>>
>> It can creates large delays on hardware with large number of cpus since
>> to free a nsproxy a synchronize_rcu() call is done.
>>
>> When a task is the only one to use a nsproxy, only the task can do an action
>> that will make this nsproxy to be shared by another task or thread (fork,...).
>> So when the refcount of the nsproxy is equal to 1, we can simply update the
>> current nsproxy field without allocating a new one and freeing the old one.
>>
>> The install operations of each kind of namespace cannot fails, so there's no
>> need to check for an error and calling ops->install().
>>
>> Tested on TileGX (36 cores) and Intel (32 cores).
> This may be worth doing (I am a little scared of a design that has setns
> on a fast path) but right now this isn't safe.
>
> Currently pidns_install ends with:
>
> 	put_pid_ns(nsproxy->pid_ns_for_children);
> 	nsproxy->pid_ns_for_children = get_pid_ns(new);
> 	return 0;
>
>
> And netns_install ends with:
>
> 	put_net(nsproxy->net_ns);
> 	nsproxy->net_ns = get_net(net);
> 	return 0;
>
> The put before the set is not atomic and is not safe unless
> the nsproxy is private.  I think this is fixable but it requires a more
> indepth look at the code than you have done.
My expectation was that nobody else but the task itself could increase the
nsproxy refcount (ie. use it), if the refcount was equal to 1. So there
was no possible races while playing with nsproxy in that case.
Do you mean that someone else than the task itself could play with the 
nsproxy,
even if its refcount is equal to 1?

>
> Mind if I ask where this comes up?
The issue has been seen on a daemon that is performing monitoring and
configuration tasks in different netns.
>
>
>> Reported-by: Chris Metcalf <cmetcalf@tilera.com>
>> Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
>> ---
>>   kernel/nsproxy.c |   12 ++++++++++++
>>   1 files changed, 12 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index afc0456..afc04ac 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -255,6 +255,18 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>>   	if (nstype && (ops->type != nstype))
>>   		goto out;
>>   
>> +	/*
>> +	 * If count == 1, only the current task can increment it,
>> +	 * by doing a fork for example so we can safely update the
>> +	 * current nsproxy pointers without allocate a new one,
>> +	 * update it and destroy the old one
>> +	 */
>> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
>> +		err = ops->install(tsk->nsproxy, ei->ns);
>> +		fput(file);
>> +		return err;
>> +	}
> As a minor nit, but to match the rest of the code in this function that
> should read:
>
>> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
>> +		err = ops->install(tsk->nsproxy, ei->ns);
>> +		goto out;
>> +	}
> There is no need to add an additional exit point to reason about.
>
>> +
>>   	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>>   	if (IS_ERR(new_nsproxy)) {
>>   		err = PTR_ERR(new_nsproxy);


-- 
Guillaume Gaudonville
6WIND
Software Engineer

Tel: +33 1 39 30 92 53
Mob: +33 6 47 85 34 33
Fax: +33 1 39 30 92 11
guillaume.gaudonville@6wind.com
www.6wind.com
Join the Multicore Packet Processing Forum: www.multicorepacketprocessing.com

Ce courriel ainsi que toutes les pièces jointes, est uniquement destiné à son ou ses destinataires. Il contient des informations confidentielles qui sont la propriété de 6WIND. Toute révélation, distribution ou copie des informations qu'il contient est strictement interdite. Si vous avez reçu ce message par erreur, veuillez immédiatement le signaler à l'émetteur et détruire toutes les données reçues

This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to 6WIND. All unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call
  2013-10-16  8:48   ` Guillaume Gaudonville
@ 2013-10-16 19:42     ` Eric W. Biederman
  2013-10-18 14:46       ` [RFC PATCH linux-next v2] " Guillaume Gaudonville
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2013-10-16 19:42 UTC (permalink / raw)
  To: Guillaume Gaudonville
  Cc: Guillaume Gaudonville, linux-kernel, serge.hallyn, akpm, viro,
	davem, cmetcalf

Guillaume Gaudonville <guillaume.gaudonville@6wind.com> writes:

> On 10/15/2013 08:40 PM, Eric W. Biederman wrote:
>> "Guillaume Gaudonville" <gaudonville@6wind.com> writes:
>>
>>> Currently, at each call of setns system call a new nsproxy is allocated,
>>> the old nsproxy namespaces are copied into the new one and the old nsproxy
>>> is freed if the task was the only one to use it.
>>>
>>> It can creates large delays on hardware with large number of cpus since
>>> to free a nsproxy a synchronize_rcu() call is done.
>>>
>>> When a task is the only one to use a nsproxy, only the task can do an action
>>> that will make this nsproxy to be shared by another task or thread (fork,...).
>>> So when the refcount of the nsproxy is equal to 1, we can simply update the
>>> current nsproxy field without allocating a new one and freeing the old one.
>>>
>>> The install operations of each kind of namespace cannot fails, so there's no
>>> need to check for an error and calling ops->install().
>>>
>>> Tested on TileGX (36 cores) and Intel (32 cores).
>> This may be worth doing (I am a little scared of a design that has setns
>> on a fast path) but right now this isn't safe.
>>
>> Currently pidns_install ends with:
>>
>> 	put_pid_ns(nsproxy->pid_ns_for_children);
>> 	nsproxy->pid_ns_for_children = get_pid_ns(new);
>> 	return 0;
>>
>>
>> And netns_install ends with:
>>
>> 	put_net(nsproxy->net_ns);
>> 	nsproxy->net_ns = get_net(net);
>> 	return 0;
>>
>> The put before the set is not atomic and is not safe unless
>> the nsproxy is private.  I think this is fixable but it requires a more
>> indepth look at the code than you have done.

> My expectation was that nobody else but the task itself could increase the
> nsproxy refcount (ie. use it), if the refcount was equal to 1. So there
> was no possible races while playing with nsproxy in that case.
> Do you mean that someone else than the task itself could play with the
> nsproxy,
> even if its refcount is equal to 1?

It is not required to increase the nsproxy refcount to use nsproxy.  It
is possible to find a living task and look at it's nsproxy using just
rcu protection.  get_net_ns_by_pid is one example of a place where we do
that.

If the refcount was all that was protecting nsproxy the problematic
synchronize_rcu call would be unnecessary.

Look for task_nsproxy if you want to find other readers of the nsproxy
that don't increase the reference count.

>> Mind if I ask where this comes up?
> The issue has been seen on a daemon that is performing monitoring and
> configuration tasks in different netns.

Interesting...  In practice I would think that daemon by opening a few
choice netlink sockets would not need to change network namespaces at
all frequently, and you can also see the net information in /proc and
/sys without changing your default net.  So I am wondering about the
daemon.

That said if we can optimize things without getting ourselves into a
maintenance nightmare I am happy to see a patch optimizing things.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-16 19:42     ` Eric W. Biederman
@ 2013-10-18 14:46       ` Guillaume Gaudonville
  2013-10-18 22:34         ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Guillaume Gaudonville @ 2013-10-18 14:46 UTC (permalink / raw)
  To: ebiederm
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf,
	Guillaume Gaudonville

Currently, at each call of setns system call a new nsproxy is allocated,
the old nsproxy namespaces are copied into the new one and the old nsproxy
is freed if the task was the only one to use it.

It can creates large delays on hardware with large number of cpus since
to free a nsproxy a synchronize_rcu() call is done.

When a task is the only one to use a nsproxy, only the task can do an action
that will make this nsproxy to be shared by another task or thread (fork,...).
So when the refcount of the nsproxy is equal to 1, we can simply update the
current nsproxy field without allocating a new one and freeing the old one.

The install operations of each kind of namespace cannot fails, so there's no
need to check for an error and calling ops->install().

However since we can have readers of the nsproxy that are not the current task,
we need to protect access to each namespace pointer in the nsproxy. This is
done by assigning it using rcu_assign_pointer() and when it is possible
that the reader is not the current task, read the pointer using
rcu_dereference().

Finally the install function of each namespace type must be modified to ensure
that the refcount of the old namespace is released after the assignment in
nsproxy.

On kernel 3.12-rc1, using a small application that does:

- call setns on a first net namespace and open a socket,
- call setns on a second net namespace and open a socket,
- switch back to the first namespace and close the socket,
- switch back to the second namespace and close the socket,

On an Intel Westmere with 24 logical cores at 3.33 GHz, it gives the
following results using the time command:

- without the proposed patch:

  root@blackcloudy:~# time ./test_x86

  real    0m0.130s
  user    0m0.000s
  sys     0m0.000s

- with the proposed patch:

  root@blackcloudy:~# time ./test_x86

  real    0m0.020s
  user    0m0.000s
  sys     0m0.000s

Reported-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
---

v2:
  - protect readers, by releasing namespaces refcount after updating the
    nsproxy pointer,
  - protect readers, by using rcu_assign_pointer() to affect nsproxy
    pointers,
  - readers need to use rcu_dereference() to access the namespace and
    must take a reference on it before leaving the rcu_read_lock section
    (this last part was already present),
  - do not add additional exit point in setns syscall.

There are still 2 suspicious functions, nfs_server_list_open() and
nfs_volume_list_open(). They are accessing directly to the net_ns
like below:

struct net *net = pid_ns->child_reaper->nsproxy->net_ns;

It seems to me that currently they should access it under rcu_read_lock()
and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?

And then with this proposed patch they should access the netns through
a rcu_dereference and take a reference on the netns. I didn't
modify them for now, but if it is confirmed I can send a patch
fixing the first issue and then send a v3 of this proposed patch.

 fs/namespace.c           |    9 +++++----
 fs/proc/proc_net.c       |    2 +-
 fs/proc_namespace.c      |    2 +-
 ipc/namespace.c          |    9 +++++----
 kernel/nsproxy.c         |   11 +++++++++++
 kernel/pid_namespace.c   |    7 ++++---
 kernel/utsname.c         |    9 +++++----
 net/core/net_namespace.c |   11 ++++++-----
 8 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a36e806..1890a87 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2840,7 +2840,7 @@ static void *mntns_get(struct task_struct *task)
 	rcu_read_lock();
 	nsproxy = task_nsproxy(task);
 	if (nsproxy) {
-		ns = nsproxy->mnt_ns;
+		ns = rcu_dereference(nsproxy->mnt_ns);
 		get_mnt_ns(ns);
 	}
 	rcu_read_unlock();
@@ -2856,7 +2856,7 @@ static void mntns_put(void *ns)
 static int mntns_install(struct nsproxy *nsproxy, void *ns)
 {
 	struct fs_struct *fs = current->fs;
-	struct mnt_namespace *mnt_ns = ns;
+	struct mnt_namespace *old_mnt_ns, *mnt_ns = ns;
 	struct path root;
 
 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
@@ -2868,8 +2868,9 @@ static int mntns_install(struct nsproxy *nsproxy, void *ns)
 		return -EINVAL;
 
 	get_mnt_ns(mnt_ns);
-	put_mnt_ns(nsproxy->mnt_ns);
-	nsproxy->mnt_ns = mnt_ns;
+	old_mnt_ns = nsproxy->mnt_ns;
+	rcu_assign_pointer(nsproxy->mnt_ns, mnt_ns);
+	put_mnt_ns(old_mnt_ns);
 
 	/* Find the root */
 	root.mnt    = &mnt_ns->root->mnt;
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index b4ac657..39b93a5 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -111,7 +111,7 @@ static struct net *get_proc_task_net(struct inode *dir)
 	if (task != NULL) {
 		ns = task_nsproxy(task);
 		if (ns != NULL)
-			net = get_net(ns->net_ns);
+			net = get_net(rcu_dereference(ns->net_ns));
 	}
 	rcu_read_unlock();
 
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 5fe34c3..3a507e7 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -239,7 +239,7 @@ static int mounts_open_common(struct inode *inode, struct file *file,
 		put_task_struct(task);
 		goto err;
 	}
-	ns = nsp->mnt_ns;
+	ns = rcu_dereference(nsp->mnt_ns);
 	if (!ns) {
 		rcu_read_unlock();
 		put_task_struct(task);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7c1fa45..94f5fd3 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -156,7 +156,7 @@ static void *ipcns_get(struct task_struct *task)
 	rcu_read_lock();
 	nsproxy = task_nsproxy(task);
 	if (nsproxy)
-		ns = get_ipc_ns(nsproxy->ipc_ns);
+		ns = get_ipc_ns(rcu_dereference(nsproxy->ipc_ns));
 	rcu_read_unlock();
 
 	return ns;
@@ -169,15 +169,16 @@ static void ipcns_put(void *ns)
 
 static int ipcns_install(struct nsproxy *nsproxy, void *new)
 {
-	struct ipc_namespace *ns = new;
+	struct ipc_namespace *old_ns, *ns = new;
 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
 	    !nsown_capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* Ditch state from the old ipc namespace */
 	exit_sem(current);
-	put_ipc_ns(nsproxy->ipc_ns);
-	nsproxy->ipc_ns = get_ipc_ns(ns);
+	old_ns = nsproxy->ipc_ns;
+	rcu_assign_pointer(nsproxy->ipc_ns, get_ipc_ns(ns));
+	put_ipc_ns(old_ns);
 	return 0;
 }
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index afc0456..4ad9f9f 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -255,6 +255,17 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	if (nstype && (ops->type != nstype))
 		goto out;
 
+	/*
+	 * If count == 1, only the current task can increment it,
+	 * by doing a fork for example so we can safely update the
+	 * current nsproxy pointers without allocate a new one,
+	 * update it and destroy the old one
+	 */
+	if (atomic_read(&tsk->nsproxy->count) == 1) {
+		err = ops->install(tsk->nsproxy, ei->ns);
+		goto out;
+	}
+
 	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
 	if (IS_ERR(new_nsproxy)) {
 		err = PTR_ERR(new_nsproxy);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 69473c4..ac10df4 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -326,7 +326,7 @@ static void pidns_put(void *ns)
 static int pidns_install(struct nsproxy *nsproxy, void *ns)
 {
 	struct pid_namespace *active = task_active_pid_ns(current);
-	struct pid_namespace *ancestor, *new = ns;
+	struct pid_namespace *old, *ancestor, *new = ns;
 
 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
 	    !nsown_capable(CAP_SYS_ADMIN))
@@ -349,8 +349,9 @@ static int pidns_install(struct nsproxy *nsproxy, void *ns)
 	if (ancestor != active)
 		return -EINVAL;
 
-	put_pid_ns(nsproxy->pid_ns);
-	nsproxy->pid_ns = get_pid_ns(new);
+	old = nsproxy->pid_ns;
+	rcu_assign_pointer(nsproxy->pid_ns, get_pid_ns(new));
+	put_pid_ns(old);
 	return 0;
 }
 
diff --git a/kernel/utsname.c b/kernel/utsname.c
index a47fc5d..ba6581b 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -96,7 +96,7 @@ static void *utsns_get(struct task_struct *task)
 	rcu_read_lock();
 	nsproxy = task_nsproxy(task);
 	if (nsproxy) {
-		ns = nsproxy->uts_ns;
+		ns = rcu_dereference(nsproxy->uts_ns);
 		get_uts_ns(ns);
 	}
 	rcu_read_unlock();
@@ -111,15 +111,16 @@ static void utsns_put(void *ns)
 
 static int utsns_install(struct nsproxy *nsproxy, void *new)
 {
-	struct uts_namespace *ns = new;
+	struct uts_namespace *old_ns, *ns = new;
 
 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
 	    !nsown_capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
 	get_uts_ns(ns);
-	put_uts_ns(nsproxy->uts_ns);
-	nsproxy->uts_ns = ns;
+	old_ns = nsproxy->uts_ns;
+	rcu_assign_pointer(nsproxy->uts_ns, ns);
+	put_uts_ns(old_ns);
 	return 0;
 }
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 80e271d..966d435 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -374,7 +374,7 @@ struct net *get_net_ns_by_pid(pid_t pid)
 		struct nsproxy *nsproxy;
 		nsproxy = task_nsproxy(tsk);
 		if (nsproxy)
-			net = get_net(nsproxy->net_ns);
+			net = get_net(rcu_dereference(nsproxy->net_ns));
 	}
 	rcu_read_unlock();
 	return net;
@@ -634,7 +634,7 @@ static void *netns_get(struct task_struct *task)
 	rcu_read_lock();
 	nsproxy = task_nsproxy(task);
 	if (nsproxy)
-		net = get_net(nsproxy->net_ns);
+		net = get_net(rcu_dereference(nsproxy->net_ns));
 	rcu_read_unlock();
 
 	return net;
@@ -647,14 +647,15 @@ static void netns_put(void *ns)
 
 static int netns_install(struct nsproxy *nsproxy, void *ns)
 {
-	struct net *net = ns;
+	struct net *old_net, *net = ns;
 
 	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
 	    !nsown_capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	put_net(nsproxy->net_ns);
-	nsproxy->net_ns = get_net(net);
+	old_net = nsproxy->net_ns;
+	rcu_assign_pointer(nsproxy->net_ns, get_net(net));
+	put_net(old_net);
 	return 0;
 }
 
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-18 14:46       ` [RFC PATCH linux-next v2] " Guillaume Gaudonville
@ 2013-10-18 22:34         ` Eric W. Biederman
  2013-10-22 15:52           ` Guillaume Gaudonville
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2013-10-18 22:34 UTC (permalink / raw)
  To: Guillaume Gaudonville
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf,
	Guillaume Gaudonville, Paul E. McKenney

"Guillaume Gaudonville" <gaudonville@6wind.com> writes:

> Currently, at each call of setns system call a new nsproxy is allocated,
> the old nsproxy namespaces are copied into the new one and the old nsproxy
> is freed if the task was the only one to use it.

In principle this looks ok.  However you are not using rcu properly.

What you are doing is just far enough outside of normal rcu usage my
brain refuses to think it through today.

Paul can you give us a hand?

Specific code comments below.

It looks like what we really want for the pointer variables in nsproxy
is an atomic pointer type.  We have something like that with
ACCESS_ONCE.    Without a prebuilt idiom I am not thinking through this
issue clearly right now.

Maybe you are right that we need to push the rcu protection down a
level.  So we can have free reads and inexpensive writes.

> It can creates large delays on hardware with large number of cpus since
> to free a nsproxy a synchronize_rcu() call is done.
>
> When a task is the only one to use a nsproxy, only the task can do an action
> that will make this nsproxy to be shared by another task or thread (fork,...).
> So when the refcount of the nsproxy is equal to 1, we can simply update the
> current nsproxy field without allocating a new one and freeing the old one.
>
> The install operations of each kind of namespace cannot fails, so there's no
> need to check for an error and calling ops->install().
>
> However since we can have readers of the nsproxy that are not the current task,
> we need to protect access to each namespace pointer in the nsproxy. This is
> done by assigning it using rcu_assign_pointer() and when it is possible
> that the reader is not the current task, read the pointer using
> rcu_dereference().
>
> Finally the install function of each namespace type must be modified to ensure
> that the refcount of the old namespace is released after the assignment in
> nsproxy.
>
> On kernel 3.12-rc1, using a small application that does:
>
> - call setns on a first net namespace and open a socket,
> - call setns on a second net namespace and open a socket,
> - switch back to the first namespace and close the socket,
> - switch back to the second namespace and close the socket,

Note.  You don't need to switch namespaces for any operation except
opening the socket.  Sockets are always fixed in a single network
namespace.

Part of me wonders if this is the time to introduce the socketat system
call I threatend people with a while ago that takes a netns file
descriptor and gives you a socket in the specified namespace.

> On an Intel Westmere with 24 logical cores at 3.33 GHz, it gives the
> following results using the time command:
>
> - without the proposed patch:
>
>   root@blackcloudy:~# time ./test_x86
>
>   real    0m0.130s
>   user    0m0.000s
>   sys     0m0.000s
>
> - with the proposed patch:
>
>   root@blackcloudy:~# time ./test_x86
>
>   real    0m0.020s
>   user    0m0.000s
>   sys     0m0.000s
>
> Reported-by: Chris Metcalf <cmetcalf@tilera.com>
> Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
> ---
>
> v2:
>   - protect readers, by releasing namespaces refcount after updating the
>     nsproxy pointer,
>   - protect readers, by using rcu_assign_pointer() to affect nsproxy
>     pointers,
>   - readers need to use rcu_dereference() to access the namespace and
>     must take a reference on it before leaving the rcu_read_lock section
>     (this last part was already present),
>   - do not add additional exit point in setns syscall.
>
> There are still 2 suspicious functions, nfs_server_list_open() and
> nfs_volume_list_open(). They are accessing directly to the net_ns
> like below:
>
> struct net *net = pid_ns->child_reaper->nsproxy->net_ns;
>
> It seems to me that currently they should access it under rcu_read_lock()
> and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?
>
> And then with this proposed patch they should access the netns through
> a rcu_dereference and take a reference on the netns. I didn't
> modify them for now, but if it is confirmed I can send a patch
> fixing the first issue and then send a v3 of this proposed patch.
>
>  fs/namespace.c           |    9 +++++----
>  fs/proc/proc_net.c       |    2 +-
>  fs/proc_namespace.c      |    2 +-
>  ipc/namespace.c          |    9 +++++----
>  kernel/nsproxy.c         |   11 +++++++++++
>  kernel/pid_namespace.c   |    7 ++++---
>  kernel/utsname.c         |    9 +++++----
>  net/core/net_namespace.c |   11 ++++++-----
>  8 files changed, 38 insertions(+), 22 deletions(-)
>

[snip]

> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index afc0456..4ad9f9f 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -255,6 +255,17 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>  	if (nstype && (ops->type != nstype))
>  		goto out;
>  
> +	/*
> +	 * If count == 1, only the current task can increment it,
> +	 * by doing a fork for example so we can safely update the
> +	 * current nsproxy pointers without allocate a new one,
> +	 * update it and destroy the old one
> +	 */
> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
> +		err = ops->install(tsk->nsproxy, ei->ns);
> +		goto out;
> +	}

Typically to modify something you would need a lock, and the barriers
that implies.  We don't need a lock but I don't know if missing the
barriers is a problem.

> +
>  	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>  	if (IS_ERR(new_nsproxy)) {
>  		err = PTR_ERR(new_nsproxy);

[snip]
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 80e271d..966d435 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -374,7 +374,7 @@ struct net *get_net_ns_by_pid(pid_t pid)
>  		struct nsproxy *nsproxy;
>  		nsproxy = task_nsproxy(tsk);
>  		if (nsproxy)
> -			net = get_net(nsproxy->net_ns);
> +			net = get_net(rcu_dereference(nsproxy->net_ns));

net_ns is not rcu protected so rcu_derference is misleading and wrong.
Perhaps ACCESS_ONCE is what we want here.

>  	}
>  	rcu_read_unlock();
>  	return net;

[snip]

> @@ -647,14 +647,15 @@ static void netns_put(void *ns)
>  
>  static int netns_install(struct nsproxy *nsproxy, void *ns)
>  {
> -	struct net *net = ns;
> +	struct net *old_net, *net = ns;
>  
>  	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
>  	    !nsown_capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> -	put_net(nsproxy->net_ns);
> -	nsproxy->net_ns = get_net(net);
> +	old_net = nsproxy->net_ns;
> +	rcu_assign_pointer(nsproxy->net_ns, get_net(net));
> +	put_net(old_net);

The ordering of operations is correct.  rcu_assign_pointer
is not correct because net_ns is not rcu protected.

>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-18 22:34         ` Eric W. Biederman
@ 2013-10-22 15:52           ` Guillaume Gaudonville
  2013-10-22 16:53             ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Guillaume Gaudonville @ 2013-10-22 15:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf,
	Paul E. McKenney

On 10/19/2013 12:34 AM, Eric W. Biederman wrote:
> "Guillaume Gaudonville" <gaudonville@6wind.com> writes:
>
>> Currently, at each call of setns system call a new nsproxy is allocated,
>> the old nsproxy namespaces are copied into the new one and the old nsproxy
>> is freed if the task was the only one to use it.
> In principle this looks ok.  However you are not using rcu properly.
>
> What you are doing is just far enough outside of normal rcu usage my
> brain refuses to think it through today.
Understood, since they are not dereferenced under rcu_read_lock()
and not freed under rcu protection.
> Paul can you give us a hand?
>
> Specific code comments below.
>
> It looks like what we really want for the pointer variables in nsproxy
> is an atomic pointer type.  We have something like that with
> ACCESS_ONCE.    Without a prebuilt idiom I am not thinking through this
> issue clearly right now.
We also want to avoid taking a reference on the old namespace just after 
the
install() function do the put. So we want to read the pointer  and 
increment the
refcount in an atomic way to avoid incrementing a refcount that has 
already gone
to zero.

However, I think this is not needed if we accept to fail to get a 
reference on the
namespace and use the maybe_get_net() (and equivalent for other 
namespaces) in functions
accessing the namespace from the nsproxy. Then we can use the 
ACCESS_ONCE to read the
pointer before giving it to maybe_get_net().
Finally I think a memory barrier is needed to ensure that no compiler 
reordering is done
between the pointer assignment and the put and that the new pointer is 
visible to other
cores before the put.
> Maybe you are right that we need to push the rcu protection down a
> level.  So we can have free reads and inexpensive writes.
>
>> It can creates large delays on hardware with large number of cpus since
>> to free a nsproxy a synchronize_rcu() call is done.
>>
>> When a task is the only one to use a nsproxy, only the task can do an action
>> that will make this nsproxy to be shared by another task or thread (fork,...).
>> So when the refcount of the nsproxy is equal to 1, we can simply update the
>> current nsproxy field without allocating a new one and freeing the old one.
>>
>> The install operations of each kind of namespace cannot fails, so there's no
>> need to check for an error and calling ops->install().
>>
>> However since we can have readers of the nsproxy that are not the current task,
>> we need to protect access to each namespace pointer in the nsproxy. This is
>> done by assigning it using rcu_assign_pointer() and when it is possible
>> that the reader is not the current task, read the pointer using
>> rcu_dereference().
>>
>> Finally the install function of each namespace type must be modified to ensure
>> that the refcount of the old namespace is released after the assignment in
>> nsproxy.
>>
>> On kernel 3.12-rc1, using a small application that does:
>>
>> - call setns on a first net namespace and open a socket,
>> - call setns on a second net namespace and open a socket,
>> - switch back to the first namespace and close the socket,
>> - switch back to the second namespace and close the socket,
> Note.  You don't need to switch namespaces for any operation except
> opening the socket.  Sockets are always fixed in a single network
> namespace.
>
> Part of me wonders if this is the time to introduce the socketat system
> call I threatend people with a while ago that takes a netns file
> descriptor and gives you a socket in the specified namespace.
>
>> On an Intel Westmere with 24 logical cores at 3.33 GHz, it gives the
>> following results using the time command:
>>
>> - without the proposed patch:
>>
>>    root@blackcloudy:~# time ./test_x86
>>
>>    real    0m0.130s
>>    user    0m0.000s
>>    sys     0m0.000s
>>
>> - with the proposed patch:
>>
>>    root@blackcloudy:~# time ./test_x86
>>
>>    real    0m0.020s
>>    user    0m0.000s
>>    sys     0m0.000s
>>
>> Reported-by: Chris Metcalf <cmetcalf@tilera.com>
>> Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
>> ---
>>
>> v2:
>>    - protect readers, by releasing namespaces refcount after updating the
>>      nsproxy pointer,
>>    - protect readers, by using rcu_assign_pointer() to affect nsproxy
>>      pointers,
>>    - readers need to use rcu_dereference() to access the namespace and
>>      must take a reference on it before leaving the rcu_read_lock section
>>      (this last part was already present),
>>    - do not add additional exit point in setns syscall.
>>
>> There are still 2 suspicious functions, nfs_server_list_open() and
>> nfs_volume_list_open(). They are accessing directly to the net_ns
>> like below:
>>
>> struct net *net = pid_ns->child_reaper->nsproxy->net_ns;
>>
>> It seems to me that currently they should access it under rcu_read_lock()
>> and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?
Do you agree there's also an issue around there?
>>
>> And then with this proposed patch they should access the netns through
>> a rcu_dereference and take a reference on the netns. I didn't
>> modify them for now, but if it is confirmed I can send a patch
>> fixing the first issue and then send a v3 of this proposed patch.
>>
>>   fs/namespace.c           |    9 +++++----
>>   fs/proc/proc_net.c       |    2 +-
>>   fs/proc_namespace.c      |    2 +-
>>   ipc/namespace.c          |    9 +++++----
>>   kernel/nsproxy.c         |   11 +++++++++++
>>   kernel/pid_namespace.c   |    7 ++++---
>>   kernel/utsname.c         |    9 +++++----
>>   net/core/net_namespace.c |   11 ++++++-----
>>   8 files changed, 38 insertions(+), 22 deletions(-)
>>
> [snip]
>
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index afc0456..4ad9f9f 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -255,6 +255,17 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>>   	if (nstype && (ops->type != nstype))
>>   		goto out;
>>   
>> +	/*
>> +	 * If count == 1, only the current task can increment it,
>> +	 * by doing a fork for example so we can safely update the
>> +	 * current nsproxy pointers without allocate a new one,
>> +	 * update it and destroy the old one
>> +	 */
>> +	if (atomic_read(&tsk->nsproxy->count) == 1) {
>> +		err = ops->install(tsk->nsproxy, ei->ns);
>> +		goto out;
>> +	}
> Typically to modify something you would need a lock, and the barriers
> that implies.  We don't need a lock but I don't know if missing the
> barriers is a problem.
>
>> +
>>   	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>>   	if (IS_ERR(new_nsproxy)) {
>>   		err = PTR_ERR(new_nsproxy);
> [snip]
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index 80e271d..966d435 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -374,7 +374,7 @@ struct net *get_net_ns_by_pid(pid_t pid)
>>   		struct nsproxy *nsproxy;
>>   		nsproxy = task_nsproxy(tsk);
>>   		if (nsproxy)
>> -			net = get_net(nsproxy->net_ns);
>> +			net = get_net(rcu_dereference(nsproxy->net_ns));
> net_ns is not rcu protected so rcu_derference is misleading and wrong.
> Perhaps ACCESS_ONCE is what we want here.
Agreed. Following above comments it would become something like:
- net = get_net(nsproxy->net_ns);
+ net = maybe_get_net(ACCESS_ONCE(nsproxy->net_ns));
>
>>   	}
>>   	rcu_read_unlock();
>>   	return net;
> [snip]
>
>> @@ -647,14 +647,15 @@ static void netns_put(void *ns)
>>   
>>   static int netns_install(struct nsproxy *nsproxy, void *ns)
>>   {
>> -	struct net *net = ns;
>> +	struct net *old_net, *net = ns;
>>   
>>   	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
>>   	    !nsown_capable(CAP_SYS_ADMIN))
>>   		return -EPERM;
>>   
>> -	put_net(nsproxy->net_ns);
>> -	nsproxy->net_ns = get_net(net);
>> +	old_net = nsproxy->net_ns;
>> +	rcu_assign_pointer(nsproxy->net_ns, get_net(net));
>> +	put_net(old_net);
> The ordering of operations is correct.  rcu_assign_pointer
> is not correct because net_ns is not rcu protected.
Agreed, I think we need a barrier between the pointer assignment and
the put, something like:

nsproxy->net_ns = get_net(net);
smp_wmb();
put_net(old_net);
>>   	return 0;
>>   }



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-22 15:52           ` Guillaume Gaudonville
@ 2013-10-22 16:53             ` Paul E. McKenney
  2013-10-22 18:47               ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2013-10-22 16:53 UTC (permalink / raw)
  To: Guillaume Gaudonville
  Cc: Eric W. Biederman, linux-kernel, serge.hallyn, akpm, viro, davem,
	cmetcalf

On Tue, Oct 22, 2013 at 05:52:49PM +0200, Guillaume Gaudonville wrote:
> On 10/19/2013 12:34 AM, Eric W. Biederman wrote:
> >"Guillaume Gaudonville" <gaudonville@6wind.com> writes:
> >
> >>Currently, at each call of setns system call a new nsproxy is allocated,
> >>the old nsproxy namespaces are copied into the new one and the old nsproxy
> >>is freed if the task was the only one to use it.
> >In principle this looks ok.  However you are not using rcu properly.
> >
> >What you are doing is just far enough outside of normal rcu usage my
> >brain refuses to think it through today.
> Understood, since they are not dereferenced under rcu_read_lock()
> and not freed under rcu protection.
> >Paul can you give us a hand?

Maybe...

First let me see if I understand what you are trying to do...

The pattern of code is consistent with a situation where you add data
elements to a linked structure, but restrict removals in one of the
following ways:

1.	Never do removals.

2.	Any data elements removed are leaked, so that they are never
	reused.

3.	Any data elements removed are added elsewhere in the overall
	linked structure, and the possibility of readers being carried
	along with a given data element is correctly dealt with.  This
	is actually useful in some cases, but it is harder to get right
	than it might initially appear.

In this case, insertions can use rcu_assign_pointer() or one of
the higher-level primitives based on rcu_assign_pointer(), such as
list_add_rcu().  Traversals can use rcu_dereference().

If for some reason rcu_assign_pointer() is considered bad, you would
need to do something like the following:

	q = kmalloc(...);
	initialize(q);
	smp_wmb();
	ACCESS_ONCE(global_q) = q;

Similarly, if for some reason rcu_dereference() is considered bad, you
would need to do something like the following:

	q = ACCESS_ONCE(global_q);
	smp_read_barrier_depends();
	do_something_with(q);

But I would recommend sticking with rcu_assign_pointer() and
rcu_dereference().  They encapsulate the needed operations nicely
and their action is well understood.

Of course, if you don't have RCU read-side critical sections, then
rcu_dereference() will complain given CONFIG_PROVE_RCU=y.  One way
to avoid this is to use rcu_dereference_raw() instead of
rcu_dereference(), but in this case please add a comment saying
what you are doing.

Does this make sense, or am I confused about what you are trying to do?

							Thanx, Paul

> >Specific code comments below.
> >
> >It looks like what we really want for the pointer variables in nsproxy
> >is an atomic pointer type.  We have something like that with
> >ACCESS_ONCE.    Without a prebuilt idiom I am not thinking through this
> >issue clearly right now.
> We also want to avoid taking a reference on the old namespace just
> after the
> install() function do the put. So we want to read the pointer  and
> increment the
> refcount in an atomic way to avoid incrementing a refcount that has
> already gone
> to zero.
> 
> However, I think this is not needed if we accept to fail to get a
> reference on the
> namespace and use the maybe_get_net() (and equivalent for other
> namespaces) in functions
> accessing the namespace from the nsproxy. Then we can use the
> ACCESS_ONCE to read the
> pointer before giving it to maybe_get_net().
> Finally I think a memory barrier is needed to ensure that no
> compiler reordering is done
> between the pointer assignment and the put and that the new pointer
> is visible to other
> cores before the put.
> >Maybe you are right that we need to push the rcu protection down a
> >level.  So we can have free reads and inexpensive writes.
> >
> >>It can creates large delays on hardware with large number of cpus since
> >>to free a nsproxy a synchronize_rcu() call is done.
> >>
> >>When a task is the only one to use a nsproxy, only the task can do an action
> >>that will make this nsproxy to be shared by another task or thread (fork,...).
> >>So when the refcount of the nsproxy is equal to 1, we can simply update the
> >>current nsproxy field without allocating a new one and freeing the old one.
> >>
> >>The install operations of each kind of namespace cannot fails, so there's no
> >>need to check for an error and calling ops->install().
> >>
> >>However since we can have readers of the nsproxy that are not the current task,
> >>we need to protect access to each namespace pointer in the nsproxy. This is
> >>done by assigning it using rcu_assign_pointer() and when it is possible
> >>that the reader is not the current task, read the pointer using
> >>rcu_dereference().
> >>
> >>Finally the install function of each namespace type must be modified to ensure
> >>that the refcount of the old namespace is released after the assignment in
> >>nsproxy.
> >>
> >>On kernel 3.12-rc1, using a small application that does:
> >>
> >>- call setns on a first net namespace and open a socket,
> >>- call setns on a second net namespace and open a socket,
> >>- switch back to the first namespace and close the socket,
> >>- switch back to the second namespace and close the socket,
> >Note.  You don't need to switch namespaces for any operation except
> >opening the socket.  Sockets are always fixed in a single network
> >namespace.
> >
> >Part of me wonders if this is the time to introduce the socketat system
> >call I threatend people with a while ago that takes a netns file
> >descriptor and gives you a socket in the specified namespace.
> >
> >>On an Intel Westmere with 24 logical cores at 3.33 GHz, it gives the
> >>following results using the time command:
> >>
> >>- without the proposed patch:
> >>
> >>   root@blackcloudy:~# time ./test_x86
> >>
> >>   real    0m0.130s
> >>   user    0m0.000s
> >>   sys     0m0.000s
> >>
> >>- with the proposed patch:
> >>
> >>   root@blackcloudy:~# time ./test_x86
> >>
> >>   real    0m0.020s
> >>   user    0m0.000s
> >>   sys     0m0.000s
> >>
> >>Reported-by: Chris Metcalf <cmetcalf@tilera.com>
> >>Signed-off-by: Guillaume Gaudonville <guillaume.gaudonville@6wind.com>
> >>---
> >>
> >>v2:
> >>   - protect readers, by releasing namespaces refcount after updating the
> >>     nsproxy pointer,
> >>   - protect readers, by using rcu_assign_pointer() to affect nsproxy
> >>     pointers,
> >>   - readers need to use rcu_dereference() to access the namespace and
> >>     must take a reference on it before leaving the rcu_read_lock section
> >>     (this last part was already present),
> >>   - do not add additional exit point in setns syscall.
> >>
> >>There are still 2 suspicious functions, nfs_server_list_open() and
> >>nfs_volume_list_open(). They are accessing directly to the net_ns
> >>like below:
> >>
> >>struct net *net = pid_ns->child_reaper->nsproxy->net_ns;
> >>
> >>It seems to me that currently they should access it under rcu_read_lock()
> >>and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?
> Do you agree there's also an issue around there?
> >>
> >>And then with this proposed patch they should access the netns through
> >>a rcu_dereference and take a reference on the netns. I didn't
> >>modify them for now, but if it is confirmed I can send a patch
> >>fixing the first issue and then send a v3 of this proposed patch.
> >>
> >>  fs/namespace.c           |    9 +++++----
> >>  fs/proc/proc_net.c       |    2 +-
> >>  fs/proc_namespace.c      |    2 +-
> >>  ipc/namespace.c          |    9 +++++----
> >>  kernel/nsproxy.c         |   11 +++++++++++
> >>  kernel/pid_namespace.c   |    7 ++++---
> >>  kernel/utsname.c         |    9 +++++----
> >>  net/core/net_namespace.c |   11 ++++++-----
> >>  8 files changed, 38 insertions(+), 22 deletions(-)
> >>
> >[snip]
> >
> >>diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> >>index afc0456..4ad9f9f 100644
> >>--- a/kernel/nsproxy.c
> >>+++ b/kernel/nsproxy.c
> >>@@ -255,6 +255,17 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> >>  	if (nstype && (ops->type != nstype))
> >>  		goto out;
> >>+	/*
> >>+	 * If count == 1, only the current task can increment it,
> >>+	 * by doing a fork for example so we can safely update the
> >>+	 * current nsproxy pointers without allocate a new one,
> >>+	 * update it and destroy the old one
> >>+	 */
> >>+	if (atomic_read(&tsk->nsproxy->count) == 1) {
> >>+		err = ops->install(tsk->nsproxy, ei->ns);
> >>+		goto out;
> >>+	}
> >Typically to modify something you would need a lock, and the barriers
> >that implies.  We don't need a lock but I don't know if missing the
> >barriers is a problem.
> >
> >>+
> >>  	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> >>  	if (IS_ERR(new_nsproxy)) {
> >>  		err = PTR_ERR(new_nsproxy);
> >[snip]
> >>diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> >>index 80e271d..966d435 100644
> >>--- a/net/core/net_namespace.c
> >>+++ b/net/core/net_namespace.c
> >>@@ -374,7 +374,7 @@ struct net *get_net_ns_by_pid(pid_t pid)
> >>  		struct nsproxy *nsproxy;
> >>  		nsproxy = task_nsproxy(tsk);
> >>  		if (nsproxy)
> >>-			net = get_net(nsproxy->net_ns);
> >>+			net = get_net(rcu_dereference(nsproxy->net_ns));
> >net_ns is not rcu protected so rcu_derference is misleading and wrong.
> >Perhaps ACCESS_ONCE is what we want here.
> Agreed. Following above comments it would become something like:
> - net = get_net(nsproxy->net_ns);
> + net = maybe_get_net(ACCESS_ONCE(nsproxy->net_ns));
> >
> >>  	}
> >>  	rcu_read_unlock();
> >>  	return net;
> >[snip]
> >
> >>@@ -647,14 +647,15 @@ static void netns_put(void *ns)
> >>  static int netns_install(struct nsproxy *nsproxy, void *ns)
> >>  {
> >>-	struct net *net = ns;
> >>+	struct net *old_net, *net = ns;
> >>  	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
> >>  	    !nsown_capable(CAP_SYS_ADMIN))
> >>  		return -EPERM;
> >>-	put_net(nsproxy->net_ns);
> >>-	nsproxy->net_ns = get_net(net);
> >>+	old_net = nsproxy->net_ns;
> >>+	rcu_assign_pointer(nsproxy->net_ns, get_net(net));
> >>+	put_net(old_net);
> >The ordering of operations is correct.  rcu_assign_pointer
> >is not correct because net_ns is not rcu protected.
> Agreed, I think we need a barrier between the pointer assignment and
> the put, something like:
> 
> nsproxy->net_ns = get_net(net);
> smp_wmb();
> put_net(old_net);
> >>  	return 0;
> >>  }
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-22 16:53             ` Paul E. McKenney
@ 2013-10-22 18:47               ` Eric W. Biederman
  2013-10-22 19:44                 ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2013-10-22 18:47 UTC (permalink / raw)
  To: paulmck
  Cc: Guillaume Gaudonville, linux-kernel, serge.hallyn, akpm, viro,
	davem, cmetcalf

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Tue, Oct 22, 2013 at 05:52:49PM +0200, Guillaume Gaudonville wrote:
>> On 10/19/2013 12:34 AM, Eric W. Biederman wrote:
>> >"Guillaume Gaudonville" <gaudonville@6wind.com> writes:
>> >
>> >>Currently, at each call of setns system call a new nsproxy is allocated,
>> >>the old nsproxy namespaces are copied into the new one and the old nsproxy
>> >>is freed if the task was the only one to use it.
>> >In principle this looks ok.  However you are not using rcu properly.
>> >
>> >What you are doing is just far enough outside of normal rcu usage my
>> >brain refuses to think it through today.
>> Understood, since they are not dereferenced under rcu_read_lock()
>> and not freed under rcu protection.
>> >Paul can you give us a hand?
>
> Maybe...
>
> First let me see if I understand what you are trying to do...
>
> The pattern of code is consistent with a situation where you add data
> elements to a linked structure, but restrict removals in one of the
> following ways:
>
> 1.	Never do removals.
>
> 2.	Any data elements removed are leaked, so that they are never
> 	reused.
>
> 3.	Any data elements removed are added elsewhere in the overall
> 	linked structure, and the possibility of readers being carried
> 	along with a given data element is correctly dealt with.  This
> 	is actually useful in some cases, but it is harder to get right
> 	than it might initially appear.
>
> In this case, insertions can use rcu_assign_pointer() or one of
> the higher-level primitives based on rcu_assign_pointer(), such as
> list_add_rcu().  Traversals can use rcu_dereference().
>
> If for some reason rcu_assign_pointer() is considered bad, you would
> need to do something like the following:
>
> 	q = kmalloc(...);
> 	initialize(q);
> 	smp_wmb();
> 	ACCESS_ONCE(global_q) = q;
>
> Similarly, if for some reason rcu_dereference() is considered bad, you
> would need to do something like the following:
>
> 	q = ACCESS_ONCE(global_q);
> 	smp_read_barrier_depends();
> 	do_something_with(q);
>
> But I would recommend sticking with rcu_assign_pointer() and
> rcu_dereference().  They encapsulate the needed operations nicely
> and their action is well understood.
>
> Of course, if you don't have RCU read-side critical sections, then
> rcu_dereference() will complain given CONFIG_PROVE_RCU=y.  One way
> to avoid this is to use rcu_dereference_raw() instead of
> rcu_dereference(), but in this case please add a comment saying
> what you are doing.
>
> Does this make sense, or am I confused about what you are trying to
> do?

Roughly I think it makes sense.  I am still not certain if
rcu_assign_pointer without the barriers that come with a lock is safe.

My brain has processes this enough to say that we need to mark the
pointer in nsproxy as rcu pointers and use normal rcu discipline on them
regardless of the outcome of this patch.

Unfortunately there is a moderately deep problem with this approach.

Today we have normal refcounting and no rcu on the namespaces as seen by
nsproxy.  And we get our rcu liveness guarantees by calling
synchronize_rcu before putting the nsproxy.  Generally that is not a
problem.

There are cases where the synchronize_rcu makes the setns syscall take
more time than people would like to pay.  Guillaume was attempting to
optimize that out.

Howevever fundamentally without chaning the namespace reference counting
we can not optimize out the synchronize_rcu, or else things like put_net
will execute too soon and we may point at a trashed data structure
inside of the rcu critical section.

This review has pointed out quite a bit of bit rot in the code that
needs to be cleaned up.  Unfortunately the optimization is invalid.

The best I can see happening is adding a rcu head into nsproxy and
using call_rcu to delay the dropping of the reference counts, and I
don't think that is worth it.

>> >>There are still 2 suspicious functions, nfs_server_list_open() and
>> >>nfs_volume_list_open(). They are accessing directly to the net_ns
>> >>like below:
>> >>
>> >>struct net *net = pid_ns->child_reaper->nsproxy->net_ns;

Ick.

>> >>
>> >>It seems to me that currently they should access it under rcu_read_lock()
>> >>and using task_nsproxy(pid_ns->child_reaper). It looks like a bug, no?
>> Do you agree there's also an issue around there?

Yes.  That pid_ns->child_reaper access is ugly.   The child_reaper in
protected by the task_list_lock which is not held there.

Then once you have safely read child_reaper than you need task_nsproxy
and all of the rcu fun.

>> >>@@ -647,14 +647,15 @@ static void netns_put(void *ns)
>> >>  static int netns_install(struct nsproxy *nsproxy, void *ns)
>> >>  {
>> >>-	struct net *net = ns;
>> >>+	struct net *old_net, *net = ns;
>> >>  	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
>> >>  	    !nsown_capable(CAP_SYS_ADMIN))
>> >>  		return -EPERM;
>> >>-	put_net(nsproxy->net_ns);
>> >>-	nsproxy->net_ns = get_net(net);
>> >>+	old_net = nsproxy->net_ns;
>> >>+	rcu_assign_pointer(nsproxy->net_ns, get_net(net));
>> >>+	put_net(old_net);
>> >The ordering of operations is correct.  rcu_assign_pointer
>> >is not correct because net_ns is not rcu protected.
>> Agreed, I think we need a barrier between the pointer assignment and
>> the put, something like:
>> 
>> nsproxy->net_ns = get_net(net);
>> smp_wmb();
>> put_net(old_net);


There is a deep problem here.  From Paul's comments and thinking about it
this line seems fine.

	rcu_assign_pointer(nsproxy->net_ns, get_net(net));

However to be safe with the current guarantees the code needs to be:
	synchronize_rcu();
	put_net(old_net);

I am pretty certain we want to apply a patch that does some rcu things
with nsproxy for documentation purposes.  Plus the rcu_dereferences you
added.  Something like the patch below.

I don't know if that will make using current->nsproxy->foo_ns warn but
shrug.

Eric

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d159ac..63c19cdfdbfd 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -28,11 +28,11 @@ struct fs_struct;
  */
 struct nsproxy {
        atomic_t count;
-       struct uts_namespace *uts_ns;
-       struct ipc_namespace *ipc_ns;
-       struct mnt_namespace *mnt_ns;
-       struct pid_namespace *pid_ns_for_children;
-       struct net           *net_ns;
+       struct uts_namespace __rcu *uts_ns;
+       struct ipc_namespace __rcu *ipc_ns;
+       struct mnt_namespace __rcu *mnt_ns;
+       struct pid_namespace __rcu *pid_ns_for_children;
+       struct net           __rcu *net_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 373f3abac94e..70c88d96053f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1200,7 +1200,7 @@ struct task_struct {
 /* open file information */
        struct files_struct *files;
 /* namespaces */
-       struct nsproxy *nsproxy;
+       struct nsproxy __rcu *nsproxy;
 /* signal handlers */
        struct signal_struct *signal;
        struct sighand_struct *sighand;





^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-22 18:47               ` Eric W. Biederman
@ 2013-10-22 19:44                 ` Eric W. Biederman
  2013-10-22 21:42                   ` Paul E. McKenney
  2013-10-23  8:27                   ` Guillaume Gaudonville
  0 siblings, 2 replies; 12+ messages in thread
From: Eric W. Biederman @ 2013-10-22 19:44 UTC (permalink / raw)
  To: Guillaume Gaudonville
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf, paulmck


To be succint.

Mutation of nsproxy in place was a distraction.

What is crucial to the current operation of the code is

synchronize_rcu();
put_pid_ns();
put_net_ns();
...

To remove the syncrhonize_rcu we would have to either user call_rcu or
make certain all of the namespaces have some kind of rcu liveness
guarantee (which many of them do) and use something like maybe_get_net.

If you are going to pursue this the maybe_get_net direction is my
preference as that is what we would need if we did not have nsproxy
and so will be simpler overall.

Hmm.  On the side of simple it may be appropriate to revisit the patch
that started using rcu protection for nsproxy.   I doesn't look like
the original reasons for nsproxy being rcu protected exist any more,
so reverting to task_lock protect may be enough..

And it would result in faster/simpler code that only slows down when we
perform a remote access, which should be far from common.

commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
Author: Pavel Emelyanov <xemul@openvz.org>
Date:   Thu Oct 18 23:39:54 2007 -0700

    Make access to task's nsproxy lighter
    
    When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists.  This is
    slow on read-only paths and may be impossible in some cases.
    
    E.g.  Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.
    
    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.
    
    The access to other task namespaces is proposed to be performed
    like this:
    
         rcu_read_lock();
         nsproxy = task_nsproxy(tsk);
         if (nsproxy != NULL) {
                 / *
                   * work with the namespaces here
                   * e.g. get the reference on one of them
                   * /
         } / *
             * NULL task_nsproxy() means that this task is
             * almost dead (zombie)
             * /
         rcu_read_unlock();
    
    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.


Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-22 19:44                 ` Eric W. Biederman
@ 2013-10-22 21:42                   ` Paul E. McKenney
  2013-10-23  8:27                   ` Guillaume Gaudonville
  1 sibling, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2013-10-22 21:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Guillaume Gaudonville, linux-kernel, serge.hallyn, akpm, viro,
	davem, cmetcalf

On Tue, Oct 22, 2013 at 12:44:20PM -0700, Eric W. Biederman wrote:
> 
> To be succint.
> 
> Mutation of nsproxy in place was a distraction.
> 
> What is crucial to the current operation of the code is
> 
> synchronize_rcu();
> put_pid_ns();
> put_net_ns();
> ...
> 
> To remove the syncrhonize_rcu we would have to either user call_rcu or
> make certain all of the namespaces have some kind of rcu liveness
> guarantee (which many of them do) and use something like maybe_get_net.

You beat me to the call_rcu() suggestion.  If the callback needs to
do something that might sleep, the usual trick is a workqueue scheduled
from the RCU callback.

> If you are going to pursue this the maybe_get_net direction is my
> preference as that is what we would need if we did not have nsproxy
> and so will be simpler overall.
> 
> Hmm.  On the side of simple it may be appropriate to revisit the patch
> that started using rcu protection for nsproxy.   I doesn't look like
> the original reasons for nsproxy being rcu protected exist any more,
> so reverting to task_lock protect may be enough..
> 
> And it would result in faster/simpler code that only slows down when we
> perform a remote access, which should be far from common.

That can be a good option, also.  ;-)

							Thanx, Paul

> commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
> Author: Pavel Emelyanov <xemul@openvz.org>
> Date:   Thu Oct 18 23:39:54 2007 -0700
> 
>     Make access to task's nsproxy lighter
>     
>     When someone wants to deal with some other taks's namespaces it has to lock
>     the task and then to get the desired namespace if the one exists.  This is
>     slow on read-only paths and may be impossible in some cases.
>     
>     E.g.  Oleg recently noticed a race between unshare() and the (sent for
>     review in cgroups) pid namespaces - when the task notifies the parent it
>     has to know the parent's namespace, but taking the task_lock() is
>     impossible there - the code is under write locked tasklist lock.
>     
>     On the other hand switching the namespace on task (daemonize) and releasing
>     the namespace (after the last task exit) is rather rare operation and we
>     can sacrifice its speed to solve the issues above.
>     
>     The access to other task namespaces is proposed to be performed
>     like this:
>     
>          rcu_read_lock();
>          nsproxy = task_nsproxy(tsk);
>          if (nsproxy != NULL) {
>                  / *
>                    * work with the namespaces here
>                    * e.g. get the reference on one of them
>                    * /
>          } / *
>              * NULL task_nsproxy() means that this task is
>              * almost dead (zombie)
>              * /
>          rcu_read_unlock();
>     
>     This patch has passed the review by Eric and Oleg :) and,
>     of course, tested.
> 
> 
> Eric
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH linux-next v2] ns: do not allocate a new nsproxy at each call
  2013-10-22 19:44                 ` Eric W. Biederman
  2013-10-22 21:42                   ` Paul E. McKenney
@ 2013-10-23  8:27                   ` Guillaume Gaudonville
  1 sibling, 0 replies; 12+ messages in thread
From: Guillaume Gaudonville @ 2013-10-23  8:27 UTC (permalink / raw)
  To: Eric W. Biederman, paulmck
  Cc: linux-kernel, serge.hallyn, akpm, viro, davem, cmetcalf

On 10/22/2013 09:44 PM, Eric W. Biederman wrote:
> To be succint.
>
> Mutation of nsproxy in place was a distraction.
>
> What is crucial to the current operation of the code is
>
> synchronize_rcu();
> put_pid_ns();
> put_net_ns();
> ...
>
> To remove the syncrhonize_rcu we would have to either user call_rcu or
> make certain all of the namespaces have some kind of rcu liveness
> guarantee (which many of them do) and use something like maybe_get_net.
>
> If you are going to pursue this the maybe_get_net direction is my
> preference as that is what we would need if we did not have nsproxy
> and so will be simpler overall.
>
> Hmm.  On the side of simple it may be appropriate to revisit the patch
> that started using rcu protection for nsproxy.   I doesn't look like
> the original reasons for nsproxy being rcu protected exist any more,
> so reverting to task_lock protect may be enough..
>
> And it would result in faster/simpler code that only slows down when we
> perform a remote access, which should be far from common.
Ok, let me think a bit of these new directions and I'll come back to you,
thanks for your help guys.
> commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
> Author: Pavel Emelyanov <xemul@openvz.org>
> Date:   Thu Oct 18 23:39:54 2007 -0700
>
>      Make access to task's nsproxy lighter
>      
>      When someone wants to deal with some other taks's namespaces it has to lock
>      the task and then to get the desired namespace if the one exists.  This is
>      slow on read-only paths and may be impossible in some cases.
>      
>      E.g.  Oleg recently noticed a race between unshare() and the (sent for
>      review in cgroups) pid namespaces - when the task notifies the parent it
>      has to know the parent's namespace, but taking the task_lock() is
>      impossible there - the code is under write locked tasklist lock.
>      
>      On the other hand switching the namespace on task (daemonize) and releasing
>      the namespace (after the last task exit) is rather rare operation and we
>      can sacrifice its speed to solve the issues above.
>      
>      The access to other task namespaces is proposed to be performed
>      like this:
>      
>           rcu_read_lock();
>           nsproxy = task_nsproxy(tsk);
>           if (nsproxy != NULL) {
>                   / *
>                     * work with the namespaces here
>                     * e.g. get the reference on one of them
>                     * /
>           } / *
>               * NULL task_nsproxy() means that this task is
>               * almost dead (zombie)
>               * /
>           rcu_read_unlock();
>      
>      This patch has passed the review by Eric and Oleg :) and,
>      of course, tested.
>
>
> Eric


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-10-23  8:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-15 17:35 [RFC PATCH linux-next] ns: do not allocate a new nsproxy at each call Guillaume Gaudonville
2013-10-15 18:40 ` Eric W. Biederman
2013-10-16  8:48   ` Guillaume Gaudonville
2013-10-16 19:42     ` Eric W. Biederman
2013-10-18 14:46       ` [RFC PATCH linux-next v2] " Guillaume Gaudonville
2013-10-18 22:34         ` Eric W. Biederman
2013-10-22 15:52           ` Guillaume Gaudonville
2013-10-22 16:53             ` Paul E. McKenney
2013-10-22 18:47               ` Eric W. Biederman
2013-10-22 19:44                 ` Eric W. Biederman
2013-10-22 21:42                   ` Paul E. McKenney
2013-10-23  8:27                   ` Guillaume Gaudonville

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).