linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created
@ 2017-05-23 16:29 Kirill Tkhai
  2017-05-27 11:01 ` Eric W. Biederman
  0 siblings, 1 reply; 4+ messages in thread
From: Kirill Tkhai @ 2017-05-23 16:29 UTC (permalink / raw)
  To: mhocko, avagin, peterz, linux-kernel, oleg, rppt, ebiederm, luto,
	gorcunov, akpm, mingo, ktkhai, serge

This patch prohibits pid allocation till child_reaper
of pid namespace is set, and it makes possible and safe
to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
file. This may be useful to determine user_ns of such a created
pid_ns, which is not possible now.

It was prohibited till now, because the architecture of pid namespaces
assumes child reaper is the firstly created process of the namespace,
and it initializes pid_namespace::proc_mnt. Child reaper creation
mustn't race with creation of another processes from this namespace,
otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
is populated and it will get a null pointer dereference in proc_flush_task().
Also, child reaper mustn't die before processes from the namespace.

The patch prevents such races. It allows to alloc_pid() only if
ns->child_reaper is already set, and it guarantees, that
pid_namespace::proc_mnt is already populated. Also, we do the assignment
under the tasklist_lock in the copy_process() stage, when it can't fail.
This guarantees the child_reaper will be hashed before the concurrent
process, so the concurrent process can't die before it. When child reaper
dies before the concurrent hashes to task list, fork() of the concurrent
will aborts as it's prohibited after commit 3fd372262166:
"pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()".
So, we can't safely allow to get "/proc/[pid]/ns/pid_for_children"
since it's created.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Serge Hallyn <serge@hallyn.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Andrei Vagin <avagin@openvz.org>
CC: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
---
 kernel/pid.c           |    3 ++-
 kernel/pid_namespace.c |    9 ---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index fd1cde1e4576..eeeb01fdd87c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -334,7 +334,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
+	if (!(ns->nr_hashed & PIDNS_HASH_ADDING) ||
+	    (!ns->child_reaper && !is_child_reaper(pid)))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 74a5a7255b4d..51dd1d490542 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
 	}
 	task_unlock(task);
 
-	if (ns) {
-		read_lock(&tasklist_lock);
-		if (!ns->child_reaper) {
-			put_pid_ns(ns);
-			ns = NULL;
-		}
-		read_unlock(&tasklist_lock);
-	}
-
 	return ns ? &ns->ns : NULL;
 }
 

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created
  2017-05-23 16:29 [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created Kirill Tkhai
@ 2017-05-27 11:01 ` Eric W. Biederman
  2017-05-29 10:49   ` Kirill Tkhai
  0 siblings, 1 reply; 4+ messages in thread
From: Eric W. Biederman @ 2017-05-27 11:01 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: mhocko, avagin, peterz, linux-kernel, oleg, rppt, luto, gorcunov,
	akpm, mingo, serge

Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> This patch prohibits pid allocation till child_reaper
> of pid namespace is set, and it makes possible and safe
> to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
> file. This may be useful to determine user_ns of such a created
> pid_ns, which is not possible now.
>
> It was prohibited till now, because the architecture of pid namespaces
> assumes child reaper is the firstly created process of the namespace,
> and it initializes pid_namespace::proc_mnt. Child reaper creation
> mustn't race with creation of another processes from this namespace,
> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
> is populated and it will get a null pointer dereference in proc_flush_task().
> Also, child reaper mustn't die before processes from the namespace.

This patch introduces the possibility that two or more processes may
have the same pid namespace (with no processes) as pid_ns_for_children.

Which means you can now have a race for the first pid in alloc_pid.
Making it indeterminant who allocates the init process.  Which is not
acceptable.

It is not acceptable on two grounds.
1) It is a bogus user space semantic.  Because userspace needs to
   know who allocates init.
2) It is horrible for maintenance becuase now the code has to be very
   clever to deal with a case that no one cares about.  Which is
   a general formula for buggy code.

Eric


> The patch prevents such races. It allows to alloc_pid() only if
> ns->child_reaper is already set, and it guarantees, that
> pid_namespace::proc_mnt is already populated. Also, we do the assignment
> under the tasklist_lock in the copy_process() stage, when it can't fail.
> This guarantees the child_reaper will be hashed before the concurrent
> process, so the concurrent process can't die before it. When child reaper
> dies before the concurrent hashes to task list, fork() of the concurrent
> will aborts as it's prohibited after commit 3fd372262166:
> "pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()".
> So, we can't safely allow to get "/proc/[pid]/ns/pid_for_children"
> since it's created.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: "Eric W. Biederman" <ebiederm@xmission.com>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Serge Hallyn <serge@hallyn.com>
> CC: Michal Hocko <mhocko@suse.com>
> CC: Andrei Vagin <avagin@openvz.org>
> CC: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/pid.c           |    3 ++-
>  kernel/pid_namespace.c |    9 ---------
>  2 files changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index fd1cde1e4576..eeeb01fdd87c 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -334,7 +334,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  
>  	upid = pid->numbers + ns->level;
>  	spin_lock_irq(&pidmap_lock);
> -	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
> +	if (!(ns->nr_hashed & PIDNS_HASH_ADDING) ||
> +	    (!ns->child_reaper && !is_child_reaper(pid)))
>  		goto out_unlock;
>  	for ( ; upid >= pid->numbers; --upid) {
>  		hlist_add_head_rcu(&upid->pid_chain,
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 74a5a7255b4d..51dd1d490542 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
>  	}
>  	task_unlock(task);
>  
> -	if (ns) {
> -		read_lock(&tasklist_lock);
> -		if (!ns->child_reaper) {
> -			put_pid_ns(ns);
> -			ns = NULL;
> -		}
> -		read_unlock(&tasklist_lock);
> -	}
> -
>  	return ns ? &ns->ns : NULL;
>  }
>  

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created
  2017-05-27 11:01 ` Eric W. Biederman
@ 2017-05-29 10:49   ` Kirill Tkhai
  2017-06-08 12:17     ` Kirill Tkhai
  0 siblings, 1 reply; 4+ messages in thread
From: Kirill Tkhai @ 2017-05-29 10:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mhocko, avagin, peterz, linux-kernel, oleg, rppt, luto, gorcunov,
	akpm, mingo, serge

On 27.05.2017 14:01, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> This patch prohibits pid allocation till child_reaper
>> of pid namespace is set, and it makes possible and safe
>> to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
>> file. This may be useful to determine user_ns of such a created
>> pid_ns, which is not possible now.
>>
>> It was prohibited till now, because the architecture of pid namespaces
>> assumes child reaper is the firstly created process of the namespace,
>> and it initializes pid_namespace::proc_mnt. Child reaper creation
>> mustn't race with creation of another processes from this namespace,
>> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
>> is populated and it will get a null pointer dereference in proc_flush_task().
>> Also, child reaper mustn't die before processes from the namespace.
> 
> This patch introduces the possibility that two or more processes may
> have the same pid namespace (with no processes) as pid_ns_for_children.
> 
> Which means you can now have a race for the first pid in alloc_pid.
> Making it indeterminant who allocates the init process.  Which is not
> acceptable.
> 
> It is not acceptable on two grounds.
> 1) It is a bogus user space semantic.  Because userspace needs to
>    know who allocates init.
> 2) It is horrible for maintenance becuase now the code has to be very
>    clever to deal with a case that no one cares about.  Which is
>    a general formula for buggy code.

We may disallow setns() if there is no child reaper created, and
this solves all above issues. Please see v2 below, it has no problems
you pointed.

[PATCH v2]pid_ns: Allow to get pid_for_children ns before child_reaper is created

This patch prohibits setns() on a pid namespace till its child_reaper
is set, and it makes possible and safe to get just unshared pid_ns
from "/proc/[pid]/ns/pid_for_children" file. This may be useful
to determine user_ns of such a created pid_ns, which is not possible now.

It was not possible till now, because the architecture of pid namespaces
assumes child reaper is the first created process of the namespace,
and it initializes pid_namespace::proc_mnt. Child reaper creation
mustn't race with creation of another processes from this namespace,
otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
is populated and it will get a null pointer dereference in proc_flush_task().
Also, child reaper mustn't die before processes from the namespace.

The patch prevents such races. It allows to setns() on a pid namespace
only if ns->child_reaper is already set, and this guarantees, that
only pid namespace creator may establish child reaper.
So, we can safely allow to get "/proc/[pid]/ns/pid_for_children"
since it's created, and to analyse it.

v2: Don't race for child reaper creation.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Serge Hallyn <serge@hallyn.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Andrei Vagin <avagin@openvz.org>
CC: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
---
 kernel/pid_namespace.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 74a5a7255b4d..5e7b3fd0d4c2 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
 	}
 	task_unlock(task);
 
-	if (ns) {
-		read_lock(&tasklist_lock);
-		if (!ns->child_reaper) {
-			put_pid_ns(ns);
-			ns = NULL;
-		}
-		read_unlock(&tasklist_lock);
-	}
-
 	return ns ? &ns->ns : NULL;
 }
 
@@ -428,6 +419,15 @@ static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
 	if (ancestor != active)
 		return -EINVAL;
 
+	/*
+	 * Disallow processes to use pid namespace till its
+	 * creator makes child reaper. Otherwise, several
+	 * processes race for that, and it's not clear who
+	 * establishes init.
+	 */
+	if (!new->child_reaper)
+		return -ESRCH;
+
 	put_pid_ns(nsproxy->pid_ns_for_children);
 	nsproxy->pid_ns_for_children = get_pid_ns(new);
 	return 0;

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created
  2017-05-29 10:49   ` Kirill Tkhai
@ 2017-06-08 12:17     ` Kirill Tkhai
  0 siblings, 0 replies; 4+ messages in thread
From: Kirill Tkhai @ 2017-06-08 12:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mhocko, avagin, peterz, linux-kernel, oleg, rppt, luto, gorcunov,
	akpm, mingo, serge

ping

On 29.05.2017 13:49, Kirill Tkhai wrote:
> On 27.05.2017 14:01, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>
>>> This patch prohibits pid allocation till child_reaper
>>> of pid namespace is set, and it makes possible and safe
>>> to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
>>> file. This may be useful to determine user_ns of such a created
>>> pid_ns, which is not possible now.
>>>
>>> It was prohibited till now, because the architecture of pid namespaces
>>> assumes child reaper is the firstly created process of the namespace,
>>> and it initializes pid_namespace::proc_mnt. Child reaper creation
>>> mustn't race with creation of another processes from this namespace,
>>> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
>>> is populated and it will get a null pointer dereference in proc_flush_task().
>>> Also, child reaper mustn't die before processes from the namespace.
>>
>> This patch introduces the possibility that two or more processes may
>> have the same pid namespace (with no processes) as pid_ns_for_children.
>>
>> Which means you can now have a race for the first pid in alloc_pid.
>> Making it indeterminant who allocates the init process.  Which is not
>> acceptable.
>>
>> It is not acceptable on two grounds.
>> 1) It is a bogus user space semantic.  Because userspace needs to
>>    know who allocates init.
>> 2) It is horrible for maintenance becuase now the code has to be very
>>    clever to deal with a case that no one cares about.  Which is
>>    a general formula for buggy code.
> 
> We may disallow setns() if there is no child reaper created, and
> this solves all above issues. Please see v2 below, it has no problems
> you pointed.
> 
> [PATCH v2]pid_ns: Allow to get pid_for_children ns before child_reaper is created
> 
> This patch prohibits setns() on a pid namespace till its child_reaper
> is set, and it makes possible and safe to get just unshared pid_ns
> from "/proc/[pid]/ns/pid_for_children" file. This may be useful
> to determine user_ns of such a created pid_ns, which is not possible now.
> 
> It was not possible till now, because the architecture of pid namespaces
> assumes child reaper is the first created process of the namespace,
> and it initializes pid_namespace::proc_mnt. Child reaper creation
> mustn't race with creation of another processes from this namespace,
> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
> is populated and it will get a null pointer dereference in proc_flush_task().
> Also, child reaper mustn't die before processes from the namespace.
> 
> The patch prevents such races. It allows to setns() on a pid namespace
> only if ns->child_reaper is already set, and this guarantees, that
> only pid namespace creator may establish child reaper.
> So, we can safely allow to get "/proc/[pid]/ns/pid_for_children"
> since it's created, and to analyse it.
> 
> v2: Don't race for child reaper creation.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: "Eric W. Biederman" <ebiederm@xmission.com>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Andy Lutomirski <luto@kernel.org>
> CC: Serge Hallyn <serge@hallyn.com>
> CC: Michal Hocko <mhocko@suse.com>
> CC: Andrei Vagin <avagin@openvz.org>
> CC: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/pid_namespace.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 74a5a7255b4d..5e7b3fd0d4c2 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
>  	}
>  	task_unlock(task);
>  
> -	if (ns) {
> -		read_lock(&tasklist_lock);
> -		if (!ns->child_reaper) {
> -			put_pid_ns(ns);
> -			ns = NULL;
> -		}
> -		read_unlock(&tasklist_lock);
> -	}
> -
>  	return ns ? &ns->ns : NULL;
>  }
>  
> @@ -428,6 +419,15 @@ static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
>  	if (ancestor != active)
>  		return -EINVAL;
>  
> +	/*
> +	 * Disallow processes to use pid namespace till its
> +	 * creator makes child reaper. Otherwise, several
> +	 * processes race for that, and it's not clear who
> +	 * establishes init.
> +	 */
> +	if (!new->child_reaper)
> +		return -ESRCH;
> +
>  	put_pid_ns(nsproxy->pid_ns_for_children);
>  	nsproxy->pid_ns_for_children = get_pid_ns(new);
>  	return 0;
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-06-08 12:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-23 16:29 [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created Kirill Tkhai
2017-05-27 11:01 ` Eric W. Biederman
2017-05-29 10:49   ` Kirill Tkhai
2017-06-08 12:17     ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).