linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Introduce the cloning with pids functionality
@ 2011-11-10 17:15 Pavel Emelyanov
  2011-11-10 17:15 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-10 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Glauber Costa, Nathan Lynch, Tejun Heo,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

When restoring a task (or a set of tasks) we need to recreate them 
with exactly the same pid(s) as they had before. Thus we need the 
ability to create a task with specified pid. The proposal is to reuse 
the already free CLONE_STOPPED clone flag, introduce the new one
called CLONE_CHILD_USEPIDS and point to the desired pids with the
child_tidptr.

There are two cnages from the previous set doing the same.

1. The child_tidptr points to an array of pids for current namespace 
   and its ancestors. When 0 is met in this array the pid number for
   the corresponding namespace is generated, rather than set.

2. For security reasons after a regular clone/fork is done in a namespace
   further cloning with predefined pid is not allowed.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/3] pids: Make alloc_pid return error
  2011-11-10 17:15 [PATCH 0/3] Introduce the cloning with pids functionality Pavel Emelyanov
@ 2011-11-10 17:15 ` Pavel Emelyanov
  2011-11-10 18:00   ` Oleg Nesterov
  2011-11-10 17:15 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
  2011-11-10 17:16 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-10 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Glauber Costa, Nathan Lynch, Tejun Heo,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

Currently the alloc_pid just reports the pid allocation failre
and this gets reported as ENOMEM to user. With the clone-with-pid
ability the error codes set will be extended and this patch eases
this extension.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |    6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..45a5f54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1253,10 +1253,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 	}
 
 	p->pid = pid_nr(pid);
diff --git a/kernel/pid.c b/kernel/pid.c
index e432057..2d9704c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,7 +214,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return -ENOMEM;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -281,7 +281,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 {
 	struct pid *pid;
 	enum pid_type type;
-	int i, nr;
+	int i, nr = -ENOMEM;
 	struct pid_namespace *tmp;
 	struct upid *upid;
 
@@ -321,7 +321,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/3] pids: Split alloc_pidmap into parts
  2011-11-10 17:15 [PATCH 0/3] Introduce the cloning with pids functionality Pavel Emelyanov
  2011-11-10 17:15 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
@ 2011-11-10 17:15 ` Pavel Emelyanov
  2011-11-10 18:12   ` Oleg Nesterov
  2011-11-10 17:16 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-10 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Glauber Costa, Nathan Lynch, Tejun Heo,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

The map's page allocation code is moved to separate function
to make clone-with-pids patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 kernel/pid.c |   37 ++++++++++++++++++++++---------------
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 2d9704c..700dda3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,6 +159,25 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
 	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us
+	 * installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (!map->page) {
+		map->page = page;
+		page = NULL;
+	}
+	spin_unlock_irq(&pidmap_lock);
+	kfree(page);
+
+	return map->page ? 0 : -1;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -176,22 +195,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	 */
 	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
+		if (unlikely(!map->page))
+			if (alloc_pidmap_page(map))
 				break;
-		}
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:15 [PATCH 0/3] Introduce the cloning with pids functionality Pavel Emelyanov
  2011-11-10 17:15 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
  2011-11-10 17:15 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
@ 2011-11-10 17:16 ` Pavel Emelyanov
  2011-11-10 17:30   ` Tejun Heo
  2011-11-10 18:46   ` Oleg Nesterov
  2 siblings, 2 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-10 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Glauber Costa, Nathan Lynch, Tejun Heo,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

When restoring a task (or a set of tasks) we need to recreate them
with exactly the same pid(s) as they had before. Thus we need the
ability to create a task with specified pid. The proposal is to reuse
the already free CLONE_STOPPED clone flag, introduce the new one
called CLONE_CHILD_USEPIDS and point to the desired pids with the
child_tidptr.

The child_tidptr points to an array of pids for current namespace and
its ancestors. When 0 is met in this array the pid number for the
corresponding namespace is generated, rather than set.

For security reasons after a regular clone/fork is done in a namespace
further cloning with predefined pid is not allowed.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 include/linux/pid.h   |    2 +-
 include/linux/sched.h |    3 +--
 kernel/fork.c         |    4 +++-
 kernel/pid.c          |   39 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..95aa618 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41d0237..5472c4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_CHILD_USEPIDS	0x02000000	/* use the pids given by user */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index 45a5f54..26c67ff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1253,7 +1253,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns,
+				(clone_flags & CLONE_CHILD_USEPIDS) ?
+					child_tidptr : NULL);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 700dda3..fb6f74f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -224,6 +224,27 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -ENOMEM;
 }
 
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+	if (unlikely(!map->page))
+		if (alloc_pidmap_page(map))
+			return -ENOMEM;
+
+	if (pid_ns->last_pid != 0)
+		return -EPERM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
 {
 	int offset;
@@ -284,7 +305,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -298,7 +319,21 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = 0;
+		if (unlikely(want_pids != NULL)) {
+			if (get_user(nr, want_pids)) {
+				nr = -EFAULT;
+				goto out_free;
+			}
+
+			if (nr != 0) {
+				want_pids++;
+				nr = set_pidmap(tmp, nr);
+			} else
+				want_pids = NULL; /* optimize above */
+		}
+		if (nr == 0)
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:16 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
@ 2011-11-10 17:30   ` Tejun Heo
  2011-11-10 17:36     ` Pavel Emelyanov
  2011-11-10 18:46   ` Oleg Nesterov
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2011-11-10 17:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

Hello, Pavel.

>  	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +		nr = 0;
> +		if (unlikely(want_pids != NULL)) {
> +			if (get_user(nr, want_pids)) {
> +				nr = -EFAULT;
> +				goto out_free;
> +			}
> +
> +			if (nr != 0) {
> +				want_pids++;
> +				nr = set_pidmap(tmp, nr);
> +			} else
> +				want_pids = NULL; /* optimize above */
> +		}
> +		if (nr == 0)
> +			nr = alloc_pidmap(tmp);

Hmmm... is it necessary to be able to replicate pids on all
namespaces?  Also, isn't it a bit weird to be able to request PIDs in
the namespaces which is beyond the task which requested cloning?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:30   ` Tejun Heo
@ 2011-11-10 17:36     ` Pavel Emelyanov
  2011-11-10 17:45       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-10 17:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

On 11/10/2011 09:30 PM, Tejun Heo wrote:
> Hello, Pavel.
> 
>>  	for (i = ns->level; i >= 0; i--) {
>> -		nr = alloc_pidmap(tmp);
>> +		nr = 0;
>> +		if (unlikely(want_pids != NULL)) {
>> +			if (get_user(nr, want_pids)) {
>> +				nr = -EFAULT;
>> +				goto out_free;
>> +			}
>> +
>> +			if (nr != 0) {
>> +				want_pids++;
>> +				nr = set_pidmap(tmp, nr);
>> +			} else
>> +				want_pids = NULL; /* optimize above */
>> +		}
>> +		if (nr == 0)
>> +			nr = alloc_pidmap(tmp);
> 
> Hmmm... is it necessary to be able to replicate pids on all
> namespaces? 

Not for us (I mean OpenVZ). But since we should (in theory) be able to
recreate the nested set of namespaces it would be good if the API allows
for this from the very beginning.

> Also, isn't it a bit weird to be able to request PIDs in
> the namespaces which is beyond the task which requested cloning?

It is, but the last_pid != 0 check will abort this request with EPERM :)
Do you think some other behavior would be better?

> Thanks.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:36     ` Pavel Emelyanov
@ 2011-11-10 17:45       ` Tejun Heo
  2011-11-11 10:04         ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2011-11-10 17:45 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

Hello,

On Thu, Nov 10, 2011 at 09:36:07PM +0400, Pavel Emelyanov wrote:
> > Hmmm... is it necessary to be able to replicate pids on all
> > namespaces? 
> 
> Not for us (I mean OpenVZ). But since we should (in theory) be able to
> recreate the nested set of namespaces it would be good if the API allows
> for this from the very beginning.

I see.

> > Also, isn't it a bit weird to be able to request PIDs in
> > the namespaces which is beyond the task which requested cloning?
> 
> It is, but the last_pid != 0 check will abort this request with EPERM :)
> Do you think some other behavior would be better?

Right, missed that part.  The only suggestion then is that clearing
wants_pid to NULL isn't an optimization but should be part of API -
ie. want_pids is 0 terminated; otherwise, the caller must know at
which ns depth it is.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] pids: Make alloc_pid return error
  2011-11-10 17:15 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
@ 2011-11-10 18:00   ` Oleg Nesterov
  2011-11-11 10:02     ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-10 18:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/10, Pavel Emelyanov wrote:
>
> @@ -281,7 +281,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
> -	int i, nr;
> +	int i, nr = -ENOMEM;
>  	struct pid_namespace *tmp;
>  	struct upid *upid;

This doesn't look right at first glance... I mean, if
the first kmem_cache_alloc(ns->pid_cachep) fails, this -ENOMEM
won't be returned as ERR_PTR().

> @@ -321,7 +321,7 @@ out_free:
>  		free_pidmap(pid->numbers + i);
>
>  	kmem_cache_free(ns->pid_cachep, pid);
> -	pid = NULL;
> +	pid = ERR_PTR(nr);
>  	goto out;

Off-topic, but with or withoit this patch this "goto out" looks
strange imho. Why not a simple

	-	pid = NULL;
	-	goto out;
	+	return ERR_PTR(nr);

instead? But this is minor and subjective, I won't insist.

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] pids: Split alloc_pidmap into parts
  2011-11-10 17:15 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
@ 2011-11-10 18:12   ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-10 18:12 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/10, Pavel Emelyanov wrote:
>
> The map's page allocation code is moved to separate function
> to make clone-with-pids patching simpler.

Sorry for the really cosmetic nit, but I simply can't resist...

> +static int alloc_pidmap_page(struct pidmap *map)
> +{
> +	void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +
> +	/*
> +	 * Free the page if someone raced with us
> +	 * installing it:
> +	 */
> +	spin_lock_irq(&pidmap_lock);
> +	if (!map->page) {
> +		map->page = page;
> +		page = NULL;
> +	}
> +	spin_unlock_irq(&pidmap_lock);
> +	kfree(page);
> +
> +	return map->page ? 0 : -1;

Again, I won't insist, but "return !map->page" looks more readable.
Even better (imho) would be to return map->page, and change the
single caller to check "if (!alloc_pidnap_page())".

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:16 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2011-11-10 17:30   ` Tejun Heo
@ 2011-11-10 18:46   ` Oleg Nesterov
  2011-11-10 18:56     ` Oleg Nesterov
  1 sibling, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-10 18:46 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/10, Pavel Emelyanov wrote:
>
> The child_tidptr points to an array of pids for current namespace and
> its ancestors. When 0 is met in this array the pid number for the
> corresponding namespace is generated, rather than set.

I must have missed something, but I can't unserstand how this works.

> For security reasons after a regular clone/fork is done in a namespace
> further cloning with predefined pid is not allowed.

I guess, this is pid_ns->last_pid != 0 check in set_pidmap(), right ?

> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	offset = pid & BITS_PER_PAGE_MASK;
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> +
> +	if (unlikely(!map->page))
> +		if (alloc_pidmap_page(map))
> +			return -ENOMEM;
> +
> +	if (pid_ns->last_pid != 0)
> +		return -EPERM;

OK, but it should be always true, no? IOW, set_pidmap() should always
fail?

Unless: you are using CLONE_NEWPID along with CLONE_CHILD_USEPIDS and
this child_tidptr array has only one pid (before zero pid).

So, could you please explain what I have missed?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 18:46   ` Oleg Nesterov
@ 2011-11-10 18:56     ` Oleg Nesterov
  2011-11-11 10:11       ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-10 18:56 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

forgot to mention...

On 11/10, Oleg Nesterov wrote:
>
> On 11/10, Pavel Emelyanov wrote:
> >
> > The child_tidptr points to an array of pids for current namespace and
> > its ancestors. When 0 is met in this array the pid number for the
> > corresponding namespace is generated, rather than set.
>
> I must have missed something, but I can't unserstand how this works.
>
> > For security reasons after a regular clone/fork is done in a namespace
> > further cloning with predefined pid is not allowed.
>
> I guess, this is pid_ns->last_pid != 0 check in set_pidmap(), right ?
>
> > +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> > +{
> > +	int offset;
> > +	struct pidmap *map;
> > +
> > +	offset = pid & BITS_PER_PAGE_MASK;
> > +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> > +
> > +	if (unlikely(!map->page))
> > +		if (alloc_pidmap_page(map))
> > +			return -ENOMEM;
> > +
> > +	if (pid_ns->last_pid != 0)
> > +		return -EPERM;
>
> OK, but it should be always true, no? IOW, set_pidmap() should always
> fail?
>
> Unless: you are using CLONE_NEWPID along with CLONE_CHILD_USEPIDS and
> this child_tidptr array has only one pid (before zero pid).

And, if you do clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS), then
new_ns->child_reaper == NULL (unless you pass "1" in child_tidptr[]) ?

> So, could you please explain what I have missed?

please ;) I guess I misread this patch completely. Help!

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] pids: Make alloc_pid return error
  2011-11-10 18:00   ` Oleg Nesterov
@ 2011-11-11 10:02     ` Pavel Emelyanov
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 10:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/10/2011 10:00 PM, Oleg Nesterov wrote:
> On 11/10, Pavel Emelyanov wrote:
>>
>> @@ -281,7 +281,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>  {
>>  	struct pid *pid;
>>  	enum pid_type type;
>> -	int i, nr;
>> +	int i, nr = -ENOMEM;
>>  	struct pid_namespace *tmp;
>>  	struct upid *upid;
> 
> This doesn't look right at first glance... I mean, if
> the first kmem_cache_alloc(ns->pid_cachep) fails, this -ENOMEM
> won't be returned as ERR_PTR().

Yikes! Will fix, thanks!

>> @@ -321,7 +321,7 @@ out_free:
>>  		free_pidmap(pid->numbers + i);
>>
>>  	kmem_cache_free(ns->pid_cachep, pid);
>> -	pid = NULL;
>> +	pid = ERR_PTR(nr);
>>  	goto out;
> 
> Off-topic, but with or withoit this patch this "goto out" looks
> strange imho. Why not a simple
> 
> 	-	pid = NULL;
> 	-	goto out;
> 	+	return ERR_PTR(nr);
> 
> instead? But this is minor and subjective, I won't insist.

Hm... OK, I will brush up the error paths a little bit more :)

> Oleg.
> 
> .
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 17:45       ` Tejun Heo
@ 2011-11-11 10:04         ` Pavel Emelyanov
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 10:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Linux Kernel Mailing List, Oleg Nesterov, Serge Hallyn,
	Daniel Lezcano

On 11/10/2011 09:45 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 10, 2011 at 09:36:07PM +0400, Pavel Emelyanov wrote:
>>> Hmmm... is it necessary to be able to replicate pids on all
>>> namespaces? 
>>
>> Not for us (I mean OpenVZ). But since we should (in theory) be able to
>> recreate the nested set of namespaces it would be good if the API allows
>> for this from the very beginning.
> 
> I see.
> 
>>> Also, isn't it a bit weird to be able to request PIDs in
>>> the namespaces which is beyond the task which requested cloning?
>>
>> It is, but the last_pid != 0 check will abort this request with EPERM :)
>> Do you think some other behavior would be better?
> 
> Right, missed that part.  The only suggestion then is that clearing
> wants_pid to NULL isn't an optimization but should be part of API -
> ie. want_pids is 0 terminated; otherwise, the caller must know at
> which ns depth it is.

Mm... I don't understand your concern.

If the want_pid[n] == 0 at some n value the further pid numbers should be 
generated, not set. I can NOT set want_pid to NULL here, but keep it point
to the n'th array member, but this will make the code go through the 
get_user() again and again. This want_pid = NULL is just to make the first
if (want_pid == NULL) check work.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-10 18:56     ` Oleg Nesterov
@ 2011-11-11 10:11       ` Pavel Emelyanov
  2011-11-11 15:25         ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 10:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

>>> The child_tidptr points to an array of pids for current namespace and
>>> its ancestors. When 0 is met in this array the pid number for the
>>> corresponding namespace is generated, rather than set.
>>
>> I must have missed something, but I can't unserstand how this works.
>>
>>> For security reasons after a regular clone/fork is done in a namespace
>>> further cloning with predefined pid is not allowed.
>>
>> I guess, this is pid_ns->last_pid != 0 check in set_pidmap(), right ?

Thanks for the feedback, Oleg! Please, see my explanation below.

>>> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
>>> +{
>>> +	int offset;
>>> +	struct pidmap *map;
>>> +
>>> +	offset = pid & BITS_PER_PAGE_MASK;
>>> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
>>> +
>>> +	if (unlikely(!map->page))
>>> +		if (alloc_pidmap_page(map))
>>> +			return -ENOMEM;
>>> +
>>> +	if (pid_ns->last_pid != 0)
>>> +		return -EPERM;
>>
>> OK, but it should be always true, no? IOW, set_pidmap() should always
>> fail?
>>
>> Unless: you are using CLONE_NEWPID along with CLONE_CHILD_USEPIDS and
>> this child_tidptr array has only one pid (before zero pid).
> 
> And, if you do clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS), then
> new_ns->child_reaper == NULL (unless you pass "1" in child_tidptr[]) ?
> 
>> So, could you please explain what I have missed?
> 
> please ;) I guess I misread this patch completely. Help!

This is how I plan to use this functionality.

When creating an init of a container being restored I call

   pids[0] = 1;
   pids[1] = 0;

   clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS, &pids)

At this point the newly created namespace will have last_pid == 0 and will allow
for this init to be created. Then this created "init" task will have to read pids
from image files and call

pids[0] = <pid>
pids[1] = 0

clone(CLONE_CHILD_USEPIDS, &pids);

one by one. At this point the last_pid is still 0 and this new tasks with given
pids will be created. The newly created tasks if they have children too will have
to call the same code snippet.

After the restore is completed and new tasks are fork()-ed the last_pid gets finally
updated and new CLONE_CHILD_USEPIDS will return the EPERM in this namespace not
allowing for pids confusion.

And for the init_pid_ns the last_pid is set to non zero early at boot (when the kthreadd
is created) and thus pids abuse isn't allowed for the non-containerized system from
the very boot.

Does this sound OK?

> Oleg.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 10:11       ` Pavel Emelyanov
@ 2011-11-11 15:25         ` Oleg Nesterov
  2011-11-11 15:58           ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-11 15:25 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11, Pavel Emelyanov wrote:
>
> >> Unless: you are using CLONE_NEWPID along with CLONE_CHILD_USEPIDS and
> >> this child_tidptr array has only one pid (before zero pid).
> >
> > And, if you do clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS), then
> > new_ns->child_reaper == NULL (unless you pass "1" in child_tidptr[]) ?
> >
> >> So, could you please explain what I have missed?
> >
> > please ;) I guess I misread this patch completely. Help!
>
> This is how I plan to use this functionality.
>
> When creating an init of a container being restored I call
>
>    pids[0] = 1;
>    pids[1] = 0;
>
>    clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS, &pids)

Yep, this is clear. In this case everything works because the pid_ns
has no pids (and thus ->last_pid == 0).

But. Let me repeat the question, what if you do the same with
pids[0] = 2 /* anything != 1 */ ? In this case we create the new
pid_ns, but its ->child_reaper is NULL. Unless I missed something.

> Then this created "init" task will have to read pids
> from image files and call
>
> pids[0] = <pid>
> pids[1] = 0
>
> clone(CLONE_CHILD_USEPIDS, &pids);
>
> one by one. At this point the last_pid is still 0

Yes, understood. set_pidmap() bypasses the last_pid logic.

Clever hack^Wtrick ;)

May be this deserves a comment above "if (pid_ns->last_pid != 0)",
and perhaps it would be more clean to do this check before anything
else.

Hmm. It seems, we can make a simpler patch to achieve the (roughly)
same effect. Without touching copy_process/alloc_pid paths. What if
we simply add PR_SET_LAST_PID? (or something else).

In this case the new init (created normally) read the pids from image
file and does prcrl(PR_SET_LAST_PID, pid-1) before the next fork.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 15:25         ` Oleg Nesterov
@ 2011-11-11 15:58           ` Pavel Emelyanov
  2011-11-11 16:06             ` Tejun Heo
  2011-11-11 16:39             ` Oleg Nesterov
  0 siblings, 2 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 15:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11/2011 07:25 PM, Oleg Nesterov wrote:
> On 11/11, Pavel Emelyanov wrote:
>>
>>>> Unless: you are using CLONE_NEWPID along with CLONE_CHILD_USEPIDS and
>>>> this child_tidptr array has only one pid (before zero pid).
>>>
>>> And, if you do clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS), then
>>> new_ns->child_reaper == NULL (unless you pass "1" in child_tidptr[]) ?
>>>
>>>> So, could you please explain what I have missed?
>>>
>>> please ;) I guess I misread this patch completely. Help!
>>
>> This is how I plan to use this functionality.
>>
>> When creating an init of a container being restored I call
>>
>>    pids[0] = 1;
>>    pids[1] = 0;
>>
>>    clone(CLONE_NEWPID | CLONE_CHILD_USEPIDS, &pids)
> 
> Yep, this is clear. In this case everything works because the pid_ns
> has no pids (and thus ->last_pid == 0).
> 
> But. Let me repeat the question, what if you do the same with
> pids[0] = 2 /* anything != 1 */ ? In this case we create the new
> pid_ns, but its ->child_reaper is NULL. Unless I missed something.

Hm... You're right here. I've missed the fact, then in recent kernels
child_reaper is set under pid == 1 condition (was clone_flags & CLONE_NEWPID).

How about if I fix it by disabling the simultaneous use of CLONE_NEWPID and
CLONE_CHILD_USEPIDS and checking for last_pid != 1 in the set_pidmap?

>> Then this created "init" task will have to read pids
>> from image files and call
>>
>> pids[0] = <pid>
>> pids[1] = 0
>>
>> clone(CLONE_CHILD_USEPIDS, &pids);
>>
>> one by one. At this point the last_pid is still 0
> 
> Yes, understood. set_pidmap() bypasses the last_pid logic.
> 
> Clever hack^Wtrick ;)

:)

> May be this deserves a comment above "if (pid_ns->last_pid != 0)",
> and perhaps it would be more clean to do this check before anything
> else.

OK, will fix this.

> Hmm. It seems, we can make a simpler patch to achieve the (roughly)
> same effect. Without touching copy_process/alloc_pid paths. What if
> we simply add PR_SET_LAST_PID? (or something else).
> 
> In this case the new init (created normally) read the pids from image
> file and does prcrl(PR_SET_LAST_PID, pid-1) before the next fork.
> 
> What do you think?

This will make it impossible to fork() children on restore in parallel. And
I don't want to lose this ability :(

> Oleg.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 15:58           ` Pavel Emelyanov
@ 2011-11-11 16:06             ` Tejun Heo
  2011-11-11 16:10               ` Tejun Heo
  2011-11-11 16:17               ` Pavel Emelyanov
  2011-11-11 16:39             ` Oleg Nesterov
  1 sibling, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2011-11-11 16:06 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

Hello,

On Fri, Nov 11, 2011 at 7:58 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> Hmm. It seems, we can make a simpler patch to achieve the (roughly)
>> same effect. Without touching copy_process/alloc_pid paths. What if
>> we simply add PR_SET_LAST_PID? (or something else).
>>
>> In this case the new init (created normally) read the pids from image
>> file and does prcrl(PR_SET_LAST_PID, pid-1) before the next fork.
>>
>> What do you think?
>
> This will make it impossible to fork() children on restore in parallel. And
> I don't want to lose this ability :(

It's highly unlikely that the ability to fork in parallel would
contribute to any meaningful speedup. That is not the critical path by
*far* and I don't think it's worth optimizing for. Forking in serial
and restoring the rest of states in parallel should be enough.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:06             ` Tejun Heo
@ 2011-11-11 16:10               ` Tejun Heo
  2011-11-11 16:18                 ` Pavel Emelyanov
  2011-11-11 16:17               ` Pavel Emelyanov
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2011-11-11 16:10 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

> It's highly unlikely that the ability to fork in parallel would
> contribute to any meaningful speedup. That is not the critical path by
> *far*

To add a bit, it's also one of the more *serialized* path. Doing it in
parallel might actually hurt, so if losing the ability to do it makes
it less intrusive, that's the better way.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:06             ` Tejun Heo
  2011-11-11 16:10               ` Tejun Heo
@ 2011-11-11 16:17               ` Pavel Emelyanov
  2011-11-11 16:48                 ` Oleg Nesterov
  1 sibling, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11/2011 08:06 PM, Tejun Heo wrote:
> Hello,
> 
> On Fri, Nov 11, 2011 at 7:58 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>> Hmm. It seems, we can make a simpler patch to achieve the (roughly)
>>> same effect. Without touching copy_process/alloc_pid paths. What if
>>> we simply add PR_SET_LAST_PID? (or something else).
>>>
>>> In this case the new init (created normally) read the pids from image
>>> file and does prcrl(PR_SET_LAST_PID, pid-1) before the next fork.
>>>
>>> What do you think?
>>
>> This will make it impossible to fork() children on restore in parallel. And
>> I don't want to lose this ability :(
> 
> It's highly unlikely that the ability to fork in parallel would
> contribute to any meaningful speedup. That is not the critical path by
> *far* and I don't think it's worth optimizing for. Forking in serial
> and restoring the rest of states in parallel should be enough.

Well, I wouldn't say that for sure, but anyway. I'm not talking here about the
performance issues only. If we accept that we fork tasks one-by-one, then this
creates great synchronization problems.

First of all - let's imagine that we just want to clone a set of tasks. Then
each of them will have to fork its kids, then report first of them that he's
OK to fork and wait for it to report back, that forking is done. Then do the
same for the rest of them. This is not impossible, but painful.

Next - let's consider we have some tasks sharing various resources, e.g. mm-s
or fd-tables. This means, that these tasks should be cloned in the carefully
calculated sequence with CLONE_XXX flags set. In this case the described above
scheme with fork() serialization simply won't work and we'll have to invent
some fancy messaging with "now X fork with Y pid" and "X done with forking,
please go on" messages.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:10               ` Tejun Heo
@ 2011-11-11 16:18                 ` Pavel Emelyanov
  2011-11-11 16:22                   ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 16:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11/2011 08:10 PM, Tejun Heo wrote:
>> It's highly unlikely that the ability to fork in parallel would
>> contribute to any meaningful speedup. That is not the critical path by
>> *far*
> 
> To add a bit, it's also one of the more *serialized* path. Doing it in
> parallel might actually hurt, 

Can you elaborate on this, please?

> so if losing the ability to do it makes it less intrusive, that's the 
> better way.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:18                 ` Pavel Emelyanov
@ 2011-11-11 16:22                   ` Tejun Heo
  2011-11-11 16:49                     ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2011-11-11 16:22 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On Fri, Nov 11, 2011 at 08:18:19PM +0400, Pavel Emelyanov wrote:
> > To add a bit, it's also one of the more *serialized* path. Doing it in
> > parallel might actually hurt, 
> 
> Can you elaborate on this, please?

It goes through a lot of locks - tasklist_lock, mmlist_lock, other
subsys locks depending on CLONE_* options.  If you try to do that on
multiple CPUs in parallel, depending on the level of concurrency,
contention and extra cacheline bounces may dominate the overhead.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 15:58           ` Pavel Emelyanov
  2011-11-11 16:06             ` Tejun Heo
@ 2011-11-11 16:39             ` Oleg Nesterov
  2011-11-11 16:55               ` Pavel Emelyanov
  1 sibling, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-11 16:39 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11, Pavel Emelyanov wrote:
>
> On 11/11/2011 07:25 PM, Oleg Nesterov wrote:
> >
> > But. Let me repeat the question, what if you do the same with
> > pids[0] = 2 /* anything != 1 */ ? In this case we create the new
> > pid_ns, but its ->child_reaper is NULL. Unless I missed something.
>
> Hm... You're right here. I've missed the fact, then in recent kernels
> child_reaper is set under pid == 1 condition (was clone_flags & CLONE_NEWPID).

Yes, I always hated the "cleanup" which removed CLONE_NEWPID from
copy_process. This is_child_reaper() simply hides CLONE_NEWPID from
grep.

But this is offtopic. We should not create ->child_reaper with pid_nr != 1.

> How about if I fix it by disabling the simultaneous use of CLONE_NEWPID and
> CLONE_CHILD_USEPIDS and checking for last_pid != 1 in the set_pidmap?

I think this should work...

> > Hmm. It seems, we can make a simpler patch to achieve the (roughly)
> > same effect. Without touching copy_process/alloc_pid paths. What if
> > we simply add PR_SET_LAST_PID? (or something else).
> >
> > In this case the new init (created normally) read the pids from image
> > file and does prcrl(PR_SET_LAST_PID, pid-1) before the next fork.
> >
> > What do you think?
>
> This will make it impossible to fork() children on restore in parallel. And
> I don't want to lose this ability :(

Yes, this is true. You need some form of synchronization in user-space.
But, otoh, prctl/sysctl/whatever is much simpler. Both from implementation
pov and from understanding/using. You can even do, say, pthread_create()
to make a thread with the desired tid. And of course I like the fact we
do not add the new hacks into copy_process's paths.

And. If you want to restore the process tree, then these new children
have to cooperate anyway. Say, nobody can clone() without
CLONE_CHILD_USEPIDS before we restore all pids.

Yes, sysctl+clone should be "atomic", but that is all. Does it really
hurt? OK, if nothing else, can't you do somthing like

	int fork_with_pid(int pid)
	{
		int ret;
		int pipefd[2];

		pipe(pipefd);

	retry:
		prcrl(PR_SET_LAST_PID, pid-1);
		ret = fork();

		if (ret == 0) {
			/* child, wait from parent's ACK */
			read(pipefd[0], 1, NULL);
			return 0;
		}

		/* raced with another user of PR_SET_LAST_PID */
		if (unlikely(ret != pid) {
			kill(ret, SIGKILL);
			waitpid(ret);
			goto retry;
		}

		close(pipefd[1]);
		return pid;
	}

?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:17               ` Pavel Emelyanov
@ 2011-11-11 16:48                 ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-11 16:48 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11, Pavel Emelyanov wrote:
>
> First of all - let's imagine that we just want to clone a set of tasks. Then
> each of them will have to fork its kids, then report first of them that he's
> OK to fork and wait for it to report back, that forking is done. Then do the
> same for the rest of them. This is not impossible, but painful.

Why we should wait/report? They can do this in parallel, only
"set_last_pid + fork" needs the synchronization or "check + retry".

> Next - let's consider we have some tasks sharing various resources, e.g. mm-s
> or fd-tables. This means, that these tasks should be cloned in the carefully
> calculated sequence with CLONE_XXX flags set. In this case the described above
> scheme with fork() serialization simply won't work and we'll have to invent
> some fancy messaging with "now X fork with Y pid" and "X done with forking,
> please go on" messages.

Can't understand...

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:22                   ` Tejun Heo
@ 2011-11-11 16:49                     ` Pavel Emelyanov
  2011-11-11 17:02                       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11/2011 08:22 PM, Tejun Heo wrote:
> On Fri, Nov 11, 2011 at 08:18:19PM +0400, Pavel Emelyanov wrote:
>>> To add a bit, it's also one of the more *serialized* path. Doing it in
>>> parallel might actually hurt, 
>>
>> Can you elaborate on this, please?
> 
> It goes through a lot of locks - tasklist_lock, mmlist_lock, other
> subsys locks depending on CLONE_* options.  If you try to do that on
> multiple CPUs in parallel, depending on the level of concurrency,
> contention and extra cacheline bounces may dominate the overhead.

O_O Here are the times of forks on my box (4cores X 2threads, Xeon, RHEL6)

1 cpu 500k forks                                         - 37s
2 cpus on different cores 500k forks on each in parallel - 39s
4 cpus on different cores 500k forks on each in parallel - 41s

8 cpus 500k forks on each in parallel                    - 1m5s

So the fork() scaling seems quite good to me.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:39             ` Oleg Nesterov
@ 2011-11-11 16:55               ` Pavel Emelyanov
  2011-11-13 18:59                 ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 16:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

>> This will make it impossible to fork() children on restore in parallel. And
>> I don't want to lose this ability :(
> 
> Yes, this is true. You need some form of synchronization in user-space.
> But, otoh, prctl/sysctl/whatever is much simpler. Both from implementation
> pov and from understanding/using. You can even do, say, pthread_create()
> to make a thread with the desired tid. And of course I like the fact we
> do not add the new hacks into copy_process's paths.
> 
> And. If you want to restore the process tree, then these new children
> have to cooperate anyway. Say, nobody can clone() without
> CLONE_CHILD_USEPIDS before we restore all pids.
> 
> Yes, sysctl+clone should be "atomic", but that is all. Does it really
> hurt? OK, if nothing else, can't you do somthing like
> 
> 	int fork_with_pid(int pid)
> 	{
> 		int ret;
> 		int pipefd[2];
> 
> 		pipe(pipefd);
> 
> 	retry:
> 		prcrl(PR_SET_LAST_PID, pid-1);
> 		ret = fork();
> 
> 		if (ret == 0) {
> 			/* child, wait from parent's ACK */
> 			read(pipefd[0], 1, NULL);
> 			return 0;
> 		}
> 
> 		/* raced with another user of PR_SET_LAST_PID */
> 		if (unlikely(ret != pid) {
> 			kill(ret, SIGKILL);
> 			waitpid(ret);
> 			goto retry;
> 		}
> 
> 		close(pipefd[1]);
> 		return pid;
> 	}
> 
> ?

Nope, as I said to Tejun, we will most likely not forks children in the depth-first
order, since tasks can share resources and we'll have to calculate the necessary fork
order. Thus this simple interaction simply won't work, more complexity will be required.

But I don't insist. If the CLONE_CHILD_USEPIDS has absolutely no way in the kernel we'll 
have to go the uglier path.

> Oleg.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:49                     ` Pavel Emelyanov
@ 2011-11-11 17:02                       ` Tejun Heo
  2011-11-11 17:13                         ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2011-11-11 17:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

Hello,

On Fri, Nov 11, 2011 at 08:49:50PM +0400, Pavel Emelyanov wrote:
> 1 cpu 500k forks                                         - 37s

That's ~14k forks per sec.  Do you still think you need parallel
forking?

> 2 cpus on different cores 500k forks on each in parallel - 39s
> 4 cpus on different cores 500k forks on each in parallel - 41s
> 
> 8 cpus 500k forks on each in parallel                    - 1m5s
> 
> So the fork() scaling seems quite good to me.

Yeah, looks pretty good actually.  Hmmm, this is on a single socket w/
shared cache where cacheline bouncing is quite cheap, right?  Also,
how are those forking processes related?  On multiple sockets, it's
gonna scale worse.  Dunno how much tho.

At any rate, if you do the rest in paralllel, whether forking is
parallel or not is immaterial.  Let's just do something least
intrusive.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 17:02                       ` Tejun Heo
@ 2011-11-11 17:13                         ` Pavel Emelyanov
  2011-11-13 19:28                           ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-11 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11/2011 09:02 PM, Tejun Heo wrote:
> Hello,
> 
> On Fri, Nov 11, 2011 at 08:49:50PM +0400, Pavel Emelyanov wrote:
>> 1 cpu 500k forks                                         - 37s
> 
> That's ~14k forks per sec.  Do you still think you need parallel
> forking?
> 
>> 2 cpus on different cores 500k forks on each in parallel - 39s
>> 4 cpus on different cores 500k forks on each in parallel - 41s
>>
>> 8 cpus 500k forks on each in parallel                    - 1m5s
>>
>> So the fork() scaling seems quite good to me.
> 
> Yeah, looks pretty good actually.  Hmmm, this is on a single socket w/
> shared cache where cacheline bouncing is quite cheap, right?  Also,
> how are those forking processes related?  On multiple sockets, it's
> gonna scale worse.  Dunno how much tho.
> 
> At any rate, if you do the rest in paralllel, whether forking is
> parallel or not is immaterial.  Let's just do something least
> intrusive.

Hm, so intrusiveness is your main concern here, I see.

OK, let's assume we go with sysctl setting the last_pid.

One of the major concerns with previous attempts have been - someone creates 
a process with a pid that was in use by some app recently and screws things 
up with pid reuse. My approach solves this, how can sysctl handle it? Allowing 
the last_pid change by the CAP_SYA_ADMIN only is not an option, since people
are looking forward to non-root restore.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 16:55               ` Pavel Emelyanov
@ 2011-11-13 18:59                 ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-13 18:59 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Tejun Heo, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11, Pavel Emelyanov wrote:
>
> > Yes, sysctl+clone should be "atomic", but that is all. Does it really
> > hurt? OK, if nothing else, can't you do somthing like
> ...
> Nope, as I said to Tejun, we will most likely not forks children in the depth-first
> order,

But this doesn't matter. I think you misunderstood a bit. You can do
set_last_pid+clone at any time. It doesn't depen on previous clone's.
Unless, of course, the necessary pid_nr was already used, but this is
equally true for CLONE_CHILD_USEPIDS.

> But I don't insist.

You should ;)

> If the CLONE_CHILD_USEPIDS has absolutely no way in the kernel we'll
> have to go the uglier path.

All I think is: we should discuss everything we can before we add the
new API. Probably we need a wider CC.

Yes, personally I can't say I like CLONE_CHILD_USEPIDS very much. But
I agree that sysctl(set_last_pid) (or whatever) is not perfect too.
To me, it has only one but very important (imho) advantage: it is much
simpler and understandable.

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-11 17:13                         ` Pavel Emelyanov
@ 2011-11-13 19:28                           ` Oleg Nesterov
  2011-11-14 10:28                             ` Pavel Emelyanov
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-13 19:28 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Andrew Morton, Cyrill Gorcunov, Glauber Costa,
	Nathan Lynch, Linux Kernel Mailing List, Serge Hallyn,
	Daniel Lezcano

On 11/11, Pavel Emelyanov wrote:
>
> Hm, so intrusiveness is your main concern here, I see.

To me, this is important.

> OK, let's assume we go with sysctl setting the last_pid.
>
> One of the major concerns with previous attempts have been - someone creates
> a process with a pid that was in use by some app recently and screws things
> up with pid reuse.

Good point.

> My approach solves this,

Yes. Although, can't resist, in a subtle way (imho). CLONE_CHILD_USEPIDS
stops working after a clone() without this flag.

> how can sysctl handle it? Allowing
> the last_pid change by the CAP_SYA_ADMIN

Yes, when I suggested set_last_pid I assumed that it needs CAP_SYS_ADMIN.

> only is not an option, since people
> are looking forward to non-root restore.

But CLONE_NEWPID needs CAP_SYS_ADMIN too ?

Anyway, I do not pretend I understand the problem space. And probably
it would be more convenient to change the creds before forking some
children with the predefined pids, I do not know.

So yes, I agree, CLONE_CHILD_USEPIDS wins here. Perhaps set_last_pid
needs another sysctl(set_last_pid_allowed)/whatever, or another idea.
Or we should use CLONE_CHILD_USEPIDS ;)

Let me repeat. It is not that I strongly against CLONE_CHILD_USEPIDS
(although yes, I can't say personally I like it very much ;). Just
I am trying to ensure we can't make something more clear/clean/simple,
at least from the kernel pov. Especially because you are trying to
establish the new user-visible API.

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-13 19:28                           ` Oleg Nesterov
@ 2011-11-14 10:28                             ` Pavel Emelyanov
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-14 10:28 UTC (permalink / raw)
  To: Oleg Nesterov, Tejun Heo
  Cc: Andrew Morton, Cyrill Gorcunov, Glauber Costa, Nathan Lynch,
	Linux Kernel Mailing List, Serge Hallyn, Daniel Lezcano

>> Hm, so intrusiveness is your main concern here, I see.
> 
> To me, this is important.
> 
> Let me repeat. It is not that I strongly against CLONE_CHILD_USEPIDS
> (although yes, I can't say personally I like it very much ;). Just
> I am trying to ensure we can't make something more clear/clean/simple,
> at least from the kernel pov. Especially because you are trying to
> establish the new user-visible API.

OK, I see. I think I will fix your comment then and will send the v2 set trying
to get the feedback from wider audience. Oleg, Tejun, thank you for your ;)

> Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 18:36   ` Oleg Nesterov
@ 2011-11-18 10:05     ` Pavel Emelyanov
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-18 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17/2011 10:36 PM, Oleg Nesterov wrote:
> My previous objection was wrong, I should try to find something else ;)
> 
> On 11/17, Pavel Emelyanov wrote:
>>
>> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
>> +{
>> +	int offset;
>> +	struct pidmap *map;
>> +
>> +	/*
>> +	 * When creating a new pid namespace we must make its init
>> +	 * have pid == 1 in it.
>> +	 */
>> +	if (pid_ns->child_reaper == NULL)
>> +		return 0;
>> +
>> +	/*
>> +	 * Don't allow to create a task with a pid which has recently
>> +	 * belonged to some other (dead already) task. Only init (of
>> +	 * a freshly created namespace) and his clones can do this.
>> +	 */
>> +	if (pid_ns->last_pid != 1)
>> +		return -EPERM;
>> +
>> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> 
> probably we should check that map < PIDMAP_ENTRIES ?

Yes, you're right here. The checks for given pid being correct are required. Will fix.

> Oleg.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 17:28   ` Linus Torvalds
@ 2011-11-17 19:04     ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-17 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Emelyanov, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Linus Torvalds wrote:
>
> Ok. Patches 1-2 look sane on their own. I think we can merge them
> regardless in the next cycle if Oleg & co agree. Oleg?

I am not sure 1/3 really makes sense without 3/3... although it
cleanups the "goto out" logic in alloc_pid().

Anyway I agree, the patches look fine.

> Patch 3 obviously is generating discussion and actually introduces new
> functionality, so this one is the contentious one..
>
>                   Linus
>
> On Thu, Nov 17, 2011 at 9:43 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> >
> > When restoring a task (or a set of tasks) we need to recreate them
> > with exactly the same pid(s) as they had before.  [...]

Yes.

Just in case, I believe technically 3/3 is correct too, modulo the
small problem with the unchecked access to ->pidmap[] (unless I missed
something again).

I am not sure about pid_max... probably we do not care in this case?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone " Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
  2011-11-17 17:28   ` Linus Torvalds
@ 2011-11-17 18:36   ` Oleg Nesterov
  2011-11-18 10:05     ` Pavel Emelyanov
  2 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-17 18:36 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

My previous objection was wrong, I should try to find something else ;)

On 11/17, Pavel Emelyanov wrote:
>
> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	/*
> +	 * When creating a new pid namespace we must make its init
> +	 * have pid == 1 in it.
> +	 */
> +	if (pid_ns->child_reaper == NULL)
> +		return 0;
> +
> +	/*
> +	 * Don't allow to create a task with a pid which has recently
> +	 * belonged to some other (dead already) task. Only init (of
> +	 * a freshly created namespace) and his clones can do this.
> +	 */
> +	if (pid_ns->last_pid != 1)
> +		return -EPERM;
> +
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

probably we should check that map < PIDMAP_ENTRIES ?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone " Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
@ 2011-11-17 17:28   ` Linus Torvalds
  2011-11-17 19:04     ` Oleg Nesterov
  2011-11-17 18:36   ` Oleg Nesterov
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2011-11-17 17:28 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Oleg Nesterov,
	Cyrill Gorcunov, James Bottomley

Ok. Patches 1-2 look sane on their own. I think we can merge them
regardless in the next cycle if Oleg & co agree. Oleg?

Patch 3 obviously is generating discussion and actually introduces new
functionality, so this one is the contentious one..

                  Linus

On Thu, Nov 17, 2011 at 9:43 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>
> When restoring a task (or a set of tasks) we need to recreate them
> with exactly the same pid(s) as they had before.  [...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 15:49     ` Pavel Emelyanov
@ 2011-11-17 16:00       ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-17 16:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> On 11/17/2011 07:32 PM, Oleg Nesterov wrote:
> > On 11/17, Pavel Emelyanov wrote:
> >>
> >> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> >> +{
> >> +	int offset;
> >> +	struct pidmap *map;
> >> +
> >> +	/*
> >> +	 * When creating a new pid namespace we must make its init
> >> +	 * have pid == 1 in it.
> >> +	 */
> >> +	if (pid_ns->child_reaper == NULL)
> >> +		return 0;
> >
> > Do we really need this check? please see below...
> >
> >> +	/*
> >> +	 * Don't allow to create a task with a pid which has recently
> >> +	 * belonged to some other (dead already) task. Only init (of
> >> +	 * a freshly created namespace) and his clones can do this.
> >> +	 */
> >> +	if (pid_ns->last_pid != 1)
> >> +		return -EPERM;
> >
> > ->last_pid == 1. This means that pid_nr == 1 was already created
> > in this namespace via CLONE_NEWPID, and the child with this pid
> > must be ->child_reaper, no?
>
> If you use the CLONE_NEWPID | CLONE_CHILD_USEPIDS

Ah wait... I misread the check above as if it returns the error if
->child_reaper == NULL.

So, CLONE_NEWPID simply ignores child_tidptr[0], alloc_pid()
fallbacks to alloc_pidmap() after set_pidmap() returns 0.

> > Cough. I really think 45a68628 should be reverted ;) IMHO it
> > complicates the understanding of CLONE_NEWPID logic.
>
> If we remove it, then it's OK to remove the check above, but in this case we
> make it possible to have an init with pid != 1. This is flexible, but ... strange.

No, I didn't mean we should allow ->child_reaper with pid != 1, sorry
for confusion.

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 15:32   ` Oleg Nesterov
@ 2011-11-17 15:49     ` Pavel Emelyanov
  2011-11-17 16:00       ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 15:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17/2011 07:32 PM, Oleg Nesterov wrote:
> On 11/17, Pavel Emelyanov wrote:
>>
>> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
>> +{
>> +	int offset;
>> +	struct pidmap *map;
>> +
>> +	/*
>> +	 * When creating a new pid namespace we must make its init
>> +	 * have pid == 1 in it.
>> +	 */
>> +	if (pid_ns->child_reaper == NULL)
>> +		return 0;
> 
> Do we really need this check? please see below...
> 
>> +	/*
>> +	 * Don't allow to create a task with a pid which has recently
>> +	 * belonged to some other (dead already) task. Only init (of
>> +	 * a freshly created namespace) and his clones can do this.
>> +	 */
>> +	if (pid_ns->last_pid != 1)
>> +		return -EPERM;
> 
> ->last_pid == 1. This means that pid_nr == 1 was already created
> in this namespace via CLONE_NEWPID, and the child with this pid
> must be ->child_reaper, no?

If you use the CLONE_NEWPID | CLONE_CHILD_USEPIDS then you should provide the 1st
pid in array is 1. Otherwise init in this new pid namespace will have pid != 1 and
the child_reaper assignment (yes, the 45a68628 commit) will be lost :(

> IOW. if copy_process() allocs the first pid in the new pid_ns, it
> always sets ->child_reaper.

Fixup - if it allocates pid == 1, then it sets the child reaper.

> Cough. I really think 45a68628 should be reverted ;) IMHO it
> complicates the understanding of CLONE_NEWPID logic.

If we remove it, then it's OK to remove the check above, but in this case we
make it possible to have an init with pid != 1. This is flexible, but ... strange.

> Oleg.
> 
> .
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone " Pavel Emelyanov
@ 2011-11-17 15:32   ` Oleg Nesterov
  2011-11-17 15:49     ` Pavel Emelyanov
  2011-11-17 17:28   ` Linus Torvalds
  2011-11-17 18:36   ` Oleg Nesterov
  2 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2011-11-17 15:32 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	/*
> +	 * When creating a new pid namespace we must make its init
> +	 * have pid == 1 in it.
> +	 */
> +	if (pid_ns->child_reaper == NULL)
> +		return 0;

Do we really need this check? please see below...

> +	/*
> +	 * Don't allow to create a task with a pid which has recently
> +	 * belonged to some other (dead already) task. Only init (of
> +	 * a freshly created namespace) and his clones can do this.
> +	 */
> +	if (pid_ns->last_pid != 1)
> +		return -EPERM;

->last_pid == 1. This means that pid_nr == 1 was already created
in this namespace via CLONE_NEWPID, and the child with this pid
must be ->child_reaper, no?

IOW. if copy_process() allocs the first pid in the new pid_ns, it
always sets ->child_reaper.


Cough. I really think 45a68628 should be reverted ;) IMHO it
complicates the understanding of CLONE_NEWPID logic.

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create " Pavel Emelyanov
@ 2011-11-17 11:43 ` Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
                     ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 11:43 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List
  Cc: Tejun Heo, Oleg Nesterov, Cyrill Gorcunov, James Bottomley

When restoring a task (or a set of tasks) we need to recreate them
with exactly the same pid(s) as they had before. Thus we need the
ability to create a task with specified pid. The proposal is to reuse
the already free CLONE_STOPPED clone flag, introduce the new one
called CLONE_CHILD_USEPIDS and point to the desired pids with the
child_tidptr.

The child_tidptr points to an array of pids for current namespace and
its ancestors. When 0 is met in this array the pid number for the
corresponding namespace is generated, rather than set.

For security reasons after a regular clone/fork is done in a namespace
further cloning with predefined pid is not allowed.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 include/linux/pid.h   |    2 +-
 include/linux/sched.h |    3 +-
 kernel/fork.c         |    4 ++-
 kernel/pid.c          |   50 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..95aa618 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41d0237..5472c4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_CHILD_USEPIDS	0x02000000	/* use the pids given by user */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index 45a5f54..26c67ff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1253,7 +1253,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns,
+				(clone_flags & CLONE_CHILD_USEPIDS) ?
+					child_tidptr : NULL);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 86bf7d2..fc7d35c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -224,6 +224,38 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -ENOMEM;
 }
 
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	/*
+	 * When creating a new pid namespace we must make its init
+	 * have pid == 1 in it.
+	 */
+	if (pid_ns->child_reaper == NULL)
+		return 0;
+
+	/*
+	 * Don't allow to create a task with a pid which has recently
+	 * belonged to some other (dead already) task. Only init (of
+	 * a freshly created namespace) and his clones can do this.
+	 */
+	if (pid_ns->last_pid != 1)
+		return -EPERM;
+
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+	offset = pid & BITS_PER_PAGE_MASK;
+
+	if (alloc_pidmap_page(map))
+		return -ENOMEM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
 {
 	int offset;
@@ -284,7 +316,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -298,7 +330,21 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = 0;
+		if (unlikely(want_pids != NULL)) {
+			if (get_user(nr, want_pids)) {
+				nr = -EFAULT;
+				goto out_free;
+			}
+
+			if (nr != 0) {
+				want_pids++;
+				nr = set_pidmap(tmp, nr);
+			} else
+				want_pids = NULL; /* optimize above */
+		}
+		if (nr == 0)
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-11-18 10:05 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-10 17:15 [PATCH 0/3] Introduce the cloning with pids functionality Pavel Emelyanov
2011-11-10 17:15 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
2011-11-10 18:00   ` Oleg Nesterov
2011-11-11 10:02     ` Pavel Emelyanov
2011-11-10 17:15 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
2011-11-10 18:12   ` Oleg Nesterov
2011-11-10 17:16 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
2011-11-10 17:30   ` Tejun Heo
2011-11-10 17:36     ` Pavel Emelyanov
2011-11-10 17:45       ` Tejun Heo
2011-11-11 10:04         ` Pavel Emelyanov
2011-11-10 18:46   ` Oleg Nesterov
2011-11-10 18:56     ` Oleg Nesterov
2011-11-11 10:11       ` Pavel Emelyanov
2011-11-11 15:25         ` Oleg Nesterov
2011-11-11 15:58           ` Pavel Emelyanov
2011-11-11 16:06             ` Tejun Heo
2011-11-11 16:10               ` Tejun Heo
2011-11-11 16:18                 ` Pavel Emelyanov
2011-11-11 16:22                   ` Tejun Heo
2011-11-11 16:49                     ` Pavel Emelyanov
2011-11-11 17:02                       ` Tejun Heo
2011-11-11 17:13                         ` Pavel Emelyanov
2011-11-13 19:28                           ` Oleg Nesterov
2011-11-14 10:28                             ` Pavel Emelyanov
2011-11-11 16:17               ` Pavel Emelyanov
2011-11-11 16:48                 ` Oleg Nesterov
2011-11-11 16:39             ` Oleg Nesterov
2011-11-11 16:55               ` Pavel Emelyanov
2011-11-13 18:59                 ` Oleg Nesterov
2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create " Pavel Emelyanov
2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone " Pavel Emelyanov
2011-11-17 15:32   ` Oleg Nesterov
2011-11-17 15:49     ` Pavel Emelyanov
2011-11-17 16:00       ` Oleg Nesterov
2011-11-17 17:28   ` Linus Torvalds
2011-11-17 19:04     ` Oleg Nesterov
2011-11-17 18:36   ` Oleg Nesterov
2011-11-18 10:05     ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).