linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
@ 2011-11-17 11:41 Pavel Emelyanov
  2011-11-17 11:42 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
                   ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 11:41 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List
  Cc: Tejun Heo, Oleg Nesterov, Cyrill Gorcunov, James Bottomley

Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
is required badly.


There's currently a work in progress with checkpoint-restore functionality
in the userspace. Most of the API for doing this kernel already provides, but
sometimes it's not enough. One of the required things is the ability to
create a process with its pids (in different pid namespaces) set to some
given values, rather than generated. Currently kernel doesn't allow for this,
so an API extension is required.

The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
and pass the pids values in the child_tidptr. In order not to introduce the
hole for the pid-reuse attack, using this flag will result in EPERM in case
the pid namespace we're trying to create pid in has at least one pid (except
for the init's one) generated with regular fork()/clone().

Currently Tejun and Oleg are worrying only about the intrusiveness of this
approach, although Oleg agrees, that it solves all the problems it should. The
previous attempts to implement the similar stuff stopped, but no objections
against this were expressed. So the decision of whether it's OK to go this
way or not is required.


The API will be used like in the code below

	/* restore new pid namespace with an init in it */
	pid = clone(CLONE_NEWPID);
	if (pid)
		return 0;

	/*
	 * init of a new pid namespace.
	 * recreate the process tree
	 */

restore_children:
	while (1) {
		pid = next_pid_from_image();
		if (!pid)
			/* no more children */
			break;

		pid = clone(CLONE_CHILD_USEPIDS, &pid);
		if (pid == 0)
			goto restore_children;
	}

	/*
	 * the process tree is recreated, can proceed with restoring
	 * other stuff
	 */


Thanks,
Pavel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] pids: Make alloc_pid return error
  2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Pavel Emelyanov
@ 2011-11-17 11:42 ` Pavel Emelyanov
  2011-11-17 11:42 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 11:42 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List
  Cc: Tejun Heo, Oleg Nesterov, Cyrill Gorcunov, James Bottomley

Currently the alloc_pid just reports the pid allocation failre
and this gets reported as ENOMEM to user. With the clone-with-pid
ability the error codes set will be extended and this patch eases
this extension.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |    9 ++++-----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..45a5f54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1253,10 +1253,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 	}
 
 	p->pid = pid_nr(pid);
diff --git a/kernel/pid.c b/kernel/pid.c
index e432057..4816f43 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,7 +214,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return -ENOMEM;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
@@ -281,7 +281,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 {
 	struct pid *pid;
 	enum pid_type type;
-	int i, nr;
+	int i, nr = -ENOMEM;
 	struct pid_namespace *tmp;
 	struct upid *upid;
 
@@ -313,7 +313,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 				&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
 	spin_unlock_irq(&pidmap_lock);
 
-out:
 	return pid;
 
 out_free:
@@ -321,8 +320,8 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
-	goto out;
+out:
+	return ERR_PTR(nr);
 }
 
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] pids: Split alloc_pidmap into parts
  2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Pavel Emelyanov
  2011-11-17 11:42 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
@ 2011-11-17 11:42 ` Pavel Emelyanov
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2011-11-17 15:49 ` [RFC][PATCH 0/3] fork: Add the ability to create " Oleg Nesterov
  3 siblings, 0 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 11:42 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List
  Cc: Tejun Heo, Oleg Nesterov, Cyrill Gorcunov, James Bottomley

The map's page allocation code is moved to separate function
to make clone-with-pids patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 kernel/pid.c |   39 +++++++++++++++++++++++----------------
 1 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 4816f43..86bf7d2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,6 +159,26 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
 	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (!map->page) {
+			map->page = page;
+			page = NULL;
+		}
+		spin_unlock_irq(&pidmap_lock);
+		kfree(page);
+	}
+
+	return !map->page;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -176,22 +196,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	 */
 	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
-				break;
-		}
+		if (alloc_pidmap_page(map))
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Pavel Emelyanov
  2011-11-17 11:42 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
  2011-11-17 11:42 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
@ 2011-11-17 11:43 ` Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
                     ` (2 more replies)
  2011-11-17 15:49 ` [RFC][PATCH 0/3] fork: Add the ability to create " Oleg Nesterov
  3 siblings, 3 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 11:43 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List
  Cc: Tejun Heo, Oleg Nesterov, Cyrill Gorcunov, James Bottomley

When restoring a task (or a set of tasks) we need to recreate them
with exactly the same pid(s) as they had before. Thus we need the
ability to create a task with specified pid. The proposal is to reuse
the already free CLONE_STOPPED clone flag, introduce the new one
called CLONE_CHILD_USEPIDS and point to the desired pids with the
child_tidptr.

The child_tidptr points to an array of pids for current namespace and
its ancestors. When 0 is met in this array the pid number for the
corresponding namespace is generated, rather than set.

For security reasons after a regular clone/fork is done in a namespace
further cloning with predefined pid is not allowed.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 include/linux/pid.h   |    2 +-
 include/linux/sched.h |    3 +-
 kernel/fork.c         |    4 ++-
 kernel/pid.c          |   50 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..95aa618 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41d0237..5472c4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_CHILD_USEPIDS	0x02000000	/* use the pids given by user */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
diff --git a/kernel/fork.c b/kernel/fork.c
index 45a5f54..26c67ff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1253,7 +1253,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns,
+				(clone_flags & CLONE_CHILD_USEPIDS) ?
+					child_tidptr : NULL);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 86bf7d2..fc7d35c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -224,6 +224,38 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -ENOMEM;
 }
 
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	/*
+	 * When creating a new pid namespace we must make its init
+	 * have pid == 1 in it.
+	 */
+	if (pid_ns->child_reaper == NULL)
+		return 0;
+
+	/*
+	 * Don't allow to create a task with a pid which has recently
+	 * belonged to some other (dead already) task. Only init (of
+	 * a freshly created namespace) and his clones can do this.
+	 */
+	if (pid_ns->last_pid != 1)
+		return -EPERM;
+
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+	offset = pid & BITS_PER_PAGE_MASK;
+
+	if (alloc_pidmap_page(map))
+		return -ENOMEM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
 {
 	int offset;
@@ -284,7 +316,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int __user *want_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -298,7 +330,21 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = 0;
+		if (unlikely(want_pids != NULL)) {
+			if (get_user(nr, want_pids)) {
+				nr = -EFAULT;
+				goto out_free;
+			}
+
+			if (nr != 0) {
+				want_pids++;
+				nr = set_pidmap(tmp, nr);
+			} else
+				want_pids = NULL; /* optimize above */
+		}
+		if (nr == 0)
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
@ 2011-11-17 15:32   ` Oleg Nesterov
  2011-11-17 15:49     ` Pavel Emelyanov
  2011-11-17 17:28   ` Linus Torvalds
  2011-11-17 18:36   ` Oleg Nesterov
  2 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 15:32 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	/*
> +	 * When creating a new pid namespace we must make its init
> +	 * have pid == 1 in it.
> +	 */
> +	if (pid_ns->child_reaper == NULL)
> +		return 0;

Do we really need this check? please see below...

> +	/*
> +	 * Don't allow to create a task with a pid which has recently
> +	 * belonged to some other (dead already) task. Only init (of
> +	 * a freshly created namespace) and his clones can do this.
> +	 */
> +	if (pid_ns->last_pid != 1)
> +		return -EPERM;

->last_pid == 1. This means that pid_nr == 1 was already created
in this namespace via CLONE_NEWPID, and the child with this pid
must be ->child_reaper, no?

IOW. if copy_process() allocs the first pid in the new pid_ns, it
always sets ->child_reaper.


Cough. I really think 45a68628 should be reverted ;) IMHO it
complicates the understanding of CLONE_NEWPID logic.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
@ 2011-11-17 15:49 ` Oleg Nesterov
  2011-11-17 16:01   ` Pavel Emelyanov
  3 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 15:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
> is required badly.

Please.

> The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
> and pass the pids values in the child_tidptr. In order not to introduce the
> hole for the pid-reuse attack, using this flag will result in EPERM in case
> the pid namespace we're trying to create pid in has at least one pid (except
> for the init's one) generated with regular fork()/clone().
>
> Currently Tejun and Oleg are worrying only about the intrusiveness of this
> approach, although Oleg agrees, that it solves all the problems it should. The
> previous attempts to implement the similar stuff stopped, but no objections
> against this were expressed. So the decision of whether it's OK to go this
> way or not is required.

Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
user-space pov (serialization, security) but it is much simpler.

OTOH, I do not pretend I understand the user-space needs, so I won't argue.
This series seems correct, the bugs we discussed are fixed.

But. Speaking of API, it differs a bit compared to the previous version...

> The API will be used like in the code below
>
> 	/* restore new pid namespace with an init in it */
> 	pid = clone(CLONE_NEWPID);

Yes, CLONE_NEWPID | CLONE_CHILD_USEPIDS is not possible.

Then how the array of pids in child_tidptr[] can be useful? If CLONE_NEWPID
can't restore the pid_nr's in the parent namespaces, then probably this
doesn't makes sense at all?

IOW. I think we should either allow CLONE_NEWPID | CLONE_CHILD_USEPIDS
(with additional check in set_pidmap() to ensure that CLONE_NEWPID
 comes with child_tidptr[0] == 1), or we should treat the "overloaded"
child_tidptr as a simple pid_t.

Again, I won't insist. Just I want to be sure we do not miss something
adding the new API.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 15:32   ` Oleg Nesterov
@ 2011-11-17 15:49     ` Pavel Emelyanov
  2011-11-17 16:00       ` Oleg Nesterov
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 15:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17/2011 07:32 PM, Oleg Nesterov wrote:
> On 11/17, Pavel Emelyanov wrote:
>>
>> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
>> +{
>> +	int offset;
>> +	struct pidmap *map;
>> +
>> +	/*
>> +	 * When creating a new pid namespace we must make its init
>> +	 * have pid == 1 in it.
>> +	 */
>> +	if (pid_ns->child_reaper == NULL)
>> +		return 0;
> 
> Do we really need this check? please see below...
> 
>> +	/*
>> +	 * Don't allow to create a task with a pid which has recently
>> +	 * belonged to some other (dead already) task. Only init (of
>> +	 * a freshly created namespace) and his clones can do this.
>> +	 */
>> +	if (pid_ns->last_pid != 1)
>> +		return -EPERM;
> 
> ->last_pid == 1. This means that pid_nr == 1 was already created
> in this namespace via CLONE_NEWPID, and the child with this pid
> must be ->child_reaper, no?

If you use the CLONE_NEWPID | CLONE_CHILD_USEPIDS then you should provide the 1st
pid in array is 1. Otherwise init in this new pid namespace will have pid != 1 and
the child_reaper assignment (yes, the 45a68628 commit) will be lost :(

> IOW. if copy_process() allocs the first pid in the new pid_ns, it
> always sets ->child_reaper.

Fixup - if it allocates pid == 1, then it sets the child reaper.

> Cough. I really think 45a68628 should be reverted ;) IMHO it
> complicates the understanding of CLONE_NEWPID logic.

If we remove it, then it's OK to remove the check above, but in this case we
make it possible to have an init with pid != 1. This is flexible, but ... strange.

> Oleg.
> 
> .
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 15:49     ` Pavel Emelyanov
@ 2011-11-17 16:00       ` Oleg Nesterov
  0 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 16:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> On 11/17/2011 07:32 PM, Oleg Nesterov wrote:
> > On 11/17, Pavel Emelyanov wrote:
> >>
> >> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> >> +{
> >> +	int offset;
> >> +	struct pidmap *map;
> >> +
> >> +	/*
> >> +	 * When creating a new pid namespace we must make its init
> >> +	 * have pid == 1 in it.
> >> +	 */
> >> +	if (pid_ns->child_reaper == NULL)
> >> +		return 0;
> >
> > Do we really need this check? please see below...
> >
> >> +	/*
> >> +	 * Don't allow to create a task with a pid which has recently
> >> +	 * belonged to some other (dead already) task. Only init (of
> >> +	 * a freshly created namespace) and his clones can do this.
> >> +	 */
> >> +	if (pid_ns->last_pid != 1)
> >> +		return -EPERM;
> >
> > ->last_pid == 1. This means that pid_nr == 1 was already created
> > in this namespace via CLONE_NEWPID, and the child with this pid
> > must be ->child_reaper, no?
>
> If you use the CLONE_NEWPID | CLONE_CHILD_USEPIDS

Ah wait... I misread the check above as if it returns the error if
->child_reaper == NULL.

So, CLONE_NEWPID simply ignores child_tidptr[0], alloc_pid()
fallbacks to alloc_pidmap() after set_pidmap() returns 0.

> > Cough. I really think 45a68628 should be reverted ;) IMHO it
> > complicates the understanding of CLONE_NEWPID logic.
>
> If we remove it, then it's OK to remove the check above, but in this case we
> make it possible to have an init with pid != 1. This is flexible, but ... strange.

No, I didn't mean we should allow ->child_reaper with pid != 1, sorry
for confusion.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-17 15:49 ` [RFC][PATCH 0/3] fork: Add the ability to create " Oleg Nesterov
@ 2011-11-17 16:01   ` Pavel Emelyanov
  2011-11-17 16:02     ` Oleg Nesterov
  2011-11-18 23:30     ` Tejun Heo
  0 siblings, 2 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-17 16:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17/2011 07:49 PM, Oleg Nesterov wrote:
> On 11/17, Pavel Emelyanov wrote:
>>
>> Gentlemen, please, find some time for this, your ACK/NACK on the API proposal
>> is required badly.
> 
> Please.
> 
>> The proposal is to introduce the CLONE_CHILD_USEPIDS flag for clone() syscall
>> and pass the pids values in the child_tidptr. In order not to introduce the
>> hole for the pid-reuse attack, using this flag will result in EPERM in case
>> the pid namespace we're trying to create pid in has at least one pid (except
>> for the init's one) generated with regular fork()/clone().
>>
>> Currently Tejun and Oleg are worrying only about the intrusiveness of this
>> approach, although Oleg agrees, that it solves all the problems it should. The
>> previous attempts to implement the similar stuff stopped, but no objections
>> against this were expressed. So the decision of whether it's OK to go this
>> way or not is required.
> 
> Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
> simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
> user-space pov (serialization, security) but it is much simpler.

Yes, this is also possible. I have a working prototype of /proc/sys/kernel/ns_last_pid
with the security issue solved, but setting sysctl then cloning seems more obfuscating
to me than just passing an array of pids to clone.

> OTOH, I do not pretend I understand the user-space needs, so I won't argue.
> This series seems correct, the bugs we discussed are fixed.
> 
> But. Speaking of API, it differs a bit compared to the previous version...
> 
>> The API will be used like in the code below
>>
>> 	/* restore new pid namespace with an init in it */
>> 	pid = clone(CLONE_NEWPID);
> 
> Yes, CLONE_NEWPID | CLONE_CHILD_USEPIDS is not possible.

It should be. If we (in theory, but) restore two pid namespaces with one being
a child of another we will have to create an init of the child ns with predefined
pid in the parent ns.

> Then how the array of pids in child_tidptr[] can be useful? If CLONE_NEWPID
> can't restore the pid_nr's in the parent namespaces, then probably this
> doesn't makes sense at all?
> 
> IOW. I think we should either allow CLONE_NEWPID | CLONE_CHILD_USEPIDS
> (with additional check in set_pidmap() to ensure that CLONE_NEWPID
>  comes with child_tidptr[0] == 1), or we should treat the "overloaded"
> child_tidptr as a simple pid_t.

The child_tidptr[0] == 1 check will also work. Currently I check for the
ns->child_reaper being NULL instead.

> Again, I won't insist. Just I want to be sure we do not miss something
> adding the new API.
> 
> Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-17 16:01   ` Pavel Emelyanov
@ 2011-11-17 16:02     ` Oleg Nesterov
  2011-11-18 23:30     ` Tejun Heo
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 16:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Pavel Emelyanov wrote:
>
> On 11/17/2011 07:49 PM, Oleg Nesterov wrote:
> >
> > But. Speaking of API, it differs a bit compared to the previous version...
> >
> >> The API will be used like in the code below
> >>
> >> 	/* restore new pid namespace with an init in it */
> >> 	pid = clone(CLONE_NEWPID);
> >
> > Yes, CLONE_NEWPID | CLONE_CHILD_USEPIDS is not possible.
>
> It should be.

Yes, I was wrong, sorry. I misread the ->child_reaper == NULL in 3/3.

We simply ignore child_tidptr[0] in this case. This should work.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
@ 2011-11-17 17:28   ` Linus Torvalds
  2011-11-17 19:04     ` Oleg Nesterov
  2011-11-17 18:36   ` Oleg Nesterov
  2 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2011-11-17 17:28 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Oleg Nesterov,
	Cyrill Gorcunov, James Bottomley

Ok. Patches 1-2 look sane on their own. I think we can merge them
regardless in the next cycle if Oleg & co agree. Oleg?

Patch 3 obviously is generating discussion and actually introduces new
functionality, so this one is the contentious one..

                  Linus

On Thu, Nov 17, 2011 at 9:43 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>
> When restoring a task (or a set of tasks) we need to recreate them
> with exactly the same pid(s) as they had before.  [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
  2011-11-17 15:32   ` Oleg Nesterov
  2011-11-17 17:28   ` Linus Torvalds
@ 2011-11-17 18:36   ` Oleg Nesterov
  2011-11-18 10:05     ` Pavel Emelyanov
  2 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 18:36 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

My previous objection was wrong, I should try to find something else ;)

On 11/17, Pavel Emelyanov wrote:
>
> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	/*
> +	 * When creating a new pid namespace we must make its init
> +	 * have pid == 1 in it.
> +	 */
> +	if (pid_ns->child_reaper == NULL)
> +		return 0;
> +
> +	/*
> +	 * Don't allow to create a task with a pid which has recently
> +	 * belonged to some other (dead already) task. Only init (of
> +	 * a freshly created namespace) and his clones can do this.
> +	 */
> +	if (pid_ns->last_pid != 1)
> +		return -EPERM;
> +
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

probably we should check that map < PIDMAP_ENTRIES ?

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 17:28   ` Linus Torvalds
@ 2011-11-17 19:04     ` Oleg Nesterov
  0 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-17 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Emelyanov, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17, Linus Torvalds wrote:
>
> Ok. Patches 1-2 look sane on their own. I think we can merge them
> regardless in the next cycle if Oleg & co agree. Oleg?

I am not sure 1/3 really makes sense without 3/3... although it
cleanups the "goto out" logic in alloc_pid().

Anyway I agree, the patches look fine.

> Patch 3 obviously is generating discussion and actually introduces new
> functionality, so this one is the contentious one..
>
>                   Linus
>
> On Thu, Nov 17, 2011 at 9:43 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> >
> > When restoring a task (or a set of tasks) we need to recreate them
> > with exactly the same pid(s) as they had before.  [...]

Yes.

Just in case, I believe technically 3/3 is correct too, modulo the
small problem with the unchecked access to ->pidmap[] (unless I missed
something again).

I am not sure about pid_max... probably we do not care in this case?

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] pids: Make it possible to clone tasks with given pids
  2011-11-17 18:36   ` Oleg Nesterov
@ 2011-11-18 10:05     ` Pavel Emelyanov
  0 siblings, 0 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-18 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Tejun Heo, Cyrill Gorcunov,
	James Bottomley

On 11/17/2011 10:36 PM, Oleg Nesterov wrote:
> My previous objection was wrong, I should try to find something else ;)
> 
> On 11/17, Pavel Emelyanov wrote:
>>
>> +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
>> +{
>> +	int offset;
>> +	struct pidmap *map;
>> +
>> +	/*
>> +	 * When creating a new pid namespace we must make its init
>> +	 * have pid == 1 in it.
>> +	 */
>> +	if (pid_ns->child_reaper == NULL)
>> +		return 0;
>> +
>> +	/*
>> +	 * Don't allow to create a task with a pid which has recently
>> +	 * belonged to some other (dead already) task. Only init (of
>> +	 * a freshly created namespace) and his clones can do this.
>> +	 */
>> +	if (pid_ns->last_pid != 1)
>> +		return -EPERM;
>> +
>> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> 
> probably we should check that map < PIDMAP_ENTRIES ?

Yes, you're right here. The checks for given pid being correct are required. Will fix.

> Oleg.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-17 16:01   ` Pavel Emelyanov
  2011-11-17 16:02     ` Oleg Nesterov
@ 2011-11-18 23:30     ` Tejun Heo
  2011-11-21  9:15       ` Pavel Emelyanov
  1 sibling, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2011-11-18 23:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

Hello,

On Thu, Nov 17, 2011 at 08:01:03PM +0400, Pavel Emelyanov wrote:
> > Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
> > simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
> > user-space pov (serialization, security) but it is much simpler.
> 
> Yes, this is also possible. I have a working prototype of /proc/sys/kernel/ns_last_pid
> with the security issue solved, but setting sysctl then cloning seems more obfuscating
> to me than just passing an array of pids to clone.

Do you mind sharing the patch?  It doesn't have to be perfect.  I'm
just curious how it looks.  IMHO the suggested pid array passing is
good enough and not too intrusive but, if there's something simpler
from kernel side, given that this is a very specialized interface, I
think we definitely need to consider that.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-18 23:30     ` Tejun Heo
@ 2011-11-21  9:15       ` Pavel Emelyanov
  2011-11-21 22:50         ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-21  9:15 UTC (permalink / raw)
  To: Tejun Heo, Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Alan Cox, Roland McGrath,
	Linux Kernel Mailing List, Cyrill Gorcunov, James Bottomley

On 11/19/2011 03:30 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 17, 2011 at 08:01:03PM +0400, Pavel Emelyanov wrote:
>>> Yes, personally I'd prefer /proc/set_last_pid (or something similar) which
>>> simply writes to pid_ns->last_pid. Perhaps it is less convenient from the
>>> user-space pov (serialization, security) but it is much simpler.
>>
>> Yes, this is also possible. I have a working prototype of /proc/sys/kernel/ns_last_pid
>> with the security issue solved, but setting sysctl then cloning seems more obfuscating
>> to me than just passing an array of pids to clone.
> 
> Do you mind sharing the patch?

Sure! First of all, we need to change the ctl_table_root->permission callback to pass
the required operations (MAY_XXX) into it, rather than just getting the mode allowed.
The API change it like in the patch below (plus we need to patch the net/ sysctl's, since
they use this API):

--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -1052,10 +1052,12 @@ struct ctl_table_root {
 	struct ctl_table_set default_set;
 	struct ctl_table_set *(*lookup)(struct ctl_table_root *root,
 					   struct nsproxy *namespaces);
-	int (*permissions)(struct ctl_table_root *root,
-			struct nsproxy *namespaces, struct ctl_table *table);
+	int (*permissions)(struct nsproxy *namespaces,
+			struct ctl_table *table, int op);
 };
 
 /* struct ctl_table_header is used to maintain dynamic lists of
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1706,7 +1706,7 @@ void register_sysctl_root(struct ctl_table_root *root)
  * some sysctl variables are readonly even to root.
  */
 
-static int test_perm(int mode, int op)
+int sysctl_test_perm(int mode, int op)
 {
 	if (!current_euid())
 		mode >>= 6;
@@ -1722,11 +1722,9 @@ int sysctl_perm(struct ctl_table_root *root, struct ctl_table *table, int op)
 	int mode;
 
 	if (root->permissions)
-		mode = root->permissions(root, current->nsproxy, table);
+		return root->permissions(current->nsproxy, table, op);
 	else
-		mode = table->mode;
-
-	return test_perm(mode, op);
+		return sysctl_test_perm(table->mode, op);
 }
 
 static void sysctl_set_parent(struct ctl_table *parent, struct ctl_table *table)

---

Then I introduce the kernel.ns_last_pid sysctl that is allows for MAY_OPEN | MAP_WRITE for
the namespace's init only and allows for MAY_WRITE for anyone else. Thus, if we want to
write to this file from non-init task it must have the respective fd inherited from the init
on fork. It works OK for checkpoint/restore.

The patch is:


diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..3686a07 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -15,6 +15,7 @@
 #include <linux/acct.h>
 #include <linux/slab.h>
 #include <linux/proc_fs.h>
+#include <linux/sysctl.h>
 
 #define BITS_PER_PAGE		(PAGE_SIZE*8)
 
@@ -191,9 +192,54 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	return;
 }
 
+static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
+			struct ctl_table *table, int op)
+{
+	int mode = 0644;
+
+	if ((op & MAY_OPEN) &&
+			current != namespaces->pid_ns->child_reaper)
+		/*
+		 * Writing to this sysctl is allowed only for init
+		 * and to whoever it grands the open file
+		 */
+		mode &= ~0222;
+
+	return sysctl_test_perm(mode, op);
+}
+
+static struct ctl_table_root pid_ns_root = {
+	.permissions = pid_ns_ctl_permissions,
+};
+
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table tmp = *table;
+	tmp.data = &current->nsproxy->pid_ns->last_pid;
+	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+	.permissions = pid_ns_ctl_permissions,
+};
+
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+		     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table tmp = *table;
+	tmp.data = &current->nsproxy->pid_ns->last_pid;
+	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+	{
+		.procname = "ns_last_pid",
+		.maxlen = sizeof(int),
+		.proc_handler = pid_ns_ctl_handler,
+	},
+	{ }
+};
+
+static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
+
 static __init int pid_namespaces_init(void)
 {
 	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+
+	setup_sysctl_set(&pid_ns_root.default_set, NULL, NULL);
+	register_sysctl_root(&pid_ns_root);
+	__register_sysctl_paths(&pid_ns_root, current->nsproxy,
+			kern_path, pid_ns_ctl_table);
+
 	return 0;
 }
 

> It doesn't have to be perfect.  I'm just curious how it looks.
> IMHO the suggested pid array passing is good enough and not too intrusive
> but, if there's something simpler from kernel side, given that this is a
> very specialized interface, I think we definitely need to consider that.

Well, after a bit more thinking I found one more pros for this sysctl - when restoring
a container we'll have the possibility to set the last_pid to what we want to prevent the
pids reuse after the restore.

> Thank you.
> 


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-21  9:15       ` Pavel Emelyanov
@ 2011-11-21 22:50         ` Tejun Heo
  2011-11-22 11:11           ` Pavel Emelyanov
  2011-11-22 21:16           ` Oleg Nesterov
  0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-21 22:50 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

Hello, Pavel.

On Mon, Nov 21, 2011 at 01:15:02PM +0400, Pavel Emelyanov wrote:
> Then I introduce the kernel.ns_last_pid sysctl that is allows for MAY_OPEN | MAP_WRITE for
> the namespace's init only and allows for MAY_WRITE for anyone else. Thus, if we want to
> write to this file from non-init task it must have the respective fd inherited from the init
> on fork. It works OK for checkpoint/restore.
> 
> The patch is:
> 
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index e9c9adc..3686a07 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -15,6 +15,7 @@
>  #include <linux/acct.h>
>  #include <linux/slab.h>
>  #include <linux/proc_fs.h>
> +#include <linux/sysctl.h>
>  
>  #define BITS_PER_PAGE		(PAGE_SIZE*8)
>  
> @@ -191,9 +192,54 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	return;
>  }
>  
> +static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
> +			struct ctl_table *table, int op)
> +{
> +	int mode = 0644;
> +
> +	if ((op & MAY_OPEN) &&
> +			current != namespaces->pid_ns->child_reaper)
> +		/*
> +		 * Writing to this sysctl is allowed only for init
> +		 * and to whoever it grands the open file
> +		 */
> +		mode &= ~0222;
> +
> +	return sysctl_test_perm(mode, op);
> +}
> +
> +static struct ctl_table_root pid_ns_root = {
> +	.permissions = pid_ns_ctl_permissions,
> +};

Hmmm... I hope this could be prettier.  I'm having trouble following
where the MAY_OPEN comes from.  Can you please explain?  Can't we for
now allow this for root and then later allow CAP_CHECKPOINT that
Cyrill suggested?  Or do we want to allow setting pids even w/o CR for
NS creator?

> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> +		     void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table tmp = *table;
> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}

Probably better to call set_last_pid() on write path instead?

> Well, after a bit more thinking I found one more pros for this
> sysctl - when restoring a container we'll have the possibility to
> set the last_pid to what we want to prevent the pids reuse after the
> restore.

Hmmm... I personally like this one better.  Restoring multilevel pids
would be more tedious but should still be possible and I really like
that it's staying out of clone path and all modifications are to ns
and pid code.  Oleg, what do you think?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-21 22:50         ` Tejun Heo
@ 2011-11-22 11:11           ` Pavel Emelyanov
  2011-11-22 12:04             ` Pedro Alves
  2011-11-22 15:23             ` Tejun Heo
  2011-11-22 21:16           ` Oleg Nesterov
  1 sibling, 2 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-22 11:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

>> +static int pid_ns_ctl_permissions(struct nsproxy *namespaces,
>> +			struct ctl_table *table, int op)
>> +{
>> +	int mode = 0644;
>> +
>> +	if ((op & MAY_OPEN) &&
>> +			current != namespaces->pid_ns->child_reaper)
>> +		/*
>> +		 * Writing to this sysctl is allowed only for init
>> +		 * and to whoever it grands the open file
>> +		 */
>> +		mode &= ~0222;
>> +
>> +	return sysctl_test_perm(mode, op);
>> +}
>> +
>> +static struct ctl_table_root pid_ns_root = {
>> +	.permissions = pid_ns_ctl_permissions,
>> +};
> 
> Hmmm... I hope this could be prettier.  I'm having trouble following
> where the MAY_OPEN comes from.  Can you please explain?

>From this calltrace:

 pid_ns_ctl_permissions
 sysctl_perm
 proc_sys_permission
 inode_permission
 do_last <<<<< MAY_OPEN appears here
 path_openat
 do_filp_open
 do_sys_open
 sys_open


> Can't we for now allow this for root and then later allow CAP_CHECKPOINT 
> that Cyrill suggested?  Or do we want to allow setting pids even w/o CR 
> for NS creator?

I think that systemd guys can play with it. E.g. respawning daemons with predefined
pids sounds like an interesting thing to play with.

>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> +		     void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table tmp = *table;
>> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
>> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>> +}
> 
> Probably better to call set_last_pid() on write path instead?

Why? The usage of this sysctl is going to be synchronized  by external locks,
so why should we care?

>> Well, after a bit more thinking I found one more pros for this
>> sysctl - when restoring a container we'll have the possibility to
>> set the last_pid to what we want to prevent the pids reuse after the
>> restore.
> 
> Hmmm... I personally like this one better.  Restoring multilevel pids
> would be more tedious but should still be possible and I really like
> that it's staying out of clone path and all modifications are to ns
> and pid code.  Oleg, what do you think?
> 
> Thank you.
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 11:11           ` Pavel Emelyanov
@ 2011-11-22 12:04             ` Pedro Alves
  2011-11-22 15:33               ` Tejun Heo
  2011-11-22 15:23             ` Tejun Heo
  1 sibling, 1 reply; 46+ messages in thread
From: Pedro Alves @ 2011-11-22 12:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Oleg Nesterov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On Tuesday 22 November 2011 11:11:02, Pavel Emelyanov wrote:
> > Can't we for now allow this for root and then later allow CAP_CHECKPOINT 
> > that Cyrill suggested?  Or do we want to allow setting pids even w/o CR 
> > for NS creator?
> 
> I think that systemd guys can play with it. E.g. respawning daemons with predefined
> pids sounds like an interesting thing to play with.

This whole userspace C/R stuff and being able to set the child's pid has potential
of being very useful for GDB too, allowing a much better reimplementation of its
old checkpointing feature [*], and allowing for a faster reverse debugging
implementation, by being able to do faster rewinding -- restore snapshot and replay
instructions up to N (by single stepping or running to breakpoint), rather than
manually undoing the effects of each instruction, one by one.

IOW, root only would be a shame from GDB's perspective.

[*] GDB has an old checkpointing feature ("checkpoint; info checkpoints" commands)
based on forcing the tracee to fork, and holding on the fork child behind the
scenes as a checkpoint.  To restore the debugging state to a previous checkpoint,
gdb swaps the current debuggee for the fork child as transparently for the user
as was possible. Obviously this has a bunch of limitations and downsides like
only working on non-threaded programs, and the inferior's pid changing...

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 11:11           ` Pavel Emelyanov
  2011-11-22 12:04             ` Pedro Alves
@ 2011-11-22 15:23             ` Tejun Heo
  2011-11-22 15:29               ` Tejun Heo
  2011-11-22 16:30               ` Pavel Emelyanov
  1 sibling, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-22 15:23 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

Hello,

On Tue, Nov 22, 2011 at 03:11:02PM +0400, Pavel Emelyanov wrote:
> > Hmmm... I hope this could be prettier.  I'm having trouble following
> > where the MAY_OPEN comes from.  Can you please explain?
> 
> From this calltrace:
> 
>  pid_ns_ctl_permissions
>  sysctl_perm
>  proc_sys_permission
>  inode_permission
>  do_last <<<<< MAY_OPEN appears here
>  path_openat
>  do_filp_open
>  do_sys_open
>  sys_open

Thanks a lot. :)

> > Can't we for now allow this for root and then later allow CAP_CHECKPOINT 
> > that Cyrill suggested?  Or do we want to allow setting pids even w/o CR 
> > for NS creator?
> 
> I think that systemd guys can play with it. E.g. respawning daemons with predefined
> pids sounds like an interesting thing to play with.

But wouldn't CAP_CHECKPOINT be enough for systemd?

> >> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >> +		     void __user *buffer, size_t *lenp, loff_t *ppos)
> >> +{
> >> +	struct ctl_table tmp = *table;
> >> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
> >> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> >> +}
> > 
> > Probably better to call set_last_pid() on write path instead?
> 
> Why? The usage of this sysctl is going to be synchronized  by external locks,
> so why should we care?

I think the question should usually be the other way around.  Why
deviate when the deviation doesn't earn any tangible benefit?  If you
think setting it explicitly is justified, explain why in the comment
of the setter and places where those explicit settings are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 15:23             ` Tejun Heo
@ 2011-11-22 15:29               ` Tejun Heo
  2011-11-22 16:30               ` Pavel Emelyanov
  1 sibling, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-22 15:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

On Tue, Nov 22, 2011 at 07:23:12AM -0800, Tejun Heo wrote:
> I think the question should usually be the other way around.  Why
> deviate when the deviation doesn't earn any tangible benefit?  If you
> think setting it explicitly is justified, explain why in the comment
> of the setter and places where those explicit settings are.

Hmmm... I think I trimmed a bit too much.  Let me add back a bit.

So, either just use set_last_pid() or explain in the comment of
set_last_pid() that there are other places which set last_pid but it
won't race because they're synchronized through outer lock and similar
comment where the explicit setting of last_pid is too.  As it
currently stands, it really is difficult to find out who else would be
changing last_pid - it's buried in an argument to a proc parse
function.  IMHO just using set_last_pid() would be better here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 12:04             ` Pedro Alves
@ 2011-11-22 15:33               ` Tejun Heo
  2011-11-23 16:20                 ` Pedro Alves
  0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2011-11-22 15:33 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Pavel Emelyanov, Oleg Nesterov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hello,

On Tue, Nov 22, 2011 at 12:04:38PM +0000, Pedro Alves wrote:
> This whole userspace C/R stuff and being able to set the child's pid has potential
> of being very useful for GDB too, allowing a much better reimplementation of its
> old checkpointing feature [*], and allowing for a faster reverse debugging
> implementation, by being able to do faster rewinding -- restore snapshot and replay
> instructions up to N (by single stepping or running to breakpoint), rather than
> manually undoing the effects of each instruction, one by one.
> 
> IOW, root only would be a shame from GDB's perspective.

Would CAP_CHECKPOINT be a shame too?  I'm reluctant about priviledge
through fd inheritance mostly because of its unusualness.  I don't
think priv management is a good problem space for small creative
solutions.  We're much better off with mundane mechanisms which people
are already familiar with and is easy to account for.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 15:23             ` Tejun Heo
  2011-11-22 15:29               ` Tejun Heo
@ 2011-11-22 16:30               ` Pavel Emelyanov
  2011-11-22 16:44                 ` Linus Torvalds
  1 sibling, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-22 16:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

On 11/22/2011 07:23 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Nov 22, 2011 at 03:11:02PM +0400, Pavel Emelyanov wrote:
>>> Hmmm... I hope this could be prettier.  I'm having trouble following
>>> where the MAY_OPEN comes from.  Can you please explain?
>>
>> From this calltrace:
>>
>>  pid_ns_ctl_permissions
>>  sysctl_perm
>>  proc_sys_permission
>>  inode_permission
>>  do_last <<<<< MAY_OPEN appears here
>>  path_openat
>>  do_filp_open
>>  do_sys_open
>>  sys_open
> 
> Thanks a lot. :)
> 
>>> Can't we for now allow this for root and then later allow CAP_CHECKPOINT 
>>> that Cyrill suggested?  Or do we want to allow setting pids even w/o CR 
>>> for NS creator?
>>
>> I think that systemd guys can play with it. E.g. respawning daemons with predefined
>> pids sounds like an interesting thing to play with.
> 
> But wouldn't CAP_CHECKPOINT be enough for systemd?

It would, but what's the point in granting to a systemd (which can be a container's
init by the way) the ability to use the _whole_ checkpoint/restore engine?

Even more - protecting with the capability implies, that any task might want to play
with it. But what's the point for an arbitrary task, that just _lives_ in a pid namespace
to set the last_pid of its namespace?

>>>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>>>> +		     void __user *buffer, size_t *lenp, loff_t *ppos)
>>>> +{
>>>> +	struct ctl_table tmp = *table;
>>>> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
>>>> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>>>> +}
>>>
>>> Probably better to call set_last_pid() on write path instead?
>>
>> Why? The usage of this sysctl is going to be synchronized  by external locks,
>> so why should we care?
> 
> I think the question should usually be the other way around.  Why
> deviate when the deviation doesn't earn any tangible benefit?  If you
> think setting it explicitly is justified, explain why in the comment
> of the setter and places where those explicit settings are.

The set_last_pid() is the way to update the last_pid by two concurrent updaters. Since
setting the last_pid via sysctl is racy by its nature, using that race protection is
just pointless.

And yes, I agree, that writing this comment is a good idea :)

> Thanks.
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 16:30               ` Pavel Emelyanov
@ 2011-11-22 16:44                 ` Linus Torvalds
  2011-11-22 19:29                   ` Pavel Emelyanov
  2012-01-26 23:28                   ` Kay Sievers
  0 siblings, 2 replies; 46+ messages in thread
From: Linus Torvalds @ 2011-11-22 16:44 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Oleg Nesterov, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

On Tue, Nov 22, 2011 at 8:30 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>>
>>> I think that systemd guys can play with it. E.g. respawning daemons with predefined
>>> pids sounds like an interesting thing to play with.
>>
>> But wouldn't CAP_CHECKPOINT be enough for systemd?
>
> It would, but what's the point in granting to a systemd (which can be a container's
> init by the way) the ability to use the _whole_ checkpoint/restore engine?

Christ, stop making it sound like we would *want* systemd to do even
more odd things.

Quite frankly, any feature that is sold with ".. and systemd can use
this fox Xyz", is a *misfeature* in my opinion.  Core infrastructure
like systemd should use a *minimal* interface, not some random
extended features.

I'm starting to really dislike this whole feature discussion.

                Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 16:44                 ` Linus Torvalds
@ 2011-11-22 19:29                   ` Pavel Emelyanov
  2012-01-26 23:28                   ` Kay Sievers
  1 sibling, 0 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-22 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Oleg Nesterov, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

On 11/22/2011 08:44 PM, Linus Torvalds wrote:
> On Tue, Nov 22, 2011 at 8:30 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>>>
>>>> I think that systemd guys can play with it. E.g. respawning daemons with predefined
>>>> pids sounds like an interesting thing to play with.
>>>
>>> But wouldn't CAP_CHECKPOINT be enough for systemd?
>>
>> It would, but what's the point in granting to a systemd (which can be a container's
>> init by the way) the ability to use the _whole_ checkpoint/restore engine?
> 
> Christ, stop making it sound like we would *want* systemd to do even
> more odd things.
> 
> Quite frankly, any feature that is sold with ".. and systemd can use
> this fox Xyz", is a *misfeature* in my opinion.  Core infrastructure
> like systemd should use a *minimal* interface, not some random
> extended features.

Damn, surely it should use a minimal! But our opinion doesn't prevent this daemon
from doing very weird stuff, and (believe it or not) I'm not trying to sell this
feature for systemd. Just trying to minimize the impact from systemd's use of it :(

> I'm starting to really dislike this whole feature discussion.
> 
>                 Linus
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-21 22:50         ` Tejun Heo
  2011-11-22 11:11           ` Pavel Emelyanov
@ 2011-11-22 21:16           ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-22 21:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Linus Torvalds, Andrew Morton, Alan Cox,
	Roland McGrath, Linux Kernel Mailing List, Cyrill Gorcunov,
	James Bottomley

On 11/21, Tejun Heo wrote:
>
> > +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> > +		     void __user *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +	struct ctl_table tmp = *table;
> > +	tmp.data = &current->nsproxy->pid_ns->last_pid;
> > +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> > +}
>
> Probably better to call set_last_pid() on write path instead?

I am not sure... set_last_pid() is "special", it tries to avoid the
races with itself to prevent the pid-reuse. It doesn't hurt, but
perhaps

	set_last_pid(pid_ns, pid_ns->last_pid, new_pid);

looks a bit confusing.

Hmm. On the second thought, perhaps this makes sense anyway. Just
to keep the "only set_last_pid() can change ->last_pid" property.

But this is almost cosmetic.

> > Well, after a bit more thinking I found one more pros for this
> > sysctl - when restoring a container we'll have the possibility to
> > set the last_pid to what we want to prevent the pids reuse after the
> > restore.
>
> Hmmm... I personally like this one better.  Restoring multilevel pid
> would be more tedious but should still be possible and I really like
> that it's staying out of clone path and all modifications are to ns
> and pid code.  Oleg, what do you think?

Obviously, I'd prefer this one too ;)

But. Personally I do not like the fact that only init can open this
file for writing... (I guess Pavel already hates me ;)

If we add this sysctl, then I think there should be some way to use
outside of "checkpoint-restore" world. For example, see the comment
from Pedro. This use-case looks unexpected (to me), but reasonable.
Or. Say, set_last_pid can be useful to test the pid-reuse races.

In any case. To me, it is not really good to have /proc/*/set_last_pid
without the ability to use it somehow on the running system.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 15:33               ` Tejun Heo
@ 2011-11-23 16:20                 ` Pedro Alves
  2011-11-23 16:24                   ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Pedro Alves @ 2011-11-23 16:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Oleg Nesterov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hello Tejun,

On Tuesday 22 November 2011 15:33:26, Tejun Heo wrote:
> Hello,
> 
> On Tue, Nov 22, 2011 at 12:04:38PM +0000, Pedro Alves wrote:
> > This whole userspace C/R stuff and being able to set the child's pid has potential
> > of being very useful for GDB too, allowing a much better reimplementation of its
> > old checkpointing feature [*], and allowing for a faster reverse debugging
> > implementation, by being able to do faster rewinding -- restore snapshot and replay
> > instructions up to N (by single stepping or running to breakpoint), rather than
> > manually undoing the effects of each instruction, one by one.
> > 
> > IOW, root only would be a shame from GDB's perspective.
> 
> Would CAP_CHECKPOINT be a shame too?  

I think CAP_CHECKPOINT (or something through some LSM) would be
definitely better.

> I'm reluctant about priviledge
> through fd inheritance mostly because of its unusualness.  I don't
> think priv management is a good problem space for small creative
> solutions.  We're much better off with mundane mechanisms which people
> are already familiar with and is easy to account for.

fd inheritance wouldn't work for gdb; a user spawned gdb
wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 16:20                 ` Pedro Alves
@ 2011-11-23 16:24                   ` Tejun Heo
  2011-11-23 17:26                     ` Oleg Nesterov
  2011-11-23 18:19                     ` Pavel Emelyanov
  0 siblings, 2 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-23 16:24 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Pavel Emelyanov, Oleg Nesterov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hello,

On Wed, Nov 23, 2011 at 04:20:44PM +0000, Pedro Alves wrote:
> > Would CAP_CHECKPOINT be a shame too?  
> 
> I think CAP_CHECKPOINT (or something through some LSM) would be
> definitely better.
> 
> > I'm reluctant about priviledge
> > through fd inheritance mostly because of its unusualness.  I don't
> > think priv management is a good problem space for small creative
> > solutions.  We're much better off with mundane mechanisms which people
> > are already familiar with and is easy to account for.
> 
> fd inheritance wouldn't work for gdb; a user spawned gdb
> wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.

I see.  So, let's do it for root for now and later add finer grained
CAP as necessary/viable.  Pavel, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 16:24                   ` Tejun Heo
@ 2011-11-23 17:26                     ` Oleg Nesterov
  2011-11-23 17:37                       ` Tejun Heo
  2011-11-23 18:19                     ` Pavel Emelyanov
  1 sibling, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-23 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pedro Alves, Pavel Emelyanov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/23, Tejun Heo wrote:
>
> I see.  So, let's do it for root for now and later add finer grained
> CAP as necessary/viable.

Agreed.

Stupid off-topic question, I am just curious... Why /proc/sys
dislikes chmod ? I do not understand this code, but it seems
that it would be simple to change proc_sys_setattr() to respect
ATTR_MODE and update PROC_I(inode)->sysctl_entry->mode, no?

> Pavel, what do you think?

Yes...

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 17:26                     ` Oleg Nesterov
@ 2011-11-23 17:37                       ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-23 17:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pedro Alves, Pavel Emelyanov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hey, Oleg.

On Wed, Nov 23, 2011 at 06:26:36PM +0100, Oleg Nesterov wrote:
> On 11/23, Tejun Heo wrote:
> >
> > I see.  So, let's do it for root for now and later add finer grained
> > CAP as necessary/viable.
> 
> Agreed.
> 
> Stupid off-topic question, I am just curious... Why /proc/sys
> dislikes chmod ? I do not understand this code, but it seems
> that it would be simple to change proc_sys_setattr() to respect
> ATTR_MODE and update PROC_I(inode)->sysctl_entry->mode, no?

I have no idea either but suspect it's something historical.  It came
from sysctl and sysctl didn't have any way to diddle with permissions,
so when later adding proc interface, it probably just stayed that way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 16:24                   ` Tejun Heo
  2011-11-23 17:26                     ` Oleg Nesterov
@ 2011-11-23 18:19                     ` Pavel Emelyanov
  2011-11-23 20:14                       ` Pavel Emelyanov
  1 sibling, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-23 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pedro Alves, Oleg Nesterov, Linus Torvalds, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/23/2011 08:24 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Nov 23, 2011 at 04:20:44PM +0000, Pedro Alves wrote:
>>> Would CAP_CHECKPOINT be a shame too?  
>>
>> I think CAP_CHECKPOINT (or something through some LSM) would be
>> definitely better.
>>
>>> I'm reluctant about priviledge
>>> through fd inheritance mostly because of its unusualness.  I don't
>>> think priv management is a good problem space for small creative
>>> solutions.  We're much better off with mundane mechanisms which people
>>> are already familiar with and is easy to account for.
>>
>> fd inheritance wouldn't work for gdb; a user spawned gdb
>> wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.
> 
> I see.  So, let's do it for root for now and later add finer grained
> CAP as necessary/viable.  Pavel, what do you think?

OK, I'll send the respective patches soon.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 18:19                     ` Pavel Emelyanov
@ 2011-11-23 20:14                       ` Pavel Emelyanov
  2011-11-24 17:31                         ` Oleg Nesterov
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-23 20:14 UTC (permalink / raw)
  To: Tejun Heo, Pedro Alves, Oleg Nesterov
  Cc: Linux Kernel Mailing List, Cyrill Gorcunov, James Bottomley

On 11/23/2011 10:19 PM, Pavel Emelyanov wrote:
> On 11/23/2011 08:24 PM, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Nov 23, 2011 at 04:20:44PM +0000, Pedro Alves wrote:
>>>> Would CAP_CHECKPOINT be a shame too?  
>>>
>>> I think CAP_CHECKPOINT (or something through some LSM) would be
>>> definitely better.
>>>
>>>> I'm reluctant about priviledge
>>>> through fd inheritance mostly because of its unusualness.  I don't
>>>> think priv management is a good problem space for small creative
>>>> solutions.  We're much better off with mundane mechanisms which people
>>>> are already familiar with and is easy to account for.
>>>
>>> fd inheritance wouldn't work for gdb; a user spawned gdb
>>> wouldn't inherit an open fd to kernel.ns_last_pid from anywhere.
>>
>> I see.  So, let's do it for root for now and later add finer grained
>> CAP as necessary/viable.  Pavel, what do you think?
> 
> OK, I'll send the respective patches soon.

Hm... Started testing this stuff and thought about Pedro's wish to use this
in gdb one more time :(

The thing is, that we (in checkpoint/restore) are going to use this sysctl
when creating a pid namespace from scratch, thus having full control over
all the forks happening in this namespace.

But when it comes to the ability for gdb to create a task with a given pid in 
a _living_ namespace this solution simply won't work! It doesn't guarantee,
that after setting the last_pid via sysctl this last_pid stays the same at
the time we do call fork()/clone(). Because there are other tasks that can call
fork themselves ignoring any lask_pid locking we can play with.

That said I see only two real-life scenarios of how to use _this_ API:

1. creating tasks in a new pid namespace, making sure all the fork-ers care
   about the proper locking;
2. forking tasks in a loop checking that getpid() returns desired value and
   hoping that other tasks do not fork() at speed high enough for spoiling
   every single last_pid value set via sysctl.

Is any of these scenarios suitable for Pedro? Other thoughts on this?

>> Thanks.
>>
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-23 20:14                       ` Pavel Emelyanov
@ 2011-11-24 17:31                         ` Oleg Nesterov
  2011-11-25 10:14                           ` Pavel Emelyanov
  0 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-24 17:31 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/24, Pavel Emelyanov wrote:
>
> Hm... Started testing this stuff and thought about Pedro's wish to use this
> in gdb one more time :(
>
> The thing is, that we (in checkpoint/restore) are going to use this sysctl
> when creating a pid namespace from scratch, thus having full control over
> all the forks happening in this namespace.
>
> But when it comes to the ability for gdb to create a task with a given pid in
> a _living_ namespace this solution simply won't work! It doesn't guarantee,
> that after setting the last_pid via sysctl this last_pid stays the same at
> the time we do call fork()/clone(). Because there are other tasks that can call
> fork themselves ignoring any lask_pid locking we can play with.

Sure. ->last_pid is a hint, nothing more.

But nothing can work reliably on the running system. Even if we can't
race with another fork(), the required pid_nr can be already in use.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-24 17:31                         ` Oleg Nesterov
@ 2011-11-25 10:14                           ` Pavel Emelyanov
  2011-11-25 16:22                             ` Oleg Nesterov
                                               ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-25 10:14 UTC (permalink / raw)
  To: Oleg Nesterov, Tejun Heo, Pedro Alves
  Cc: Linux Kernel Mailing List, Cyrill Gorcunov, James Bottomley

On 11/24/2011 09:31 PM, Oleg Nesterov wrote:
> On 11/24, Pavel Emelyanov wrote:
>>
>> Hm... Started testing this stuff and thought about Pedro's wish to use this
>> in gdb one more time :(
>>
>> The thing is, that we (in checkpoint/restore) are going to use this sysctl
>> when creating a pid namespace from scratch, thus having full control over
>> all the forks happening in this namespace.
>>
>> But when it comes to the ability for gdb to create a task with a given pid in
>> a _living_ namespace this solution simply won't work! It doesn't guarantee,
>> that after setting the last_pid via sysctl this last_pid stays the same at
>> the time we do call fork()/clone(). Because there are other tasks that can call
>> fork themselves ignoring any lask_pid locking we can play with.
> 
> Sure. ->last_pid is a hint, nothing more.
> 
> But nothing can work reliably on the running system. Even if we can't
> race with another fork(), the required pid_nr can be already in use.
> 
> Oleg.

OK, here's another proposal that seem to suit all of us:

1. me wants to clone tasks with pids set
2. Pedro wants to fork task with not changing pids and w/o root perms
3. Oleg and Tejun want to have little intrusion into fork() path

The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
can reuse the same pid again. This basic thing doesn't require root perms at all
and safe against pid reuse problems. When requesting for pid reservation task may
specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).

Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?

Few comments about intrusion:

 * the common path - if (pid != &init_struct_pid) - on fork is just modified
 * we have -1 argument to copy_process
 * one more field on struct pid is OK, since it size doesn't change (32 bit level is
   anyway not required, it's OK to reduce on down to 16 bits)
 * no clone flags extension
 * no new locking - the reserved pid manipulations happen under tasklist_lock and
   existing common paths do not require more of it
 * yes, we have +1 member on task_struct :(

Current API problems:

 * Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
   previous reservation (don't know how to fix)
 * No way to fork() an init of a pid sub-namespace with desired pid in current
   (can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
    namespace of a next level)
 * No way to grab existing pid for reserve (can be fixed, if someone wants this)

Oleg, Tejun, do you agree with such an approach?

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..b87a4ac 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -57,13 +57,16 @@ struct upid {
 struct pid
 {
 	atomic_t count;
-	unsigned int level;
+	unsigned short flags;
+	unsigned short level;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct rcu_head rcu;
 	struct upid numbers[1];
 };
 
+#define PID_RESERVED	0x1
+
 extern struct pid init_struct_pid;
 
 struct pid_link
@@ -72,6 +75,11 @@ struct pid_link
 	struct pid *pid;
 };
 
+int reserve_pid(int nr);
+void __unreserve_pid(struct task_struct *p);
+void unreserve_pid(void);
+int pid_attached(struct pid *pid);
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
@@ -119,7 +127,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, int want_pid);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..6cc443c 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,7 @@
 
 #define PR_MCE_KILL_GET 34
 
+#define PR_RESERVE_PID		35
+#define PR_UNRESERVE_PID	36
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68daf4f..a991b03 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,7 @@ struct task_struct {
 	struct pid_link pids[PIDTYPE_MAX];
 	struct list_head thread_group;
 
+	struct pid *rsv_pid;
 	struct completion *vfork_done;		/* for vfork() */
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
diff --git a/kernel/exit.c b/kernel/exit.c
index d0b7d98..ae028f8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -177,6 +177,8 @@ repeat:
 	proc_flush_task(p);
 
 	write_lock_irq(&tasklist_lock);
+	if (p->rsv_pid)
+		__unreserve_pid(p);
 	ptrace_release_task(p);
 	__exit_signal(p);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ba0d172..088d8fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,6 +297,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+	tsk->rsv_pid = NULL;
 
 	account_kernel_stack(ti, 1);
 
@@ -1049,11 +1050,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					struct pt_regs *regs,
 					unsigned long stack_size,
 					int __user *child_tidptr,
-					struct pid *pid,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
+	struct pid *pid = current->rsv_pid;
 	int cgroup_callbacks_done = 0;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -1249,9 +1250,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	if (pid != &init_struct_pid) {
+	if (pid == NULL || pid_attached(pid)) {
 		retval = -ENOMEM;
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, 0);
 		if (!pid)
 			goto bad_fork_cleanup_io;
 	}
@@ -1383,7 +1384,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	return p;
 
 bad_fork_free_pid:
-	if (pid != &init_struct_pid)
+	if (pid != current->rsv_pid)
 		free_pid(pid);
 bad_fork_cleanup_io:
 	if (p->io_context)
@@ -1447,8 +1448,8 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct task_struct *task;
 	struct pt_regs regs;
 
-	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+	current->rsv_pid = &init_struct_pid;
+	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1508,7 +1509,7 @@ long do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
diff --git a/kernel/pid.c b/kernel/pid.c
index fa5f722..86e7c6d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -159,6 +159,26 @@ static void set_last_pid(struct pid_namespace *pid_ns, int base, int pid)
 	} while ((prev != last_write) && (pid_before(base, last_write, pid)));
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (!map->page) {
+			map->page = page;
+			page = NULL;
+		}
+		spin_unlock_irq(&pidmap_lock);
+		kfree(page);
+	}
+
+	return map->page != NULL;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -176,22 +196,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	 */
 	max_scan = DIV_ROUND_UP(pid_max, BITS_PER_PAGE) - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
-				break;
-		}
+		if (!alloc_pidmap_page(map))
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -217,6 +224,23 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+	if (!alloc_pidmap_page(map))
+		return -1;
+
+	if (test_and_set_bit(offset, map->page))
+		return -1;
+
+	set_last_pid(pid_ns, pid_ns->last_pid, pid);
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, unsigned int last)
 {
 	int offset;
@@ -277,7 +301,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int want_pid)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -291,7 +315,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		if (want_pid) {
+			nr = set_pidmap(tmp, want_pid);
+			want_pid = 0;
+		} else
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
@@ -301,6 +329,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	}
 
 	get_pid_ns(ns);
+	pid->flags = 0;
 	pid->level = ns->level;
 	atomic_set(&pid->count, 1);
 	for (type = 0; type < PIDTYPE_MAX; ++type)
@@ -359,12 +388,22 @@ void attach_pid(struct task_struct *task, enum pid_type type,
 	hlist_add_head_rcu(&link->node, &pid->tasks[type]);
 }
 
+int pid_attached(struct pid *pid)
+{
+	int tmp;
+
+	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
+		if (!hlist_empty(&pid->tasks[tmp]))
+			return 1;
+
+	return 0;
+}
+
 static void __change_pid(struct task_struct *task, enum pid_type type,
 			struct pid *new)
 {
 	struct pid_link *link;
 	struct pid *pid;
-	int tmp;
 
 	link = &task->pids[type];
 	pid = link->pid;
@@ -372,11 +411,11 @@ static void __change_pid(struct task_struct *task, enum pid_type type,
 	hlist_del_rcu(&link->node);
 	link->pid = new;
 
-	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
-		if (!hlist_empty(&pid->tasks[tmp]))
-			return;
+	if (pid_attached(pid))
+		return;
 
-	free_pid(pid);
+	if (!(pid->flags & PID_RESERVED))
+		free_pid(pid);
 }
 
 void detach_pid(struct task_struct *task, enum pid_type type)
@@ -534,6 +573,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return pid;
 }
 
+static void do_unreserve_pid(struct pid *pid)
+{
+	pid->flags &= ~PID_RESERVED;
+	if (!pid_attached(pid))
+		free_pid(pid);
+}
+
+int reserve_pid(int nr)
+{
+	struct task_struct *me = current;
+	struct pid *pid;
+
+	if (nr != 0 && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	pid = alloc_pid(me->nsproxy->pid_ns, nr);
+	if (pid == NULL)
+		return -ENOMEM;
+
+	write_lock_irq(&tasklist_lock);
+	if (me->rsv_pid)
+		do_unreserve_pid(me->rsv_pid);
+	pid->flags |= PID_RESERVED;
+	me->rsv_pid = pid;
+	write_unlock_irq(&tasklist_lock);
+
+	return pid_nr_ns(pid, me->nsproxy->pid_ns);
+}
+
+void __unreserve_pid(struct task_struct *tsk)
+{
+	do_unreserve_pid(tsk->rsv_pid);
+	tsk->rsv_pid = NULL;
+}
+
+void unreserve_pid(void)
+{
+	write_lock_irq(&tasklist_lock);
+	__unreserve_pid(current);
+	write_unlock_irq(&tasklist_lock);
+}
+
 /*
  * The pid hash table is scaled according to the amount of memory in the
  * machine.  From a minimum of 16 slots up to 4096 slots at one gigabyte or
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..8886672 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1841,6 +1841,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_RESERVE_PID:
+			error = reserve_pid(arg2);
+			break;
+		case PR_UNRESERVE_PID:
+			unreserve_pid();
+			break;
 		default:
 			error = -EINVAL;
 			break;

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 10:14                           ` Pavel Emelyanov
@ 2011-11-25 16:22                             ` Oleg Nesterov
  2011-11-25 16:44                               ` Pavel Emelyanov
  2011-11-27  9:41                             ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Konstantin Khlebnikov
  2011-11-27 18:47                             ` Tejun Heo
  2 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-25 16:22 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/25, Pavel Emelyanov wrote:
>
> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
> pid on the current. The subsequent fork() uses this pid,

Oh. This is subjective, yes, but this doesn't clean to me.

Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
It only helps to avoid the race with another fork.

>  * one more field on struct pid is OK, since it size doesn't change (32 bit level is
>    anyway not required, it's OK to reduce on down to 16 bits)

Even if sizeof is the same, the new member and the code which plays
with ->flags doesn't make the things better ;)

>  * yes, we have +1 member on task_struct :(

Yes, and this task_struct->rsv_pid acts as implicit parameter for the
next clone(). Doesn't look very nice to me. Plus the code complications.

> Oleg, Tejun, do you agree with such an approach?

If set_last_pid doesn't work, I'd prefer CLONE_CHILD_USEPIDS.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 16:22                             ` Oleg Nesterov
@ 2011-11-25 16:44                               ` Pavel Emelyanov
  2011-11-25 16:54                                 ` Oleg Nesterov
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-25 16:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/25/2011 08:22 PM, Oleg Nesterov wrote:
> On 11/25, Pavel Emelyanov wrote:
>>
>> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
>> pid on the current. The subsequent fork() uses this pid,
> 
> Oh. This is subjective, yes, but this doesn't clean to me.
> 
> Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
> It only helps to avoid the race with another fork.

No. It can fail if you try to allocate a pid with given number. The API allows for
pid generation. AFAIU this can help with Pedro's requirements to resurrect task with
the same pid value it used to have before.

>>  * one more field on struct pid is OK, since it size doesn't change (32 bit level is
>>    anyway not required, it's OK to reduce on down to 16 bits)
> 
> Even if sizeof is the same, the new member and the code which plays
> with ->flags doesn't make the things better ;)
> 
>>  * yes, we have +1 member on task_struct :(
> 
> Yes, and this task_struct->rsv_pid acts as implicit parameter for the
> next clone(). Doesn't look very nice to me. Plus the code complications.

Well, the last_pid is also an implicit parameter for the next clone() with sysctl
approach :) But the code complication is the problem, yes :(

>> Oleg, Tejun, do you agree with such an approach?
> 
> If set_last_pid doesn't work, I'd prefer CLONE_CHILD_USEPIDS.

OK, thanks.

> Oleg.
> 
> .
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 16:44                               ` Pavel Emelyanov
@ 2011-11-25 16:54                                 ` Oleg Nesterov
  2011-11-25 17:03                                   ` Pavel Emelyanov
  0 siblings, 1 reply; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-25 16:54 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/25, Pavel Emelyanov wrote:
>
> On 11/25/2011 08:22 PM, Oleg Nesterov wrote:
> > On 11/25, Pavel Emelyanov wrote:
> >>
> >> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
> >> pid on the current. The subsequent fork() uses this pid,
> >
> > Oh. This is subjective, yes, but this doesn't clean to me.
> >
> > Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
> > It only helps to avoid the race with another fork.
>
> No. It can fail if you try to allocate a pid with given number. The API allows for
> pid generation. AFAIU this can help with Pedro's requirements to resurrect task with
> the same pid value it used to have before.

Yes gdb can do fork() in a row (until it unreserves the pid) and the
pid will be the same.

OK, I misunderstood. I thought you insist that PR_RESERVE_PID itself
is reliable.

But this can only work in the simplest case. How you can restore the
multithread tracee? You need to unreserve/reserve the previous pid,
and we have the same problems again, no?

> > Yes, and this task_struct->rsv_pid acts as implicit parameter for the
> > next clone(). Doesn't look very nice to me. Plus the code complications.
>
> Well, the last_pid is also an implicit parameter for the next clone() with sysctl
> approach :)

Yes. but it is already here ;)

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 16:54                                 ` Oleg Nesterov
@ 2011-11-25 17:03                                   ` Pavel Emelyanov
  2011-11-25 22:36                                     ` Pedro Alves
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-25 17:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/25/2011 08:54 PM, Oleg Nesterov wrote:
> On 11/25, Pavel Emelyanov wrote:
>>
>> On 11/25/2011 08:22 PM, Oleg Nesterov wrote:
>>> On 11/25, Pavel Emelyanov wrote:
>>>>
>>>> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
>>>> pid on the current. The subsequent fork() uses this pid,
>>>
>>> Oh. This is subjective, yes, but this doesn't clean to me.
>>>
>>> Amd why?? On the running system PR_RESERVE_PID can obviously fail anyway.
>>> It only helps to avoid the race with another fork.
>>
>> No. It can fail if you try to allocate a pid with given number. The API allows for
>> pid generation. AFAIU this can help with Pedro's requirements to resurrect task with
>> the same pid value it used to have before.
> 
> Yes gdb can do fork() in a row (until it unreserves the pid) and the
> pid will be the same.
> 
> OK, I misunderstood. I thought you insist that PR_RESERVE_PID itself
> is reliable.
> 
> But this can only work in the simplest case.

Yup!

> How you can restore the multithread tracee?

Don't know :) But if this approach sounds promising (I see, that now it's not, but...) I
can think more on it.

> You need to unreserve/reserve the previous pid, and we have the same problems again, no?

With the existing patch - yes, but as I said above - we need to decide which direction to
go and then I'll think further.

By now your opinion is to better stay where we are ;) but if moving is unavoidable, then
it's better to take the CLONE_CHILD_USEPIDS route. That's my position as well.

>>> Yes, and this task_struct->rsv_pid acts as implicit parameter for the
>>> next clone(). Doesn't look very nice to me. Plus the code complications.
>>
>> Well, the last_pid is also an implicit parameter for the next clone() with sysctl
>> approach :)
> 
> Yes. but it is already here ;)
> 
> Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks  with given pids
  2011-11-25 17:03                                   ` Pavel Emelyanov
@ 2011-11-25 22:36                                     ` Pedro Alves
  2011-11-27 16:24                                       ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with?given pids Oleg Nesterov
  0 siblings, 1 reply; 46+ messages in thread
From: Pedro Alves @ 2011-11-25 22:36 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Tejun Heo, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On Friday 25 November 2011 17:03:26, Pavel Emelyanov wrote:
> On 11/25/2011 08:54 PM, Oleg Nesterov wrote:
> > How you can restore the multithread tracee?
> 
> Don't know :) But if this approach sounds promising (I see, that now it's not, but...) I
> can think more on it.
> 
> > You need to unreserve/reserve the previous pid, and we have the same problems again, no?
> 
> With the existing patch - yes, but as I said above - we need to decide which direction to
> go and then I'll think further.

Thanks for thinking about all this.  Being able to reserve pids would be
nice, but I won't pretend to know the kernel's internals enough to be able
to suggest a sane and acceptable way to do it.  We'd have to be able
to restore multi-threaded tracees (which would also mean that there are
pids which leaders and others which are clones), and, we'd have to support
a single-threaded tracer debugging (and spawning) more than one process,
while not all tracees are involved in C/R.  Maybe this (reservation) issue
should be be considered an orthogonal mechanism for now.

> By now your opinion is to better stay where we are ;) but if moving is unavoidable, then
> it's better to take the CLONE_CHILD_USEPIDS route. That's my position as well.

>From the perspective of a client that is
going to use this on a live system, CLONE_CHILD_USEPIDS seems a little better,
in that the pid race is only against another task reusing the same pid,
while with setting last_pid, you have a try/whoops-not-the-pid-I-want/kill/retry/rinse/repeat/
loop racing against all fork/clone's in the system, along with possibly
needing to first to do a kill(PID, 0) to check whether the PID is
available (unless setting last_pid already detects that).

BTW, it's not only GDB that would want this for live systems.
Check out Berkeley Lab's C/R (https://ftg.lbl.gov/projects/CheckpointRestart/),
where these guys use mixed kernel/userspace C/R in clusters for high-end
scientific computing to e.g., migrate tasks between nodes, and pause/resume
parallel MPI jobs (on live systems).  (Apologies if everyone already knows
about this :-) .)

>From what I read from their papers, in their approach, from userspace, they
spawn new children as usual, with whatever pids the kernel wants, and then
afterwards (from userspace, but through a kernel module), magically change
the process and threads's pids to the pids they really want.  They also fixup
the parent pids, and session ids after the fact, along the way.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 10:14                           ` Pavel Emelyanov
  2011-11-25 16:22                             ` Oleg Nesterov
@ 2011-11-27  9:41                             ` Konstantin Khlebnikov
  2011-11-27 17:34                               ` Oleg Nesterov
  2011-11-27 18:47                             ` Tejun Heo
  2 siblings, 1 reply; 46+ messages in thread
From: Konstantin Khlebnikov @ 2011-11-27  9:41 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Tejun Heo, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Pavel Emelyanov wrote:
> OK, here's another proposal that seem to suit all of us:
>
> 1. me wants to clone tasks with pids set
> 2. Pedro wants to fork task with not changing pids and w/o root perms
> 3. Oleg and Tejun want to have little intrusion into fork() path
>
> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
> pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
> its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
> can reuse the same pid again. This basic thing doesn't require root perms at all
> and safe against pid reuse problems. When requesting for pid reservation task may
> specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).
>
> Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?
>
> Few comments about intrusion:
>
>   * the common path - if (pid !=&init_struct_pid) - on fork is just modified
>   * we have -1 argument to copy_process
>   * one more field on struct pid is OK, since it size doesn't change (32 bit level is
>     anyway not required, it's OK to reduce on down to 16 bits)
>   * no clone flags extension
>   * no new locking - the reserved pid manipulations happen under tasklist_lock and
>     existing common paths do not require more of it
>   * yes, we have +1 member on task_struct :(
>
> Current API problems:
>
>   * Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
>     previous reservation (don't know how to fix)
>   * No way to fork() an init of a pid sub-namespace with desired pid in current
>     (can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
>      namespace of a next level)

>   * No way to grab existing pid for reserve (can be fixed, if someone wants this)

We can add flag to sys_wait4(), and stash pid in wait_task_zombie(), right before release_task()
code will looks something like this:

-       if (p != NULL)
+       if (p != NULL) {
+               if ((wo->wo_flags & WCATCHPID) && !current->pid_stash) {
+                       struct pid *pid = task_pid(p);
+
+                       pid->flags |= PID_STASHED;
+                       current->pid_stash = get_pid(pid);
+               }
                 release_task(p);
+       }

And next fork() creates child with the same pid.
So, struct pid will work like boomerang =)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with?given pids
  2011-11-25 22:36                                     ` Pedro Alves
@ 2011-11-27 16:24                                       ` Oleg Nesterov
  0 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-27 16:24 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Pavel Emelyanov, Tejun Heo, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On 11/25, Pedro Alves wrote:
>
> From the perspective of a client that is
> going to use this on a live system, CLONE_CHILD_USEPIDS seems a little better,
> in that the pid race is only against another task reusing the same pid,

Yes. Except you can't use on a live system at all. It simply doesn't
work after the first "normal" clone() without CLONE_CHILD_USEPIDS.

Although we can probably allow CLONE_CHILD_USEPIDS for CAP_SYS_ADMIN.

> while with setting last_pid, you have a try/whoops-not-the-pid-I-want/kill/retry/rinse/repeat/
> loop racing against all fork/clone's in the system,

Yes, setting last_pid can race with another fork(). I agree this
sucks. But simple ;)

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-27  9:41                             ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Konstantin Khlebnikov
@ 2011-11-27 17:34                               ` Oleg Nesterov
  0 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2011-11-27 17:34 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Pavel Emelyanov, Tejun Heo, Pedro Alves,
	Linux Kernel Mailing List, Cyrill Gorcunov, James Bottomley

On 11/27, Konstantin Khlebnikov wrote:
>
> We can add flag to sys_wait4(), and stash pid in wait_task_zombie(), right before release_task()
> code will looks something like this:
>
> -       if (p != NULL)
> +       if (p != NULL) {
> +               if ((wo->wo_flags & WCATCHPID) && !current->pid_stash) {
> +                       struct pid *pid = task_pid(p);
> +
> +                       pid->flags |= PID_STASHED;
> +                       current->pid_stash = get_pid(pid);
> +               }
>                 release_task(p);
> +       }
>
> And next fork() creates child with the same pid.
> So, struct pid will work like boomerang =)

Like PR_RESERVE_PID, this can only help if the tracee (or whatever)
is single-threaded and it is the natural child.

Personally I do not think such a limited interface makes sense.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-25 10:14                           ` Pavel Emelyanov
  2011-11-25 16:22                             ` Oleg Nesterov
  2011-11-27  9:41                             ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Konstantin Khlebnikov
@ 2011-11-27 18:47                             ` Tejun Heo
  2011-11-28 10:38                               ` Pavel Emelyanov
  2 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2011-11-27 18:47 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hello, Pavel.

On Fri, Nov 25, 2011 at 02:14:56PM +0400, Pavel Emelyanov wrote:
> OK, here's another proposal that seem to suit all of us:
> 
> 1. me wants to clone tasks with pids set
> 2. Pedro wants to fork task with not changing pids and w/o root perms
> 3. Oleg and Tejun want to have little intrusion into fork() path
> 
> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
> pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
> its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
> can reuse the same pid again. This basic thing doesn't require root perms at all
> and safe against pid reuse problems. When requesting for pid reservation task may
> specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).
> 
> Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?
> 
> Few comments about intrusion:
> 
>  * the common path - if (pid != &init_struct_pid) - on fork is just modified
>  * we have -1 argument to copy_process
>  * one more field on struct pid is OK, since it size doesn't change (32 bit level is
>    anyway not required, it's OK to reduce on down to 16 bits)
>  * no clone flags extension
>  * no new locking - the reserved pid manipulations happen under tasklist_lock and
>    existing common paths do not require more of it
>  * yes, we have +1 member on task_struct :(
> 
> Current API problems:
> 
>  * Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
>    previous reservation (don't know how to fix)
>  * No way to fork() an init of a pid sub-namespace with desired pid in current
>    (can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
>     namespace of a next level)
>  * No way to grab existing pid for reserve (can be fixed, if someone wants this)
> 
> Oleg, Tejun, do you agree with such an approach?

Hmmm... Any attempt to reserve PIDs without full control over the
namespace is futile.  It can never be complete / reliable.  Let's just
forget about it.  If anyone, including gdb, wants to have fun with CR,
let them manage namespace too; otherwise, it's never gonna be
reliable.

If you take the above out, setting last_pid is as simple as it gets
and good enough.  It's essentially few tens of lines of code to add
userland interface for setting one pid_t value.  Let's restrict
manipulation to root for now and see whether finer grained CAP_* makes
sense as we go along.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-27 18:47                             ` Tejun Heo
@ 2011-11-28 10:38                               ` Pavel Emelyanov
  2011-11-28 16:25                                 ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Pavel Emelyanov @ 2011-11-28 10:38 UTC (permalink / raw)
  To: Tejun Heo, Oleg Nesterov
  Cc: Pedro Alves, Linux Kernel Mailing List, Cyrill Gorcunov, James Bottomley

On 11/27/2011 10:47 PM, Tejun Heo wrote:
> Hello, Pavel.
> 
> On Fri, Nov 25, 2011 at 02:14:56PM +0400, Pavel Emelyanov wrote:
>> OK, here's another proposal that seem to suit all of us:
>>
>> 1. me wants to clone tasks with pids set
>> 2. Pedro wants to fork task with not changing pids and w/o root perms
>> 3. Oleg and Tejun want to have little intrusion into fork() path
>>
>> The proposal is to implement the PR_RESERVE_PID prctl which allocates and puts a
>> pid on the current. The subsequent fork() uses this pid, this pid survives and keeps
>> its bit in the pidmap after detach. The 2nd fork() after the 1st task death thus
>> can reuse the same pid again. This basic thing doesn't require root perms at all
>> and safe against pid reuse problems. When requesting for pid reservation task may
>> specify a pid number it wants to have, but this requires root perms (CAP_SYS_ADMIN).
>>
>> Pedro, I suppose this will work for your checkpoint feature in gdb, am I right?
>>
>> Few comments about intrusion:
>>
>>  * the common path - if (pid != &init_struct_pid) - on fork is just modified
>>  * we have -1 argument to copy_process
>>  * one more field on struct pid is OK, since it size doesn't change (32 bit level is
>>    anyway not required, it's OK to reduce on down to 16 bits)
>>  * no clone flags extension
>>  * no new locking - the reserved pid manipulations happen under tasklist_lock and
>>    existing common paths do not require more of it
>>  * yes, we have +1 member on task_struct :(
>>
>> Current API problems:
>>
>>  * Only one fork() with pid at a time. Next call to PR_RESERVE_PID will kill the
>>    previous reservation (don't know how to fix)
>>  * No way to fork() an init of a pid sub-namespace with desired pid in current
>>    (can be fixed for a flag for PR_RESERVE_PID saying that we need a pid for a
>>     namespace of a next level)
>>  * No way to grab existing pid for reserve (can be fixed, if someone wants this)
>>
>> Oleg, Tejun, do you agree with such an approach?
> 
> Hmmm... Any attempt to reserve PIDs without full control over the
> namespace is futile.  It can never be complete / reliable. 

Why? What's the _real_ problem with the 

pid = prctl(PR_RESERVE_PID, 0); /* let the kernel _generate_ a pid for us */
while (1) {
	real_pid = fork();
	BUG_ON(pid != real_pid);
	if (real_pid == 0)
		return do_child();

	wait();
}

model? Let's temporarily forget about the single reserved pid implementation
limitation and concentrate on the approach itself.

> Let's just
> forget about it.  If anyone, including gdb, wants to have fun with CR,
> let them manage namespace too; otherwise, it's never gonna be
> reliable.
> 
> If you take the above out, setting last_pid is as simple as it gets
> and good enough.  It's essentially few tens of lines of code to add
> userland interface for setting one pid_t value.  Let's restrict
> manipulation to root for now and see whether finer grained CAP_* makes
> sense as we go along.

That's OK for me, I'll send the patches soon, but I'd like to hear for some sane 
explanation of the above.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-28 10:38                               ` Pavel Emelyanov
@ 2011-11-28 16:25                                 ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2011-11-28 16:25 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Pedro Alves, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

Hello, Pavel.

On Mon, Nov 28, 2011 at 02:38:46PM +0400, Pavel Emelyanov wrote:
> > Hmmm... Any attempt to reserve PIDs without full control over the
> > namespace is futile.  It can never be complete / reliable. 
> 
> Why? What's the _real_ problem with the 
> 
> pid = prctl(PR_RESERVE_PID, 0); /* let the kernel _generate_ a pid for us */
> while (1) {
> 	real_pid = fork();
> 	BUG_ON(pid != real_pid);
> 	if (real_pid == 0)
> 		return do_child();
> 
> 	wait();
> }
> 
> model? Let's temporarily forget about the single reserved pid implementation
> limitation and concentrate on the approach itself.

PID is ns-shared resource.  If you don't have full control over it and
there are other tasks allocating from it, there is no way to reserve
specific pid reliably no matter what you do.  The only things you can
do are - either reserve the pids you want before anyone else takes it
or somehow revoke pids held by other tasks.

Full ns control + set_last_pid essentially gives the ns owner full
reservation + a way to control allocation.

I suppose you're suggesting that with reserve approach, we can also
support recycling pids of existing tasks which is suggested to be
useful for systemd and gdb.

Using this kind of black magic for general system management seems
like a really bad idea to me.  It is extremely obscure and unexpected
and we actually should be looking to dissuade such usage even if the
natural implementation of the mechanism allows for it.

For gdb, it *might* be useful but the usage isn't out there yet and
the suggested mechanism isn't enough to support the suggested usage
(ie. multithread).  We have neither concrete problem or solution.

So, let's do the simple 30 line non-invasive thing now and worry about
the complex problem when it's actually necessary.  It's not like the
ability to set last_pid is gonna interfere with future changes or
anything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids
  2011-11-22 16:44                 ` Linus Torvalds
  2011-11-22 19:29                   ` Pavel Emelyanov
@ 2012-01-26 23:28                   ` Kay Sievers
  1 sibling, 0 replies; 46+ messages in thread
From: Kay Sievers @ 2012-01-26 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Emelyanov, Tejun Heo, Oleg Nesterov, Andrew Morton,
	Alan Cox, Roland McGrath, Linux Kernel Mailing List,
	Cyrill Gorcunov, James Bottomley

On Tue, Nov 22, 2011 at 17:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 22, 2011 at 8:30 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>>>
>>>> I think that systemd guys can play with it. E.g. respawning daemons with predefined
>>>> pids sounds like an interesting thing to play with.
>>>
>>> But wouldn't CAP_CHECKPOINT be enough for systemd?
>>
>> It would, but what's the point in granting to a systemd (which can be a container's
>> init by the way) the ability to use the _whole_ checkpoint/restore engine?
>
> Christ, stop making it sound like we would *want* systemd to do even
> more odd things.
>
> Quite frankly, any feature that is sold with ".. and systemd can use
> this fox Xyz", is a *misfeature* in my opinion.  Core infrastructure
> like systemd should use a *minimal* interface, not some random
> extended features.

No worries, there is no plan from the systemd maintainers side to use
or need predictable PIDs. We still start, stop and restart service the
old-school way, and will leave checkpoint/restart to somebody else. :)

Kay

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2012-01-26 23:29 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-17 11:41 [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Pavel Emelyanov
2011-11-17 11:42 ` [PATCH 1/3] pids: Make alloc_pid return error Pavel Emelyanov
2011-11-17 11:42 ` [PATCH 2/3] pids: Split alloc_pidmap into parts Pavel Emelyanov
2011-11-17 11:43 ` [PATCH 3/3] pids: Make it possible to clone tasks with given pids Pavel Emelyanov
2011-11-17 15:32   ` Oleg Nesterov
2011-11-17 15:49     ` Pavel Emelyanov
2011-11-17 16:00       ` Oleg Nesterov
2011-11-17 17:28   ` Linus Torvalds
2011-11-17 19:04     ` Oleg Nesterov
2011-11-17 18:36   ` Oleg Nesterov
2011-11-18 10:05     ` Pavel Emelyanov
2011-11-17 15:49 ` [RFC][PATCH 0/3] fork: Add the ability to create " Oleg Nesterov
2011-11-17 16:01   ` Pavel Emelyanov
2011-11-17 16:02     ` Oleg Nesterov
2011-11-18 23:30     ` Tejun Heo
2011-11-21  9:15       ` Pavel Emelyanov
2011-11-21 22:50         ` Tejun Heo
2011-11-22 11:11           ` Pavel Emelyanov
2011-11-22 12:04             ` Pedro Alves
2011-11-22 15:33               ` Tejun Heo
2011-11-23 16:20                 ` Pedro Alves
2011-11-23 16:24                   ` Tejun Heo
2011-11-23 17:26                     ` Oleg Nesterov
2011-11-23 17:37                       ` Tejun Heo
2011-11-23 18:19                     ` Pavel Emelyanov
2011-11-23 20:14                       ` Pavel Emelyanov
2011-11-24 17:31                         ` Oleg Nesterov
2011-11-25 10:14                           ` Pavel Emelyanov
2011-11-25 16:22                             ` Oleg Nesterov
2011-11-25 16:44                               ` Pavel Emelyanov
2011-11-25 16:54                                 ` Oleg Nesterov
2011-11-25 17:03                                   ` Pavel Emelyanov
2011-11-25 22:36                                     ` Pedro Alves
2011-11-27 16:24                                       ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with?given pids Oleg Nesterov
2011-11-27  9:41                             ` [RFC][PATCH 0/3] fork: Add the ability to create tasks with given pids Konstantin Khlebnikov
2011-11-27 17:34                               ` Oleg Nesterov
2011-11-27 18:47                             ` Tejun Heo
2011-11-28 10:38                               ` Pavel Emelyanov
2011-11-28 16:25                                 ` Tejun Heo
2011-11-22 15:23             ` Tejun Heo
2011-11-22 15:29               ` Tejun Heo
2011-11-22 16:30               ` Pavel Emelyanov
2011-11-22 16:44                 ` Linus Torvalds
2011-11-22 19:29                   ` Pavel Emelyanov
2012-01-26 23:28                   ` Kay Sievers
2011-11-22 21:16           ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).